Commit 486c2c4e authored by Jan Niklas Böhm's avatar Jan Niklas Böhm
Browse files

Move the chapters to separate files

also rename the label to viz
parent 2413ccd0
......@@ -43,352 +43,11 @@
% \setcounter{tocdepth}{0}
\tableofcontents
\chapter*{Preface}\label{preface}
\addcontentsline{toc}{chapter}{Preface}
\pagenumbering{arabic}
\setcounter{page}{1}
The work presented here has been done over the course of an internship
at the DFKI in 2017 under supervision from Dr.~Damian Borth and
Dr.~Christian Schulze.
The internship was done as a part of the
MADM\footnote{\url{http://madm.dfki.de/}} group; more specifically as
part of the Multimedia Opinion Mining (MOM) project, which is funded
by the Federal Ministry of Education and Research (BMBF). The goal of
the work is to create a social media corporate genome, that can
display information about any company with a sufficient online
presence.
For this particular project the focus was on the analysis of Twitter
data and the links that are present in tweets, connecting companies
and other Twitter users. Despite narrowing down the data, the size of
the dataset that was created already presented some problems due to
the size of it that did take up quite a bit of time to resolve or
required some workaround.
\chapter{Motivation}\label{motivation}
Companies use social media to interact with potential customers and
generally to advertise themselves online. The way a company presents
itself online gets more and more important as the social networks like
Twitter or Facebook expand and the influence over users increases as
they continue to spend more and more time on the social outlets.
% cite something (maybe sources from statista?)
The structure of the online presence for a company grows most of the
time organically, and is often split up over multiple accounts, especially
in big international conglomerates. Additionally, a company can hold
multiple subsidiaries, that do not show any obvious connections to the
parent company. An example would be Nestlé, a company which owns many
food companies, for example Maggi or Hot Pockets, which do not share
any part of the name or corporate styling that would suggest a
connection to the parent company, other than their legal status, which
a normal person is most of the time not aware of. By linking all the available information
%
together and displaying a relation – if one can be found in the data – forms
a more holistic overview of a company can be formed, and ulterior
motives could possibly be detected from such a display.
Twitter does not show changes to an account that are made over time,
for example by how much the count of its followers increased over time
or how many tweets were posted by an account in a certain time span.
Only the current amount of
%
followers or tweets is displayed but not how big the rate of change is
for these aspects. By observing these numbers over time, more insight
can be gained about the strategy of a company, for example the start
of a media campaign could be pinpointed to a specific date.
Additionally, sudden jumps in these statistics can signal a
significant event or the purchase of followers.
\medskip\noindent
%
Tweets are ephemeral in nature and rarely of interest to others after
some time, especially since Twitter tries to brand itself as a real
time social media service. But the way Twitter users communicate with
other accounts and who they communicate with follows a more or less
obvious pattern which, after observing it over time, can offer insight
in how companies communicate and possibly uncover some implicit
strategies.
Analyzing social media data is currently en vogue and mainstream media
is picking up on the possibilities for journalistic
work.\footnote{\url{http://digitalpresent.tagesspiegel.de/afd-unterstuetzernetzwerk}}
\chapter{Data}\label{data}
The basis for analysis is data made available by Twitter via its API.\footnote{\url{https://dev.twitter.com/overview/api}}
Since the entirety of the data from Twitter is neither accessible, nor
is it feasible to learn on it as a whole, it has been narrowed down to
data from accounts that
%
have officially been verified by Twitter. In general the accounts
belonging to this subset present more information about themselves and
are more active than the average twitter account.
The interesting parts about the data are the links between accounts or
tweets since they represent relations between those entities. The
other information embedded in the response is also saved and is used
for visualizing key facts about a user entity (see
chapter~\ref{visualization}).
Twitter offers an API for selected
data.
Unfortunately the API is too restrictive for some interesting types of data, which
is the reason why the first task was to develop a distributed program
that can download data from Twitter in parallel with the help of
multiple accounts. Answers from the API come in the form of JSON
objects or an array of JSON objects, which can be stored natively in
MongoDB, the database used for persisting the responses.
% sample response for user or narrowed list of tweets?
\section{The Twitter API}
% describe the API and what the limits are.
While the API offers free access to the data that users post to
Twitter, they do restrict it in order to limit load on their servers
and to entice users of the API to buy commercial access.
All limitations are reset after fifteen minutes elapse, starting with
the first request posted to the specified API-endpoint.
%
An endpoint is a certain suffix to the root URL
\texttt{https://api.twitter.com/1.1/} and determines what is going to
be returned by the Twitter API. An excerpt is shown in
Table~\ref{tbl:limits}; those deliver data that is interesting for
further processing. Note that the limitations for
\texttt{followers/ids} and \texttt{friends/ids} are exceptionally
lower than the others. This limitation makes is infeasible to
download data that shows the friendship status.
As a result, the kinds of data that is saved is limited to all
verified users that are being repeatedly downloaded every six hours
and all tweets that were tweeted by these. Additionally, all tweets
that mention those users are saved as well. The API limits the tweets
originating from a user to the last 3200, while the tweets that
mention a certain user are limited to tweets that are younger than a
week.
% table for the interesting endpoints
\begin{table}
\centering\begin{tabular}{l r p{6.5cm}}
\hline
Endpoint & Limit & Description \\ \hline\smallskip
\texttt{followers/ids} & {\setmainfont[Numbers={Uppercase,Monospaced}]{Vollkorn}15} & Returns up to 5000 IDs of users that follow the specified user. \\ \smallskip
\texttt{friends/ids} & {\setmainfont[Numbers={Uppercase,Monospaced}]{Vollkorn}15} & Returns up to 5000 IDs of users that the specified user is following.\\ \smallskip
\texttt{search/tweets} & {\setmainfont[Numbers={Uppercase,Monospaced}]{Vollkorn}900} & Returns up to 100 tweets mentioning a specified user. \\ \smallskip
\texttt{statuses/user\_timeline} & {\setmainfont[Numbers={Uppercase,Monospaced}]{Vollkorn}1500} & Returns up to 200 tweets from a specified user in reverse chronological order.\\\smallskip
\texttt{users/lookup} & {\setmainfont[Numbers={Uppercase,Monospaced}]{Vollkorn}300} & Returns data for up to 100 users. \\ \smallskip
\texttt{users/show} & {\setmainfont[Numbers={Uppercase,Monospaced}]{Vollkorn}900} & Returns data for a single user. \\
\hline
\end{tabular}
\caption{\label{tbl:limits} Excerpt of the number of requests that
can be send to the Twitter API in a fifteen minute time frame.
The values displayed are for a single app authentication – in contrast to
user authentication –, since these values are not shared for a
single account but instead are counted individually for each app,
of which a user can have multiple.}
\end{table}
% Is there a good way to cite the online manual?
\section{Program Architecture}
The first iteration for pulling in data from Twitter was carried out
in Python running in a single thread. The approach worked for downloading
user data, but was not sufficient for downloading tweets as there are
a lot more available and they can be downloaded with a higher
throughput.
As the list of verified users grows over time, it may even be
debatable whether the single-threaded program would be future-proof.
Since the program has to run in parallel it made sense to use a
language conceived for such tasks. The choice fell to Erlang, which
has been designed for systems that execute procedures in parallel and
in general helps developing distributed applications~\citep{lyse}.
Initially, the language was developed for telecommunication systems % by Ericsson Laboratories
that need to run with high-availability and low latency. For the task
at hand, the primitives for distributed communication were the most
important factors for the decision to switch to Erlang.
The language has a concept of “behaviors” that are in some aspects
similar to interfaces as they exist in Java, where some functions need
to be implemented so that they can be plugged into another component.
For example, this allows the creation of a server by only specifying
what calls it should handle and what responses it should send back.
The generic part of it has already been implemented and will take care
of persisting the state and waiting for the next call. The
behaviors that are used in the project are the following:
\begin{description}
\item [gen\_server] A generic server where messages can be send to
synchronously. The interface for using it does not expose the
network communication between the client that calls a function and
the listening server. The queues are implemented with this type of
behavior – they respond to calls by sending back payload
information.
\item [gen\_fsm\footnotemark]\footnotetext{The module has been
deprecated in the most recent version of Erlang in favor or
gen\_statem.} This behavior is used to write a
custom finite state machine. The worker has been implemented with
this behavior, the finite state machine is displayed in
Figure~\ref{fig:worker-fsm}.
\item [gen\_event] A module that implements gen\_event can handle
events that are send to it. The event handler is in use as it
allows the processing of data multiple times without having to copy
it more than once. It used as it should also zip up some data and
save them directly to disk, which was not implemented because it
would not allow for deduplication, which is enabled in the database.
Still, by generating an event with the received data, it is inserted
into the database asynchronously, which gave the program a
considerable speedup.
\item [supervisor] Supervisors are concerned with nothing but
starting, stopping, and restarting the components of an application.
They have detailed information about how the started processes
(called children) are supposed to be handled in the case of them
crashing, how the surviving children are supposed to be treated, and
how many times they should attempt a restart.
Figure~\ref{fig:overview-arch} details the hierarchy and structure
of the program. All nodes containing the word “sup” are
supervisors, which is the custom for most Erlang programs.
\end{description}
\input{tikz/prog-arch}
\input{tikz/worker-fsm}
The program is split into multiple queue, worker, event-receiving and
supervisor processes. Additionally there is a database pool and
another process that sends start signals for starting the download
procedure.
A worker process retrieves the parameter for the API call from the
queue and issues a request to the assigned API. After receiving a
response an event is generated that is picked up by the
event-receiving process. This one interacts with the database pool
and inserts the data synchronously while the payload for the next API
request is computed. The supervisors will restart a failing process
and ensure an ordered start up and shut down.
\medskip\noindent%
Further analysis concerns itself with the relations between users and
tweets. From every downloaded tweet the author, mentioned users and
the tweet, that the current one is in replying to, are extracted and
saved as edges. the edges are the basis or the social network graph
that will be of interest for the following analysis. In the graph a
user or a tweet are represented by a node and the edges consist of the
mentioned associations that live in a tweet object. Unfortunately the
friendship between users – who follows whom – could not be incorporated
in this graph as the data cannot be retrieved in a suitable time
frame. %
Currently there are about 1.6 billion edges.
Additionally the user profile, together with some statistics is
downloaded every six hours. This is later used to create a timeline of
changes to the account and can help showing unusual account behavior.
\chapter{Visualization}\label{visualization}
The captured information needs to be visualized to create a meaningful
representation. Since the ultimate goal is to create a genome the
visualization should incorporate an overview of the most vital
statistics that could be gathered from the available data.
As the visualization should be interactive, the choice fell to use
JavaScript and the D3 library for
visualization.\footnote{\url{https://d3js.org/}} The library works by
generating elements standardized by the W3C and is thus very flexible,
albeit a bit complex. The complexity stems from the web standards –
HTML, CSS, and especially SVG and HTML5-\texttt{<canvas>}
themselves, as they take some time to learn, as do the concepts of D3
like the “data join”\footnote{\url{https://bost.ocks.org/mike/join/}}
or “update patterns”. The concepts of D3 may seem foreign at first, but they ease the
process of extending a visualization once implemented by abstracting away details
of the data.
Still, the process of visualizing elements is a bottom-up approach,
which takes considerable time, as the interface to D3 is very
low-level and barely any plotting primitives are provided out of the
box. As an example, instead of having a \texttt{scatterplot()}
function that takes two dimensional data, the axes have to be
specified, their scales need to have their domain and range set up –
i.~e. map the values 0 to 5000 to a range of 0 and 900 pixels – and
the shape of the elements need to be defined as an SVG element. The
detailed declaration of the arrangement gives a lot of control to the
layout and the display of the result.
Some help exists on the form of its active community, where plenty of
examples are provided that can be used and
customized.\footnote{\url{https://bl.ocks.org/}} But even after
finding a suitable example, it still has to be amended in order to
work with the real data instead of the provided sample data, which
increases the effort tremendously when encountering a new situation
where no guidance is available. Despite the complexity it feels
rewarding to create visualizations with it as it gives the developer
full control over the layout and is respectably fast, as the rendering
of SVG and HTML5-\texttt{<canvas>} elements happens natively in the
browser and thus it is a great deal faster than plain JavaScript. The
generated SVG elements are part of the DOM in the browser, just like
other HTML elements, and can be inspected just the same. This
property makes logic mistakes easier to resolve as they are made
obvious not only by the faulty placement of graphic elements in the
page but also by the wrong order or position in the DOM, which is
often more precise feedback.
\section{Parallel coordinates}
\label{sec:pcoords}
To visualize multidimensional data, parallel coordinates were
implemented in D3. The technique works by drawing each datum as a
line that intersects every axis at its value for that dimension.
% cite the origin (maybe also tufte?)
% show a picture of it (how can I input an svg?) maybe with this:
% https://developer.mozilla.org/en-US/docs/Web/API/Canvas_API/Drawing_DOM_objects_into_a_canvas
\chapter{Analysis}\label{analysis}
Different approaches have been discussed or considered for
implementation. The implementation itself is not part of the
internship but will be part of a bachelor's thesis project.
Nevertheless, discussion about it already happened during the
internship as the further analysis will process the acquired data, as
without this step, the effort of collecting the data would have been
futile.
Another student is concerned with differentiating between businesses
and normal users, based on their profile picture. If this approach
succeeds, it
could on one hand be used to narrow the relevant data down further and
generate a richer profile of the remaining accounts, as this makes
more queries feasible. On the other hand it could eliminate persons
that are strictly speaking not part of the company but employed by it, thus stripping away some valuable information.
While employees most of the time represent themselves as a natural person on social media they are still
affiliated with its employer and thus could offer interesting
information or a different perspective.
Since the amount of data is already quite big, the resulting graph can
be sampled from in order to generate a smaller sub-graph. There are
algorithms like Forest Fire~\citep{graph-sampling}, that preserve
properties of a graph while reducing its size.
The relationships are used to construct a graph, on which either
word2vec\penalty5000\citep{word2vec-nips} or RDF2Vec~\citep{rdf2vec}
will be applied. The networks take random walks as input that need to
be generated first. An approach that is similar to the random surfer
model can be employed which will generate random walks more
efficiently.
The idea is that the latent representation from the network will be
able to place related accounts in a close proximity. With the latent
representation a clustering algorithm could be employed that will show
those related accounts to a user, who can then observe a more holistic
view of a company. A colleague suggested hierarchical clustering,
since there is an unknown amount of clusters in the data.
Furthermore, the hierarchy could also show a more refined structure of
accounts and how close the relationships between subgroups are.
\input{tex/preface}
\input{tex/motivation}
\input{tex/data}
\input{tex/viz}
\input{tex/analysis}
%% \appendix
%% \input{./tex/appendix.tex}
......
\chapter{Analysis}\label{analysis}
Different approaches have been discussed or considered for
implementation. The implementation itself is not part of the
internship but will be part of a bachelor's thesis project.
Nevertheless, discussion about it already happened during the
internship as the further analysis will process the acquired data, as
without this step, the effort of collecting the data would have been
futile.
Another student is concerned with differentiating between businesses
and normal users, based on their profile picture. If this approach
succeeds, it
could on one hand be used to narrow the relevant data down further and
generate a richer profile of the remaining accounts, as this makes
more queries feasible. On the other hand it could eliminate persons
that are strictly speaking not part of the company but employed by it, thus stripping away some valuable information.
While employees most of the time represent themselves as a natural person on social media they are still
affiliated with its employer and thus could offer interesting
information or a different perspective.
Since the amount of data is already quite big, the resulting graph can
be sampled from in order to generate a smaller sub-graph. There are
algorithms like Forest Fire~\citep{graph-sampling}, that preserve
properties of a graph while reducing its size.
The relationships are used to construct a graph, on which either
word2vec\penalty5000\citep{word2vec-nips} or RDF2Vec~\citep{rdf2vec}
will be applied. The networks take random walks as input that need to
be generated first. An approach that is similar to the random surfer
model can be employed which will generate random walks more
efficiently.
The idea is that the latent representation from the network will be
able to place related accounts in a close proximity. With the latent
representation a clustering algorithm could be employed that will show
those related accounts to a user, who can then observe a more holistic
view of a company. A colleague suggested hierarchical clustering,
since there is an unknown amount of clusters in the data.
Furthermore, the hierarchy could also show a more refined structure of
accounts and how close the relationships between subgroups are.
\chapter{Data}\label{data}
The basis for analysis is data made available by Twitter via its API.\footnote{\url{https://dev.twitter.com/overview/api}}
Since the entirety of the data from Twitter is neither accessible, nor
is it feasible to learn on it as a whole, it has been narrowed down to
data from accounts that
%
have officially been verified by Twitter. In general the accounts
belonging to this subset present more information about themselves and
are more active than the average twitter account.
The interesting parts about the data are the links between accounts or
tweets since they represent relations between those entities. The
other information embedded in the response is also saved and is used
for visualizing key facts about a user entity (see
chapter~\ref{viz}).
Twitter offers an API for selected
data.
Unfortunately the API is too restrictive for some interesting types of data, which
is the reason why the first task was to develop a distributed program
that can download data from Twitter in parallel with the help of
multiple accounts. Answers from the API come in the form of JSON
objects or an array of JSON objects, which can be stored natively in
MongoDB, the database used for persisting the responses.
% sample response for user or narrowed list of tweets?
\section{The Twitter API}
% describe the API and what the limits are.
While the API offers free access to the data that users post to
Twitter, they do restrict it in order to limit load on their servers
and to entice users of the API to buy commercial access.
All limitations are reset after fifteen minutes elapse, starting with
the first request posted to the specified API-endpoint.
%
An endpoint is a certain suffix to the root URL
\texttt{https://api.twitter.com/1.1/} and determines what is going to
be returned by the Twitter API. An excerpt is shown in
Table~\ref{tbl:limits}; those deliver data that is interesting for
further processing. Note that the limitations for
\texttt{followers/ids} and \texttt{friends/ids} are exceptionally
lower than the others. This limitation makes is infeasible to
download data that shows the friendship status.
As a result, the kinds of data that is saved is limited to all
verified users that are being repeatedly downloaded every six hours
and all tweets that were tweeted by these. Additionally, all tweets
that mention those users are saved as well. The API limits the tweets
originating from a user to the last 3200, while the tweets that
mention a certain user are limited to tweets that are younger than a
week.
% table for the interesting endpoints
\begin{table}
\centering\begin{tabular}{l r p{6.5cm}}
\hline
Endpoint & Limit & Description \\ \hline\smallskip
\texttt{followers/ids} & {\setmainfont[Numbers={Uppercase,Monospaced}]{Vollkorn}15} & Returns up to 5000 IDs of users that follow the specified user. \\ \smallskip
\texttt{friends/ids} & {\setmainfont[Numbers={Uppercase,Monospaced}]{Vollkorn}15} & Returns up to 5000 IDs of users that the specified user is following.\\ \smallskip
\texttt{search/tweets} & {\setmainfont[Numbers={Uppercase,Monospaced}]{Vollkorn}900} & Returns up to 100 tweets mentioning a specified user. \\ \smallskip
\texttt{statuses/user\_timeline} & {\setmainfont[Numbers={Uppercase,Monospaced}]{Vollkorn}1500} & Returns up to 200 tweets from a specified user in reverse chronological order.\\\smallskip
\texttt{users/lookup} & {\setmainfont[Numbers={Uppercase,Monospaced}]{Vollkorn}300} & Returns data for up to 100 users. \\ \smallskip
\texttt{users/show} & {\setmainfont[Numbers={Uppercase,Monospaced}]{Vollkorn}900} & Returns data for a single user. \\
\hline
\end{tabular}
\caption{\label{tbl:limits} Excerpt of the number of requests that
can be send to the Twitter API in a fifteen minute time frame.
The values displayed are for a single app authentication – in contrast to
user authentication –, since these values are not shared for a
single account but instead are counted individually for each app,
of which a user can have multiple.}
\end{table}
% Is there a good way to cite the online manual?
\section{Program Architecture}
The first iteration for pulling in data from Twitter was carried out
in Python running in a single thread. The approach worked for downloading
user data, but was not sufficient for downloading tweets as there are
a lot more available and they can be downloaded with a higher
throughput.
As the list of verified users grows over time, it may even be
debatable whether the single-threaded program would be future-proof.
Since the program has to run in parallel it made sense to use a
language conceived for such tasks. The choice fell to Erlang, which
has been designed for systems that execute procedures in parallel and
in general helps developing distributed applications~\citep{lyse}.
Initially, the language was developed for telecommunication systems % by Ericsson Laboratories
that need to run with high-availability and low latency. For the task
at hand, the primitives for distributed communication were the most
important factors for the decision to switch to Erlang.
The language has a concept of “behaviors” that are in some aspects
similar to interfaces as they exist in Java, where some functions need
to be implemented so that they can be plugged into another component.
For example, this allows the creation of a server by only specifying
what calls it should handle and what responses it should send back.
The generic part of it has already been implemented and will take care
of persisting the state and waiting for the next call. The
behaviors that are used in the project are the following:
\begin{description}
\item [gen\_server] A generic server where messages can be send to
synchronously. The interface for using it does not expose the
network communication between the client that calls a function and
the listening server. The queues are implemented with this type of
behavior – they respond to calls by sending back payload
information.
\item [gen\_fsm\footnotemark]\footnotetext{The module has been
deprecated in the most recent version of Erlang in favor or
gen\_statem.} This behavior is used to write a
custom finite state machine. The worker has been implemented with
this behavior, the finite state machine is displayed in
Figure~\ref{fig:worker-fsm}.
\item [gen\_event] A module that implements gen\_event can handle
events that are send to it. The event handler is in use as it
allows the processing of data multiple times without having to copy
it more than once. It used as it should also zip up some data and
save them directly to disk, which was not implemented because it
would not allow for deduplication, which is enabled in the database.
Still, by generating an event with the received data, it is inserted
into the database asynchronously, which gave the program a
considerable speedup.
\item [supervisor] Supervisors are concerned with nothing but
starting, stopping, and restarting the components of an application.
They have detailed information about how the started processes
(called children) are supposed to be handled in the case of them
crashing, how the surviving children are supposed to be treated, and
how many times they should attempt a restart.
Figure~\ref{fig:overview-arch} details the hierarchy and structure
of the program. All nodes containing the word “sup” are
supervisors, which is the custom for most Erlang programs.
\end{description}
\input{tikz/prog-arch}
\input{tikz/worker-fsm}
The program is split into multiple queue, worker, event-receiving and
supervisor processes. Additionally there is a database pool and
another process that sends start signals for starting the download
procedure.
A worker process retrieves the parameter for the API call from the
queue and issues a request to the assigned API. After receiving a
response an event is generated that is picked up by the
event-receiving process. This one interacts with the database pool
and inserts the data synchronously while the payload for the next API
request is computed. The supervisors will restart a failing process
and ensure an ordered start up and shut down.
\medskip\noindent%
Further analysis concerns itself with the relations between users and
tweets. From every downloaded tweet the author, mentioned users and
the tweet, that the current one is in replying to, are extracted and
saved as edges. the edges are the basis or the social network graph
that will be of interest for the following analysis. In the graph a
user or a tweet are represented by a node and the edges consist of the
mentioned associations that live in a tweet object. Unfortunately the
friendship between users – who follows whom – could not be incorporated
in this graph as the data cannot be retrieved in a suitable time
frame. %
Currently there are about 1.6 billion edges.
Additionally the user profile, together with some statistics is
downloaded every six hours. This is later used to create a timeline of
changes to the account and can help showing unusual account behavior.
\chapter{Motivation}\label{motivation}
Companies use social media to interact with potential customers and
generally to advertise themselves online. The way a company presents
itself online gets more and more important as the social networks like
Twitter or Facebook expand and the influence over users increases as
they continue to spend more and more time on the social outlets.
% cite something (maybe sources from statista?)
The structure of the online presence for a company grows most of the
time organically, and is often split up over multiple accounts, especially
in big international conglomerates. Additionally, a company can hold
multiple subsidiaries, that do not show any obvious connections to the
parent company. An example would be Nestlé, a company which owns many
food companies, for example Maggi or Hot Pockets, which do not share
any part of the name or corporate styling that would suggest a
connection to the parent company, other than their legal status, which
a normal person is most of the time not aware of. By linking all the available information
%
together and displaying a relation – if one can be found in the data – forms
a more holistic overview of a company can be formed, and ulterior
motives could possibly be detected from such a display.
Twitter does not show changes to an account that are made over time,
for example by how much the count of its followers increased over time
or how many tweets were posted by an account in a certain time span.
Only the current amount of
%
followers or tweets is displayed but not how big the rate of change is
for these aspects. By observing these numbers over time, more insight
can be gained about the strategy of a company, for example the start
of a media campaign could be pinpointed to a specific date.
Additionally, sudden jumps in these statistics can signal a
significant event or the purchase of followers.
\medskip\noindent
%
Tweets are ephemeral in nature and rarely of interest to others after
some time, especially since Twitter tries to brand itself as a real
time social media service. But the way Twitter users communicate with
other accounts and who they communicate with follows a more or less
obvious pattern which, after observing it over time, can offer insight
in how companies communicate and possibly uncover some implicit
strategies.
Analyzing social media data is currently en vogue and mainstream media
is picking up on the possibilities for journalistic
work.\footnote{\url{http://digitalpresent.tagesspiegel.de/afd-unterstuetzernetzwerk}}
\chapter*{Preface}\label{preface}
\addcontentsline{toc}{chapter}{Preface}
\pagenumbering{arabic}
\setcounter{page}{1}
The work presented here has been done over the course of an internship
at the DFKI in 2017 under supervision from Dr.~Damian Borth and
Dr.~Christian Schulze.
The internship was done as a part of the
MADM\footnote{\url{http://madm.dfki.de/}} group; more specifically as
part of the Multimedia Opinion Mining (MOM) project, which is funded
by the Federal Ministry of Education and Research (BMBF). The goal of
the work is to create a social media corporate genome, that can
display information about any company with a sufficient online
presence.
For this particular project the focus was on the analysis of Twitter
data and the links that are present in tweets, connecting companies
and other Twitter users. Despite narrowing down the data, the size of
the dataset that was created already presented some problems due to
the size of it that did take up quite a bit of time to resolve or
required some workaround.
tex/viz.tex 0 → 100644