Commit a4b74c9a authored by Jan Niklas Böhm's avatar Jan Niklas Böhm
Browse files

Write boatloads of text, everywhere

what a messy commit, whoops!
parent 9b4d2260
@inproceedings{egonets,
author = {Julian J. McAuley and
Jure Leskovec},
title = {Learning to Discover Social Circles in Ego Networks},
booktitle = {Advances in Neural Information Processing Systems 25: 26th Annual
Conference on Neural Information Processing Systems 2012. Proceedings
of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States.},
pages = {548--556},
year = {2012},
}
@inproceedings{twitter-dataset2011,
author = {Jaewon Yang and
Jure Leskovec},
title = {Patterns of temporal variation in online media},
booktitle = {Proceedings of the Forth International Conference on Web Search and
Web Data Mining, {WSDM} 2011, Hong Kong, China, February 9-12, 2011},
pages = {177--186},
year = {2011},
}
@book{dlbook,
title={Deep Learning},
author={Ian Goodfellow and Yoshua Bengio and Aaron Courville},
publisher={MIT Press},
year={2016}
}
@article{node2vec,
author = {Aditya Grover and
Jure Leskovec},
title = {node2vec: Scalable Feature Learning for Networks},
journal = {CoRR},
volume = {abs/1607.00653},
year = {2016},
}
@techreport{pagerank,
number = {1999-66},
month = {November},
author = {Lawrence Page and Sergey Brin and Rajeev Motwani and Terry Winograd},
note = {Previous number = SIDL-WP-1999-0120},
title = {The PageRank Citation Ranking: Bringing Order to the Web.},
type = {Technical Report},
publisher = {Stanford InfoLab},
year = {1999},
institution = {Stanford InfoLab},
url = {http://ilpubs.stanford.edu:8090/422/},
abstract = {The importance of a Web page is an inherently subjective matter, which depends on the readers interests, knowledge and attitudes. But there is still much that can be said objectively about the relative importance of Web pages. This paper describes PageRank, a mathod for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them. We compare PageRank to an idealized random Web surfer. We show how to efficiently compute PageRank for large numbers of pages. And, we show how to apply PageRank to search and to user navigation.}
}
......
......@@ -28,7 +28,7 @@ content employees are while working for their company.
\medskip\noindent
%
By considering properties of a tweet as an edge that links to other
nodes, like users, tweets, or external hyperlinks, a graph can be
nodes like users, tweets, or external hyperlinks, a graph can be
constructed, similar to a social network graph. This is the final
data structure that should be processed by the machine learning
algorithm.
......@@ -43,14 +43,45 @@ representation of how it could have looked at a previous time. The
relationships are used to construct a graph, on which either
word2vec\penalty5000\ \citep{word2vec-nips} or RDF2Vec~\citep{rdf2vec}
will be applied. % autoencoders, what do they do, how do they work?
%
A third option is node2vec, which works similar to the previous two
approaches but is also somewhat robust to missing edges in the graph
network, judging by its F$_1$ score~\citep{node2vec}.
The options listed above are all autoencoders, which mean that they
attempt to learn a representation from the data in an unsupervised
way. Autoencoders are similar to neural networks in that they can be
trained with the same methods as these. But instead of having a
target that has previously assigned to the training data, autoencoders
have the input as their target that needs to be optimized. Since this
problem could be perfectly solved by applying the identity function,
further restrictions need to be employed, that force the autoencoder
to learn the underlying representation~\citep{dlbook}.
The learning process needs to be unsupervised since the data from
Twitter is not labeled. Still, the output from the algorithms
still needs to be evaluated in order to have a meaning and concept
of worth attached to it. Thus a benchmark is needed. The baseline
for it will be created by establishing a rank for every of the
thirty constituents of the DAX, if they have a verified Twitter
account – which further narrows it down to twenty-three. The
ranking will display how connected other accounts are to one of the
DAX accounts. By sorting the accounts by their rank, a list can be
assembled for each account, which can be scanned by a human for
actual relatedness between the accounts.
The graph is currently being created but poses some problems due to
the size of the data. It may be necessary to use a graph database to
construct a graph and then query it in order to compile the list, as
it looks like the graph will not fit entirely into the memory.
The networks take random walks as input. These random walks need to
be generated first. Since random walks have a lot of similarities with
the “random surfer” in \citet{pagerank}, an approach for generating
random walks would be to calculate the transition probability matrices
and draw the random walks from this instead of creating them over the
The neural networks all take random walks as input. These random walks need to
be generated first. Since random walks share a lot of properties with
the “random surfer” described in \citet{pagerank}, an approach for generating
random walks would be to calculate the transition probability matrix
and generate random walks by reading the probabilities from the matrix instead of creating them over the
graph. This approach should allow a more efficient generation process
as the matrix operations are implemented more efficiently in the
as matrix operations are implemented more efficiently in the
language than custom code.
The idea is that the latent representation that has been created from the network will be
......
......@@ -49,7 +49,7 @@ As a result, the kinds of data that is saved is limited to all
verified users that are being repeatedly downloaded every six hours
and all tweets that were tweeted by these. Additionally, all tweets
that mention those users are saved as well. The API limits the tweets
originating from a user to the last 3200, while the tweets that
originating from a user to the last 3200 tweets, while the tweets that
mention a certain user are limited to tweets that are younger than a
week.
......@@ -289,7 +289,7 @@ As an easy fix, the application and database could be run on a
different machine that has 492~GB of RAM available. This was
attempted but ultimately proved unfitting. On this machine, named
“kolmar”, the memory did not run out, but the application still
crashed and some kernel messages were visible when calling
crashed and kernel messages were generated because of some unknown problem. They became apparent when calling
\texttt{dmesg}. The reasons for this were unknown but they left the
entire machine in an unstable state, which forced a restart of kolmar.
Even after the restart, the application still did not run
......
......@@ -4,8 +4,10 @@ Companies use social media to interact with potential customers and
generally to advertise themselves online. The way a company presents
itself online gets more and more important as the social networks like
Twitter or Facebook expand and the influence over users increases as
they continue to spend more and more time on the social outlets.
% cite something (maybe sources from statista?)
they continue to spend more and more time online and due to that on
social outlets as
well.\footnote{\url{https://www.destatis.de/DE/ZahlenFakten/GesellschaftStaat/EinkommenKonsumLebensbedingungen/ITNutzung/Tabellen/ZeitvergleichComputernutzung_IKT.html}}
The structure of the online presence for a company grows most of the
time organically, and is often split up over multiple accounts, especially
......@@ -71,8 +73,30 @@ and serve as a weight and thus form a weighted graph between users
only. This representation will be more sparse than one only
considering friendship but can be able to detect influences better.
Another point for tweets is that they can also capture negative
emotions, in contrast to friendship, which most of the time has a
Another point for tweets is that they may also be generated when negative
emotions are expressed towards a company or another user. This contrasts to friendship, which most of the time has a
positive connotation. By tweeting out to a company in anger over the
latest scandal, a connection is formed that would not be visible by
narrowing the scope down to followers.
There has been a dataset consisting of Twitter data published and
analyzed in~\citet{twitter-dataset2011}. This dataset had to be taken
down due to of a request from
Twitter.\footnote{\url{https://snap.stanford.edu/data/twitter7.html}}.
Their research focused on the temporal aspects of different hashtags
or URLs that have been posted during the 8 month period during which
they were creating the dataset.
Another dataset from the same group that still is available has been
presented in~\citet{egonets}. The detection of social circles is the
goal in the paper, but they do admit that the data from Twitter may
not be up-to-date. Furthermore, the gathered data is mostly attached
to users and the stated goal for this project is the analysis of
company accounts.
This is also a more general problem; most research based on social
media data puts its focus on the social aspect, whereas company
interests are not social, and it could be argued that the commercial
gain they want from their interest goes against the benefit that
social media promises a normal user, in essence keeping up with
friends, family, and acquaintances.
......@@ -6,13 +6,18 @@ to me and presented some novel problems that I did not anticipate.
Solving these were educational. In general, I did not expect the
development to take as long as it did. This leaves me a bit
dissatisfied with the result of my internship, as I had hoped to get
at the very least some results from a preliminiary analysis. Instead,
the storage of the data took much longer than expected, as did the
visualization. This is partly to blame on the technology stack as I
at the very least some results from a preliminary analysis. Instead,
the process of getting and storing the data took much longer than expected, as did the
visualization. More generally, even simple tasks like iterating over the entire
collection become quite expensive once the data grows into the first billion.
Nevertheless,
the technology stack is also partly to blame as I
did not try to the best available options, but instead often went with
my gut feeling. I am grateful that the environment in which I
completed my internship allowed such experiments as I still learned
much along the way.
much along the way, but on some ends they eat up some time for a questionable return.
Personally, I am glad that I got to try out the technologies I picked,
but maybe I should have started asking other colleagues earlier for advice.
Starting to build up an experiment from the ground up shows how much
work actually goes into producing scientific results. There are many
......@@ -27,13 +32,28 @@ could arise in many domains that are related to computer science.
Personally, a bit more guidance would have helped. I was not the
first person to work on this topic, another student worked on this
topic in the scope of his master's thesis. He has since left the DFKI
topic already in the scope of his master's thesis. He has since left the DFKI
so I could only inspect the documentation he left behind. The work he
invested, together with his written texts proved unhelpful. There
invested, together with his written texts, proved unhelpful. There
were neither references to other scientific work in the field that
could have served as a starting point, nor was there any reference to
the data he used, which might have been used as a stepping stone.
In general I was not able to find related literature easily, and as
Still, I am glad that I was allowed to work independently and had
largely free reign over the project I have been assigned. I believe
using Erlang as the language to interact with the API proved helpful
as the handling of errors is a core concept of the language and even
restart due to grave ones are a common way to deal with errors in
Erlang. I am less happy about the choice of using D3 for visualizing
some values. While I am happy with the output it produces, and the
underlying concepts are carefully thought out, using it requires a lot
of upfront work to get started since the library allows control over
every meticulous detail by simply generating elements defined by web
standards, which in turn carry their own complexity. If I had more
experience with web development in general or had used the library
before, its usage might have proved more successful.
In general it was not easy to find related literature, and as
the majority of the effort went into creating a dataset there was not
much time left to exercise an in-depth literature study.
enough time left to exercise an in-depth literature study.
It is still planned for the future to take another look into related work.
......@@ -39,7 +39,7 @@ work around.
The work and its progress has been discussed weekly in group meetings,
where the entire group was able to give input. Besides two one-on-one
meetings with Damian, some longer personal discussions have been had
with Christian Schulze, Jörn Hees, and Benjamin Bischke. In addition
with Christian Schulze, who also gave technical support, as well as Jörn Hees, and Benjamin Bischke. In addition
to these discussions, more informal discussions with Marco Schreyer, Patrick
Helber, Tim Hertweck, and Sebastian Schreiber helped further
clarifying some aspects of scientific work.
......@@ -28,7 +28,7 @@ the shape of the elements need to be defined as an SVG element. The
detailed declaration of the arrangement gives a lot of control to the
layout and the display of the result.
Some help exists on the form of its active community, where plenty of
Help exists on the form of its active community, where plenty of
examples are provided that can be used and
customized.\footnote{\url{https://bl.ocks.org/}} But even after
finding a suitable example, it still has to be amended in order to
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment