Automatic identification of representative content on Twitter

Vijayaraghavan, Prashanth

Author(s)

Vijayaraghavan, Prashanth

DownloadFull printable version (9.869Mb)

Other Contributors

Program in Media Arts and Sciences (Massachusetts Institute of Technology)

Advisor

Deb Roy.

Terms of use

M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission. http://dspace.mit.edu/handle/1721.1/7582

Metadata

Show full item record

Abstract

Microblogging services, most notably Twitter, have become popular avenues to voice opinions and be active participants of discourse on a wide range of topics. As a consequence, Twitter has become an important part of the political battleground that journalists and political analysts can harness to analyze and understand the narratives that organically form, spread and decline among the public in a political campaign. A challenge with social media is that important discussions around certain issues can be overpowered by majoritarian or controversial topics that provoke strong reactions and attract large audiences. In this thesis we develop a method to identify the specific ideas and sentiments that represent the overall conversation surrounding a topic or event as reflected in collections of tweets. We have developed this method in the context of the 2016 US presidential elections. We present and evaluate a large scale data analytics framework, based on recent advances in deep neural networks, for identifying and analyzing election- related conversation on Twitter on a continuous, longitudinal basis in order to identify representative tweets across prominent election issues. The framework consists of two main components, (1) a dynamic topic model that identifies all tweets related to election issues using knowledge from news stories and continuous learning of Twitter's evolving vocabulary, (2) a semantic model of tweets called Tweet2vec that generates general purpose tweet embeddings used for identifying representative tweets by robust semantic clustering. The topic model performed with an average F-1 score of 0.90 across 22 different election topics on a manually annotated dataset. Tweet2Vec outperformed state-of-the- art algorithms on widely used semantic relatedness and sentiment classification evaluation tasks. To demonstrate the value of the framework, we analyzed tweets leading up to a primary debate and contrasted the automatically identified representative tweets with those that were actually used in the debate. The system was able to identify tweets that represented more semantically diverse conversations around each of the major election issues, in comparison to those that were presented during the debate. This framework may have a broad range of applications, from enabling exemplar-based methods for understanding the gist of large collections of tweets, extensible perhaps to other forms of short text documents, to providing an input for new forms of data-grounded journalism and debate.

Description

Thesis: S.M., Massachusetts Institute of Technology, School of Architecture and Planning, Program in Media Arts and Sciences, 2016.

Cataloged from PDF version of thesis.

Includes bibliographical references (pages 97-103).

Date issued

2016

URI

http://hdl.handle.net/1721.1/106045

Department

Program in Media Arts and Sciences (Massachusetts Institute of Technology)

Publisher

Massachusetts Institute of Technology

Keywords

Program in Media Arts and Sciences ()

Collections

Graduate Theses