Content modeling for social media text

Sauper, Christina (Christina Joan)

Author(s)

Sauper, Christina (Christina Joan)

DownloadFull printable version (17.27Mb)

Other Contributors

Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science.

Advisor

Regina Barzilay.

Terms of use

M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission. http://dspace.mit.edu/handle/1721.1/7582

Metadata

Show full item record

Abstract

This thesis focuses on machine learning methods for extracting information from user-generated content. Instances of this data such as product and restaurant reviews have become increasingly valuable and influential in daily decision making. In this work, I consider a range of extraction tasks such as sentiment analysis and aspect-based review aggregation. These tasks have been well studied in the context of newswire documents, but the informal and colloquial nature of social media poses significant new challenges. The key idea behind our approach is to automatically induce the content structure of individual documents given a large, noisy collection of user-generated content. This structure enables us to model the connection between individual documents and effectively aggregate their content. The models I propose demonstrate that content structure can be utilized at both document and phrase level to aid in standard text analysis tasks. At the document level, I capture this idea by joining the original task features with global contextual information. The coupling of the content model and the task-specific model allows the two components to mutually influence each other during learning. At the phrase level, I utilize a generative Bayesian topic model where a set of properties and corresponding attribute tendencies are represented as hidden variables. The model explains how the observed text arises from the latent variables, thereby connecting text fragments with corresponding properties and attributes.

Description

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.

Cataloged from PDF version of thesis.

Includes bibliographical references (p. 129-136).

Date issued

2012

URI

http://hdl.handle.net/1721.1/75648

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Keywords

Electrical Engineering and Computer Science.

Collections

Doctoral Theses