Show simple item record

dc.contributor.advisorAndrew W. Lo.en_US
dc.contributor.authorLi, William (William Pui Lum)en_US
dc.contributor.otherMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.en_US
dc.date.accessioned2016-07-18T19:11:42Z
dc.date.available2016-07-18T19:11:42Z
dc.date.copyright2016en_US
dc.date.issued2016en_US
dc.identifier.urihttp://hdl.handle.net/1721.1/103673
dc.descriptionThesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2016.en_US
dc.descriptionThis electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.en_US
dc.descriptionCataloged from student-submitted PDF version of thesis.en_US
dc.descriptionIncludes bibliographical references (pages 205-209).en_US
dc.description.abstractThis thesis focuses on the development of machine learning and natural language processing methods and their application to large, text-based open government datasets. We focus on models that uncover patterns and insights by inferring the origins of legal and political texts, with a particular emphasis on identifying text reuse and text similarity in these document collections. First, we present an authorship attribution model on unsigned U.S. Supreme Court opinions, offering insights into the authorship of important cases and the dynamics of Supreme Court decision-making. Second, we apply software engineering metrics to analyze the complexity of the United States Code of Laws, thereby illustrating the structure and evolution of the U.S. Code over the past century. Third, we trace policy trajectories of legislative bills in the United States Congress, enabling us to visualize the contents of four key bills during the Financial Crisis. These applications on diverse open government datasets reveal that text reuse occurs widely in legal and political texts: similar ideas often repeat in the same corpus, different historical versions of documents are usually quite similar, or legitimate reasons for copying or borrowing text may exist. Motivated by this observation, we present a novel statistical text model, Probabilistic Text Reuse (PTR), for finding repeated passages of text in large document collections. We illustrate the utility of PTR by finding template ideas, less-common voices, and insights into document structure in a large collection of public comments on regulations proposed by the U.S. Federal Communications Commission (FCC) on net neutrality. These techniques aim to help citizens better understand political processes and help governments better understand political speech.en_US
dc.description.statementofresponsibilityby William P. Li.en_US
dc.format.extent209 pagesen_US
dc.language.isoengen_US
dc.publisherMassachusetts Institute of Technologyen_US
dc.rightsM.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission.en_US
dc.rights.urihttp://dspace.mit.edu/handle/1721.1/7582en_US
dc.subjectElectrical Engineering and Computer Science.en_US
dc.titleLanguage technologies for understanding law, politics, and public policyen_US
dc.typeThesisen_US
dc.description.degreePh. D.en_US
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
dc.identifier.oclc953524878en_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record