Syntactically annotated Ngrams for Google Books
Author(s)
Lin, Yuri, M. Eng. Massachusetts Institute of Technology
DownloadFull printable version (528.5Kb)
Other Contributors
Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science.
Advisor
Dorothy Curtis and Slav Petrov.
Terms of use
Metadata
Show full item recordAbstract
In this thesis, we present a new edition of the Google Books Ngram Corpus, describing how often words and phrases were used over a period of five centuries, in eight languages; it aggregates data from 6% of all books ever published. This new edition introduces syntactic annotations: words are tagged with their part-of-speech, and head-modifier dependency relationships are recorded. We generate these annotations automatically from the Google Books text, using statistical models that are specifically adapted to the historical text found in these books. The new edition will facilitate the study of linguistic trends, especially those related to the evolution of syntax. We present our initial findings from the annotated Ngrams in the new edition, including studies of the change in various words' primary parts of speech over time, and to find the words most closely related to a given set of topics.
Description
Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012. Cataloged from PDF version of thesis. Includes bibliographical references (p. 101-102).
Date issued
2012Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology
Keywords
Electrical Engineering and Computer Science.