Unsupervised learning of lexical subclasses from phonotactics

Morita, Takashi, Ph. D. Massachusetts Institute of Technology

dc.contributor.advisor	Adam Albright.	en_US
dc.contributor.author	Morita, Takashi, Ph. D. Massachusetts Institute of Technology	en_US
dc.contributor.other	Massachusetts Institute of Technology. Department of Linguistics and Philosophy.	en_US
dc.date.accessioned	2019-03-01T19:34:06Z
dc.date.available	2019-03-01T19:34:06Z
dc.date.copyright	2018	en_US
dc.date.issued	2018	en_US
dc.identifier.uri	http://hdl.handle.net/1721.1/120612
dc.description	Thesis: Ph. D. in Linguistics, Massachusetts Institute of Technology, Department of Linguistics and Philosophy, 2018.	en_US
dc.description	This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.	en_US
dc.description	Cataloged from student-submitted PDF version of thesis.	en_US
dc.description	Includes bibliographical references (pages 203-215).	en_US
dc.description.abstract	Languages are constantly borrowing words from one another. Since the donor and recipient languages typically differ in their phonology and phonotactics, the native words and the loanwords of the borrower language can also exhibit dierent phonology/ phonotactics. Accordingly, it has been proposed that the phonotactics of languages such as Japanese is better explained if words are classified into etymologically defined sublexica. However, this sublexical analysis is challenged by a learnability problem: the sublexical membership of words is not directly observable. This study applies a state-of-the-art clustering method (a Dirichlet process mixture model) to a substantial number of Japanese and English words extracted from corpora. It turns out that the predicted clusters largely correspond to the etymologically defined sublexica. Since the clustering method is domain-general and not specialized to sublexicon identication, the results can be taken as statistical evidence for the heterogeneous lexica of the two languages. Moreover, the unsupervised nature of the clustering method demonstrates the learnability of sublexica from naturalistic data. The learned sublexica also replicate linguistic characterizations of actual sublexica proposed in previous literature, such as the biased distribution of (certain substrings of) segments to particular sublexica. In addition, the learned sublexica make informative predictions based on previous experimental studies. These results suggest that the predicted sublexica are linguistically sound. Finally, the predicted sublexica reveal hitherto unnoticed phonotactic properties. These discoveries can be used for further investigation of native speakers' knowledge.	en_US
dc.description.statementofresponsibility	by Takashi Morita.	en_US
dc.format.extent	215 pages	en_US
dc.language.iso	eng	en_US
dc.publisher	Massachusetts Institute of Technology	en_US
dc.rights	MIT theses are protected by copyright. They may be viewed, downloaded, or printed from this source but further reproduction or distribution in any format is prohibited without written permission.	en_US
dc.rights.uri	http://dspace.mit.edu/handle/1721.1/7582	en_US
dc.subject	Linguistics and Philosophy.	en_US
dc.title	Unsupervised learning of lexical subclasses from phonotactics	en_US
dc.type	Thesis	en_US
dc.description.degree	Ph. D. in Linguistics	en_US
dc.contributor.department	Massachusetts Institute of Technology. Department of Linguistics and Philosophy
dc.identifier.oclc	1088558202	en_US

Files in this item

Name:: 1088558202-MIT.pdf
Size:: 3.587Mb
Format:: PDF
Description:: Full printable version

View/Open

This item appears in the following Collection(s)

Doctoral Theses

Show simple item record