Synthetic-to-real transfer for natural language processing

Marzoev, Michelle Alana.

dc.contributor.advisor	Jacob Andreas.	en_US
dc.contributor.author	Marzoev, Michelle Alana.	en_US
dc.contributor.other	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.	en_US
dc.date.accessioned	2021-05-24T20:23:53Z
dc.date.available	2021-05-24T20:23:53Z
dc.date.copyright	2021	en_US
dc.date.issued	2021	en_US
dc.identifier.uri	https://hdl.handle.net/1721.1/130787
dc.description	Thesis: S.M., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, February, 2021	en_US
dc.description	Cataloged from the official PDF version of thesis.	en_US
dc.description	Includes bibliographical references (pages 41-42).	en_US
dc.description.abstract	Large, human-annotated datasets are central to the development of natural language processing models. Collecting these datasets is often the most challenging part of the development process. In this thesis, I explore different strategies for learning models that can interpret natural utterances without natural training data through "simulation-to-real" transfer techniques suited to language understanding problems with a delimited set of target behaviors. Each of the transfer techniques requires access to a manually-specified synthetic data generation procedure (i.e. a "synthetic grammar") as a source of unlimited but linguistically homogeneous training data. This data is used to train models that can accurately interpret utterances from the synthetic grammar. Through experiments, I demonstrate that the most effective method for sim-to-real transfer involves automatically finding projections of natural language utterances onto the support of the synthetic language, using learned sentence embeddings to define a distance metric. With only synthetic training data, the projections approach matches or outperforms state-of-the-art models trained on natural language data on grounded instruction following and semantic parsing problems. These results suggest that simulation-to-real transfer could be a practical framework for developing NLP applications with defined target behaviors in cases where natural in-domain training data is not readily available.	en_US
dc.description.statementofresponsibility	by Michelle Alana Marzoev.	en_US
dc.format.extent	42 pages	en_US
dc.language.iso	eng	en_US
dc.publisher	Massachusetts Institute of Technology	en_US
dc.rights	MIT theses may be protected by copyright. Please reuse MIT thesis content according to the MIT Libraries Permissions Policy, which is available through the URL provided.	en_US
dc.rights.uri	http://dspace.mit.edu/handle/1721.1/7582	en_US
dc.subject	Electrical Engineering and Computer Science.	en_US
dc.title	Synthetic-to-real transfer for natural language processing	en_US
dc.type	Thesis	en_US
dc.description.degree	S.M.	en_US
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science	en_US
dc.identifier.oclc	1252064308	en_US
dc.description.collection	S.M. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science	en_US
dspace.imported	2021-05-24T20:23:53Z	en_US
mit.thesis.degree	Master	en_US
mit.thesis.department	EECS	en_US

Files in this item

Name:: 1252064308-MIT.pdf
Size:: 1.366Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Graduate Theses

Show simple item record