Synthetic-to-real transfer for natural language processing

Marzoev, Michelle Alana.

Author(s)

Marzoev, Michelle Alana.

Download1252064308-MIT.pdf (1.366Mb)

Other Contributors

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.

Advisor

Jacob Andreas.

Terms of use

MIT theses may be protected by copyright. Please reuse MIT thesis content according to the MIT Libraries Permissions Policy, which is available through the URL provided. http://dspace.mit.edu/handle/1721.1/7582

Metadata

Show full item record

Abstract

Large, human-annotated datasets are central to the development of natural language processing models. Collecting these datasets is often the most challenging part of the development process. In this thesis, I explore different strategies for learning models that can interpret natural utterances without natural training data through "simulation-to-real" transfer techniques suited to language understanding problems with a delimited set of target behaviors. Each of the transfer techniques requires access to a manually-specified synthetic data generation procedure (i.e. a "synthetic grammar") as a source of unlimited but linguistically homogeneous training data. This data is used to train models that can accurately interpret utterances from the synthetic grammar. Through experiments, I demonstrate that the most effective method for sim-to-real transfer involves automatically finding projections of natural language utterances onto the support of the synthetic language, using learned sentence embeddings to define a distance metric. With only synthetic training data, the projections approach matches or outperforms state-of-the-art models trained on natural language data on grounded instruction following and semantic parsing problems. These results suggest that simulation-to-real transfer could be a practical framework for developing NLP applications with defined target behaviors in cases where natural in-domain training data is not readily available.

Description

Thesis: S.M., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, February, 2021

Cataloged from the official PDF version of thesis.

Includes bibliographical references (pages 41-42).

Date issued

2021

URI

https://hdl.handle.net/1721.1/130787

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Keywords

Electrical Engineering and Computer Science.

Collections

Graduate Theses