Learning Language with Multimodal Models

Ross, Candace Cheronda

dc.contributor.advisor	Katz, Boris
dc.contributor.author	Ross, Candace Cheronda
dc.date.accessioned	2022-08-29T16:02:30Z
dc.date.available	2022-08-29T16:02:30Z
dc.date.issued	2022-05
dc.date.submitted	2022-06-21T19:15:08.087Z
dc.identifier.uri	https://hdl.handle.net/1721.1/144654
dc.description.abstract	Language acquisition by children and machines is remarkable. Yet while children learn from hearing a relatively modest amount of language and by interacting with people and the environment around them, neural language models require far more data and supervision, struggle with generalizing to new domains and overwhelmingly learn from text alone. This thesis explores how knowledge about child language acquisition – particularly the scale and type of linguistic information children receive, how they use feedback, and how they generalize in systematic ways beyond the language input they have been exposed to – can be applied to multimodal language models. In particular, this work focuses on (1) training language models with weak supervision using less data by grounding in vision and (2) exploring the generalization abilities of models in multimodal domains. The first approach trains a semantic parser to map from natural language to logical forms using captioned videos, learning without parse trees or any other annotations. The second approach moves from simply observing videos to a more dynamic setup using a robotic simulator and world states to validate the generated logical forms. These approaches focus on evaluating weak supervision, with training and inference data that are relatively similar; we lastly explore evaluation where the inference data is quite different from training and requires systematic generalizations. One approach tests the role of pretraining and a novel decoding strategy for navigating in a grid world; inference commands and action sequences differ in systematic ways from training. The final approach tests the extent to which pretrained multimodal Transformers models generalize when the demographics in the input images or text differ from their learned social biases.
dc.publisher	Massachusetts Institute of Technology
dc.rights	In Copyright - Educational Use Permitted
dc.rights	Copyright MIT
dc.rights.uri	http://rightsstatements.org/page/InC-EDU/1.0/
dc.title	Learning Language with Multimodal Models
dc.type	Thesis
dc.description.degree	Ph.D.
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degree	Doctoral
thesis.degree.name	Doctor of Philosophy

Files in this item

Name:: Ross-ccross-PhD-EECS-2022-thes ...
Size:: 4.522Mb
Format:: PDF
Description:: Thesis PDF

View/Open

This item appears in the following Collection(s)

Doctoral Theses

Show simple item record