Learning Language with Multimodal Models

Ross, Candace Cheronda

Author(s)

Ross, Candace Cheronda

DownloadThesis PDF (4.522Mb)

Advisor

Katz, Boris

Terms of use

In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

Language acquisition by children and machines is remarkable. Yet while children learn from hearing a relatively modest amount of language and by interacting with people and the environment around them, neural language models require far more data and supervision, struggle with generalizing to new domains and overwhelmingly learn from text alone. This thesis explores how knowledge about child language acquisition – particularly the scale and type of linguistic information children receive, how they use feedback, and how they generalize in systematic ways beyond the language input they have been exposed to – can be applied to multimodal language models. In particular, this work focuses on (1) training language models with weak supervision using less data by grounding in vision and (2) exploring the generalization abilities of models in multimodal domains. The first approach trains a semantic parser to map from natural language to logical forms using captioned videos, learning without parse trees or any other annotations. The second approach moves from simply observing videos to a more dynamic setup using a robotic simulator and world states to validate the generated logical forms. These approaches focus on evaluating weak supervision, with training and inference data that are relatively similar; we lastly explore evaluation where the inference data is quite different from training and requires systematic generalizations. One approach tests the role of pretraining and a novel decoding strategy for navigating in a grid world; inference commands and action sequences differ in systematic ways from training. The final approach tests the extent to which pretrained multimodal Transformers models generalize when the demographics in the input images or text differ from their learned social biases.

Date issued

2022-05

URI

https://hdl.handle.net/1721.1/144654

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Doctoral Theses