Composing Foundation Models for Decision Making

Ajay, Anurag

Author(s)

Ajay, Anurag

DownloadThesis PDF (24.81Mb)

Advisor

Agrawal, Pulkit

Terms of use

Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) Copyright retained by author(s) https://creativecommons.org/licenses/by-nc-nd/4.0/

Metadata

Show full item record

Abstract

Recent advancements in conditional generative modeling have enabled models like DALLE and GPT-4 to generate high-resolution images and coherent text from brief prompts. However, developing a foundation model for decision-making is hindered by the scarcity and expense of collecting paired visual, language, and action data. To address this challenge, this thesis proposes a scalable alternative: a compositional model architecture that leverages separately trained expert models specializing in language, vision, and action. By reducing the need for extensive paired data collection, this approach maintains efficiency in solving novel decision-making tasks while mitigating the data scarcity problem. Our compositional foundation model employs a large language model for task planning, a video diffusion model to generate detailed video trajectories, and an inverse dynamics model to map videos into actions. We demonstrate the effectiveness of this approach in the context of table-top manipulation tasks. Furthermore, given the application of foundation models across various embodied agents, there is a growing need for systematically evaluating these models’ "common sense" understanding of the world. This evaluation is crucial for the successful deployment of embodied agents in real-world scenarios. To address this need, we introduce the first open-vocabulary benchmark for Embodied Question Answering (EQA). This benchmark assesses the foundation models’ ability to comprehend and reason about the world. In summary, by addressing data scarcity in developing foundation models for decision-making and establishing a benchmark for evaluating the reasoning capabilities of embodied agents, this thesis aims to advance the development of foundation models for decision-making.

Date issued

2024-09

URI

https://hdl.handle.net/1721.1/158501

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Doctoral Theses