Composing Foundation Models for Decision Making

Ajay, Anurag

dc.contributor.advisor	Agrawal, Pulkit
dc.contributor.author	Ajay, Anurag
dc.date.accessioned	2025-03-12T16:55:56Z
dc.date.available	2025-03-12T16:55:56Z
dc.date.issued	2024-09
dc.date.submitted	2025-03-04T18:28:53.831Z
dc.identifier.uri	https://hdl.handle.net/1721.1/158501
dc.description.abstract	Recent advancements in conditional generative modeling have enabled models like DALLE and GPT-4 to generate high-resolution images and coherent text from brief prompts. However, developing a foundation model for decision-making is hindered by the scarcity and expense of collecting paired visual, language, and action data. To address this challenge, this thesis proposes a scalable alternative: a compositional model architecture that leverages separately trained expert models specializing in language, vision, and action. By reducing the need for extensive paired data collection, this approach maintains efficiency in solving novel decision-making tasks while mitigating the data scarcity problem. Our compositional foundation model employs a large language model for task planning, a video diffusion model to generate detailed video trajectories, and an inverse dynamics model to map videos into actions. We demonstrate the effectiveness of this approach in the context of table-top manipulation tasks. Furthermore, given the application of foundation models across various embodied agents, there is a growing need for systematically evaluating these models’ "common sense" understanding of the world. This evaluation is crucial for the successful deployment of embodied agents in real-world scenarios. To address this need, we introduce the first open-vocabulary benchmark for Embodied Question Answering (EQA). This benchmark assesses the foundation models’ ability to comprehend and reason about the world. In summary, by addressing data scarcity in developing foundation models for decision-making and establishing a benchmark for evaluating the reasoning capabilities of embodied agents, this thesis aims to advance the development of foundation models for decision-making.
dc.publisher	Massachusetts Institute of Technology
dc.rights	Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
dc.rights	Copyright retained by author(s)
dc.rights.uri	https://creativecommons.org/licenses/by-nc-nd/4.0/
dc.title	Composing Foundation Models for Decision Making
dc.type	Thesis
dc.description.degree	Ph.D.
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degree	Doctoral
thesis.degree.name	Doctor of Philosophy

Files in this item

Name:: ajay-aajay-phd-eecs-2024-thesis.pdf
Size:: 24.81Mb
Format:: PDF
Description:: Thesis PDF

View/Open

This item appears in the following Collection(s)

Doctoral Theses

Show simple item record