Composing Foundation Models for Decision Making
Author(s)
Ajay, Anurag
DownloadThesis PDF (24.81Mb)
Advisor
Agrawal, Pulkit
Terms of use
Metadata
Show full item recordAbstract
Recent advancements in conditional generative modeling have enabled models like DALLE and GPT-4 to generate high-resolution images and coherent text from brief prompts. However, developing a foundation model for decision-making is hindered by the scarcity and expense of collecting paired visual, language, and action data. To address this challenge, this thesis proposes a scalable alternative: a compositional model architecture that leverages separately trained expert models specializing in language, vision, and action. By reducing the need for extensive paired data collection, this approach maintains efficiency in solving novel decision-making tasks while mitigating the data scarcity problem. Our compositional foundation model employs a large language model for task planning, a video diffusion model to generate detailed video trajectories, and an inverse dynamics model to map videos into actions. We demonstrate the effectiveness of this approach in the context of table-top manipulation tasks. Furthermore, given the application of foundation models across various embodied agents, there is a growing need for systematically evaluating these models’ "common sense" understanding of the world. This evaluation is crucial for the successful deployment of embodied agents in real-world scenarios. To address this need, we introduce the first open-vocabulary benchmark for Embodied Question Answering (EQA). This benchmark assesses the foundation models’ ability to comprehend and reason about the world. In summary, by addressing data scarcity in developing foundation models for decision-making and establishing a benchmark for evaluating the reasoning capabilities of embodied agents, this thesis aims to advance the development of foundation models for decision-making.
Date issued
2024-09Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology