MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Doctoral Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Doctoral Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Composing Foundation Models for Decision Making

Author(s)
Ajay, Anurag
Thumbnail
DownloadThesis PDF (24.81Mb)
Advisor
Agrawal, Pulkit
Terms of use
Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) Copyright retained by author(s) https://creativecommons.org/licenses/by-nc-nd/4.0/
Metadata
Show full item record
Abstract
Recent advancements in conditional generative modeling have enabled models like DALLE and GPT-4 to generate high-resolution images and coherent text from brief prompts. However, developing a foundation model for decision-making is hindered by the scarcity and expense of collecting paired visual, language, and action data. To address this challenge, this thesis proposes a scalable alternative: a compositional model architecture that leverages separately trained expert models specializing in language, vision, and action. By reducing the need for extensive paired data collection, this approach maintains efficiency in solving novel decision-making tasks while mitigating the data scarcity problem. Our compositional foundation model employs a large language model for task planning, a video diffusion model to generate detailed video trajectories, and an inverse dynamics model to map videos into actions. We demonstrate the effectiveness of this approach in the context of table-top manipulation tasks. Furthermore, given the application of foundation models across various embodied agents, there is a growing need for systematically evaluating these models’ "common sense" understanding of the world. This evaluation is crucial for the successful deployment of embodied agents in real-world scenarios. To address this need, we introduce the first open-vocabulary benchmark for Embodied Question Answering (EQA). This benchmark assesses the foundation models’ ability to comprehend and reason about the world. In summary, by addressing data scarcity in developing foundation models for decision-making and establishing a benchmark for evaluating the reasoning capabilities of embodied agents, this thesis aims to advance the development of foundation models for decision-making.
Date issued
2024-09
URI
https://hdl.handle.net/1721.1/158501
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Doctoral Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.