MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Doctoral Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Doctoral Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Video as the Language of Embodied Intelligence

Author(s)
Chen, Boyuan
Thumbnail
DownloadThesis PDF (36.00Mb)
Advisor
Sitzmann, Vincent
Tedrake, Russ
Terms of use
In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/
Metadata
Show full item record
Abstract
Achieving general-purpose embodied intelligence remains a central challenge in artificial intelligence. While recent efforts have extended Large Language Models (LLMs) to robotics by incorporating additional modalities, these adaptations face critical limitations in perception, grounding, and control. For example, spatial reasoning—a simple yet indispensable capability for robots—reveals one of such shortcoming clearly: multimodal LLMs often fail even basic spatial perception tasks like estimating distances. This thesis begins by examining these failures through SpatialVLM, a system that augments vision-language models with 3D spatial reasoning. Although more effective in spatial estimation, this work reveals a deeper issue: the fundamental expressive limitations of language-only outputs in capturing sensorimotor dynamics. Based on these findings, the thesis advocates for a ground-up methodology for robot foundation models, starting with identifying an appropriate “language” for embodied AI, then architecting models and training regimes accordingly. We investigate video as the foundational language, integrated with model-based planning for decision-making. This new paradigm is instantiated through two core contributions. The first is Diffusion Forcing, a hybrid modeling framework that combines causal next-token prediction with full-sequence diffusion. This approach supports stable, coherent rollouts far beyond the training horizon and allows guided generation for decision-making tasks, bridging predictive modeling and planning. Building on Diffusion Forcing, we introduce the Diffusion Forcing Transformer (DFoT), a natural architectural extension designed for flexible video generation conditioned on variable-length histories. To further support long-horizon world-modeling, we propose History Guidance, a set of techniques that enhance sample fidelity, temporal consistency, and compositional generalization. Together, these methods enable robust modeling of visual dynamics across extended timeframes. Finally, we present a preliminary yet promising video foundation model for zero-shot robot motion planning, highlighting the potential of video as the foundational language of embodied intelligence.
Date issued
2025-09
URI
https://hdl.handle.net/1721.1/164584
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Doctoral Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.