Video as the Language of Embodied Intelligence

Chen, Boyuan

Author(s)

Chen, Boyuan

DownloadThesis PDF (36.00Mb)

Advisor

Sitzmann, Vincent

Tedrake, Russ

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

Achieving general-purpose embodied intelligence remains a central challenge in artificial intelligence. While recent efforts have extended Large Language Models (LLMs) to robotics by incorporating additional modalities, these adaptations face critical limitations in perception, grounding, and control. For example, spatial reasoning—a simple yet indispensable capability for robots—reveals one of such shortcoming clearly: multimodal LLMs often fail even basic spatial perception tasks like estimating distances. This thesis begins by examining these failures through SpatialVLM, a system that augments vision-language models with 3D spatial reasoning. Although more effective in spatial estimation, this work reveals a deeper issue: the fundamental expressive limitations of language-only outputs in capturing sensorimotor dynamics. Based on these findings, the thesis advocates for a ground-up methodology for robot foundation models, starting with identifying an appropriate “language” for embodied AI, then architecting models and training regimes accordingly. We investigate video as the foundational language, integrated with model-based planning for decision-making. This new paradigm is instantiated through two core contributions. The first is Diffusion Forcing, a hybrid modeling framework that combines causal next-token prediction with full-sequence diffusion. This approach supports stable, coherent rollouts far beyond the training horizon and allows guided generation for decision-making tasks, bridging predictive modeling and planning. Building on Diffusion Forcing, we introduce the Diffusion Forcing Transformer (DFoT), a natural architectural extension designed for flexible video generation conditioned on variable-length histories. To further support long-horizon world-modeling, we propose History Guidance, a set of techniques that enhance sample fidelity, temporal consistency, and compositional generalization. Together, these methods enable robust modeling of visual dynamics across extended timeframes. Finally, we present a preliminary yet promising video foundation model for zero-shot robot motion planning, highlighting the potential of video as the foundational language of embodied intelligence.

Date issued

2025-09

URI

https://hdl.handle.net/1721.1/164584

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Doctoral Theses