Video as the Language of Embodied Intelligence
Author(s)
Chen, Boyuan
DownloadThesis PDF (36.00Mb)
Advisor
Sitzmann, Vincent
Tedrake, Russ
Terms of use
Metadata
Show full item recordAbstract
Achieving general-purpose embodied intelligence remains a central challenge in artificial intelligence. While recent efforts have extended Large Language Models (LLMs) to robotics by incorporating additional modalities, these adaptations face critical limitations in perception, grounding, and control. For example, spatial reasoning—a simple yet indispensable capability for robots—reveals one of such shortcoming clearly: multimodal LLMs often fail even basic spatial perception tasks like estimating distances. This thesis begins by examining these failures through SpatialVLM, a system that augments vision-language models with 3D spatial reasoning. Although more effective in spatial estimation, this work reveals a deeper issue: the fundamental expressive limitations of language-only outputs in capturing sensorimotor dynamics. Based on these findings, the thesis advocates for a ground-up methodology for robot foundation models, starting with identifying an appropriate “language” for embodied AI, then architecting models and training regimes accordingly. We investigate video as the foundational language, integrated with model-based planning for decision-making. This new paradigm is instantiated through two core contributions. The first is Diffusion Forcing, a hybrid modeling framework that combines causal next-token prediction with full-sequence diffusion. This approach supports stable, coherent rollouts far beyond the training horizon and allows guided generation for decision-making tasks, bridging predictive modeling and planning. Building on Diffusion Forcing, we introduce the Diffusion Forcing Transformer (DFoT), a natural architectural extension designed for flexible video generation conditioned on variable-length histories. To further support long-horizon world-modeling, we propose History Guidance, a set of techniques that enhance sample fidelity, temporal consistency, and compositional generalization. Together, these methods enable robust modeling of visual dynamics across extended timeframes. Finally, we present a preliminary yet promising video foundation model for zero-shot robot motion planning, highlighting the potential of video as the foundational language of embodied intelligence.
Date issued
2025-09Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology