Video as the Language of Embodied Intelligence

Chen, Boyuan

dc.contributor.advisor	Sitzmann, Vincent
dc.contributor.advisor	Tedrake, Russ
dc.contributor.author	Chen, Boyuan
dc.date.accessioned	2026-01-20T19:47:01Z
dc.date.available	2026-01-20T19:47:01Z
dc.date.issued	2025-09
dc.date.submitted	2025-09-15T14:39:45.908Z
dc.identifier.uri	https://hdl.handle.net/1721.1/164584
dc.description.abstract	Achieving general-purpose embodied intelligence remains a central challenge in artificial intelligence. While recent efforts have extended Large Language Models (LLMs) to robotics by incorporating additional modalities, these adaptations face critical limitations in perception, grounding, and control. For example, spatial reasoning—a simple yet indispensable capability for robots—reveals one of such shortcoming clearly: multimodal LLMs often fail even basic spatial perception tasks like estimating distances. This thesis begins by examining these failures through SpatialVLM, a system that augments vision-language models with 3D spatial reasoning. Although more effective in spatial estimation, this work reveals a deeper issue: the fundamental expressive limitations of language-only outputs in capturing sensorimotor dynamics. Based on these findings, the thesis advocates for a ground-up methodology for robot foundation models, starting with identifying an appropriate “language” for embodied AI, then architecting models and training regimes accordingly. We investigate video as the foundational language, integrated with model-based planning for decision-making. This new paradigm is instantiated through two core contributions. The first is Diffusion Forcing, a hybrid modeling framework that combines causal next-token prediction with full-sequence diffusion. This approach supports stable, coherent rollouts far beyond the training horizon and allows guided generation for decision-making tasks, bridging predictive modeling and planning. Building on Diffusion Forcing, we introduce the Diffusion Forcing Transformer (DFoT), a natural architectural extension designed for flexible video generation conditioned on variable-length histories. To further support long-horizon world-modeling, we propose History Guidance, a set of techniques that enhance sample fidelity, temporal consistency, and compositional generalization. Together, these methods enable robust modeling of visual dynamics across extended timeframes. Finally, we present a preliminary yet promising video foundation model for zero-shot robot motion planning, highlighting the potential of video as the foundational language of embodied intelligence.
dc.publisher	Massachusetts Institute of Technology
dc.rights	In Copyright - Educational Use Permitted
dc.rights	Copyright retained by author(s)
dc.rights.uri	https://rightsstatements.org/page/InC-EDU/1.0/
dc.title	Video as the Language of Embodied Intelligence
dc.type	Thesis
dc.description.degree	Ph.D.
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degree	Doctoral
thesis.degree.name	Doctor of Philosophy

Files in this item

Name:: chen-boyuanc-phd-eecs-2025-the ...
Size:: 36.00Mb
Format:: PDF
Description:: Thesis PDF

View/Open

This item appears in the following Collection(s)

Doctoral Theses

Show simple item record