Show simple item record

dc.contributor.advisorSitzmann, Vincent
dc.contributor.advisorTedrake, Russ
dc.contributor.authorChen, Boyuan
dc.date.accessioned2026-01-20T19:47:01Z
dc.date.available2026-01-20T19:47:01Z
dc.date.issued2025-09
dc.date.submitted2025-09-15T14:39:45.908Z
dc.identifier.urihttps://hdl.handle.net/1721.1/164584
dc.description.abstractAchieving general-purpose embodied intelligence remains a central challenge in artificial intelligence. While recent efforts have extended Large Language Models (LLMs) to robotics by incorporating additional modalities, these adaptations face critical limitations in perception, grounding, and control. For example, spatial reasoning—a simple yet indispensable capability for robots—reveals one of such shortcoming clearly: multimodal LLMs often fail even basic spatial perception tasks like estimating distances. This thesis begins by examining these failures through SpatialVLM, a system that augments vision-language models with 3D spatial reasoning. Although more effective in spatial estimation, this work reveals a deeper issue: the fundamental expressive limitations of language-only outputs in capturing sensorimotor dynamics. Based on these findings, the thesis advocates for a ground-up methodology for robot foundation models, starting with identifying an appropriate “language” for embodied AI, then architecting models and training regimes accordingly. We investigate video as the foundational language, integrated with model-based planning for decision-making. This new paradigm is instantiated through two core contributions. The first is Diffusion Forcing, a hybrid modeling framework that combines causal next-token prediction with full-sequence diffusion. This approach supports stable, coherent rollouts far beyond the training horizon and allows guided generation for decision-making tasks, bridging predictive modeling and planning. Building on Diffusion Forcing, we introduce the Diffusion Forcing Transformer (DFoT), a natural architectural extension designed for flexible video generation conditioned on variable-length histories. To further support long-horizon world-modeling, we propose History Guidance, a set of techniques that enhance sample fidelity, temporal consistency, and compositional generalization. Together, these methods enable robust modeling of visual dynamics across extended timeframes. Finally, we present a preliminary yet promising video foundation model for zero-shot robot motion planning, highlighting the potential of video as the foundational language of embodied intelligence.
dc.publisherMassachusetts Institute of Technology
dc.rightsIn Copyright - Educational Use Permitted
dc.rightsCopyright retained by author(s)
dc.rights.urihttps://rightsstatements.org/page/InC-EDU/1.0/
dc.titleVideo as the Language of Embodied Intelligence
dc.typeThesis
dc.description.degreePh.D.
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degreeDoctoral
thesis.degree.nameDoctor of Philosophy


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record