Model-based Planning for Efficient Task Execution
Author(s)
Ding, Wenqi
DownloadThesis PDF (4.830Mb)
Advisor
Balakrishnan, Hamsa
Terms of use
Metadata
Show full item recordAbstract
Robotic agents navigating 3D environments must continuously decide their next moves by reasoning about both visual observations and high-level language instructions. However, they plan in a high-dimensional latent space, opaque to human collaborators. Hence, it is difficult for humans to understand the agent’s decision-making process. This lack of interpretability hinders effective collaboration between humans and robots. The key question we are trying to answer in this thesis is: Can we build a unified planning framework that fuses visual and language into a single, interpretable representation, so that humans can interpret robots’ decisions? We propose a model-based planning framework built around pretrained vision-language models (VLMs). We show that VLMs can be used to plan in a unified embedding space, where visual and language representations can be decoded back to human-interpretable forms. Empirical evaluation on vision-language navigation benchmarks demonstrates both improved sample efficiency and transparent decision making, enabling human-in-the-loop planning and more effective human-robot collaboration.
Date issued
2025-05Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology