dc.contributor.advisor | Rus, Daniela | |
dc.contributor.author | Mishra, Kartikesh | |
dc.date.accessioned | 2025-10-06T17:39:52Z | |
dc.date.available | 2025-10-06T17:39:52Z | |
dc.date.issued | 2025-05 | |
dc.date.submitted | 2025-06-23T14:03:03.667Z | |
dc.identifier.uri | https://hdl.handle.net/1721.1/163018 | |
dc.description.abstract | Recent vision-language navigation (VLN) approaches leverage large models, prompt engineering, and/or explicit reasoning for instruction interpretation and agent guidance. We introduce MiniNav, a minimalist framework employing frozen vision-language foundation models as patch-wise feature extractors, avoiding data and compute heavy fine-tuning and cumbersome language model reasoning. Our lightweight control policies (∼ 10⁵ trainable parameters) are trained on a compact dataset of language-based specified navigational behaviors (∼ 10² runs, ∼ 10⁴ frames per behavior). We demonstrate generalization to novel objects and scenes, including direct real-world transfer, despite training on only two objects in a single simulated environment. Through its simple and scalable design, MiniNav provides an alternative to computationally intensive pipelines for robust real-world instruction-following. Our solution can provide a reference for evaluating the effective edge of more complex and larger VLN policies. | |
dc.publisher | Massachusetts Institute of Technology | |
dc.rights | In Copyright - Educational Use Permitted | |
dc.rights | Copyright retained by author(s) | |
dc.rights.uri | https://rightsstatements.org/page/InC-EDU/1.0/ | |
dc.title | Minimalist Approach to End-to-End Vision Language
Navigation with Multi-Modal Foundation Model Features | |
dc.type | Thesis | |
dc.description.degree | M.Eng. | |
dc.contributor.department | Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science | |
mit.thesis.degree | Master | |
thesis.degree.name | Master of Engineering in Electrical Engineering and Computer Science | |