Show simple item record

dc.contributor.advisorRus, Daniela
dc.contributor.authorMishra, Kartikesh
dc.date.accessioned2025-10-06T17:39:52Z
dc.date.available2025-10-06T17:39:52Z
dc.date.issued2025-05
dc.date.submitted2025-06-23T14:03:03.667Z
dc.identifier.urihttps://hdl.handle.net/1721.1/163018
dc.description.abstractRecent vision-language navigation (VLN) approaches leverage large models, prompt engineering, and/or explicit reasoning for instruction interpretation and agent guidance. We introduce MiniNav, a minimalist framework employing frozen vision-language foundation models as patch-wise feature extractors, avoiding data and compute heavy fine-tuning and cumbersome language model reasoning. Our lightweight control policies (∼ 10⁵ trainable parameters) are trained on a compact dataset of language-based specified navigational behaviors (∼ 10² runs, ∼ 10⁴ frames per behavior). We demonstrate generalization to novel objects and scenes, including direct real-world transfer, despite training on only two objects in a single simulated environment. Through its simple and scalable design, MiniNav provides an alternative to computationally intensive pipelines for robust real-world instruction-following. Our solution can provide a reference for evaluating the effective edge of more complex and larger VLN policies.
dc.publisherMassachusetts Institute of Technology
dc.rightsIn Copyright - Educational Use Permitted
dc.rightsCopyright retained by author(s)
dc.rights.urihttps://rightsstatements.org/page/InC-EDU/1.0/
dc.titleMinimalist Approach to End-to-End Vision Language Navigation with Multi-Modal Foundation Model Features
dc.typeThesis
dc.description.degreeM.Eng.
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degreeMaster
thesis.degree.nameMaster of Engineering in Electrical Engineering and Computer Science


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record