Minimalist Approach to End-to-End Vision Language
Navigation with Multi-Modal Foundation Model Features

Mishra, Kartikesh

dc.contributor.advisor	Rus, Daniela
dc.contributor.author	Mishra, Kartikesh
dc.date.accessioned	2025-10-06T17:39:52Z
dc.date.available	2025-10-06T17:39:52Z
dc.date.issued	2025-05
dc.date.submitted	2025-06-23T14:03:03.667Z
dc.identifier.uri	https://hdl.handle.net/1721.1/163018
dc.description.abstract	Recent vision-language navigation (VLN) approaches leverage large models, prompt engineering, and/or explicit reasoning for instruction interpretation and agent guidance. We introduce MiniNav, a minimalist framework employing frozen vision-language foundation models as patch-wise feature extractors, avoiding data and compute heavy fine-tuning and cumbersome language model reasoning. Our lightweight control policies (∼ 10⁵ trainable parameters) are trained on a compact dataset of language-based specified navigational behaviors (∼ 10² runs, ∼ 10⁴ frames per behavior). We demonstrate generalization to novel objects and scenes, including direct real-world transfer, despite training on only two objects in a single simulated environment. Through its simple and scalable design, MiniNav provides an alternative to computationally intensive pipelines for robust real-world instruction-following. Our solution can provide a reference for evaluating the effective edge of more complex and larger VLN policies.
dc.publisher	Massachusetts Institute of Technology
dc.rights	In Copyright - Educational Use Permitted
dc.rights	Copyright retained by author(s)
dc.rights.uri	https://rightsstatements.org/page/InC-EDU/1.0/
dc.title	Minimalist Approach to End-to-End Vision Language Navigation with Multi-Modal Foundation Model Features
dc.type	Thesis
dc.description.degree	M.Eng.
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degree	Master
thesis.degree.name	Master of Engineering in Electrical Engineering and Computer Science

Files in this item

Name:: mishra-mk314k-meng-eecs-2025-t ...
Size:: 19.36Mb
Format:: PDF
Description:: Thesis PDF

View/Open

This item appears in the following Collection(s)

Graduate Theses

Show simple item record

Minimalist Approach to End-to-End Vision Language Navigation with Multi-Modal Foundation Model Features

Files in this item

This item appears in the following Collection(s)