MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Applications of Large Language Models for Robot Navigation and Scene Understanding

Author(s)
Chen, William
Thumbnail
DownloadThesis PDF (1.320Mb)
Advisor
Carlone, Luca
Terms of use
In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/
Metadata
Show full item record
Abstract
Common-sense reasoning is a key challenge in robot navigation and 3D scene understanding. Humans tend to reason about their environments in abstract terms, with a wealth of common sense on object and spatial relations to back up such inferences. Thus, if robots are to see widespread deployment, they must also be able to reason with such knowledge to support tasks specified in such terms. As modern language models trained on large text corpora encode much worldly knowledge, we thus investigate methods for extracting common sense from such models for use in non-linguistic semantically grounded robotics tasks. We start by examining how language models can be used for attaching abstract room classes to locations based on visual percepts and lower-level object classes, commonly generated by spatial perception systems. We detail three language-only approaches (zero-shot, embedding-based, and structured language) as well as two vision-and-language approaches (zero-shot and fine-tuned), finding that language-leveraging systems outperform both standard pure-vision and scene graph neural classifiers while yielding impressive generalization and transfer abilities. We then consider a simple robot semantic navigation task to see how an agent can act upon prior knowledge encoded within language models in order to find goal objects by reasoning about where such objects can be found. Our framework, Language Models as Probabilistic Priors (LaMPP), uses the language model to fill in parameters of standard probabilistic graphical models. We also touch upon use cases outside of robotics, namely semantic segmentation and video action segmentation. Lastly, we show how common-sense knowledge can be extracted from language models and encoded in abstract spatial ontology graphs. We measure how well language model scores align with human common sense judgements regarding object and spatial relationships. Ultimately, we hope this work paves the way for more advanced robot semantic scene understanding and navigation algorithms that leverage language models.
Date issued
2023-06
URI
https://hdl.handle.net/1721.1/151450
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Graduate Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.