Applications of Large Language Models for Robot
Navigation and Scene Understanding

Chen, William

Author(s)

Chen, William

DownloadThesis PDF (1.320Mb)

Advisor

Carlone, Luca

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

Common-sense reasoning is a key challenge in robot navigation and 3D scene understanding. Humans tend to reason about their environments in abstract terms, with a wealth of common sense on object and spatial relations to back up such inferences. Thus, if robots are to see widespread deployment, they must also be able to reason with such knowledge to support tasks specified in such terms. As modern language models trained on large text corpora encode much worldly knowledge, we thus investigate methods for extracting common sense from such models for use in non-linguistic semantically grounded robotics tasks. We start by examining how language models can be used for attaching abstract room classes to locations based on visual percepts and lower-level object classes, commonly generated by spatial perception systems. We detail three language-only approaches (zero-shot, embedding-based, and structured language) as well as two vision-and-language approaches (zero-shot and fine-tuned), finding that language-leveraging systems outperform both standard pure-vision and scene graph neural classifiers while yielding impressive generalization and transfer abilities. We then consider a simple robot semantic navigation task to see how an agent can act upon prior knowledge encoded within language models in order to find goal objects by reasoning about where such objects can be found. Our framework, Language Models as Probabilistic Priors (LaMPP), uses the language model to fill in parameters of standard probabilistic graphical models. We also touch upon use cases outside of robotics, namely semantic segmentation and video action segmentation. Lastly, we show how common-sense knowledge can be extracted from language models and encoded in abstract spatial ontology graphs. We measure how well language model scores align with human common sense judgements regarding object and spatial relationships. Ultimately, we hope this work paves the way for more advanced robot semantic scene understanding and navigation algorithms that leverage language models.

Date issued

2023-06

URI

https://hdl.handle.net/1721.1/151450

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses