Enhancing 3D Scene Graph Generation with Multimodal Embeddings

Morales, Joseph

Author(s)

Morales, Joseph

DownloadThesis PDF (30.19Mb)

Advisor

Carlone, Luca

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

3D Scene Graphs are expressive map representations for scene understanding in robotics and computer vision. Current approaches for automated zero-shot 3D Scene Graph generation rely on spatial ontologies that relate objects with the semantic locations they are found in (e.g., a fork is found in a kitchen). While conferring impressive zero-shot performance, these approaches are conditioned on the existence of disambiguating objects in a scene, the expressiveness of the generated spatial ontologies, and knowing during data collection that a robot needs to observe specific objects in the environment. This thesis proposes a method for zero-shot scene graph generation by leveraging Vision-Language Models (VLMs) to construct a layer of Viewpoints in the scene graph, which allow for after-the-fact open-vocabulary querying over the scene. Methods for utilizing different VLM features are explored, which result in improvement over the ontological approach on region segmentation tasks.

Date issued

2024-05

URI

https://hdl.handle.net/1721.1/156743

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses