| dc.contributor.advisor | Carlone, Luca | |
| dc.contributor.author | Morales, Joseph | |
| dc.date.accessioned | 2024-09-16T13:46:25Z | |
| dc.date.available | 2024-09-16T13:46:25Z | |
| dc.date.issued | 2024-05 | |
| dc.date.submitted | 2024-07-11T14:36:33.616Z | |
| dc.identifier.uri | https://hdl.handle.net/1721.1/156743 | |
| dc.description.abstract | 3D Scene Graphs are expressive map representations for scene understanding in robotics and computer vision. Current approaches for automated zero-shot 3D Scene Graph generation rely on spatial ontologies that relate objects with the semantic locations they are found in (e.g., a fork is found in a kitchen). While conferring impressive zero-shot performance, these approaches are conditioned on the existence of disambiguating objects in a scene, the expressiveness of the generated spatial ontologies, and knowing during data collection that a robot needs to observe specific objects in the environment. This thesis proposes a method for zero-shot scene graph generation by leveraging Vision-Language Models (VLMs) to construct a layer of Viewpoints in the scene graph, which allow for after-the-fact open-vocabulary querying over the scene. Methods for utilizing different VLM features are explored, which result in improvement over the ontological approach on region segmentation tasks. | |
| dc.publisher | Massachusetts Institute of Technology | |
| dc.rights | In Copyright - Educational Use Permitted | |
| dc.rights | Copyright retained by author(s) | |
| dc.rights.uri | https://rightsstatements.org/page/InC-EDU/1.0/ | |
| dc.title | Enhancing 3D Scene Graph Generation with Multimodal Embeddings | |
| dc.type | Thesis | |
| dc.description.degree | M.Eng. | |
| dc.contributor.department | Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science | |
| mit.thesis.degree | Master | |
| thesis.degree.name | Master of Engineering in Electrical Engineering and Computer Science | |