Scene Perception for Simulated Intuitive Physics via Bayesian Inverse Graphics

Shehada, Khaled K.

Author(s)

Shehada, Khaled K.

DownloadThesis PDF (2.896Mb)

Advisor

Roy, Nicholas

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

Humans have a wide range of cognitive capacities that make us adept at interpreting our physical world. Every day, we encounter new environments, yet we can parse those environments with limited visual exposure and make fairly accurate inferences about unfamiliar objects. Emulating scene understanding capacities in computational models has numerous applications ranging from autonomous driving to virtual reality. Despite the proficiency demonstrated by deep neural networks in pattern recognition, recent works have uncovered challenges in their abilities to encode prior physical knowledge, form visual concepts, and perform compositional reasoning, such as inferring inter-object relations like containment. To this end, the thesis introduces the Simulated COgnitive Tasks (SCOT) benchmark, a large-scale synthetic dataset and data creation codebase allowing for the procedural generation of videos of simulated cognitive tasks targeting intuitive physics understanding. Those cognitive tasks are adapted from tests in the literature used to comparatively assess the cognitive capacities of non-human primates. Additionally, the thesis presents an analysis of several deep learning models on the benchmark, underlining their limitations in tasks involving object permanence comprehension, quantities, and compositionality and their inability to generalize learned knowledge to complex dynamic scenes. In response to these limitations, we propose a probabilistic generative approach that leverages Bayesian inverse graphics to learn structured scene representations that facilitate learning new objects and tracking objects in dynamic scenes. Our evaluation of this model on SCOT revealed near-perfect performance on most tasks with significant data efficiency, suggesting that structured representations and symbolic inference can cooperate with deep learning methods to interpret complex 3D scenes accurately. Overall, this thesis contributes to the field of artificial intelligence (AI) by presenting a new method for improving scene understanding in AI models and providing a benchmark for assessing the visual cognitive capacities of computational models.

Date issued

2023-09

URI

https://hdl.handle.net/1721.1/152837

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses