Scene Perception for Simulated Intuitive Physics via Bayesian Inverse Graphics
Author(s)
Shehada, Khaled K.
DownloadThesis PDF (2.896Mb)
Advisor
Roy, Nicholas
Terms of use
Metadata
Show full item recordAbstract
Humans have a wide range of cognitive capacities that make us adept at interpreting our physical world. Every day, we encounter new environments, yet we can parse those environments with limited visual exposure and make fairly accurate inferences about unfamiliar objects. Emulating scene understanding capacities in computational models has numerous applications ranging from autonomous driving to virtual reality. Despite the proficiency demonstrated by deep neural networks in pattern recognition, recent works have uncovered challenges in their abilities to encode prior physical knowledge, form visual concepts, and perform compositional reasoning, such as inferring inter-object relations like containment. To this end, the thesis introduces the Simulated COgnitive Tasks (SCOT) benchmark, a large-scale synthetic dataset and data creation codebase allowing for the procedural generation of videos of simulated cognitive tasks targeting intuitive physics understanding. Those cognitive tasks are adapted from tests in the literature used to comparatively assess the cognitive capacities of non-human primates. Additionally, the thesis presents an analysis of several deep learning models on the benchmark, underlining their limitations in tasks involving object permanence comprehension, quantities, and compositionality and their inability to generalize learned knowledge to complex dynamic scenes. In response to these limitations, we propose a probabilistic generative approach that leverages Bayesian inverse graphics to learn structured scene representations that facilitate learning new objects and tracking objects in dynamic scenes. Our evaluation of this model on SCOT revealed near-perfect performance on most tasks with significant data efficiency, suggesting that structured representations and symbolic inference can cooperate with deep learning methods to interpret complex 3D scenes accurately. Overall, this thesis contributes to the field of artificial intelligence (AI) by presenting a new method for improving scene understanding in AI models and providing a benchmark for assessing the visual cognitive capacities of computational models.
Date issued
2023-09Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology