Learning to see the physical world

Wu, Jiajun,Ph.D.Massachusetts Institute of Technology.

Author(s)

Wu, Jiajun,Ph.D.Massachusetts Institute of Technology.

Download1201541074-MIT.pdf (62.04Mb)

Other Contributors

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.

Advisor

William T. Freeman and Joshua B. Tenenbaum.

Terms of use

MIT theses may be protected by copyright. Please reuse MIT thesis content according to the MIT Libraries Permissions Policy, which is available through the URL provided. http://dspace.mit.edu/handle/1721.1/7582

Metadata

Show full item record

Abstract

Human intelligence is beyond pattern recognition. From a single image, we are able to explain what we see, reconstruct the scene in 3D, predict what's going to happen, and plan our actions accordingly. Artificial intelligence, in particular deep learning, still falls short in some preeminent aspects when compared with human intelligence, despite its phenomenal development in the past decade: they in general tackle specific problems, require large amounts of training data, and easily break when generalizing to new tasks or environments. In this dissertation, we study the problem of physical scene understanding-building versatile, data-efficient, and generalizable machines that learn to see, reason about, and interact with the physical world. The core idea is to exploit the generic, causal structure behind the world, including knowledge from computer graphics, physics, and language, in the form of approximate simulation engines, and to integrate them with deep learning.

Here, learning plays a multifaceted role: models may learn to invert simulation engines for efficient inference; they may also learn to approximate or augment simulation engines for more powerful forward simulation. This dissertation consists of three parts, where we investigate the use of such a hybrid model for perception, dynamics modeling, and cognitive reasoning, respectively. In Part I, we use learning in conjunction with graphics engines to build an object-centered scene representation for object shape, pose, and texture. In Part II, in addition to graphics engines, we pair learning with physics engines to simultaneously infer physical object properties. We also explore learning approximate simulation engines for better flexibility and expressiveness. In Part III, we leverage and extend the models introduced in Parts I and II for concept discovery and cognitive reasoning by looping in a program execution engine.

The enhanced models discover program-like structures in objects and scenes and, in turn, exploit them for downstream tasks such as visual question answering and scene manipulation.

Description

Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2020

Cataloged from PDF of thesis.

Includes bibliographical references (pages 271-303).

Date issued

2020

URI

https://hdl.handle.net/1721.1/128332

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Keywords

Electrical Engineering and Computer Science.

Collections

Doctoral Theses