Learning through looking and listening

Recasens Continente, Adriá.

Author(s)

Recasens Continente, Adriá.

Download1201522300-MIT.pdf (100.3Mb)

Other Contributors

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.

Advisor

Antonio Torralba.

Terms of use

MIT theses may be protected by copyright. Please reuse MIT thesis content according to the MIT Libraries Permissions Policy, which is available through the URL provided. http://dspace.mit.edu/handle/1721.1/7582

Metadata

Show full item record

Abstract

In order to read emotions, understand actions or anticipate intentions, humans need efficient ways of gathering information about each other. In particular, gaze and speech are rich sources of information about other peoples' thoughts. This thesis investigates these modes. In the first part of the thesis, we describe our work on predicting human gaze. We introduce a series of methods to follow gaze for different modalities. First, we present GazeFollow, a dataset and model to predict the location people's gaze in an image. We then extend this method to work on video, where the system predicts when and where in the video the attended object appears. Finally, we introduce Gaze360, a large-scale gaze-tracking dataset and method for robust 3D gaze direction estimation in unconstrained scenes.

In order to improve processing efficiency, we also propose a saliency-based sampling layer designed to improve performance in arbitrary tasks by efficiently zooming into the relevant parts of the input image. In the second part of the thesis, we present our work on learning spoken words from raw audio descriptions of images. We describe a multi-modal system capable of learning correspondences between segments of audio - nouns - and specific visual concepts. To investigate how to extend this system beyond learning nouns, we present a novel training procedure to learn abstract visual attributes (i.e., size, material or color) by using a generative model to generate the training images. Building upon recent findings that GAN representations can be manipulated to edit semantic concepts in the generated output, our method uses GAN-generated images to train the model using a triplet loss.

Finally, we present three extensions and applications derived from our work: a dataset to jointly model speech and gaze; a system for gaze-tracking for behavioral research in children; and gaze-following in the classroom. Together, the methods presented in this thesis demonstrate the potential for human understanding through gaze and speech in images and videos.

Description

This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.

Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2020

Cataloged from student-submitted PDF of thesis.

Includes bibliographical references (pages 147-163).

Date issued

2020

URI

https://hdl.handle.net/1721.1/128297

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Keywords

Electrical Engineering and Computer Science.

Collections

Doctoral Theses