Learning through looking and listening
Author(s)
Recasens Continente, Adriá.
Download1201522300-MIT.pdf (100.3Mb)
Other Contributors
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.
Advisor
Antonio Torralba.
Terms of use
Metadata
Show full item recordAbstract
In order to read emotions, understand actions or anticipate intentions, humans need efficient ways of gathering information about each other. In particular, gaze and speech are rich sources of information about other peoples' thoughts. This thesis investigates these modes. In the first part of the thesis, we describe our work on predicting human gaze. We introduce a series of methods to follow gaze for different modalities. First, we present GazeFollow, a dataset and model to predict the location people's gaze in an image. We then extend this method to work on video, where the system predicts when and where in the video the attended object appears. Finally, we introduce Gaze360, a large-scale gaze-tracking dataset and method for robust 3D gaze direction estimation in unconstrained scenes. In order to improve processing efficiency, we also propose a saliency-based sampling layer designed to improve performance in arbitrary tasks by efficiently zooming into the relevant parts of the input image. In the second part of the thesis, we present our work on learning spoken words from raw audio descriptions of images. We describe a multi-modal system capable of learning correspondences between segments of audio - nouns - and specific visual concepts. To investigate how to extend this system beyond learning nouns, we present a novel training procedure to learn abstract visual attributes (i.e., size, material or color) by using a generative model to generate the training images. Building upon recent findings that GAN representations can be manipulated to edit semantic concepts in the generated output, our method uses GAN-generated images to train the model using a triplet loss. Finally, we present three extensions and applications derived from our work: a dataset to jointly model speech and gaze; a system for gaze-tracking for behavioral research in children; and gaze-following in the classroom. Together, the methods presented in this thesis demonstrate the potential for human understanding through gaze and speech in images and videos.
Description
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections. Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2020 Cataloged from student-submitted PDF of thesis. Includes bibliographical references (pages 147-163).
Date issued
2020Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology
Keywords
Electrical Engineering and Computer Science.