MIT Libraries homeMIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Doctoral Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Doctoral Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Learning through looking and listening

Author(s)
Recasens Continente, Adriá.
Thumbnail
Download1201522300-MIT.pdf (100.3Mb)
Other Contributors
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.
Advisor
Antonio Torralba.
Terms of use
MIT theses may be protected by copyright. Please reuse MIT thesis content according to the MIT Libraries Permissions Policy, which is available through the URL provided. http://dspace.mit.edu/handle/1721.1/7582
Metadata
Show full item record
Abstract
In order to read emotions, understand actions or anticipate intentions, humans need efficient ways of gathering information about each other. In particular, gaze and speech are rich sources of information about other peoples' thoughts. This thesis investigates these modes. In the first part of the thesis, we describe our work on predicting human gaze. We introduce a series of methods to follow gaze for different modalities. First, we present GazeFollow, a dataset and model to predict the location people's gaze in an image. We then extend this method to work on video, where the system predicts when and where in the video the attended object appears. Finally, we introduce Gaze360, a large-scale gaze-tracking dataset and method for robust 3D gaze direction estimation in unconstrained scenes.
 
In order to improve processing efficiency, we also propose a saliency-based sampling layer designed to improve performance in arbitrary tasks by efficiently zooming into the relevant parts of the input image. In the second part of the thesis, we present our work on learning spoken words from raw audio descriptions of images. We describe a multi-modal system capable of learning correspondences between segments of audio - nouns - and specific visual concepts. To investigate how to extend this system beyond learning nouns, we present a novel training procedure to learn abstract visual attributes (i.e., size, material or color) by using a generative model to generate the training images. Building upon recent findings that GAN representations can be manipulated to edit semantic concepts in the generated output, our method uses GAN-generated images to train the model using a triplet loss.
 
Finally, we present three extensions and applications derived from our work: a dataset to jointly model speech and gaze; a system for gaze-tracking for behavioral research in children; and gaze-following in the classroom. Together, the methods presented in this thesis demonstrate the potential for human understanding through gaze and speech in images and videos.
 
Description
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
 
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2020
 
Cataloged from student-submitted PDF of thesis.
 
Includes bibliographical references (pages 147-163).
 
Date issued
2020
URI
https://hdl.handle.net/1721.1/128297
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology
Keywords
Electrical Engineering and Computer Science.

Collections
  • Doctoral Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries homeMIT Libraries logo

Find us on

Twitter Facebook Instagram YouTube RSS

MIT Libraries navigation

SearchHours & locationsBorrow & requestResearch supportAbout us
PrivacyPermissionsAccessibility
MIT
Massachusetts Institute of Technology
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.