Ambient Sound Provides Supervision for Visual Learning

Owens, Andrew; Wu, Jiajun; McDermott, Josh H.; Freeman, William T.; Torralba, Antonio

Author(s)

Owens, Andrew Hale; Wu, Jiajun; McDermott, Joshua H.; Freeman, William T.; Torralba, Antonio

DownloadAmbient sound provides.pdf (7.436Mb)

OPEN_ACCESS_POLICY

Terms of use

Creative Commons Attribution-Noncommercial-Share Alike http://creativecommons.org/licenses/by-nc-sa/4.0/

Metadata

Show full item record

Abstract

The sound of crashing waves, the roar of fast-moving cars – sound conveys important information about the objects in our surroundings. In this work, we show that ambient sounds can be used as a supervisory signal for learning visual models. To demonstrate this, we train a convolutional neural network to predict a statistical summary of the sound associated with a video frame. We show that, through this process, the network learns a representation that conveys information about objects and scenes. We evaluate this representation on several recognition tasks, finding that its performance is comparable to that of other state-of-the-art unsupervised learning methods. Finally, we show through visualizations that the network learns units that are selective to objects that are often associated with characteristic sounds.

Date issued

2016-09

URI

http://hdl.handle.net/1721.1/111172

Department

Massachusetts Institute of Technology. Department of Brain and Cognitive Sciences; Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Journal

Lecture Notes in Computer Science

Publisher

Springer-Verlag

Citation

Version: Original manuscript

ISBN

978-3-319-46447-3

978-3-319-46448-0

ISSN

0302-9743

1611-3349

Collections

MIT Open Access Articles

DSpace@MIT