MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Open Access Articles
  • MIT Open Access Articles
  • View Item
  • DSpace@MIT Home
  • MIT Open Access Articles
  • MIT Open Access Articles
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

SoundNet: learning sound representations from unlabeled video

Author(s)
Aytar, Yusuf; Vondrick*, Carl; Torralba, Antonio
Thumbnail
DownloadPublished version (5.576Mb)
Terms of use
Article is made available in accordance with the publisher's policy and may be subject to US copyright law. Please refer to the publisher's site for terms of use.
Metadata
Show full item record
Abstract
We learn rich natural sound representations by capitalizing on large amounts of unlabeled sound data collected in the wild. We leverage the natural synchronization between vision and sound to learn an acoustic representation using two-million unlabeled videos. Unlabeled video has the advantage that it can be economically acquired at massive scales, yet contains useful signals about natural sound. We propose a student-teacher training procedure which transfers discriminative visual knowledge from well established visual recognition models into the sound modality using unlabeled video as a bridge. Our sound representation yields significant performance improvements over the state-of-the-art results on standard benchmarks for acoustic scene/object classification. Visualizations suggest some high-level semantics automatically emerge in the sound network, even though it is trained without ground truth labels. ©2016 Paper presented at the 30th Conference on Neural Information Processing Systems (NeurIPS), Dec. 5-10, 2016, Barcelona, Spain.
Date issued
2016
URI
https://hdl.handle.net/1721.1/124993
Department
Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory; Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Journal
Advances in Neural Information Processing Systems
Citation
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba, "SoundNet: learning sound representations from unlabeled video." In Lee, D.D., et al., eds., Advances in Neural Information Processing Systems 19 (San Diego, Calif.: Neural Information Processing Systems Foundation, 2016): url doi https://papers.nips.cc/paper/6146-soundnet-learning-sound-representations-from-unlabeled-video ©2016 Author(s)
Version: Final published version

Collections
  • MIT Open Access Articles

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.