SoundNet: learning sound representations from unlabeled video
Author(s)
Aytar, Yusuf; Vondrick*, Carl; Torralba, Antonio
DownloadPublished version (5.576Mb)
Terms of use
Metadata
Show full item recordAbstract
We learn rich natural sound representations by capitalizing on large amounts of unlabeled sound data collected in the wild. We leverage the natural synchronization between vision and sound to learn an acoustic representation using two-million unlabeled videos. Unlabeled video has the advantage that it can be economically acquired at massive scales, yet contains useful signals about natural sound. We propose a student-teacher training procedure which transfers discriminative visual knowledge from well established visual recognition models into the sound modality using unlabeled video as a bridge. Our sound representation yields significant performance improvements over the state-of-the-art results on standard benchmarks for acoustic scene/object classification. Visualizations suggest some high-level semantics automatically emerge in the sound network, even though it is trained without ground truth labels. ©2016 Paper presented at the 30th Conference on Neural Information Processing Systems (NeurIPS), Dec. 5-10, 2016, Barcelona, Spain.
Date issued
2016Department
Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory; Massachusetts Institute of Technology. Department of Electrical Engineering and Computer ScienceJournal
Advances in Neural Information Processing Systems
Citation
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba, "SoundNet: learning sound representations from unlabeled video." In Lee, D.D., et al., eds., Advances in Neural Information Processing Systems 19 (San Diego, Calif.: Neural Information Processing Systems Foundation, 2016): url doi https://papers.nips.cc/paper/6146-soundnet-learning-sound-representations-from-unlabeled-video ©2016 Author(s)
Version: Final published version