The Sound of Pixels

Zhao, Hang; Gan, Chuang; Rouditchenko, Andrew; Vondrick, Carl Martin; McDermott, Joshua Hartman; Torralba, Antonio

dc.contributor.author	Zhao, Hang
dc.contributor.author	Gan, Chuang
dc.contributor.author	Rouditchenko, Andrew
dc.contributor.author	Vondrick, Carl Martin
dc.contributor.author	McDermott, Joshua Hartman
dc.contributor.author	Torralba, Antonio
dc.date.accessioned	2020-01-20T20:55:35Z
dc.date.available	2020-01-20T20:55:35Z
dc.date.issued	2018-10-06
dc.identifier.isbn	9783030012458
dc.identifier.isbn	9783030012465
dc.identifier.issn	0302-9743
dc.identifier.issn	1611-3349
dc.identifier.uri	https://hdl.handle.net/1721.1/123480
dc.description.abstract	We introduce PixelPlayer, a system that, by leveraging large amounts of unlabeled videos, learns to locate image regions which produce sounds and separate the input sounds into a set of components that represents the sound from each pixel. Our approach capitalizes on the natural synchronization of the visual and audio modalities to learn models that jointly parse sounds and images, without requiring additional manual supervision. Experimental results on a newly collected MUSIC dataset show that our proposed Mix-and-Separate framework outperforms several baselines on source separation. Qualitative results suggest our model learns to ground sounds in vision, enabling applications such as independently adjusting the volume of sound sources. Keywords: Cross-modal learning; Sound separation and localization	en_US
dc.description.sponsorship	National Science Foundation (U.S.) (Grant IIS-1524817)	en_US
dc.language.iso	en
dc.publisher	Springer Nature	en_US
dc.relation.isversionof	http://dx.doi.org/10.1007/978-3-030-01246-5_35	en_US
dc.rights	Creative Commons Attribution-Noncommercial-Share Alike	en_US
dc.rights.uri	http://creativecommons.org/licenses/by-nc-sa/4.0/	en_US
dc.source	arXiv	en_US
dc.title	The Sound of Pixels	en_US
dc.type	Book	en_US
dc.identifier.citation	Zhao, Hang et al. "The Sound of Pixels." Computer Vision – European Conference on Computer Vision (ECCV 2018), September 4-18, 2018, Munich, Germany, edited by V. Ferrari, M. Hebert, C. Sminchisescu C., and Y. Weiss. Lecture Notes in Computer Science, vol 11205, pages 587-604. Springer, Cham, 2018. © 2018 Springer Nature	en_US
dc.contributor.department	MIT-IBM Watson AI Lab
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
dc.eprint.version	Author's final manuscript	en_US
dc.type.uri	http://purl.org/eprint/type/ConferencePaper	en_US
eprint.status	http://purl.org/eprint/status/NonPeerReviewed	en_US
dc.date.updated	2019-07-11T17:31:29Z
dspace.date.submission	2019-07-11T17:31:31Z
mit.metadata.status	Complete

Files in this item

Name:: 1804.03160.pdf
Size:: 5.574Mb
Format:: PDF
Description:: Accepted version

View/Open

This item appears in the following Collection(s)

MIT Open Access Articles

Show simple item record