Show simple item record

dc.contributor.authorGan, Chuang
dc.contributor.authorZhao, Hang
dc.contributor.authorChen, Peihao
dc.contributor.authorCox, David
dc.contributor.authorTorralba, Antonio
dc.date.accessioned2021-11-03T12:58:38Z
dc.date.available2021-11-03T12:58:38Z
dc.date.issued2019-10
dc.identifier.urihttps://hdl.handle.net/1721.1/137172
dc.description.abstract© 2019 IEEE. Humans are able to localize objects in the environment using both visual and auditory cues, integrating information from multiple modalities into a common reference frame. We introduce a system that can leverage unlabeled audiovisual data to learn to localize objects (moving vehicles) in a visual reference frame, purely using stereo sound at inference time. Since it is labor-intensive to manually annotate the correspondences between audio and object bounding boxes, we achieve this goal by using the co-occurrence of visual and audio streams in unlabeled videos as a form of self-supervision, without resorting to the collection of ground truth annotations. In particular, we propose a framework that consists of a vision ''teacher'' network and a stereo-sound ''student'' network. During training, knowledge embodied in a well-established visual vehicle detection model is transferred to the audio domain using unlabeled videos as a bridge. At test time, the stereo-sound student network can work independently to perform object localization using just stereo audio and camera meta-data, without any visual input. Experimental results on a newly collected Auditory Vehicles Tracking dataset verify that our proposed approach outperforms several baseline approaches. We also demonstrate that our cross-modal auditory localization approach can assist in the visual localization of moving vehicles under poor lighting conditions.en_US
dc.language.isoen
dc.publisherIEEEen_US
dc.relation.isversionof10.1109/iccv.2019.00715en_US
dc.rightsCreative Commons Attribution-Noncommercial-Share Alikeen_US
dc.rights.urihttp://creativecommons.org/licenses/by-nc-sa/4.0/en_US
dc.sourcearXiven_US
dc.titleSelf-Supervised Moving Vehicle Tracking With Stereo Sounden_US
dc.typeArticleen_US
dc.identifier.citationGan, Chuang, Zhao, Hang, Chen, Peihao, Cox, David and Torralba, Antonio. 2019. "Self-Supervised Moving Vehicle Tracking With Stereo Sound." Proceedings of the IEEE International Conference on Computer Vision, 2019-October.
dc.contributor.departmentMassachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
dc.contributor.departmentMIT-IBM Watson AI Lab
dc.relation.journalProceedings of the IEEE International Conference on Computer Visionen_US
dc.eprint.versionAuthor's final manuscripten_US
dc.type.urihttp://purl.org/eprint/type/ConferencePaperen_US
eprint.statushttp://purl.org/eprint/status/NonPeerRevieweden_US
dc.date.updated2021-04-15T17:27:29Z
dspace.orderedauthorsGan, C; Zhao, H; Chen, P; Cox, D; Torralba, Aen_US
dspace.date.submission2021-04-15T17:27:30Z
mit.journal.volume2019-Octoberen_US
mit.licenseOPEN_ACCESS_POLICY
mit.metadata.statusAuthority Work and Publication Information Neededen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record