dc.contributor.author | Gan, Chuang | |
dc.contributor.author | Zhao, Hang | |
dc.contributor.author | Chen, Peihao | |
dc.contributor.author | Cox, David | |
dc.contributor.author | Torralba, Antonio | |
dc.date.accessioned | 2021-11-03T12:58:38Z | |
dc.date.available | 2021-11-03T12:58:38Z | |
dc.date.issued | 2019-10 | |
dc.identifier.uri | https://hdl.handle.net/1721.1/137172 | |
dc.description.abstract | © 2019 IEEE. Humans are able to localize objects in the environment using both visual and auditory cues, integrating information from multiple modalities into a common reference frame. We introduce a system that can leverage unlabeled audiovisual data to learn to localize objects (moving vehicles) in a visual reference frame, purely using stereo sound at inference time. Since it is labor-intensive to manually annotate the correspondences between audio and object bounding boxes, we achieve this goal by using the co-occurrence of visual and audio streams in unlabeled videos as a form of self-supervision, without resorting to the collection of ground truth annotations. In particular, we propose a framework that consists of a vision ''teacher'' network and a stereo-sound ''student'' network. During training, knowledge embodied in a well-established visual vehicle detection model is transferred to the audio domain using unlabeled videos as a bridge. At test time, the stereo-sound student network can work independently to perform object localization using just stereo audio and camera meta-data, without any visual input. Experimental results on a newly collected Auditory Vehicles Tracking dataset verify that our proposed approach outperforms several baseline approaches. We also demonstrate that our cross-modal auditory localization approach can assist in the visual localization of moving vehicles under poor lighting conditions. | en_US |
dc.language.iso | en | |
dc.publisher | IEEE | en_US |
dc.relation.isversionof | 10.1109/iccv.2019.00715 | en_US |
dc.rights | Creative Commons Attribution-Noncommercial-Share Alike | en_US |
dc.rights.uri | http://creativecommons.org/licenses/by-nc-sa/4.0/ | en_US |
dc.source | arXiv | en_US |
dc.title | Self-Supervised Moving Vehicle Tracking With Stereo Sound | en_US |
dc.type | Article | en_US |
dc.identifier.citation | Gan, Chuang, Zhao, Hang, Chen, Peihao, Cox, David and Torralba, Antonio. 2019. "Self-Supervised Moving Vehicle Tracking With Stereo Sound." Proceedings of the IEEE International Conference on Computer Vision, 2019-October. | |
dc.contributor.department | Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory | |
dc.contributor.department | MIT-IBM Watson AI Lab | |
dc.relation.journal | Proceedings of the IEEE International Conference on Computer Vision | en_US |
dc.eprint.version | Author's final manuscript | en_US |
dc.type.uri | http://purl.org/eprint/type/ConferencePaper | en_US |
eprint.status | http://purl.org/eprint/status/NonPeerReviewed | en_US |
dc.date.updated | 2021-04-15T17:27:29Z | |
dspace.orderedauthors | Gan, C; Zhao, H; Chen, P; Cox, D; Torralba, A | en_US |
dspace.date.submission | 2021-04-15T17:27:30Z | |
mit.journal.volume | 2019-October | en_US |
mit.license | OPEN_ACCESS_POLICY | |
mit.metadata.status | Authority Work and Publication Information Needed | en_US |