Self-Supervised Moving Vehicle Tracking With Stereo Sound
Author(s)
Gan, Chuang; Zhao, Hang; Chen, Peihao; Cox, David; Torralba, Antonio
DownloadAccepted version (2.173Mb)
Open Access Policy
Open Access Policy
Creative Commons Attribution-Noncommercial-Share Alike
Terms of use
Metadata
Show full item recordAbstract
© 2019 IEEE. Humans are able to localize objects in the environment using both visual and auditory cues, integrating information from multiple modalities into a common reference frame. We introduce a system that can leverage unlabeled audiovisual data to learn to localize objects (moving vehicles) in a visual reference frame, purely using stereo sound at inference time. Since it is labor-intensive to manually annotate the correspondences between audio and object bounding boxes, we achieve this goal by using the co-occurrence of visual and audio streams in unlabeled videos as a form of self-supervision, without resorting to the collection of ground truth annotations. In particular, we propose a framework that consists of a vision ''teacher'' network and a stereo-sound ''student'' network. During training, knowledge embodied in a well-established visual vehicle detection model is transferred to the audio domain using unlabeled videos as a bridge. At test time, the stereo-sound student network can work independently to perform object localization using just stereo audio and camera meta-data, without any visual input. Experimental results on a newly collected Auditory Vehicles Tracking dataset verify that our proposed approach outperforms several baseline approaches. We also demonstrate that our cross-modal auditory localization approach can assist in the visual localization of moving vehicles under poor lighting conditions.
Date issued
2019-10Department
Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory; MIT-IBM Watson AI LabJournal
Proceedings of the IEEE International Conference on Computer Vision
Publisher
IEEE
Citation
Gan, Chuang, Zhao, Hang, Chen, Peihao, Cox, David and Torralba, Antonio. 2019. "Self-Supervised Moving Vehicle Tracking With Stereo Sound." Proceedings of the IEEE International Conference on Computer Vision, 2019-October.
Version: Author's final manuscript