Inferring Shape and Material from Sound
Author(s)
Zhang, Zhoutong
DownloadThesis PDF (11.30Mb)
Advisor
Freeman, William T.
Terms of use
Metadata
Show full item recordAbstract
Humans infer rich knowledge of objects from both auditory and visual cues. Building a machine of such competency, however, is very challenging. One possible solution is to rely on supervised learning, which requires a large-scale dataset containing sounds of various objects, with clean labels on their appearances, shape and material. However, it is difficult and expensive to capture such a dataset. Another approach is to tackle the problem in an analysis-by-synthesis framework, where we iterative update current estimates given a generative model. This, however, requires sophisticated generative models, which is too computationally expensive to support iterative inference. Finally, despite the popularity of deep learning methods in auditory perception tasks, most of them are derived from visual recognition tasks, which may not be suitable for processing audios.
To address such difficulties, we first present a novel, open-source pipeline that generates audio-visual data, purely from 3D object shapes and their physical properties. Using this generative model, we are able to construct a synthetic audio-visual dataset, namely Sound-20K, for object perception tasks. We further demonstrate that the representation learned on synthetic audio-visual data can transfer to real-world scenarios. In addition, the generative model can be made efficient enough to support iterative inference, where we construct an analysis-by-synthesis framework that infers object’s shape and material by hearing it falling on the ground.
Date issued
2021-06Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology