Spoken ObjectNet: Creating a Bias-Controlled Spoken Caption Dataset

Palmer, Ian A.

Author(s)

Palmer, Ian A.

DownloadThesis PDF (13.75Mb)

Advisor

Glass, James R.

Terms of use

In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

Visually-grounded spoken language datasets can enable models to learn cross-modal correspondences with very weak supervision. However, modern audio-visual datasets contain biases that undermine the real-world performance of models trained on that data. We introduce Spoken ObjectNet, which is designed to remove some of these biases and provide a way to better evaluate how effectively models will perform in real-world scenarios. This dataset expands upon ObjectNet, which is a large-scale image dataset that features controls for biases encoded into many other common image datasets. We detail our data collection pipeline, which features several methods to improve caption quality, including automated language model checks. We also present an analysis of the vocabulary of our collected captions. Lastly, we show baseline results on several audio-visual machine learning tasks, including retrieval and machine captioning. These results show that models trained on other datasets and then evaluated on Spoken ObjectNet tend to perform poorly due to biases in other datasets that the models have learned. We also show evidence that the performance decrease is due to the dataset controls, and not the transfer setting. We intend to make our dataset openly available to the general public to encourage new lines of work in training models that are better equipped to operate in the real world.

Date issued

2021-06

URI

https://hdl.handle.net/1721.1/139030

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses