Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input

Harwath, David F.; Recasens, Adria; Suris Coll-Vinent, Didac; Chuang, Galen; Torralba, Antonio; Glass, James R

dc.contributor.author	Harwath, David F.
dc.contributor.author	Recasens, Adria
dc.contributor.author	Suris Coll-Vinent, Didac
dc.contributor.author	Chuang, Galen
dc.contributor.author	Torralba, Antonio
dc.contributor.author	Glass, James R
dc.date.accessioned	2020-01-20T17:03:22Z
dc.date.available	2020-01-20T17:03:22Z
dc.date.issued	2018-10-06
dc.date.submitted	2018-04-04
dc.identifier.isbn	9783030012304
dc.identifier.isbn	9783030012311
dc.identifier.issn	0302-9743
dc.identifier.issn	1611-3349
dc.identifier.uri	https://hdl.handle.net/1721.1/123476
dc.description.abstract	In this paper, we explore neural network models that learn to associate segments of spoken audio captions with the semantically relevant portions of natural images that they refer to. We demonstrate that these audio-visual associative localizations emerge from network-internal representations learned as a by-product of training to perform an image-audio retrieval task. Our models operate directly on the image pixels and speech waveform, and do not rely on any conventional supervision in the form of labels, segmentations, or alignments between the modalities during training. We perform analysis using the Places 205 and ADE20k datasets demonstrating that our models implicitly learn semantically-coupled object and word detectors. Keywords: vision and language; sound; speech; convolutional networks; multimodal learning; unsupervised learning	en_US
dc.language.iso	en
dc.publisher	Springer International Publishing	en_US
dc.relation.isversionof	http://dx.doi.org/10.1007/978-3-030-01231-1_40	en_US
dc.rights	Creative Commons Attribution-Noncommercial-Share Alike	en_US
dc.rights.uri	http://creativecommons.org/licenses/by-nc-sa/4.0/	en_US
dc.source	arXiv	en_US
dc.title	Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input	en_US
dc.type	Book	en_US
dc.identifier.citation	Harwath, David et al. "Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input." Computer Vision – ECCV 2018, September 8–14, 2018, Munich, Germany, edited by V. Ferrari et al., Springer, 2018	en_US
dc.contributor.department	Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory	en_US
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science	en_US
dc.relation.journal	Computer Vision – ECCV 2018	en_US
dc.eprint.version	Original manuscript	en_US
dc.type.uri	http://purl.org/eprint/type/ConferencePaper	en_US
eprint.status	http://purl.org/eprint/status/NonPeerReviewed	en_US
dc.date.updated	2019-07-11T17:10:06Z
dspace.date.submission	2019-07-11T17:10:08Z
mit.metadata.status	Complete

Files in this item

Name:: 1804.01452.pdf
Size:: 4.662Mb
Format:: PDF
Description:: Submitted version

View/Open

This item appears in the following Collection(s)

MIT Open Access Articles

Show simple item record