Show simple item record

dc.contributor.advisorJames Glass and David Harwath.en_US
dc.contributor.authorBoggust, Angie Wynne.en_US
dc.contributor.otherMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.en_US
dc.date.accessioned2020-09-15T21:54:53Z
dc.date.available2020-09-15T21:54:53Z
dc.date.copyright2020en_US
dc.date.issued2020en_US
dc.identifier.urihttps://hdl.handle.net/1721.1/127377
dc.descriptionThesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, May, 2020en_US
dc.descriptionCataloged from the official PDF of thesis.en_US
dc.descriptionIncludes bibliographical references (pages 71-77).en_US
dc.description.abstractHuman babies posses the innate ability to learn language using correspondences between what they see and what they hear. Yet, current unsupervised methods for learning visually grounded language often rely on time-consuming and expensive data annotation, such as spoken language captions of images or textual summaries of video. In this thesis, we eliminate the need for annotation by learning an audio-visual grounding between instructional videos and their audio waveforms. We present two methods capable of learning a joint audio-visual embedding space from video input. In the first method, we apply the DAVEnet model architecture (Harwath et al., 2016) to visual frames and audio segments extracted from over 3000 instructional cooking videos. In the second method, we introduce the Video DAVEnet architecture -- an unsupervised network that learns a joint audio-visual embedding space from raw video -- and apply it to 1.2 million publicly available instructional YouTube videos. While the methods we compare to learn from video and human-generated textual summaries, our methods achieve state of the art performance on downstream audio and visual recall tasks using only raw video data. Finally, we perform analysis of the learned audio-visual embedding space and show that our models learn salient audio-visual concepts, such as "oil", "onion", and "fry" when applied to cooking videos from the YouCook2 dataset (Zhou et al., 2018a).en_US
dc.description.statementofresponsibilityby Angie Wynne Boggust.en_US
dc.format.extent77 pagesen_US
dc.language.isoengen_US
dc.publisherMassachusetts Institute of Technologyen_US
dc.rightsMIT theses may be protected by copyright. Please reuse MIT thesis content according to the MIT Libraries Permissions Policy, which is available through the URL provided.en_US
dc.rights.urihttp://dspace.mit.edu/handle/1721.1/7582en_US
dc.subjectElectrical Engineering and Computer Science.en_US
dc.titleUnsupervised audio-visual learning in the wilden_US
dc.typeThesisen_US
dc.description.degreeM. Eng.en_US
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Scienceen_US
dc.identifier.oclc1192539279en_US
dc.description.collectionM.Eng. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Scienceen_US
dspace.imported2020-09-15T21:54:53Zen_US
mit.thesis.degreeMasteren_US
mit.thesis.departmentEECSen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record