| dc.contributor.advisor | James R. Glass and David Harwath. | en_US |
| dc.contributor.author | Azuh, Emmanuel Mensah | en_US |
| dc.contributor.other | Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science. | en_US |
| dc.date.accessioned | 2020-03-24T15:35:27Z | |
| dc.date.available | 2020-03-24T15:35:27Z | |
| dc.date.copyright | 2019 | en_US |
| dc.date.issued | 2019 | en_US |
| dc.identifier.uri | https://hdl.handle.net/1721.1/124231 | |
| dc.description | This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections. | en_US |
| dc.description | Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2019 | en_US |
| dc.description | Cataloged from student-submitted PDF version of thesis. | en_US |
| dc.description | Includes bibliographical references (pages 99-103). | en_US |
| dc.description.abstract | In this thesis, we present a method for the discovery of word-like units and their approximate translations from visually grounded speech across multiple languages. We first train a neural network model to map images and their spoken audio captions in both English and Hindi to a shared, multimodal embedding space. Next, we use this model to segment and cluster regions of the spoken captions which approximately correspond to words. Then, we exploit between-cluster similarities in the embedding space to associate English pseudo-word clusters with Hindi pseudo-word clusters, and show that many of these cluster pairings capture semantic translations between English and Hindi words. We present quantitative cross-lingual clustering results, as well as qualitative results in the form of a bilingual picture dictionary. Finally, we show the same analysis for a joint training using three languages at the same time, with Japanese as the third language. | en_US |
| dc.description.statementofresponsibility | by Emmanuel Azuh Mensah. | en_US |
| dc.format.extent | 103 pages | en_US |
| dc.language.iso | eng | en_US |
| dc.publisher | Massachusetts Institute of Technology | en_US |
| dc.rights | MIT theses are protected by copyright. They may be viewed, downloaded, or printed from this source but further reproduction or distribution in any format is prohibited without written permission. | en_US |
| dc.rights.uri | http://dspace.mit.edu/handle/1721.1/7582 | en_US |
| dc.subject | Electrical Engineering and Computer Science. | en_US |
| dc.title | Towards multilingual lexicon discovery from visually grounded speech | en_US |
| dc.type | Thesis | en_US |
| dc.description.degree | M. Eng. | en_US |
| dc.contributor.department | Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science | en_US |
| dc.identifier.oclc | 1144932743 | en_US |
| dc.description.collection | M.Eng. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science | en_US |
| dspace.imported | 2020-03-24T15:35:27Z | en_US |
| mit.thesis.degree | Master | en_US |
| mit.thesis.department | EECS | en_US |