Unsupervised Phonetic Category Learning from
Audio and Visual Input

Zhi, Sophia

dc.contributor.advisor	Levy, Roger
dc.contributor.author	Zhi, Sophia
dc.date.accessioned	2023-07-31T19:57:03Z
dc.date.available	2023-07-31T19:57:03Z
dc.date.issued	2023-06
dc.date.submitted	2023-06-06T16:35:03.470Z
dc.identifier.uri	https://hdl.handle.net/1721.1/151659
dc.description.abstract	Understanding how children learn the phonetic categories of their native language is an open area of research in cognitive science and child language development. However, despite experimental evidence that phonetic processing is very often a multimodal phenomenon (involving both auditory and visual cues), computational research has primarily modeled phonetic category learning as a function of only auditory input. In this thesis, I investigate whether multimodal information benefits phonetic category learning under a clustering model. Due to the lack of an appropriate dataset, I also introduce a method for creating a high-quality dataset of synthetic videos of speakers’ faces for an existing audio corpus. This model trained and tested on audiovisual data achieves up to a 9.1% improvement on a phoneme discrimination battery over the random baseline compared to a model trained and tested on only audio data. The audiovisual model also outperforms the audio model by up to 4.7% over the baseline when both are tested on audio-only data, suggesting that visual information guides the learner towards better clusters. Further analysis indicates that visual information benefits most, but not all, phonemic contrasts. In follow-up analyses, I investigate the learned audiovisual clusters and their relationship to auditory gestures and phones, finding that the clusters capture a unit of speech smaller than phonemes. This work demonstrates the benefit of visual information to a computational model of phonetic category learning, suggesting that children may benefit substantively by using visual cues while learning phonetic categories.
dc.publisher	Massachusetts Institute of Technology
dc.rights	In Copyright - Educational Use Permitted
dc.rights	Copyright retained by author(s)
dc.rights.uri	https://rightsstatements.org/page/InC-EDU/1.0/
dc.title	Unsupervised Phonetic Category Learning from Audio and Visual Input
dc.type	Thesis
dc.description.degree	M.Eng.
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degree	Master
thesis.degree.name	Master of Engineering in Electrical Engineering and Computer Science

Files in this item

Name:: zhi-szhi-meng-eecs-2023-thesis.pdf
Size:: 2.938Mb
Format:: PDF
Description:: Thesis PDF

View/Open

This item appears in the following Collection(s)

Graduate Theses

Show simple item record

Unsupervised Phonetic Category Learning from Audio and Visual Input

Files in this item

This item appears in the following Collection(s)