Improving end-to-end neural network models for low-resource automatic speech recognition

Drexler, Jennifer Fox.

dc.contributor.advisor	James Glass.	en_US
dc.contributor.author	Drexler, Jennifer Fox.	en_US
dc.contributor.other	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.	en_US
dc.date.accessioned	2021-01-06T19:35:21Z
dc.date.available	2021-01-06T19:35:21Z
dc.date.copyright	2020	en_US
dc.date.issued	2020	en_US
dc.identifier.uri	https://hdl.handle.net/1721.1/129248
dc.description	Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, September, 2020	en_US
dc.description	Cataloged from student-submitted PDF of thesis.	en_US
dc.description	Includes bibliographical references (pages 131-140).	en_US
dc.description.abstract	In this thesis, we explore the problem of training end-to-end neural network models for automatic speech recognition (ASR) when limited training data are available. End-to-end models are theoretically well-suited to low-resource languages because they do not rely on expert linguistic resources, but they are difficult to train without large amounts of transcribed speech. This amount of training data is prohibitively expensive to acquire in most of the world's languages. We present several methods for improving end-to-end neural network-based ASR in low-resource scenarios. First, we explore two methods for creating a shared embedding space for speech and text. In doing so, we learn representations of speech that contain only linguistic content and not, for example, the speaker or noise characteristics in the speech signal. These linguistic-only representations allow the ASR model to generalize better to unseen speech by discouraging the model from learning spurious correlations between the text transcripts and extra-linguistic factors in speech. This shared embedding space also enables semi-supervised training of some parameters of the ASR model with additional text. Next, we experiment with two techniques for probabilistically segmenting text into subword units during training. We introduce the n-gram maximum likelihood loss, which allows the ASR model to learn an inventory of acoustically-inspired subword units as part of the training process. We show that this technique combines well with the embedding space alignment techniques in the previous section, leading to a 44% relative improvement in word error rate in the lowest resource condition tested.	en_US
dc.description.statementofresponsibility	by Jennifer Fox Drexler.	en_US
dc.format.extent	140 pages	en_US
dc.language.iso	eng	en_US
dc.publisher	Massachusetts Institute of Technology	en_US
dc.rights	MIT theses may be protected by copyright. Please reuse MIT thesis content according to the MIT Libraries Permissions Policy, which is available through the URL provided.	en_US
dc.rights.uri	http://dspace.mit.edu/handle/1721.1/7582	en_US
dc.subject	Electrical Engineering and Computer Science.	en_US
dc.title	Improving end-to-end neural network models for low-resource automatic speech recognition	en_US
dc.type	Thesis	en_US
dc.description.degree	Ph. D.	en_US
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science	en_US
dc.identifier.oclc	1227518442	en_US
dc.description.collection	Ph.D. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science	en_US
dspace.imported	2021-01-06T19:35:19Z	en_US
mit.thesis.degree	Doctoral	en_US
mit.thesis.department	EECS	en_US

Files in this item

Name:: 1227518442-MIT.pdf
Size:: 2.306Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Doctoral Theses

Show simple item record