Show simple item record

dc.contributor.advisorJames Glass.en_US
dc.contributor.authorDrexler, Jennifer Fox.en_US
dc.contributor.otherMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.en_US
dc.date.accessioned2021-01-06T19:35:21Z
dc.date.available2021-01-06T19:35:21Z
dc.date.copyright2020en_US
dc.date.issued2020en_US
dc.identifier.urihttps://hdl.handle.net/1721.1/129248
dc.descriptionThesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, September, 2020en_US
dc.descriptionCataloged from student-submitted PDF of thesis.en_US
dc.descriptionIncludes bibliographical references (pages 131-140).en_US
dc.description.abstractIn this thesis, we explore the problem of training end-to-end neural network models for automatic speech recognition (ASR) when limited training data are available. End-to-end models are theoretically well-suited to low-resource languages because they do not rely on expert linguistic resources, but they are difficult to train without large amounts of transcribed speech. This amount of training data is prohibitively expensive to acquire in most of the world's languages. We present several methods for improving end-to-end neural network-based ASR in low-resource scenarios. First, we explore two methods for creating a shared embedding space for speech and text. In doing so, we learn representations of speech that contain only linguistic content and not, for example, the speaker or noise characteristics in the speech signal. These linguistic-only representations allow the ASR model to generalize better to unseen speech by discouraging the model from learning spurious correlations between the text transcripts and extra-linguistic factors in speech. This shared embedding space also enables semi-supervised training of some parameters of the ASR model with additional text. Next, we experiment with two techniques for probabilistically segmenting text into subword units during training. We introduce the n-gram maximum likelihood loss, which allows the ASR model to learn an inventory of acoustically-inspired subword units as part of the training process. We show that this technique combines well with the embedding space alignment techniques in the previous section, leading to a 44% relative improvement in word error rate in the lowest resource condition tested.en_US
dc.description.statementofresponsibilityby Jennifer Fox Drexler.en_US
dc.format.extent140 pagesen_US
dc.language.isoengen_US
dc.publisherMassachusetts Institute of Technologyen_US
dc.rightsMIT theses may be protected by copyright. Please reuse MIT thesis content according to the MIT Libraries Permissions Policy, which is available through the URL provided.en_US
dc.rights.urihttp://dspace.mit.edu/handle/1721.1/7582en_US
dc.subjectElectrical Engineering and Computer Science.en_US
dc.titleImproving end-to-end neural network models for low-resource automatic speech recognitionen_US
dc.typeThesisen_US
dc.description.degreePh. D.en_US
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Scienceen_US
dc.identifier.oclc1227518442en_US
dc.description.collectionPh.D. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Scienceen_US
dspace.imported2021-01-06T19:35:19Zen_US
mit.thesis.degreeDoctoralen_US
mit.thesis.departmentEECSen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record