Speech processing with less supervision : learning from weak labels and multiple modalities

Hsu, Wei-Ning,Ph. D.Massachusetts Institute of Technology.

Author(s)

Hsu, Wei-Ning,Ph. D.Massachusetts Institute of Technology.

Download1191625000-MIT.pdf (7.805Mb)

Other Contributors

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.

Advisor

James R. Glass.

Terms of use

MIT theses may be protected by copyright. Please reuse MIT thesis content according to the MIT Libraries Permissions Policy, which is available through the URL provided. http://dspace.mit.edu/handle/1721.1/7582

Metadata

Show full item record

Abstract

In recent years, supervised learning has achieved great success in speech processing with powerful neural network models and vast quantities of in-domain labeled data. However, collecting a labeled dataset covering all domains can be either expensive due to the diversity of speech or almost impossible for some tasks such as speech-to-speech translation. Such a paradigm limits the applicability of speech technologies to high-resource settings. In sharp contrast, humans are good at reading the training signals from indirect supervision, such as from small amount of explicit labels and from different modalities. This capability enables humans to learn from a wider variety of resources, including better domain coverage. In light of this observation, this thesis focuses on learning algorithms for speech processing that can utilize weak and indirect supervision to overcome the restrictions imposed by the supervised paradigm and make the most out of the data at hand for learning.

In the first part of the thesis, we devise a self-training algorithm for speech recognition that distills knowledge from a trained language model, a compact form of external non-speech prior knowledge. The algorithm is inspired by how humans use contextual and prior information to bias speech recognition and produce confident predictions. To distill knowledge within the language model, we implement a beam-search based objective to align the prediction probability with the likelihood of the language model among candidate hypotheses. Experimental results demonstrate state-of-the-art performance that recover word error rates by up to 90% relative to using the same data with ground truth transcripts. Moreover, we show that the proposed algorithm can scale to 60,000 hours of unlabeled speech and yield further reduction in word error rates.

In the second part of the thesis, we present several text-to-speech synthesis models that enable fine-grained control of unlabeled non-textual attributes, including voice, prosody, acoustic environment properties and microphone channel effects. We achieve controllability of unlabeled attributes by formulating a text-to-speech system as a generative model with structured latent variables, and learn this generative process along with an efficient approximate inference model by adopting the variational autoencoder framework. We demonstrate that those latent variables can then be used to control the unlabeled variations in speech, making it possible to build a high-quality speech synthesis model using weakly-labeled mixed-quality speech data as the model learns to control the hidden factors. In the last part of the thesis, we extend a cross-modal semantic embedding learning framework proposed in Harwath et al.

(2019) to learn hierarchical discrete linguistic units from visually grounded speech, a form of multimodal sensory data. By utilizing a discriminative, multimodal grounding objective, the proposed framework forces the learned units to be useful for semantic image retrieval. In contrast, most of the previous work on linguistic unit discovery do not use multimodal data--they consider a reconstruction objective that encourages the learned units to be useful for reconstructing the speech, and hence those units may also encode non-linguistic factors. Experimental results show that the proposed framework outperforms state-of-the-art phonetic unit discovery frameworks by almost 50% on the ZeroSpeech 2019 ABX phone discriminative task, and learns word detectors that discover over 270 words with an F1 score of greater than 0.5. In addition, the learned units from the proposed framework are also more robust to nuisance variation compared to frameworks that learn from only speech.

Description

Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, May, 2020

Cataloged from the official PDF of thesis.

Includes bibliographical references (pages 191-217).

Date issued

2020

URI

https://hdl.handle.net/1721.1/127021

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Keywords

Electrical Engineering and Computer Science.

Collections

Doctoral Theses