Unsupervised learning of disentangled representations for speech with neural variational inference models

Hsu, Wei-Ning, Ph. D. Massachusetts Institute of Technology

dc.contributor.advisor	James R. Glass.	en_US
dc.contributor.author	Hsu, Wei-Ning, Ph. D. Massachusetts Institute of Technology	en_US
dc.contributor.other	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.	en_US
dc.date.accessioned	2018-09-17T15:55:42Z
dc.date.available	2018-09-17T15:55:42Z
dc.date.copyright	2018	en_US
dc.date.issued	2018	en_US
dc.identifier.uri	http://hdl.handle.net/1721.1/118059
dc.description	Thesis: S.M., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2018.	en_US
dc.description	Cataloged from PDF version of thesis.	en_US
dc.description	Includes bibliographical references (pages 121-128).	en_US
dc.description.abstract	Despite recent successes in machine learning, artificial intelligence is still far from matching human intelligence in many ways. Two important aspects are transferability and amount of supervision required. Take speech recognition for example: while humans can easily adapt to a new accent without explicit supervision (i.e., ground truth transcripts for speech of a new accent), current machine learning techniques still struggle with such a scenario. We argue that an essential component of human learning is unsupervised or weakly supervised representation learning, which transforms input signals to low dimensional representations that facilitate subsequent structured learning and knowledge acquisition. In this thesis, we develop unsupervised representation learning frameworks for speech data. We start with investigating an existing variational autoencoder (VAE) model for learning latent representations, and derive novel latent space operations for speech transformation. The transformation method is applied to unsupervised domain adaptation problems, which addresses the transferability issues of supervised machine learning framework. We then extend the VAE models, and propose a novel factorized hierarchical variational autoencoder (FHVAE), which better models a generative process of sequential data, and learns not only disentangled, but also interpretable latent representations without any supervision. By leveraging the interpretability, we demonstrate that such representations can be applied to a wide range of tasks, including but not limited to: voice conversion, denoising, speaker verification, speaker invariant phonetic feature extraction, and noise invariant phonetic feature extraction. In the last part of this thesis, we examine scalability issues regarding the original FHVAE training algorithm in terms of runtime, memory, and optimization stability. Based on our analysis, we propose a hierarchical sampling algorithm for training, which enables training of FHVAE models on arbitrarily large datasets.	en_US
dc.description.statementofresponsibility	by Wei-Ning Hsu.	en_US
dc.format.extent	128 pages	en_US
dc.language.iso	eng	en_US
dc.publisher	Massachusetts Institute of Technology	en_US
dc.rights	MIT theses are protected by copyright. They may be viewed, downloaded, or printed from this source but further reproduction or distribution in any format is prohibited without written permission.	en_US
dc.rights.uri	http://dspace.mit.edu/handle/1721.1/7582	en_US
dc.subject	Electrical Engineering and Computer Science.	en_US
dc.title	Unsupervised learning of disentangled representations for speech with neural variational inference models	en_US
dc.type	Thesis	en_US
dc.description.degree	S.M.	en_US
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science	en_US
dc.identifier.oclc	1051460462	en_US

Files in this item

Name:: 1051460462-MIT.pdf
Size:: 14.56Mb
Format:: PDF
Description:: Full printable version

View/Open

This item appears in the following Collection(s)

Graduate Theses

Show simple item record