Speaker Anonymization using End-to-End Zero-Shot Voice Conversion
Author(s)
Kang, Wonjune
DownloadThesis PDF (11.01Mb)
Advisor
Roy, Deb
Terms of use
Metadata
Show full item recordAbstract
Spoken language is a rich medium of communication that combines words with various information about emotions, feelings, and excitation through modulations in tone and pitch. In discourse, this allows for maintaining a human element that is lacking in many other channels, such as writing or social media. However, a person's voice is a distinct biomarker, and there exist many settings in which it may need to be anonymized in order to protect the speaker's identity.
This thesis presents a framework for performing speaker anonymization using voice conversion (VC) methods. We first introduce a model for performing end-to-end zero-shot voice conversion by modifying the architecture of a neural vocoder. To the best of our knowledge, this is one of the first end-to-end approaches for zero-shot VC that has ever been proposed. Our model is able to maintain the clarity and intelligibility of transformed speech very well while also achieving good voice style transfer performance---an improvement over current state-of-the-art VC models, which exhibit a trade-off between audio quality and accurate voice style transfer.
Next, we present a method for extending targeted voice conversion to un-targeted voice anonymization. This is done by fitting a Gaussian mixture model (GMM) to the latent space of speaker embeddings that are fed into the VC model, and then sampling from the GMM to select the target voice for anonymization. This obviates the need for explicitly specifying a target speaker when performing VC-based anonymization.
We evaluate both our voice conversion and anonymization methods on publicly available data as well as real-world audio from conversations on the Local Voices Network (LVN) platform, demonstrating their applicability to "in-the-wild" settings. Finally, we provide a discussion of this work's potential applications and the ethical considerations of using voice conversion technologies in society.
Date issued
2022-05Department
Program in Media Arts and Sciences (Massachusetts Institute of Technology)Publisher
Massachusetts Institute of Technology