Speaker Anonymization using End-to-End Zero-Shot Voice Conversion

Kang, Wonjune

Author(s)

Kang, Wonjune

DownloadThesis PDF (11.01Mb)

Advisor

Roy, Deb

Terms of use

In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

Spoken language is a rich medium of communication that combines words with various information about emotions, feelings, and excitation through modulations in tone and pitch. In discourse, this allows for maintaining a human element that is lacking in many other channels, such as writing or social media. However, a person's voice is a distinct biomarker, and there exist many settings in which it may need to be anonymized in order to protect the speaker's identity. This thesis presents a framework for performing speaker anonymization using voice conversion (VC) methods. We first introduce a model for performing end-to-end zero-shot voice conversion by modifying the architecture of a neural vocoder. To the best of our knowledge, this is one of the first end-to-end approaches for zero-shot VC that has ever been proposed. Our model is able to maintain the clarity and intelligibility of transformed speech very well while also achieving good voice style transfer performance---an improvement over current state-of-the-art VC models, which exhibit a trade-off between audio quality and accurate voice style transfer. Next, we present a method for extending targeted voice conversion to un-targeted voice anonymization. This is done by fitting a Gaussian mixture model (GMM) to the latent space of speaker embeddings that are fed into the VC model, and then sampling from the GMM to select the target voice for anonymization. This obviates the need for explicitly specifying a target speaker when performing VC-based anonymization. We evaluate both our voice conversion and anonymization methods on publicly available data as well as real-world audio from conversations on the Local Voices Network (LVN) platform, demonstrating their applicability to "in-the-wild" settings. Finally, we provide a discussion of this work's potential applications and the ethical considerations of using voice conversion technologies in society.

Date issued

2022-05

URI

https://hdl.handle.net/1721.1/144662

Department

Program in Media Arts and Sciences (Massachusetts Institute of Technology)

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses