Listening by Synthesizing

Cherep, Manuel

Author(s)

Cherep, Manuel

DownloadThesis PDF (8.684Mb)

Advisor

Machover, Tod

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

Generative audio models offer a scalable solution for producing a rich variety of sounds. This can be useful for practical tasks, like sound design in music, film, and other media. However, these models overwhelmingly rely on deep neural networks, and their massive complexity hinders our ability to fully leverage them in many scenarios, as they are not easily controllable or interpretable. In this thesis, I propose an alternate approach that relies on a virtual modular synthesizer; a computational model with modules for controlling, generating, and processing sound that connect together to produce diverse sounds. This approach has the advantage of using only a small number of physically-motivated parameters, each of which is intuitively controllable and causally interpretable in terms of its influence on the output sound. This design takes inspiration from devices long used in sound design and combines it with state-of-the-art machine learning techniques. In this thesis, I present three projects that use this formulation. The first is SynthAX, an accelerated virtual modular synthesizer that implements the core computational elements in an accelerated framework. The second, CTAG, combines the synthesizer with an audio-language model into a novel method for text-to-audio synthesis via parameter inference. This method produces more abstract sketch-like sounds that are distinctive, perceived as artistic, and yet similarly identifiable to recent neural audio synthesis models. The third is audio doppelgängers, sounds generated by randomly perturbing the parameters of the synthesizer to create positive pairs for contrastive learning, encompassing more of the variety found in real-world recordings, with controlled variations in timbre, pitch, and temporal envelopes. This method offers an efficient alternative to collecting real-world data, producing robust audio representations that compete with real data on established audio classification benchmarks. This thesis contributes tools for understandably generating rich and diverse sounds, using them and their parameters for sound design and understanding at scale.

Date issued

2024-09

URI

https://hdl.handle.net/1721.1/157728

Department

Program in Media Arts and Sciences (Massachusetts Institute of Technology)

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses