MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Listening by Synthesizing

Author(s)
Cherep, Manuel
Thumbnail
DownloadThesis PDF (8.684Mb)
Advisor
Machover, Tod
Terms of use
In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/
Metadata
Show full item record
Abstract
Generative audio models offer a scalable solution for producing a rich variety of sounds. This can be useful for practical tasks, like sound design in music, film, and other media. However, these models overwhelmingly rely on deep neural networks, and their massive complexity hinders our ability to fully leverage them in many scenarios, as they are not easily controllable or interpretable. In this thesis, I propose an alternate approach that relies on a virtual modular synthesizer; a computational model with modules for controlling, generating, and processing sound that connect together to produce diverse sounds. This approach has the advantage of using only a small number of physically-motivated parameters, each of which is intuitively controllable and causally interpretable in terms of its influence on the output sound. This design takes inspiration from devices long used in sound design and combines it with state-of-the-art machine learning techniques. In this thesis, I present three projects that use this formulation. The first is SynthAX, an accelerated virtual modular synthesizer that implements the core computational elements in an accelerated framework. The second, CTAG, combines the synthesizer with an audio-language model into a novel method for text-to-audio synthesis via parameter inference. This method produces more abstract sketch-like sounds that are distinctive, perceived as artistic, and yet similarly identifiable to recent neural audio synthesis models. The third is audio doppelgängers, sounds generated by randomly perturbing the parameters of the synthesizer to create positive pairs for contrastive learning, encompassing more of the variety found in real-world recordings, with controlled variations in timbre, pitch, and temporal envelopes. This method offers an efficient alternative to collecting real-world data, producing robust audio representations that compete with real data on established audio classification benchmarks. This thesis contributes tools for understandably generating rich and diverse sounds, using them and their parameters for sound design and understanding at scale.
Date issued
2024-09
URI
https://hdl.handle.net/1721.1/157728
Department
Program in Media Arts and Sciences (Massachusetts Institute of Technology)
Publisher
Massachusetts Institute of Technology

Collections
  • Graduate Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.