Towards a Perceptual Loss: Using a Neural Network Codec Approximation as a Loss for Generative Audio Models

Ananthabhotla, Ishwarya; Ewert, Sebastian; Paradiso, Joseph A

dc.contributor.author	Ananthabhotla, Ishwarya
dc.contributor.author	Ewert, Sebastian
dc.contributor.author	Paradiso, Joseph A
dc.date.accessioned	2021-12-15T14:32:25Z
dc.date.available	2021-11-02T17:08:23Z
dc.date.available	2021-12-15T14:32:25Z
dc.date.issued	2019
dc.identifier.uri	https://hdl.handle.net/1721.1/137115.2
dc.description.abstract	© 2019 Association for Computing Machinery. Generative audio models based on neural networks have led to considerable improvements across fields including speech enhancement, source separation, and text-to-speech synthesis. These systems are typically trained in a supervised fashion using simple element-wise ℓ1 or ℓ2 losses. However, because they do not capture properties of the human auditory system, such losses encourage modelling perceptually meaningless aspects of the output, wasting capacity and limiting performance. Additionally, while adversarial models have been employed to encourage outputs that are statistically indistinguishable from ground truth and have resulted in improvements in this regard, such losses do not need to explicitly model perception as their task; furthermore, training adversarial networks remains an unstable and slow process. In this work, we investigate an idea fundamentally rooted in psychoacoustics. We train a neural network to emulate an MP3 codec as a differentiable function. Feeding the output of a generative model through this MP3 function, we remove signal components that are perceptually irrelevant before computing a loss. To further stabilize gradient propagation, we employ intermediate layer outputs to define our loss, as found useful in image domain methods. Our experiments using an autoencoding task show an improvement over standard losses in listening tests, indicating the potential of psychoacoustically motivated models for audio generation.	en_US
dc.language.iso	en
dc.publisher	Association for Computing Machinery (ACM)	en_US
dc.relation.isversionof	10.1145/3343031.3351148	en_US
dc.rights	Creative Commons Attribution-Noncommercial-Share Alike	en_US
dc.rights.uri	http://creativecommons.org/licenses/by-nc-sa/4.0/	en_US
dc.source	MIT web domain	en_US
dc.title	Towards a Perceptual Loss: Using a Neural Network Codec Approximation as a Loss for Generative Audio Models	en_US
dc.type	Article	en_US
dc.identifier.citation	Ananthabhotla, Ishwarya, Ewert, Sebastian and Paradiso, Joseph A. 2019. "Towards a Perceptual Loss: Using a Neural Network Codec Approximation as a Loss for Generative Audio Models." MM 2019 - Proceedings of the 27th ACM International Conference on Multimedia.	en_US
dc.contributor.department	Massachusetts Institute of Technology. Media Laboratory	en_US
dc.contributor.department	Program in Media Arts and Sciences (Massachusetts Institute of Technology)	en_US
dc.relation.journal	MM 2019 - Proceedings of the 27th ACM International Conference on Multimedia	en_US
dc.eprint.version	Author's final manuscript	en_US
dc.type.uri	http://purl.org/eprint/type/ConferencePaper	en_US
eprint.status	http://purl.org/eprint/status/NonPeerReviewed	en_US
dc.date.updated	2021-06-25T17:49:20Z
dspace.orderedauthors	Ananthabhotla, I; Ewert, S; Paradiso, JA	en_US
dspace.date.submission	2021-06-25T17:49:21Z
mit.license	OPEN_ACCESS_POLICY
mit.metadata.status	Publication Information Needed	en_US

Files in this item

Name:: 2019-07-ACMM-Ishwarya-Loss.pdf
Size:: 3.506Mb
Format:: Unknown
Description:: Accepted version

View/Open

This item appears in the following Collection(s)

MIT Open Access Articles

Show simple item record

Version	Item	Date	Summary
2	1721.1/137115.2*	2021-12-15T14:29:37Z	Authority information verified/added.
1	1721.1/137115	2021-11-02T17:08:23Z

DSpace@MIT

Towards a Perceptual Loss: Using a Neural Network Codec Approximation as a Loss for Generative Audio Models

Files in this item

This item appears in the following Collection(s)

Version History