Improving Impulse Audio Source Separation using Generative Adversarial Networks for Phase Generation

Piercy, Phoebe K.

Author(s)

Piercy, Phoebe K.

DownloadThesis PDF (12.09Mb)

Advisor

Lang, Jeffrey H.

Terms of use

In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

This thesis explored separating impulse noise from a desired signal, for the purposes of hearing protection for soldiers and musicians. An evaluation of current techniques in source separation, such as matrix demixing methods (Independent Component Analysis, Independent Vector Analysis), and masking methods (Ideal Ratio Mask, Ideal Binary Mask), amongst others, concluded that Time-Frequency masking of the noisy signal spectrogram was the best candidate audio separation method for dynamic soundscapes such as tactical fields and music. We followed with an experimental investigation of the role of phase in Time-Frequency masking, finding its importance to the intelligibility of speech to be paramount. In particular, the construction of a Complex Ideal Ratio Mask (cIRM), altering both magnitude and phase information in the spectrogram, was identified as the most promising method of impulse source separation, with separated speech intelligibility comparable to clean speech. This motivated us to develop a method to generate an approximation of the cIRM, but without prior source information. As such, the growing use of neural networks as a tool in source separation and phase estimation was presented and evaluated. Experiments were conducted to evaluate the potential of Generative Adversarial Networks (GANs), often used in image transformation, in generating the phase of the cIRM, with human test subjects to evaluate whether intelligibility of separated speech was improved. The GAN showed promise in generating phase-like results, although imperfect transformation resulted in an audible quality decrease, suggesting that the approach was unlikely to produce the natural sound required by musicians. However, for the tactical case, where intelligibility is valued over quality, consonant reconstruction and improved impulse attenuation was observed using our GAN-estimated cIRM. This improvement was reflected in an increase in the signal to noise ratio as compared to clean speech, and a decrease in the same metric compared to the impulse noise, demonstrating the improved clean speech contributions, and the reduction in impulse noise contributions in the separated output. These results show the potential, with better resources, for GAN-generated phase to be used to improve intelligibility during audio source separation of impulse noise from speech, and motivates further exploration on this topic.

Date issued

2021-06

URI

https://hdl.handle.net/1721.1/138956

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses