Acoustic landmark detection and segmentation using the McAulay-Quatieri Sinusoidal Model
Author(s)
Sainath, Tara N
DownloadFull printable version (9.182Mb)
Other Contributors
Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science.
Advisor
Timothy J. Hazen.
Terms of use
Metadata
Show full item recordAbstract
The current method for phonetic landmark detection in the Spoken Language Systems Group at MIT is performed by SUMMIT, a segment-based speech recognition system. Under noisy conditions the system's segmentation algorithm has difficulty distinguishing between noise and speech components and often produces a poor alignment of sounds. Noise robustness in SUMMIT can be improved using a full segmentation method, which allows landmarks at regularly spaced intervals. While this approach is computationally more expensive than the original segmentation method, it is more robust under noisy environments. In this thesis, we explore a landmark detection and segmentation algorithm using the McAulay-Quatieri Sinusoidal Model, in hopes of improving the performance of the recognizer in noisy conditions. We first discuss the sinusoidal model representation, in which rapid changes in spectral components are tracked using the concept of "birth" and "death" of underlying sinewaves. Next, we describe our method of landmark detection with respect to the behavior of sinewave tracks generated from this model. These landmarks are interconnected together to form a graph of hypothetical segments. (cont.) Finally, we experiment with different segmentation algorithms to reduce the size of the segment graph. We compare the performance of our approach with the full and original segmentation methods under different noise environments. The word error rate of original segmentation model degrades rapidly in the presence of noise, while the sinusoidal and full segmentation models degrade more gracefully. However, the full segmentation method has the largest computation time compared to original and sinusoidal methods. We find that our algorithm provides the best tradeoff between word accuracy and computation time of the three methods. Furthermore, we find that our model is robust when speech is contaminated by white noise, speech babble noise and destroyer operations room noise.
Description
Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2005. Includes bibliographical references (leaves 95-98).
Date issued
2005Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology
Keywords
Electrical Engineering and Computer Science.