dc.contributor.advisor | James R. Glass and Karen Livescu. | en_US |
dc.contributor.author | Zhu, Bo, Ph. D. Massachusetts Institute of Technology | en_US |
dc.contributor.other | Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science. | en_US |
dc.date.accessioned | 2009-08-26T16:43:31Z | |
dc.date.available | 2009-08-26T16:43:31Z | |
dc.date.copyright | 2008 | en_US |
dc.date.issued | 2008 | en_US |
dc.identifier.uri | http://hdl.handle.net/1721.1/46530 | |
dc.description | Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008. | en_US |
dc.description | Includes bibliographical references (p. 95-96). | en_US |
dc.description.abstract | Ultrasonic sensing of articulator movement is an area of multimodal speech recognition that has not been researched extensively. The widely-researched audio-visual speech recognition (AVSR), which relies upon video data, is awkwardly high-maintenance in its setup and data collection process, as well as computationally expensive because of image processing. In this thesis we explore the effectiveness of ultrasound as a more lightweight secondary source of information in speech recognition. We first describe our hardware systems that made simultaneous audio and ultrasound capture possible. We then discuss the new types of features that needed to be extracted; traditional Mel-Frequency Cepstral Coefficients (MFCCs) were not effective in this narrowband domain. Spectral analysis pointed to frequency-band energy averages, energy-band frequency midpoints, and spectrogram peak location vs. acoustic event timing as convenient features. Next, we devised ultrasonic-only phonetic classification experiments to investigate the ultrasound's abilities and weaknesses in classifying phones. We found that several acoustically-similar phone pairs were distinguishable through ultrasonic classification. Additionally, several same-place consonants were also distinguishable. We also compared classification metrics across phonetic contexts and speakers. Finally, we performed multimodal continuous digit recognition in the presence of acoustic noise. We found that the addition of ultrasonic information reduced word error rates by 24-29% over a wide range of acoustic signal-to-noise ratio (SNR) (clean to OdB). This research indicates that ultrasound has the potential to be a financially and computationally cheap noise-robust modality for speech recognition systems. | en_US |
dc.description.statementofresponsibility | by Bo Zhu. | en_US |
dc.format.extent | 96 p. | en_US |
dc.language.iso | eng | en_US |
dc.publisher | Massachusetts Institute of Technology | en_US |
dc.rights | M.I.T. theses are protected by
copyright. They may be viewed from this source for any purpose, but
reproduction or distribution in any format is prohibited without written
permission. See provided URL for inquiries about permission. | en_US |
dc.rights.uri | http://dspace.mit.edu/handle/1721.1/7582 | en_US |
dc.subject | Electrical Engineering and Computer Science. | en_US |
dc.title | Multimodal speech recognition with ultrasonic sensors | en_US |
dc.type | Thesis | en_US |
dc.description.degree | M.Eng. | en_US |
dc.contributor.department | Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science | |
dc.identifier.oclc | 416526597 | en_US |