MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Doctoral Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Doctoral Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Modelling out-of-vocabulary words for robust speech recognition

Author(s)
Bazzi, Issam
Thumbnail
DownloadFull printable version (9.237Mb)
Alternative title
Modelling OOV words for robust speech recognition
Other Contributors
Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science.
Advisor
James Glass.
Terms of use
M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission. http://dspace.mit.edu/handle/1721.1/7582
Metadata
Show full item record
Abstract
This thesis concerns the problem of unknown or out-of-vocabulary (OOV) words in continuous speech recognition. Most of today's state-of-the-art speech recognition systems can recognize only words that belong to some predefined finite word vocabulary. When encountering an OOV word, a speech recognizer erroneously substitutes the OOV word with a similarly sounding word from its vocabulary. Furthermore, a recognition error due to an OOV word tends to spread errors into neighboring words; dramatically degrading overall recognition performance. In this thesis we propose a novel approach for handling OOV words within a single-stage recognition framework. To achieve this goal, an explicit and detailed model of OOV words is constructed and then used to augment the closed-vocabulary search space of a standard speech recognizer. This OOV model achieves open-vocabulary recognition through the use of more flexible subword units that can be concatenated during recognition to form new phone sequences corresponding to potential new words. Examples of such subword units are phones, syllables, or some automatically-learned multi-phone sequences. Subword units have the attractive property of being a closed set, and thus are able to cover any new words, and can conceivably cover most utterances with partially spoken words as well. The main challenge with such an approach is ensuring that the OOV model does not absorb portions of the speech signal corresponding to in-vocabulary (IV) words. In dealing with this challenge, we explore several research issues related to designing the subword lexicon, language model, and topology of the OOV model. We present a dictionary-based approach for estimating subword language models.
 
(cont.) Such language models are utilized within the subword search space to help recognize the underlying phonetic transcription of OOV words. We also propose a data-driven iterative bottom-up procedure for automatically creating a multi-phone subword inventory. Starting with individual phones, this procedure uses the maximum mutual information principle to successively merge phones to obtain longer subword units. The thesis also extends this OOV approach to modelling multiple classes of OOV words. Instead of augmenting the word search space with a single model, we add several models, one for each class of words. We present two approaches for designing the OOV word classes. The first approach relies on using common part-of-speech tags. The second approach is a data-driven two-step clustering procedure, where the first step uses agglomerative clustering to derive an initial class assignment, while the second step uses iterative clustering to move words from one class to another in order to reduce the model perplexity. We present experiments on two recognition tasks: the medium-vocabulary spontaneous speech JUPITER weather information domain and the large-vocabulary broadcast news HUB4 domain. On the JUPITER task, the proposed approach can detect 70% of the OOV words with a false alarm rate of less than 3%. At this operating point, the word error rate (WER) on the IV utterances degrades slightly (from 10.9% to 11.2%) while the overall WER decreases from 17.1% to 16.4% ...
 
Description
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2002.
 
Includes bibliographical references (p. 147-153).
 
Date issued
2002
URI
http://hdl.handle.net/1721.1/29241
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology
Keywords
Electrical Engineering and Computer Science.

Collections
  • Doctoral Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.