Feature-based pronunciation modeling for automatic speech recognition

Livescu, Karen, 1975-

dc.contributor.advisor	James R. Glass.	en_US
dc.contributor.author	Livescu, Karen, 1975-	en_US
dc.contributor.other	Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science.	en_US
dc.date.accessioned	2008-03-26T20:36:55Z
dc.date.available	2008-03-26T20:36:55Z
dc.date.copyright	2005	en_US
dc.date.issued	2005	en_US
dc.identifier.uri	http://dspace.mit.edu/handle/1721.1/34488	en_US
dc.identifier.uri	http://hdl.handle.net/1721.1/34488
dc.description	Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2005.	en_US
dc.description	Includes bibliographical references (p. 131-140).	en_US
dc.description.abstract	Spoken language, especially conversational speech, is characterized by great variability in word pronunciation, including many variants that differ grossly from dictionary prototypes. This is one factor in the poor performance of automatic speech recognizers on conversational speech. One approach to handling this variation consists of expanding the dictionary with phonetic substitution, insertion, and deletion rules. Common rule sets, however, typically leave many pronunciation variants unaccounted for and increase word confusability due to the coarse granularity of phone units. We present an alternative approach, in which many types of variation are explained by representing a pronunciation as multiple streams of linguistic features rather than a single stream of phones. Features may correspond to the positions of the speech articulators, such as the lips and tongue, or to acoustic or perceptual categories. By allowing for asynchrony between features and per-feature substitutions, many pronunciation changes that are difficult to account for with phone-based models become quite natural. Although it is well-known that many phenomena can be attributed to this "semi-independent evolution" of features, previous models of pronunciation variation have typically not taken advantage of this. In particular, we propose a class of feature-based pronunciation models represented as dynamic Bayesian networks (DBNs).	en_US
dc.description.abstract	(cont.) The DBN framework allows us to naturally represent the factorization of the state space of feature combinations into feature-specific factors, as well as providing standard algorithms for inference and parameter learning. We investigate the behavior of such a model in isolation using manually transcribed words. Compared to a phone-based baseline, the feature-based model has both higher coverage of observed pronunciations and higher recognition rate for isolated words. We also discuss the ways in which such a model can be incorporated into various types of end-to-end speech recognizers and present several examples of implemented systems, for both acoustic speech recognition and lipreading tasks.	en_US
dc.description.statementofresponsibility	by Karen Livescu.	en_US
dc.format.extent	140 p.	en_US
dc.language.iso	eng	en_US
dc.publisher	Massachusetts Institute of Technology	en_US
dc.rights	M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission.	en_US
dc.rights.uri	http://dspace.mit.edu/handle/1721.1/34488	en_US
dc.rights.uri	http://dspace.mit.edu/handle/1721.1/7582	en_US
dc.subject	Electrical Engineering and Computer Science.	en_US
dc.title	Feature-based pronunciation modeling for automatic speech recognition	en_US
dc.type	Thesis	en_US
dc.description.degree	Ph.D.	en_US
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
dc.identifier.oclc	70847032	en_US

Files in this item

Name:: 70847032-MIT.pdf
Size:: 24.41Mb
Format:: PDF
Description:: Full printable version

View/Open

This item appears in the following Collection(s)

Doctoral Theses

Show simple item record