A comparison-based approach to mispronunciation detection
Author(s)
Lee, Ann, Ph. D. Massachusetts Institute of Technology
DownloadFull printable version (10.88Mb)
Other Contributors
Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science.
Advisor
James Glass.
Terms of use
Metadata
Show full item recordAbstract
This thesis focuses on the problem of detecting word-level mispronunciations in nonnative speech. Conventional automatic speech recognition-based mispronunciation detection systems have the disadvantage of requiring a large amount of language-specific, annotated training data. Some systems even require a speech recognizer in the target language and another one in the students' native language. To reduce human labeling effort and for generalization across all languages, we propose a comparison-based framework which only requires word-level timing information from the native training data. With the assumption that the student is trying to enunciate the given script, dynamic time warping (DTW) is carried out between a student's utterance (nonnative speech) and a teacher's utterance (native speech), and we focus on detecting mis-alignment in the warping path and the distance matrix. The first stage of the system locates word boundaries in the nonnative utterance. To handle the problem that nonnative speech often contains intra-word pauses, we run DTW with a silence model which can align the two utterances, detect and remove silences at the same time. In order to segment each word into smaller, acoustically similar, units for a finer-grained analysis, we develop a phoneme-like unit segmentor which works by segmenting the selfsimilarity matrix into low-distance regions along the diagonal. Both phone-level and wordlevel features that describe the degree of mis-alignment between the two utterances are extracted, and the problem is formulated as a classification task. SVM classifiers are trained, and three voting schemes are considered for the cases where there are more than one matching reference utterance. The system is evaluated on the Chinese University Chinese Learners of English (CUCHLOE) corpus, and the TIMIT corpus is used as the native corpus. Experimental results have shown 1) the effectiveness of the silence model in guiding DTW to capture the word boundaries in nonnative speech more accurately, 2) the complimentary performance of the word-level and the phone-level features, and 3) the stable performance of the system with or without phonetic units labeling.
Description
Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012. Cataloged from PDF version of thesis. Includes bibliographical references (p. 89-92).
Date issued
2012Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology
Keywords
Electrical Engineering and Computer Science.