Consonant recognition by humans and machines
Author(s)
Sroka, Jason (Jason Jonathan), 1970-
DownloadFull printable version (8.388Mb)
Other Contributors
Harvard University--MIT Division of Health Sciences and Technology.
Advisor
Louis D. Braida.
Terms of use
Metadata
Show full item recordAbstract
The goal of this research is to determine how aspects of human speech processing can be utilized to improve the performance of Automatic Speech Recognition (ASR) systems. Three traditional ASR parameterizations matched with Hidden Markov Models (HMMs) are compared to humans on a consonant recognition task using Consonant Vowel- Consonant (CVC) nonsense syllables degraded by highpass filtering, lowpass filtering, or additive noise. Confusion matrices were determined by recognizing the syllabies using different ASR front ends, including Mel-Filter Bank (MFB) energies, Mel-F filtered Cepstral Coefficients (MFCCs), and the Ensemble Interval Histogram (EIH). For syllables degraded by lowpass and highpass filtering, automated systems trained on the degraded condition recognized the consonants roughly as well as humans. Moreover, all the ASR systems produce similar patterns of recognition errors for a given filtering condition. These patterns differ significantly from that characteristic of humans under the same filtering conditions. For syllables degraded by additive speech-shaped noise, none of the automated systems recognized consonants as well as humans. As with filtered conditions, confusion matrices revealed similar error patterns for all the ASR systems. While the error patterns of humans and machines was more similar for noise conditions than for filtered conditions, the similarities were not as great as between the ASR systems. The greatest difference between human and machine performances was in determining the correct voiced/unvoiced classification of consonants. Given these results, work was focused on recognition of the correct voicing classification in additive noise (0 dB SNR). The approach taken attempted to automatically extract attributes of the. speech signal, termed subphonetic features, which are useful in determining the distinctive feature voicing. Two subphonetic features, intervocal period ( the length of time between the onset of the vowel and any preceding vocalization) and delta fundamental (the average first difference of fundamental frequency over the first 90 msec of the vowel) proved particularly useful. When these two features were appended to traditional ASR parameters, th-3 deficit exhibited by automated systems was reduced substantially, though not eliminated.
Description
Thesis (Ph.D.)--Harvard--Massachusetts Institute of Technology Division of Health Sciences and Technology, 1998. Includes bibliographical references (p. 113-117).
Date issued
1998Department
Harvard University--MIT Division of Health Sciences and TechnologyPublisher
Massachusetts Institute of Technology
Keywords
Harvard University--MIT Division of Health Sciences and Technology.