MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Doctoral Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Doctoral Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Consonant recognition by humans and machines

Author(s)
Sroka, Jason (Jason Jonathan), 1970-
Thumbnail
DownloadFull printable version (8.388Mb)
Other Contributors
Harvard University--MIT Division of Health Sciences and Technology.
Advisor
Louis D. Braida.
Terms of use
M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission. http://dspace.mit.edu/handle/1721.1/7582
Metadata
Show full item record
Abstract
The goal of this research is to determine how aspects of human speech processing can be utilized to improve the performance of Automatic Speech Recognition (ASR) systems. Three traditional ASR parameterizations matched with Hidden Markov Models (HMMs) are compared to humans on a consonant recognition task using Consonant Vowel- Consonant (CVC) nonsense syllables degraded by highpass filtering, lowpass filtering, or additive noise. Confusion matrices were determined by recognizing the syllabies using different ASR front ends, including Mel-Filter Bank (MFB) energies, Mel-F filtered Cepstral Coefficients (MFCCs), and the Ensemble Interval Histogram (EIH). For syllables degraded by lowpass and highpass filtering, automated systems trained on the degraded condition recognized the consonants roughly as well as humans. Moreover, all the ASR systems produce similar patterns of recognition errors for a given filtering condition. These patterns differ significantly from that characteristic of humans under the same filtering conditions. For syllables degraded by additive speech-shaped noise, none of the automated systems recognized consonants as well as humans. As with filtered conditions, confusion matrices revealed similar error patterns for all the ASR systems. While the error patterns of humans and machines was more similar for noise conditions than for filtered conditions, the similarities were not as great as between the ASR systems. The greatest difference between human and machine performances was in determining the correct voiced/unvoiced classification of consonants. Given these results, work was focused on recognition of the correct voicing classification in additive noise (0 dB SNR). The approach taken attempted to automatically extract attributes of the. speech signal, termed subphonetic features, which are useful in determining the distinctive feature voicing. Two subphonetic features, intervocal period ( the length of time between the onset of the vowel and any preceding vocalization) and delta fundamental (the average first difference of fundamental frequency over the first 90 msec of the vowel) proved particularly useful. When these two features were appended to traditional ASR parameters, th-3 deficit exhibited by automated systems was reduced substantially, though not eliminated.
Description
Thesis (Ph.D.)--Harvard--Massachusetts Institute of Technology Division of Health Sciences and Technology, 1998.
 
Includes bibliographical references (p. 113-117).
 
Date issued
1998
URI
http://hdl.handle.net/1721.1/9312
Department
Harvard University--MIT Division of Health Sciences and Technology
Publisher
Massachusetts Institute of Technology
Keywords
Harvard University--MIT Division of Health Sciences and Technology.

Collections
  • Doctoral Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.