Multi-tape finite-state transducer for asynchronous multi-stream pattern recognition with application to speech

Shu, Han, Ph. D. Massachusetts Institute of Technology

Author(s)

Shu, Han, Ph. D. Massachusetts Institute of Technology

DownloadFull printable version (10.13Mb)

Other Contributors

Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science.

Advisor

James R. Glass.

Terms of use

M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission. http://dspace.mit.edu/handle/1721.1/7582

Metadata

Show full item record

Abstract

In this thesis, we have focused on improving the acoustic modeling of speech recognition systems to increase the overall recognition performance. We formulate a novel multi-stream speech recognition framework using multi-tape finite-state transducers (FSTs). The multi-dimensional input labels of the multi-tape FST transitions specify the acoustic models to be used for the individual feature streams. An additional auxiliary field is used to model the degree of asynchrony among the feature streams. The individual feature streams can be linear sequences such as fixed-frame-rate features in traditional hidden Markov model (HMM) systems, and the feature streams can also be directed acyclic graphs such as segment features in segment-based systems. In a single-tape mode, this multi-stream framework also unifies the frame-based HMM and the segment-based approach. Systems using the multi-stream speech recognition framework were evaluated on an audio-only and an audio-visual speech recognition task. On the Wall Street Journal speech recognition task, the multi-stream framework combined a traditional frame-based HMM with segment-based landmark features.

(cont.) The system achieved word error rate (WER) of 8.0%, improved from both the WER of 8.8% of the baseline HMM-only system and the WER of 10.4% of the landmark-only system. On the AV-TIMIT audio-visual speech recognition task, the multi-stream framework combined a landmark model, a segment model, and a visual HMM. The system achieved a WER of 0.9%, which also improved from the baseline systems. These results demonstrate the feasibility and versatility of the multi-stream speech recognition framework.

Description

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2006.

Includes bibliographical references (p. 119-127).

Date issued

2006

URI

http://hdl.handle.net/1721.1/37880

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Keywords

Electrical Engineering and Computer Science.

Collections

Doctoral Theses