Spontaneous speech recognition using visual context-aware language models

Mukherjee, Niloy, 1978-

dc.contributor.advisor	Deb K. Roy.	en_US
dc.contributor.author	Mukherjee, Niloy, 1978-	en_US
dc.contributor.other	Massachusetts Institute of Technology. Dept. of Architecture. Program In Media Arts and Sciences.	en_US
dc.date.accessioned	2011-04-25T15:49:45Z
dc.date.available	2011-04-25T15:49:45Z
dc.date.copyright	2003	en_US
dc.date.issued	2003	en_US
dc.identifier.uri	http://hdl.handle.net/1721.1/62380
dc.description	Thesis (S.M.)--Massachusetts Institute of Technology, School of Architecture and Planning, Program in Media Arts and Sciences, 2003.	en_US
dc.description	Includes bibliographical references (p. 83-88).	en_US
dc.description.abstract	The thesis presents a novel situationally-aware multimodal spoken language system called Fuse that performs speech understanding for visual object selection. An experimental task was created in which people were asked to refer, using speech alone, to objects arranged on a table top. During training, Fuse acquires a grammar and vocabulary from a "show-and-tell" procedure in which visual scenes are paired with verbal descriptions of individual objects. Fuse determines a set of visually salient words and phrases and associates them to a set of visual features. Given a new scene, Fuse uses the acquired knowledge to generate class-based language models conditioned on the objects present in the scene as well as a spatial language model that predicts the occurrences of spatial terms conditioned on target and landmark objects. The speech recognizer in Fuse uses a weighted mixture of these language models to search for more likely interpretations of user speech in context of the current scene. During decoding, the weights are updated using a visual attention model which redistributes attention over objects based on partially decoded utterances. The dynamic situationally-aware language models enable Fuse to jointly infer spoken language utterances underlying speech signals as well as the identities of target objects they refer to. In an evaluation of the system, visual situationally-aware language modeling shows significant , more than 30 %, decrease in speech recognition and understanding error rates. The underlying ideas of situation-aware speech understanding that have been developed in Fuse may may be applied in numerous areas including assistive and mobile human-machine interfaces.	en_US
dc.description.statementofresponsibility	by Niloy Mukherjee.	en_US
dc.format.extent	88 p.	en_US
dc.language.iso	eng	en_US
dc.publisher	Massachusetts Institute of Technology	en_US
dc.rights	M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission.	en_US
dc.rights.uri	http://dspace.mit.edu/handle/1721.1/7582	en_US
dc.subject	Architecture. Program In Media Arts and Sciences.	en_US
dc.title	Spontaneous speech recognition using visual context-aware language models	en_US
dc.type	Thesis	en_US
dc.description.degree	S.M.	en_US
dc.contributor.department	Program in Media Arts and Sciences (Massachusetts Institute of Technology)
dc.identifier.oclc	54698754	en_US

Files in this item

Name:: 54698754-MIT.pdf
Size:: 5.046Mb
Format:: PDF
Description:: Full printable version

View/Open

This item appears in the following Collection(s)

Graduate Theses

Show simple item record