Improving automatic speech recognition through head pose driven visual grounding

Vosoughi, Soroush

dc.contributor.author	Vosoughi, Soroush
dc.date.accessioned	2014-05-14T16:17:37Z
dc.date.available	2014-05-14T16:17:37Z
dc.date.issued	2014
dc.identifier.isbn	9781450324731
dc.identifier.uri	http://hdl.handle.net/1721.1/86943
dc.description.abstract	In this paper, we present a multimodal speech recognition system for real world scene description tasks. Given a visual scene, the system dynamically biases its language model based on the content of the visual scene and visual attention of the speaker. Visual attention is used to focus on likely objects within the scene. Given a spoken description the system then uses the visually biased language model to process the speech. The system uses head pose as a proxy for the visual attention of the speaker. Readily available standard computer vision algorithms are used to recognize the objects in the scene and automatic real time head pose estimation is done using depth data captured via a Microsoft Kinect. The system was evaluated on multiple participants. Overall, incorporating visual information into the speech recognizer greatly improved speech recognition accuracy. The rapidly decreasing cost of 3D sensing technologies such as the Kinect allows systems with similar underlying principles to be used for many speech recognition tasks where there is visual information.	en_US
dc.language.iso	en_US
dc.publisher	Association for Computing Machinery	en_US
dc.relation.isversionof	http://dx.doi.org/10.1145/2556288.2556957	en_US
dc.rights	Article is made available in accordance with the publisher's policy and may be subject to US copyright law. Please refer to the publisher's site for terms of use.	en_US
dc.source	Soroush Vosoughi	en_US
dc.title	Improving automatic speech recognition through head pose driven visual grounding	en_US
dc.type	Article	en_US
dc.identifier.citation	Vosoughi, Soroush. “Improving Automatic Speech Recognition through Head Pose Driven Visual Grounding.” Proceedings of the 32nd Annual ACM Conference on Human Factors in Computing Systems - CHI ’14 (2014), April 26–May 01, 2014, Toronto, ON, Canada.	en_US
dc.contributor.department	Massachusetts Institute of Technology. Media Laboratory	en_US
dc.contributor.department	Program in Media Arts and Sciences (Massachusetts Institute of Technology)	en_US
dc.contributor.approver	Vosoughi, Soroush	en_US
dc.contributor.mitauthor	Vosoughi, Soroush	en_US
dc.relation.journal	Proceedings of the 32nd annual ACM conference on Human factors in computing systems - CHI '14	en_US
dc.eprint.version	Final published version	en_US
dc.type.uri	http://purl.org/eprint/type/JournalArticle	en_US
eprint.status	http://purl.org/eprint/status/PeerReviewed	en_US
dspace.orderedauthors	Vosoughi, Soroush	en_US
dc.identifier.orcid	https://orcid.org/0000-0002-2564-8909
mit.license	PUBLISHER_POLICY	en_US
mit.metadata.status	Complete

Files in this item

Name:: CHI2014_Vosoughi.pdf
Size:: 813.4Kb
Format:: PDF
Description:: Main Article

View/Open

This item appears in the following Collection(s)

MIT Open Access Articles

Show simple item record