Show simple item record

dc.contributor.authorVosoughi, Soroush
dc.date.accessioned2014-05-14T16:17:37Z
dc.date.available2014-05-14T16:17:37Z
dc.date.issued2014
dc.identifier.isbn9781450324731
dc.identifier.urihttp://hdl.handle.net/1721.1/86943
dc.description.abstractIn this paper, we present a multimodal speech recognition system for real world scene description tasks. Given a visual scene, the system dynamically biases its language model based on the content of the visual scene and visual attention of the speaker. Visual attention is used to focus on likely objects within the scene. Given a spoken description the system then uses the visually biased language model to process the speech. The system uses head pose as a proxy for the visual attention of the speaker. Readily available standard computer vision algorithms are used to recognize the objects in the scene and automatic real time head pose estimation is done using depth data captured via a Microsoft Kinect. The system was evaluated on multiple participants. Overall, incorporating visual information into the speech recognizer greatly improved speech recognition accuracy. The rapidly decreasing cost of 3D sensing technologies such as the Kinect allows systems with similar underlying principles to be used for many speech recognition tasks where there is visual information.en_US
dc.language.isoen_US
dc.publisherAssociation for Computing Machineryen_US
dc.relation.isversionofhttp://dx.doi.org/10.1145/2556288.2556957en_US
dc.rightsArticle is made available in accordance with the publisher's policy and may be subject to US copyright law. Please refer to the publisher's site for terms of use.en_US
dc.sourceSoroush Vosoughien_US
dc.titleImproving automatic speech recognition through head pose driven visual groundingen_US
dc.typeArticleen_US
dc.identifier.citationVosoughi, Soroush. “Improving Automatic Speech Recognition through Head Pose Driven Visual Grounding.” Proceedings of the 32nd Annual ACM Conference on Human Factors in Computing Systems - CHI ’14 (2014), April 26–May 01, 2014, Toronto, ON, Canada.en_US
dc.contributor.departmentMassachusetts Institute of Technology. Media Laboratoryen_US
dc.contributor.departmentProgram in Media Arts and Sciences (Massachusetts Institute of Technology)en_US
dc.contributor.approverVosoughi, Soroushen_US
dc.contributor.mitauthorVosoughi, Soroushen_US
dc.relation.journalProceedings of the 32nd annual ACM conference on Human factors in computing systems - CHI '14en_US
dc.eprint.versionFinal published versionen_US
dc.type.urihttp://purl.org/eprint/type/JournalArticleen_US
eprint.statushttp://purl.org/eprint/status/PeerRevieweden_US
dspace.orderedauthorsVosoughi, Soroushen_US
dc.identifier.orcidhttps://orcid.org/0000-0002-2564-8909
mit.licensePUBLISHER_POLICYen_US
mit.metadata.statusComplete


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record