Structured video content analysis : learning spatio-temporal and multimodal structures

Song, Yale

dc.contributor.advisor	Randall Davis.	en_US
dc.contributor.author	Song, Yale	en_US
dc.contributor.other	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.	en_US
dc.date.accessioned	2014-09-19T21:33:47Z
dc.date.available	2014-09-19T21:33:47Z
dc.date.copyright	2014	en_US
dc.date.issued	2014	en_US
dc.identifier.uri	http://hdl.handle.net/1721.1/90003
dc.description	Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2014.	en_US
dc.description	Cataloged from PDF version of thesis.	en_US
dc.description	Includes bibliographical references (pages 141-154).	en_US
dc.description.abstract	Video data exhibits a variety of structures: pixels exhibit spatial structure, e.g., the same class of objects share certain shapes and/or colors in image; sequences of frames exhibit temporal structure, e.g., dynamic events such as jumping and running have a certain chronological order of frame occurrence; and when combined with audio and text, there is multimodal structure, e.g., human behavioral data shows correlation between audio (speech) and visual information (gesture). Identifying, formulating, and learning these structured patterns is a fundamental task in video content analysis. This thesis tackles two challenging problems in video content analysis - human action recognition and behavior understanding - and presents novel algorithms to solve each: one algorithm performs sequence classification by learning spatio-temporal structure of human action; another performs data fusion by learning multimodal structure of human behavior. The first algorithm, hierarchical sequence summarization, is a probabilistic graphical model that learns spatio-temporal structure of human action in a fine-to-coarse manner. It constructs a hierarchical representation of video by iteratively summarizing the video sequence, and uses the representation to learn spatio-temporal structure of human action, classifying sequences into action categories. We developed an efficient learning method to train our model, and show that its complexity grows only sublinearly with the depth of the hierarchy. The second algorithm focuses on data fusion - the task of combining information from multiple modalities in an effective way. Our approach is motivated by the observation that human behavioral data is modality-wise sparse, i.e., information from just a few modalities contain most information needed at any given time. We perform data fusion using structured sparsity, representing a multimodal signal as a sparse combination of multimodal basis vectors embedded in a hierarchical tree structure, learned directly from the data. The key novelty is in a mixed-norm formulation of regularized matrix factorization via structured sparsity. We show the effectiveness of our algorithms on two real-world application scenarios: recognizing aircraft handling signals used by the US Navy, and predicting people's impression about the personality of public figures from their multimodal behavior. We describe the whole procedure of the recognition pipeline, from the signal acquisition to processing, to the interpretation of the processed signals using our algorithms. Experimental results show that our algorithms outperform state-of-the-art methods on human action recognition and behavior understanding.	en_US
dc.description.statementofresponsibility	by Yale Song.	en_US
dc.format.extent	154 pages	en_US
dc.language.iso	eng	en_US
dc.publisher	Massachusetts Institute of Technology	en_US
dc.rights	M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission.	en_US
dc.rights.uri	http://dspace.mit.edu/handle/1721.1/7582	en_US
dc.subject	Electrical Engineering and Computer Science.	en_US
dc.title	Structured video content analysis : learning spatio-temporal and multimodal structures	en_US
dc.type	Thesis	en_US
dc.description.degree	Ph. D.	en_US
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
dc.identifier.oclc	890133028	en_US

Files in this item

Name:: 890133028-MIT.pdf
Size:: 13.34Mb
Format:: PDF
Description:: Full printable version

View/Open

This item appears in the following Collection(s)

Doctoral Theses

Show simple item record