Structured video content analysis : learning spatio-temporal and multimodal structures

Song, Yale

Author(s)

Song, Yale

DownloadFull printable version (13.34Mb)

Other Contributors

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.

Advisor

Randall Davis.

Terms of use

M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission. http://dspace.mit.edu/handle/1721.1/7582

Metadata

Show full item record

Abstract

Video data exhibits a variety of structures: pixels exhibit spatial structure, e.g., the same class of objects share certain shapes and/or colors in image; sequences of frames exhibit temporal structure, e.g., dynamic events such as jumping and running have a certain chronological order of frame occurrence; and when combined with audio and text, there is multimodal structure, e.g., human behavioral data shows correlation between audio (speech) and visual information (gesture). Identifying, formulating, and learning these structured patterns is a fundamental task in video content analysis. This thesis tackles two challenging problems in video content analysis - human action recognition and behavior understanding - and presents novel algorithms to solve each: one algorithm performs sequence classification by learning spatio-temporal structure of human action; another performs data fusion by learning multimodal structure of human behavior. The first algorithm, hierarchical sequence summarization, is a probabilistic graphical model that learns spatio-temporal structure of human action in a fine-to-coarse manner. It constructs a hierarchical representation of video by iteratively summarizing the video sequence, and uses the representation to learn spatio-temporal structure of human action, classifying sequences into action categories. We developed an efficient learning method to train our model, and show that its complexity grows only sublinearly with the depth of the hierarchy. The second algorithm focuses on data fusion - the task of combining information from multiple modalities in an effective way. Our approach is motivated by the observation that human behavioral data is modality-wise sparse, i.e., information from just a few modalities contain most information needed at any given time. We perform data fusion using structured sparsity, representing a multimodal signal as a sparse combination of multimodal basis vectors embedded in a hierarchical tree structure, learned directly from the data. The key novelty is in a mixed-norm formulation of regularized matrix factorization via structured sparsity. We show the effectiveness of our algorithms on two real-world application scenarios: recognizing aircraft handling signals used by the US Navy, and predicting people's impression about the personality of public figures from their multimodal behavior. We describe the whole procedure of the recognition pipeline, from the signal acquisition to processing, to the interpretation of the processed signals using our algorithms. Experimental results show that our algorithms outperform state-of-the-art methods on human action recognition and behavior understanding.

Description

Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2014.

Cataloged from PDF version of thesis.

Includes bibliographical references (pages 141-154).

Date issued

2014

URI

http://hdl.handle.net/1721.1/90003

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Keywords

Electrical Engineering and Computer Science.

Collections

Doctoral Theses