VICTORIOUS : video indexing with combined tracking and object recognition for improved object understanding in scenes

Xu, Yuetian

Author(s)

Xu, Yuetian

DownloadFull printable version (11.81Mb)

Alternative title

Video indexing with combined tracking and object recognition for improved object understanding in scenes

Other Contributors

Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science.

Advisor

Richard W. Madison and Tomaso A. Poggio.

Terms of use

M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission. http://dspace.mit.edu/handle/1721.1/7582

Metadata

Show full item record

Abstract

Automatic understanding of video content is a problem which grows in importance every day. Video understanding algorithms require accuracy, robustness, speed, and scalability. Accuracy generates user confidence in usage. Robustness enables greater autonomy and reduced human intervention. Applications such as navigation and mapping demand real-time performance. Scalability is also important for maintaining high speed while expanding capacity to multiple users and sensors. In this thesis, I propose a "bag-of-phrases" model to improve the accuracy and robustness of the popular "bag-of-words" models. This model applies a "geometric grammar" to add structural constraints to the unordered "bag-of-words." I incorporate this model into an architecture which combines an object recognizer, a tracker, and a geolocation module. This architecture has the ability to use the complementarity of its components to compensate for its weaknesses. This allows for improvements in accuracy, robustness, and speed. Subsequently, I introduce VICTORIOUS, a fast implementation of the proposed architecture. Evaluation on computer-generated data as well as Caltech-101 indicate that this implementation is accurate, robust, and capable of performing in real time on current generation hardware. This implementation, together with the "bag-of-phrases" model and integrated architecture, forms a step towards meeting the requirements for an accurate, robust, real-time vision system.

Description

Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.

Cataloged from PDF version of thesis.

Includes bibliographical references (p. ).

Date issued

2009

URI

http://hdl.handle.net/1721.1/61290

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Keywords

Electrical Engineering and Computer Science.

Collections

Graduate Theses