Labeling and modeling large databases of videos
by
Jenny Yuen
Submitted to the Department of Electrical Engineering and Computer Science
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy
in Electrical Engineering and Computer Science MASSACHUSTS INSTITUTE
at the Massachusetts Institute of Technology A
February 2012
@ 2012 Massachusetts Institute of Technology
All Rights Reserved. ARCHIVES
Author:
Department of Electrical En ' eerin d Computer Science
December 2, 2011
Certified by:
Accepted by:
Antonio Torralba
Associate Professor of Electrical Engineering and Computer Science
Thesis Supervisor
Leslie A. Kolodziejski
Chair, Department Committee on Graduate Students

3Dedicated to my family

Labeling and modeling large databases of videos
by Jenny Yuen
Submitted to the Department of Electrical Engineering and Computer Science
on December 2, 2011, in partial fulfillment of the requirements for the degree of
Doctor of Philosophy in Electrical Engineering and Computer Science
Abstract
As humans, we can say many things about the scenes surrounding us. For in-
stance, we can tell what type of scene and location an image depicts, describe what
objects live in it, their material properties, or their spatial arrangement. These com-
prise descriptions of a scene and are majorly studied areas in computer vision. This
thesis, however, hypotheses that observers have an inherent prior knowledge that can
be applied to the scene at hand. This prior knowledge can be translated into the cogni-
sance of which objects move, or in the trajectories and velocities to expect. Conversely,
when faced with unusual events such as car accidents, humans are very well tuned to
identify them regardless of having observed the scene a priori. This is, in part, due to
prior observations that we have for scenes with similar configurations to the current
one.
This thesis emulates the prior knowledge base of humans by creating a large
and heterogeneous database and annotation tool for videos depicting real world scenes.
The first application of this thesis is in the area of unusual event detection. Given a
short clip, the task is to identify the moving portions of the scene that depict abnormal
events. We adopt a data-driven framework powered by scene matching techniques to
6retrieve the videos nearest to the query clip and integrate the motion information in the
nearest videos. The result is a final clip with localized annotations for unusual activity.
The second application lies in the area of event prediction. Given a static image, we
adapt our framework to compile a prediction of motions to expect in the image. This
result is crafted by integrating the knowledge of videos depicting scenes similar to the
query image. With the help of scene matching, only scenes relevant to the queries
are considered, resulting in reliable predictions. Our dataset, experimentation, and
proposed model introduce and explore a new facet of scene understanding in images
and videos.
Thesis Supervisor: Antonio Torralba
Title: Associate Professor of Electrical Engineering and Computer Science
Acknowledgments
First and foremost, I thank Antonio Torralba, my advisor, for welcoming into
his newly formed group more than four years ago and teaching me how to do research.
In particular, to think outside the box and not being afraid of breaking new grounds,
formulating new problems, and crafting novel solutions for them. I thank him for the
countless hours spent not only on the big picture of problems, but also at the imple-
mentation level. For his perseverance and passion for research, the 3 am e-mail brain-
storming sessions, and for being the largest, most reliable data contributor, and best
beta tester for LabelMe video project. Many thanks also go to my thesis committee:
Fredo Durand, Alyosha Efros, Bill Freeman, and Antonio for their support, valuable
comments, and feedback for this thesis.
Over the last five and a half years, I have also been very fortunate to collaborate
with brilliant researchers: Daniel Goldman, Ce Liu, Yasuyuki Matsushita, Bryan Rus-
sell, Josef Sivic, and Larry Zitnick. Each one of them played a key role in developing
raw ideas into interesting results and to that I give many thanks.
The computer vision group at MIT has been a second home to me. Eric Grim-
son and the APP group welcomed me to MIT in my first year. I have been fortunate to
share offices with amazing people: Gerald Dalley, Biz Bose, Xiaogang Wang, Wanmei
Ou, Thomas Yeo, Joseph Lim, Jianxiong Xiao, and Aditya Khosla. Thank you for the
wonderful times, the great conversations and for being such an integral part in my grad-
uate school experience. Special thanks to Joseph Lim for helping me descipher object
detector libraries, to Tomasz Malisiewicz for proposing the Exemplar SVM model and
helping me adapt it to my work, and to Sylvain Paris, for always being available to
discuss ideas and help polish submissions. To Biliana Kaneva, Michael Bernstein, and
Alvin Raj for their friendship and the late nights solving problem sets during our first
year at MIT, bura.
8Last but certainly not least, I thank my family for their enormous support over
the last years. This thesis is dedicated to my father, who seeded in me the thirst for
learning and thorough understanding; to my mother whose dedication and stamina
taught me to never give up; to my brother Hector, who everyday shows me how to
stay calm and in control with humor; to my brother Joel, who has taught me math and
science ever since I can remember; and finally to Justin, who listened to every single
practice talk I have given throughout graduate school numerous times, walked with me
throughout cities capturing data, and was always there for me. Thanks, for everything.
Contents
Acknowledgments 7
List of Figures 13
1 Introduction 21
1.1 Overview of techniques and contributions . . . . . . . . . . . . . . . 23
1.1.1 Chapter 2. LabelMe video: Building a video database with
human annotations. . . . . . . . . . . . . . . . . . . . . . . . 23
1.1.2 Chapter 3: A data-driven approach for unusual event detection. 24
1.1.3 Chapter 4: Car trajectory prediction from a single image . . . 25
1.2 Other work not included in this thesis . . . . . . . . . . . . . . . . . 26
1.3 N otes .. . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2 LabelMe video: Building a video database with human annotations 29
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2 Related W ork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3 Online video annotation tool . . . . . . . . . . . . . . . . . . . . . . 32
2.3.1 Object Annotation . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.2 Event Annotation . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.3 Stabilizing Annotations . . . . . . . . . . . . . . . . . . . . . 35
2.3.4 Annotation interpolation . . . . . . . . . . .
2.4 Data set statistics . . . . . . . . . . . . . . . . . . .
2.5 Beyond User Annotations . . . . . . . . . . . . . . .
2.5.1 Occlusion handling and depth ordering . . .
2.5.2 Cause-effect relations within moving objects
2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . .
2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . .
3 A data-driven approach for unusual event detection
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . .
3.2 Related Work . . . . . . . . . . . . . . . . . . . . .
3.3 Scene-based video retrieval . . . . . . . . . . . . . .
3.4 Video event representation . . . . . . . . . . . . . .
3.4.1 Recovering trajectories . . . . . . . . . . . .
3.4.2 Clustering trajectories . . . . . . . . . . . .
3.4.3 Comparing track clusters . . . . . . . . . . .
3.5 Video database and ground truth . . . . . . . . . . .
3.6 Experiments and Applications . . . . . . . . . . . .
3.6.1 Localized motion prediction . . . . . . . . .
3.6.2 Event prediction from a single image . . . .
3.6.3 Anomaly detection . . . . . . . . . . . . . .
3.7 Discussion and concluding remarks . . . . . . . . .
4 Car
4.1
4.2
4.3
trajectory prediction from a single image
Introduction . . . . . . . . . . . . . . . . . .
Related Work . . . . . . . . . . . . . . . . .
Object and trajectory model . . . . . . . . . .
CONTENTS
. . . . . . . . . 36
. . . . . . . . . 41
. . . . . . . . . 43
. . . . . . . . . 44
. . . . . . . . . 46
49
51
. . . . . . . . 51
. . . . . . . . 53
. . . . . . . . 55
. . . . . . . . 55
. . . . . . . . 56
. . . . . . . . 56
. . . . . . . . 57
. . . . . . . . 59
. . . . . . . . 59
. . . . . . . . 61
. . . . . . . . 62
. . . . . . . . 62
. . . . . . . . 67
CONTENTS 11
4.3.1 From 2D to 3D . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.4 From trajectories to action discovery . . . . . . . . . . . . . . . . . . 75
4.4.1 Car trajectory prediction from a single image . . . . . . . . . 76
4.5 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . . 80
4.6 Discussion and concluding remarks . . . . . . . . . . . . . . . . . . 82
5 Conclusion 87
5.1 Contributions . . . . ... . . . . . . . . . . . . . . . . . . . . . . . 87
Bibliography 91
12 CONTENTS
List of Figures
1.1 What is occurring in this image? Can you identify the objects that move?
What actions are they performing'? While humans are surprisingly good at
this ill-posed problem, it is not the case for computers. (Image from Nikoart-
work.com ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.2 Sample video frames with ground truth annotations overlaid. LabelMe
video provides a way to create ground truth annotations for objects in a wide
variety of scenes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.3 What do these images have in common? They depict objects moving towards
the right. These images do not contain motion cues such as temporal informa-
tion or motion blur. The implied motion is known because we can recognize
the image content and make reliable predictions what would occur if these
were movies playing based on prior experiences. This, at the same time, al-
lows us to be very finely tuned at identifying events that do not align to our
prior information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.4 Object trajectory prediction from a static image. Based solely on the appear-
ance of a detected object (in yellow) and the horizon line in the scene, our
algorithm can determine a plausible trajectory for the selected object (red) by
leveraging the information from a database of annotated moving objects. . . 26
14 LIST OF FIGURES
2.1 Object annotation. Users annotate moving or static objects in a video by
outlining their shape with a polygon and describing their actions. . . . . . 33
2.2 Event annotation. Simple and complex events can be annotated by entering
free-form sentences and linking them to existing labeled objects in the video. 35
2.3 Interpolation comparison between constant 2D motion (red) and constant
3D motion (green). a) Two polygons from different frames and their vanish-
ing point. b) Interpolation of an intermediate frame, and c) interpolation of
the polygon centers for multiple frames between the two reference frames. . 39
2.4 Examples of annotations. Our interpolation framework is based on the heuris-
tic that objects often move with constant velocity and follow straight trajec-
tories. Our system can propagate annotations of rigid (or semi-rigid) objects
such as cars, motorbikes, fish, cups, etc. across different frames in a video
automatically aiming for minimal user intervention. Annotation of non-rigid
objects (e.g.humans), while possible by the tool (but requiring more editing),
remains a more challenging task than the one for rigid objects. Presently,
users can opt to, instead, draw bounding boxes around non-rigid entities like
people. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5 Distribution of labels in data set. The vertical axis indicates the log fre-
quency of the object/action instances in the database while the y axis indicates
the rank of the class (the classes are sorted by frequency). As we aimed to
capture videos from a variety of common scenes and events in the real world,
these distributions are similar to natural word frequencies described by Zipf's
law [471. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
LIST OF FIGURES 15
2.6 Occlusion relationships and depth estimation. A sample video frame (a),
the propagated polygons created with our annotation tool (b), the polygons or-
dered using the LabelMe-heuristic-based inference for occlusion relationships
(c) polygon ordering using LabelMe heuristic (notice how in the top figure,
the group of people standing far away from the camera are mistakenly ordered
as closer than the man pushing the stroller and in the bottom figure there is a
mistake in the ordering of the cars), and (d) ordered polygons inferred using
3D relationship heuristics (notice how the mistakes in (c) are fixed). . . . . 46
3.1 Track clustering. Sample frames from the video sequence (a). The ground
truth annotations denoted by polygons surrounding moving objects (b) can be
used to generate ground truth labels for the tracked points in the video (c).
Our track distance affinity function is used to to automatically cluster tracks
into groups and generates fairly reasonable clusters where each roughly corre-
spond to independent objects in the scene (d). The track clusters visualizations
in (c) and (d) show the first frame of each video and the spatial location of all
tracked points for the duration of the clip color-coded by the track cluster that
each point corresponds to. . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.2 Unusual videos. We define an unusual or anomalous event as one that is not
likely to happen in our training data set. However, we ensured that they belong
to scene classes present in our video corpus. . . . . . . . . . . . . . . . . 60
3.3 Localized motion prediction (a) and unusual event detection (b). The algo-
rithm was compared against two scene matching methods (GIST and dense
SIFT) as well as a baseline supported by random nearest neighbors. Retriev-
ing videos similar to the query image improves the classification rate. . . . 60
16 LIST OF FIGURES
3.4 Event prediction. Each row shows a static image with its corresponding event
predictions. For each query image, we retrieve their nearest video clips using
scene matching. The events belonging to the nearest neighbors are resized
to match the dimensions of the query image and are further clustered to cre-
ate different event predictions. For example, in a hallway scene, the system
predicts motions of different people; in street scenes, it predicts cars moving
along the road, etc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.5 Track cluster retrieval for common events. A frame from a query video (a),
the tracks corresponding to one event in the video (b), the localized motion
prediction map (c) generated after integrating the track information of the
nearest neighbors (some examples shown in d), and the average image of
the retrieved nearest neighbors (e). Notice the definition of high probability
motion regions in (c) and how its shape roughly matches the scene geometry
in (a). The maps in (c) were generated with no motion information originating
from the query videos videos. . . . . . . . . . . . . . . . . . . . . . . . 65
3.6 Track cluster retrieval for unusual events (left) and scenes with less samples in
our data set. When presented with unusual events such as a car crashing into
the camera or a person jumping over a car while in motion (left and middle
columns; key frames can be seen in fig. 3.7) our system is able to flag these
as unusual events (b) due to their disparity with respect to the events taking
place in the nearest neighbor videos. Notice the supporting neighbors belong
to the same scene class as the query and the motion map predicts movements
mostly in the car regions. However, our system fails when an image does not
have enough representation in the database (right). . . . . . . . . . . . . . 66
LIST OF FIGURES 17
3.7 Unusual event detection. Videos of a person jumping over a car and running
across it (left) and a car crashing into the camera (right). Our system outputs
anomaly scores for individual events. Common events shown in yellow and
unusual ones in red. The thickness and saturation of the red tracks is propor-
tional to the degree of anomaly. . . . . . . . . . . . . . . . . . . . . . . 67
4.1 Estimated average height and speed from annotations in the LabelMe video
dataset. With the approach first introduced by Hoiem et al. , we can estimate
the average height of objects in the data set. From video, we additionally
estimate average velocities. . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2 Visualization of trajectory feature. The ground plane is divided radially into
8 equally sized regions. Each trajectory is translated to the world center and
described by the normalized count of bounding boxes landing in each region. 75
4.3 Discovered motion clusters for each object class. Object trajectories for each
class are normalized and transferred to a common point in world coordinates.
The trajectories are further clustered and each cluster is visualized as an en-
ergy map summarizing all of the trajectories belonging to the cluster. The
trajectories have been translated, re-projected and resized to fit the displayed
im age crops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
18 LIST OF FIGURES
4.4 (right) Image containing the top detection using the LDPM car detector (in
blue) and the top exemplar SVM detection trained on one single exemplar
(left). The LDPM detector is trained on many instances of cars varying in
shape and pose. The non-maximum-suppression phase rules out overlapping
detections and scores the blue detection as the highest. The eSVM, trained
on the single positive instance (left) identifies the window that matches the
hatchback template from the query the best (in this case the side of the cab
excluding its hood). Our approach aims at detecting complete cars using the
LDPM detector and filtering these detections using an eSVM detector trained
on the query crop. We compare the bounding box intersection between the
eSVM detection and the DPM one and discard detections that do not overlap
more than 70% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.5 Top candidate detections for trajectory transfer. We detect cars in the 200
nearest video frames. The naive approach of considering only the gist distance
between scenes results in very few reliable detections amongst the top scenes
(a). Ordering detections by the LDPM detection score gives a higher ratio of
reliable car detections (b); however, there is no guarantee that all detections
will contain the same pose as of the query. An exemplar SVM approach
focuses the search on windows similar to the query (c); however, this approach
sometimes fires on only portions of entire cars (see yellow boxes). Finally, our
approach integrates scene (gist) similarity, bounding box intersection with the
query detection, the LDPM and eSVM scores (d). . . . . . . . . . . . . . 83
LIST OF FIGURES 19
4.6 Predictions from a single image. (a) For each object, we can predict different
trajectories even from the same action/trajectory group. (b) other example
predictions; note the diversity in locations and sizes of objects and how their
predictions match the motion implied by their appearance. (c) failure cases
can take place when the appearance of the object is not correctly matched to
the implied family of actions, when the horizon line is not correctly estimated
(or horizontal), or if there are obstacles in the scene that interfere with the
predicted trajectory. . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.7 User study scenarios. The prediction evaluation is presented in and a syn-
thetic world (1) and the original image where the the object resides (2). In
the synthetic scenario, the user is asked to determine the quality of the predic-
tion based solely on the pose without considering the semantics of the scene
whereas in the original scene the user is asked to judge the 3D trajectory tak-
ing the scene elements into consideration (e.g. cars should move on the road
and not on sidewalks or though obstacles). . . . . . . . . . . . . . . . . . 85
4.8 User study results. A set of 30 reliable car detections comprises our test set.
Our algorithm was configured to output 5 predictions per example. Subjects
were asked to score the predictions as very likely, unlikely but possible, im-
possible, or cannot tell. Each bar represents the percentage of objects where
at least x (out of the total 5) predictions are likely. Our algorithm is evaluated
under a synthetic background and a real one (blue and red bars). . . . . . . 86
20 LIST OF FIGURES
Chapter 1
Introduction
Consider the image in figure 1.1. What can you say about it? What objects are in it?
What actions were the objects in the scene performing before and after the picture was
taken? These are questions that humans can easily answer but are extremely difficult
for an artificial system. As humans, we can quickly reason about the scene without
having been to the depicted location, or interacted with the object instances depicted in
the scene. More interestingly, we are able to make reliable predictions for the objects
in the scene. In other words, we can animate the scene in our minds and picture the
person walking on the sidewalk, the biker following its lane on the road, and the car
stopping at the intersection.
Imagine the situation where we want to create a robot capable of navigating real
world environments, with crowds of people walking and vehicles on the road moving at
high speeds. The robot requires foresight capabilities to navigate these highly dynamic
environments given only prior information, which might consist of a few seconds, or
even just a few snapshots of the scene.
How is it that humans can make reliable predictions given (in the extreme) only
one snapshot of our surroundings or even in just an abstracted drawing? How are we
so certain that the car in the picture is moving on the road, most likely to the left of
the scene? or that the person is going to cross the street if it hasn't stepped a foot on
the crosswalk yet'? One hypothesis lies in the large volumes of training data that we
44 /CHAPTER 1. INTRODUCTION
Figure 1.1: What is occurring in this image? Can you identify the objects that move? What
actions are they performing? While humans are surprisingly good at this ill-posed problem, it
is not the case for computers. (Image from Nikoartwork.com)
feed to our memory throughout the years of repeated interactions in this world. This
information is clearly transferrable in that, we can innately walk or drive in a different
road or city for the first time and adapt prior knowledge acquired and reinforced by
experiences in many other roads, vehicles, office spaces, etc.
We will refer to this capacity of foresight, as event prediction. Event prediction
is challenging in various aspects. The first aspect involves the intricacies of efficient
data acquisition and ground truth annotation. The second challenge lies in the high
variability of the data to learn from. Consider, even in the simplest case, learning for
only a single street scene. Since images are 2d projections of the 3d world, we would
get different images for the same scene depending on the location, viewpoint, and in
Sec. 1.1. Overview of techniques and contributions 23
general the setup of the camera. Compound this variability with different objects, their
configurations, and different scene locations. Clearly, we need a compact, flexible, and
comprehensive representation to cover as many environments and configurations as
possible. The last challenge lies in the inherently high dimensionality of videos. With
a large video corpus to train from, it is very important to represent data compactly and
flexibly.
In summary, this thesis will leverage the information n video databases to power
methods for event prediction and unusual event detection. We introduce (1) a real
world video database and a tool to annotate objects and events in it, (2) a method that
integrates the raw information in this video corpus and helps identify unusual events
previously unseen videos, and, (3) a framework for event prediction from a single im-
age, powered by user-generated annotations in the training video corpus.
* 1.1 Overview of techniques and contributions
This section provides a preview of techniques and contributions.
* 1.1.1 Chapter 2. LabelMe video: Building a video database with human
annotations.
With the wide availability of consumer cameras, larger volumes of video are captured
everyday though amateurs, professionals, surveillance systems, etc. As reported by
Youtube.com, users are uploading hundreds of thousands of videos daily; every minute,
24 hours of video is uploaded to Youtube. However, current video analysis algorithms
suffer from lack of information regarding the objects present, their interactions, as
well as from missing comprehensive annotated video databases for benchmarking. We
designed an online and openly accessible video annotation system that allows any-
one with a browser and internet access to efficiently annotate object category, shape,
4''4 CHAPTER 1. INTRODUCTION
Figure 1.2: Sample video frames with ground truth annotations overlaid. LabelMe video
provides a way to create ground truth annotations for objects in a wide variety of scenes.
motion, and activity information in real-world videos. The annotations are also com-
plemented with knowledge from static image databases to infer occlusion and depth
information. Using this system, we have built a scalable video database composed of
diverse video samples and paired with human-guided annotations.
M 1.1.2 Chapter 3: A data-driven approach for unusual event detection.
When a human observes a short video clip, it is easy to decide if the event taking
place is normal or unexpected, even if the video depicts a new place not familiar for
the viewer. This is in contrast with work in surveillance and outlier event detection.
Those models rely on thousands of hours of video recorded at a single place in order to
identify what constitutes an unusual event. In this work we present a simple method to
identify videos with unusual events in a large collection of short video clips. The algo-
rithm is inspired by recent approaches in computer vision that rely on large databases.
In this work we show how, relying on large collections of videos, we can retrieve other
Sec. 1.1. Overview of techniques and contributions 25
Figure 1.3: What do these images have in common? They depict objects moving towards the
right. These images do not contain motion cues such as temporal information or motion blur.
The implied motion is known because we can recognize the image content and make reliable
predictions what would occur if these were movies playing based on prior experiences. This,
at the same time, allows us to be very finely tuned at identifying events that do not align to our
prior information.
videos similar to the query to build a simple model of the distribution of expected mo-
tions for the query. Then, the model can evaluate how unusual is the video as well
as make predictions. We show how a very simple retrieval model is able to provide
reliable results.
* 1.1.3 Chapter 4: Car trajectory prediction from a single image
Given a single static picture, humans can interpret, not just the instantaneous content
captured by the image, but also they are able to infer the chain of dynamic events
that are currently happening or that are likely to happen in the near future. Image
understanding not only consists of parsing what is in our surroundings, but also in
determining what is likely to happen in the future. In this chapter, we propose a system
that, given a static outdoor urban image, predicts potential trajectories for cars for the
next few seconds. This work leverages the information in a database of annotated
videos captured at different locations by different users. The core component lies in
the video data, which is modeled as dynamic projections of 3D objects into a 2D plane.
Our experiments show how this method is more descriptive and reliable at generating
plausible object trajectory predictions.
4wo CHAPTER 1. INTRODUCTION
Figure 1.4: Object trajectory prediction from a static image. Based solely on the appearance
of a detected object (in yellow) and the horizon line in the scene, our algorithm can determine a
plausible trajectory for the selected object (red) by leveraging the information from a database
of annotated moving objects.
N 1.2 Other work not included in this thesis
During my PhD studies, I also had the privilege to work and contribute in other disci-
plines. I worked with Dr. Ce Liu, Dr. Josef Sivic, and Professors Antonio Torralba,
and and William Freeman in SIFT flow, an algorithm for generating pixel-wise dense
correspondences across scenes. Additionally, I worked with Prof. Antonio Torralba
and Dr. Ce Liu on using SIFT flow for object recognition via label transfer. Finally,
through an internship with Microsoft Research, I had the opportunity to work with Dr.
Lawrence Zitnick, Dr. Ce Liu and Prof. Antonio Torralba on a paper titled "Maximum
entropy framework for encoding object-level image priors".
* 1.3 Notes
Parts of this thesis have been published at the International Conference on Computer
Vision (ICCV 2009) and the European Conference on Computer Vision (ECCV 2010).
This work was supported by a National Defense Science and Engineering Grad-
Sec. 1.3. Notes 27
uate Fellowship and a National Science Foundation Graduate Fellowship.
28 CHAPTER 1. INTRODUCTION
Chapter 2
LabelMe video: Building a video
database with human annotations
U 2.1 Introduction
Video processing and understanding are very important problems in computer vision.
Researchers have studied motion estimation and object tracking to analyze temporal
correspondences of pixels or objects across the frames. The motion information of a
static scene with a moving camera can further help to infer the 3D geometry of the
scene. In some video segmentation approaches, pixels that move together are grouped
into layers or objects. Higher level information, such as object identities, events and
activities, has also been widely used for video retrieval, surveillance and advanced
video editing.
Despite the advancements achieved with the various approaches, we observe that
their commonality lies in that they are built in a bottom-up direction: image features
and pixel-wise flow vectors are typically the first things analyzed to build these video
processing systems. Little has been taken into account for the prior knowledge of
motion, location and appearance at the object and object interaction levels in real-world
videos. Moreover, video analysis algorithms are often designed and tested on different
sets of data and sometimes suffer of having limited number of samples. Consequently,
30 CHAPTER 2. LABELME VIDEO: BUILDING A VIDEO DATABASE WITH HUMAN ANNOTATIONS
it is hard to evaluate and further improve these algorithms on a common ground.
We believe in the importance of creating a large and comprehensive video database
with rich human annotations. We can utilize this annotated database to obtain video
priors at the object level to facilitate advanced video analysis algorithms. This database
can also provide a platform to benchmark object tracking, video segmentation, object
recognition and activity analysis algorithms. Although there have been several anno-
tated video databases in the literature [25,29,451, our objective is to build one that will
scale in quantity, variety, and quality like the currently available ones in benchmark
databases for static images 139,52] .
Therefore, our criteria for designing this annotated video database are diversity,
accuracy and openness. We want to collect a large and diverse database of videos
that span many different scene, object, and action categories, and to accurately label
the identity and location of objects and actions. Furthermore, we wish to allow open
and easy access to the data without copyright restrictions. Note that this last point
differentiates us from to the Lotus Hill database 1581, which has similar goals but is not
freely available.
However, it is not easy to obtain such an annotated video database. Particularly,
challenges arise when collecting a large amount of video data free of copyright; it is
also difficult to make temporally consistent annotations across the frames with little
human interaction. Accurately annotating objects using layers and their associated
motions 1291 can also be tedious. Finally, including advanced tracking and motion
analysis algorithms may prevent the users from interacting with videos in real time.
Inspired by the recent success of online image annotation applications such as
LabelMe 139] Mechanical Turk 1461, and labeling games such as the ESP game 1521
and Peekaboom [531, we developed an online video annotation system enabling in-
ternet users to upload and label video data with minimal effort. Since tracking algo-
rithms are too expensive for efficient use in client-side software, we use a homography-
Sec. 2.2. Related Work 31
preserving shape interpolation to propagate annotations temporally and with the aid of
global motion estimation. Using our online video annotation tool, we have annotated
238 object classes, and 70 action classes for 1903 video sequences. As this online
video annotation system allows internet users to interact with videos, we expect the
database to grow rapidly after the tool is released to the public.
Using the annotated video database, we are able to obtain statistics of moving
objects and information regarding their interactions. In particular, we explored mo-
tion statistics for each object classes and cause-effect relationships between moving
objects. We also generated coarse depth information and video pop-ups by combining
our database with a thoroughly labeled image database 1391. These preliminary results
suggest potential in a wide variety of applications for the computer vision community.
N 2.2 Related Work
There has been a variety of recent work and considerable progress on scene understand-
ing and object recognition. One component critical to the success of this task is the
collection and use of large, high quality image databases with ground truth annotations
spanning many different scene and object classes [8,10,37,39,41,501. Annotations
may provide information about the depicted scene and objects, along with their spatial
extent. Such databases are useful for training and validating recognition algorithms, in
addition to being useful for a variety of tasks 1151.
Similar databases would be useful for recognition of scenes, objects, and actions
in videos although it is nontrivial to collect such databases. A number of prior works
have looked at collecting such data. For example, surveillance videos have offered an
abundance of data, resulting in a wide body of interesting work in tracking and activity
recognition. However, these videos primarily depict a single static scene with a limited
number of object and action semantic classes. Furthermore, there is little ground truth
32 CHAPTER 2. LABELME VIDEO: BUILDING A VIDEO DATABASE WITH HUMAN ANNOTATIONS
annotation indicating objects and actions, and their extent.
Efforts have taken place to collect annotated video databases with a more diverse
set of action classes. The KTH database has been widely used as a benchmark, which
depicts close-up views of a number of human action classes performed at different
viewpoints 1401. A similar database was collected containing various sports actions 151.
While these databases offer a richer vocabulary of actions, the number of object and
action classes and examples is yet small.
There also has been recent work to scale up video databases to contain a larger
number of examples. The TRECVID 1451 project contains many hours of television
programs and is a widely used benchmark in the information retrieval community. This
database provides tags of scenes, objects, and actions, which are used for training and
validation of retrieval tasks. Another example is the database in 1241, and later ex-
tended in [25], which was collected from Hollywood movies. This database contains
up to hundreds of examples per action class, with some actions being quite subtle (e.g.
drinking and smoking). However, there is little annotation of objects and their spatial
extent and the distribution of the data is troublesome due to copyright issues.
In summary, current video databases do not meet the requirements for exploring
the priors of objects and activities in video at a large scale or for benchmarking video
processing algorithms at a common ground. In this article we introduce a tool to create
a video database composed of a diverse collection of real-world scenes, containing
accurately labeled objects and events, open to download, and growth.
* 2.3 Online video annotation tool
We aim to create an open database of videos where users can upload, annotate, and
download content efficiently. Some desired features include speed, responsiveness,
and intuitiveness. In addition, we wish to handle system failures such as those related
Sec. 2.3. Online video annotation tool M3
Figure 2.1: Object annotation. Users annotate moving or static objects in a video by outlining
their shape with a polygon and describing their actions.
to camera tracking, interpolation, etc., so as not to dramatically hinder the user experi-
ence. The consideration of these features is vital to the development of our system as
they constrain the computer vision techniques that can be feasibly used.
We wish to allow multi-platform accessibility and easy access from virtually any
computer. Therefore, we have chosen to deploy an online service in the spirit of image
annotation tools such as LabelMe [39], ESP game 1521, and Mechanical Turk-based
applications 1461. This section will describe the design and implementation choices,
as well as challenges, involved in developing a workflow for annotating objects and
events in videos.
34 CHAPTER 2. LABELME VIDEO: BUILDING A VIDEO DATABASE WITH HUMAN ANNOTATIONS
U 2.3.1 Object Annotation
We designed a drawing tool similar to the one for annotating static images in La-
belMe 1391. For our case, an annotation consists of a segmentation (represented by
a polygon, and information about the object) and its motion. The user begins the an-
notation process by clicking control points along the boundary of an object to form a
polygon. When the polygon is closed, the user is prompted for the name of the object
and information about its motion. The user may indicate whether the object is static
or moving and describe the action it is performing, if any. The entered information is
recorded on the server and the polygon is propagated across all frames in the video as
if it were static and present at all times throughout the sequence. The user can fur-
ther navigate across the video using the video controls to inspect and edit the polygons
propagated across the different frames.
To correctly annotate moving objects, our tool allows the user to edit key frames
in the sequence. Specifically, the tool allows selection, translation, resizing, and editing
of polygons at any frame to adjust the annotation based on the new location and form
of the object. Upon finishing, the web client uses the manually edited keyframes to
interpolate or extrapolate the position and shape of the object at the missing locations
(Section 2.3.4 describes how annotations are interpolated). Figure 2.1 shows a screen
shot of our tool and illustrates some of the key features of the described system.
* 2.3.2 Event Annotation
The second feature is designed for annotating more complex events where one or more
nouns interact with each other. To enter an event, the user clicks on the Add Event
button, which prompts a panel where the user is asked for a sentence description of
the event (e.g.the dog is chewing a bone). The event annotation tool renders a button
for each token in the sentence, which the user can click on and link with one or more
Sec. 2.3. Online video annotation tool 35
Figure 2.2: Event annotation. Simple and complex events can be annotated by entering
free-form sentences and linking them to existing labeled objects in the video.
polygons in the video. Finally, the user is asked to specify the time when the described
event occurs using a time slider. Once the event is annotated, the user can browse
through objects and events to visualize the annotation details. Figure 2.2 illustrates this
feature.
* 2.3.3 Stabilizing Annotations
As video cameras become ubiquitous, we expect most of the content uploaded to our
system to be captured from handheld recorders. Handheld-captured videos contain
some degree of ego-motion, even with shake correction features enabled in cameras.
Due to the camera motion, the annotation of static objects can become tedious as a sim-
ple cloning of polygon locations across time might produce misaligned polygons. One
way to correct this problem is to compute the global motion between two consecutive
frames to stabilize the sequence. Some drawbacks of this approach include the intro-
duction of missing pixel patches due to image warping (especially visible with large
camera movements), as well as potential camera tracking errors that result in visually
unpleasant artifacts in the sequences.
36 CHAPTER 2. LABELME VIDEO: BUILDING A VIDEO DATABASE WITH HUMAN ANNOTATIONS
Our approach consists of estimating camera motion as a homographic transfor-
mation between each pair of consecutive frames during an offline pre-processing stage.
The camera motion parameters are encoded, saved in our servers, and downloaded by
the web client when the user loads a video to annotate. When the user finishes outlin-
ing an object, the web client software propagates the location of the polygon across the
video by taking into account the camera parameters. Therefore, if the object is static,
the annotation will move together with the camera and not require further correction
from the user. In this setup, even with failures in the camera tracking, we observe
that the user can correct the annotation of the polygon and continue annotating without
generating uncorrectable artifacts in the video or in the final annotation.
* 2.3.4 Annotation interpolation
To fill the missing polygons in between keyframes, we have chosen to use interpolation
techniques. These methods can be computationally lightweight for our web client,
easy to implement, and still produce very compelling results. An initial interpolation
algorithm assumes that control points in outlining objects are transformed by a 2D
homography plus a residual term:
pi =SR*po+T+r (2.1)
where po and pi are vectors containing the 2D coordinates of the control points for
the annotation of some object at two user annotated key frames, say t = 0 and t =1
respectively; S,R, and T are scaling, rotation, and translation matrices encoding the
homographic projection from po to pi that minimizes the residual term r.
A polygon at frame t E [0,1], can then be linearly interpolated in 2D as:
p, = [SR]'po +t[T +r] (2.2)
Sec. 2.3. Online video annotation tool
Once a user starts creating key frames, our tool interpolates the location of the
control points for the frames in between two frames or linearly extrapolates the control
points in the case of a missing key frame in either temporal extreme of the sequence.
Figure 2.4 shows interpolation examples for some object annotations and illustrates
how, with relatively few user edits, our tool can annotate several objects common in
the real world such as cars, boats, and pedestrians.
3D Linear Motion Prior So far, we assume that the motion between two frames is
linear in 2D (equation 2.2). However, in many real videos, objects do not always move
parallel to the image plane, but move in a 3D space. As a result, the user must make
corrections in the annotation to compensate for foreshortening effects during motion.
We can further assist the user by making simple assumptions about the most likely
motions between two annotated frames [1].
Figure 2.3(a) shows an example of two polygons corresponding to the annota-
tion of a car at two distinct times within the video. The car has moved from one frame
to another and has changed in location and size. The scaling is a cue to 3D motion.
Therefore, instead of assuming a constant velocity in 2D, it would be more accurate to
assume constant velocity in 3D in order to interpolate intermediate frames. Interest-
ingly, this interpolation can be done without knowing the camera parameters.
We start by assuming that a given point on the object moves in a straight line in
the 3D world. The motion of point X(t) at time t in 3D can be written as X(t) = Xo +
A (t)D, where Xo is an initial point, D is the 3D direction, and A(t) is the displacement
along the direction vector. Here, we assume that the points X = (X,Y,Z, 1) live in
projective space.
For the camera, we assume perspective projection and that the camera is sta-
tionary. Therefore, the intrinsic and extrinsic parameters of the camera can be ex-
pressed as a 3 x 4 matrix P. Points on the line are projected to the image plane
38 CHAPTER 2. LABELME VIDEO: BUILDING A VIDEO DATABASE WITH HUMAN ANNOTATIONS
as x(t) = PX(t) = xo + X (t)xy, where xo = PXo and x, = PD. Using the fact that
x, = lim,()-oxo + A(t)xy, we see that xy is the vanishing point of the line. More
explicitly, the image coordinates for points on the object can be written as:
(x)+.(t),t =O , X(2.3)
Furthermore, we assume that the point moves with constant velocity. Therefore,
X (t) = vt, where v is a scalar denoting the velocity of the point along the line. Given a
corresponding second point x(i) (iJ, 1) along the path projected into another frame,
we can recover the velocity as v = In summary, to find the image coordinates
for points on the object at any time, we simply need to know the coordinates of a point
at two different times.
To recover the vanishing points for the control points belonging to a polygon,
we assume that all of the points move in parallel lines (in 3D space) toward the same
vanishing point. With this assumption, we estimate the vanishing point from polygons
in two key frames by intersecting lines passing through two point correspondences,
and taking the median of these intersections points as illustrated in Figure 2.3(a). Note
that not all points need to move at the same velocity. Figure 2.3(b) compares the result
of interpolating a frame using constant 2D velocity interpolation versus using constant
3D velocity interpolation.
The validity of the interpolation depends on the statistics of typical 3D motions
that objects undergo. We evaluated the error the interpolation method by using two
annotated frames to predict intermediate annotated frames (users have introduced the
true location of the intermediate frame). We compare against our baseline approach
that uses a 2D linear interpolation. Table I shows that, for several of the tested objects,
the pixel error is reduced by more than 50% when using the constant 3D velocity
assumption.
t=o
a)
- Annotated '
- Constant 2D velocity interpolation
- - - - - Constant 3D velocity interpolation
b)
)t=
C)
t=0.41 t=0.66 t=0.73
Figure 2.3: Interpolation comparison between constant 2D motion (red) and constant 3D
motion (green). a) Two polygons from different frames and their vanishing point. b) Interpo-
lation of an intermediate frame, and c) interpolation of the polygon centers for multiple frames
between the two reference frames.
t=1
t=0.90
t=1
Figure 2.4: Examples of annotations. Our interpolation framework is based on the heuristic
that objects often move with constant velocity and follow straight trajectories. Our system
can propagate annotations of rigid (or semni-rigid) objects such as cars, motorbikes, fish, cups,
etc. across different frames in a video automatically aiming for minimal user intervention.
Annotation of non-rigid objects (e.g.humans), while possible by the tool (but requiring more
editing), remains a more challenging task than the one for rigid objects. Presently, users can
opt to, instead, draw bounding boxes around non-rigid entities like people.
Sec. 2.4. Data set statistics 41
Table 2.1: Interpolation evaluation. Pixel error per object class.
Linear in 2D Linear in 3D # test samples
car 36.1 18.6 21
motorbike 34.6 14.7 11
person 15.5 8.6 35
U 2.4 Data set statistics
We intend to grow the video annotation database with contributions from Internet users.
As an initial contribution, we have provided and annotated a first set of videos. These
videos were captured at a diverse set of geographical locations, which includes both
indoor and outdoor scenes.
Currently, the database contains a total of 1903 annotations, 238 object classes,
and 70 action classes. The statistics of the annotations for each object category are
listed in table 2.2. We found that the frequency of any category is inversely proportional
to its rank in the frequency table (following Zipf's law 1471), as illustrated in Figure 2.5.
This figure describes the frequency distribution of the objects in our video database by
plotting the number of annotated instances for each class by the object rank (objects
names are sorted by their frequency in the database). The graph includes plots for static
and moving objects, and action descriptions. For comparison, we also show the curve
of the annotation of static images 1391.
The most frequently annotated static objects in the video database are buildings
(13%), windows (6%), and doors (6%). In the case of moving objects the order is
persons (33%), cars (17%), and hands (7%). The most common actions are moving
forward (3 1%), walking (8%), and swimming (3%).
Table 2.2: Object and action statistics. Number of instances per object/action class in our
current database.
Static # Moving # Action #
object object
building 183 person 187 moving forward 124
window 86 car 100 walking 136
door 83 hand 40 swimming 13
sidewalk 77 motorbike 16 waving 13
tree 76 water 14 riding motorbike 8
road 71 bag 13 flowing 7
sky 68 knife 12 opening 5
car 65 purse 11 floating 4
street lamp 34 tree 11 eating 3
wall 31 door 9 flying 3
motorbike 25 blue fish 7 holding knife 3
pole 24 bycicle 7 riding bike 3
column 20 carrot 7 running 3
person 20 flag 7 standing 3
balcony 18 stroller 7 stopping 3
sign 18 dog 6 turning 3
floor 13 faucet 6 being peeled 2
Sec. 2.5. Beyond User Annotations 4J
- static
10 - moving
- action names
= static+motion
s traditional labelme
10
C1
0
0)
.2102
101
100 101 102 103
log rank
Figure 2.5: Distribution of labels in data set. The vertical axis indicates the log frequency
of the object/action instances in the database while the y axis indicates the rank of the class
(the classes are sorted by frequency). As we aimed to capture videos from a variety of common
scenes and events in the real world, these distributions are similar to natural word frequencies
described by Zipf's law[47].
N 2.5 Beyond User Annotations
Once a set of videos is annotated, we can infer other properties not explicitly provided
by the users in the annotation. For example: How do objects occlude each other?
Which objects in our surroundings move and what are their common motions like?
Which objects move autonomously and make others move? In this section, we will
demonstrate how to use the user annotations to infer extra information as well as rela-
tionships between moving objects.
Large databases of annotated static images are currently available, and labeling
single images appears easier since no temporal consistency needs to be taken into ac-
count. We expect that there would be more annotated images than annotated videos at
the same level of annotation accuracy. Therefore, it can be a useful strategy to propa-
gate information from static image databases onto our video database.
44 CHAPTER 2. LABELME VIDEO: BUILDING A VIDEO DATABASE WITH HUMAN ANNOTATIONS
Table 2.3: Motion statistics. We can compute the probability that an object of a certain class
moves based on the observations in our annotated dataset. Due to the state of the database,
there are some cases where there were no annotated static instances, resulting in probabilities
equaling 1. We expect this to change as the database grows.
Object Motion Object Motion
probability probability
hand 1.00 building 0
purse 1.00 column 0
bag 0.92 floor 0
person 0.90 grass 0
water 0.74 plant 0
knife 0.67 pole 0
car 0.61 road 0
bycicle 0.50 sidewalk 0
boat 0.45 sign 0
motorbike 0.39 sky 0
tree 0.13 street lamp 0
door 0.10 traffic light 0
awning 0 wall 0
balcony 0 window 0
* 2.5.1 Occlusion handling and depth ordering
Occlusion and depth ordering are important relationships between objects in a scene.
We need to sort objects with respect to their depth to infer the visibility of each polygon
at each frame. Depth ordering can be provided by the user 1291, but it makes the
annotation tool tedious to use for naive users because the depth ordering of the objects
may change during the video sequence. In fact, occlusion information can be recovered
by post-processing the annotated data.
One possibility is to model the appearance of the object and, wherever there
is overlapping with another object, infer which object owns the visible part based on
matching appearance. Although this would work in general, it can be unreliable when
the appearance of the object is changing (e.g.a walking person), when the occlusion is
small (e.g.a person walking behind a lamp-post) or when the resolution is low (for far
Sec. 2.5. Beyond User Annotations
and small objects). Another alternative, as proposed in 1391, is that when two objects
overlap, the polygon with more control points in the intersection region is in front. This
simple heuristic provides around 95% correct decisions when applied to static images.
However, this approach does not work for video because moving objects are generally
annotated with fewer points than static objects, which can be annotated in great detail.
Our experiment has demonstrated that this approach, when applied to videos, failed in
almost all cases (Fig. 2.6.c).
As suggested by recent work of [38], it is indeed possible to extract accurate
depth information using the object labels and to infer support relationships from a
large database of annotated images. We use the algorithm from 1381 to infer polygon
order information in our video annotations and show examples of a 3d-video pop up in
Fig. 2.6. We found that this approach reliably handles occlusions for outdoor scenes,
but indoor scenes remains a challenge.
As shown in 1231, information about the distribution of object sizes (e.g., the
average height of a person is 1.7 meters) and camera parameters can be extracted from
user annotations. The procedure is an iterative algorithm that alternatively updates the
camera parameters (location of the horizon line in the image) and the object heights. In
order to recover the 3D scene geometry, we also need to recover the support relations
between objects [381. The process starts by defining a subset of objects as being ground
objects (e.g., road, sidewalk, etc.). The support relationship between two objects can
be inferred by counting how many times the bottom part of a polygon overlaps with the
supporting object (e.g., the boundary defining a person will overlap with the boundary
defining a sidewalk, whenever these two objects co-occur in an image and they are
nearby each other). Once support relations are estimated, we can use the contact point
of an object with the ground to recover its 3D position.
Both techniques (recovering camera parameters with 3D object sizes and in-
ferring the support graph) benefit from a large collection of annotated images. This
40 CHAPTER 2. LABELME VIDEO: BUILDING A VID
EO DATABASE WITH HUMAN ANNOTATIONS
1km
1Om/fj
Example 3D-popup generated
1m from annotations
a) video frame b) propagated c) occlusion d) occlusion handling e) 
depth map
annotations handling with with 3D information
LabelMe heuristic
Figure 2.6: Occlusion relationships and depth estimation. A sample
 video frame (a), the
propagated polygons created with our annotation tool (b), the polygons o
rdered using the
LabelMe-heuristic-based inference for occlusion relationships (c) polygon orde
ring using La-
belMe heuristic (notice how in the top figure, the group of people standing far away from the
camera are mistakenly ordered as closer than the man pushi
ng the stroller and in the bottom
figure there is a mistake in the ordering of the cars), and (d) ordered polygon
s inferred using
3D relationship heuristics (notice how the mistakes in (c) are fixed).
information, learned from still images, is used to recover a 3D model
 of the scene. As
our video scenes share similar objects with LabelMe, we are able to estimate 3D in-
formation for each video frame in our database (even when there is no camera mot
ion
for inferring 3D using structure from motion techniques). We found that this techniq
ue
works well in most outdoor street scenes, but fails in many indoor 
scenes due to the
lack of a clear ground plane. Fig. 2.6 shows some results with succe
ssful depth order
inference in a video annotation.
* 2.5.2 Cause-effect relations within moving objects
In the previous section we showed how a collection of static im
ages can be used to
extract additional information from annotated videos. Here, we 
discuss how to extract
information from videos that might not be inferred solely from 
static images. There-
fore, the two collections (images and video) can complement each other.
As described in Section 2.3.1, our annotation tool allows users to rec
ord whether
an object is moving or static. Using this coarse motion information, we can infer cause-
r -7IIN
Sec. 2.6. Discussion 47
effect motion relationships for common objects. We define a measure of causality,
which is the degree to which an object class C causes the motion in an object of class
E:
de) p(E moves C moves and contacts E)
causality(C,E) =pEueIC vs omeoonat) (2.4)
p(E mocesjCdocs not mote or contact E)
Table 2.4 shows the inferred cause-effect motion relationships from the objects
annotated in our database. It accurately learns that people cause the motions of most
objects in our surroundings and distinguishes inert objects, such as strollers, bags,
doors, etc., as being the ones moved by living objects.
* 2.6 Discussion
Most of the existing video analysis systems (e.g.motion estimation, object tracking,
video segmentation, object recognition, and activity analysis) use a bottom-up ap-
proach for inference. Despite the high correlation in these topics, solutions are often
sought independently for these problems. We believe that the next step in developing
video analysis techniques involves integrating top-down approaches by incorporating
prior information at the object and action levels. For example, motion estimation can
be performed in a completely different way by first recognizing the identities of the
objects, accessing motion priors for each object category, and possibly integrating oc-
clusion relationships of the objects in the scenes to finally estimate the motion of the
whole scene.
As it is inherently easier to annotate a database of static images, propagating
the annotations of static images to label a video database can be crucial to grow the
video database in both scale and dimension. We showed how, for example, that depth
information can be propagated from static images to video sequences, but there is a lot
48 CHAPTER 2. LABELME VIDEO: BUILDING A VIDEO DATABASE WITH HUMAN ANNOTATIONS
Table 2.4: Inferred cause-effect motion relationships. The cause-effect relationships are
ranked by causality score. The threshold line separates correct relationships from the incorrect
ones (with lower scores). Notice that many pairs of relationships appear in the list in their two
possible forms (e.g.knife- > bag and bag- > knife) but that in all the cases the correct one
has a higher score than the incorrect one.
Cause(C)
hand
hand
person
person
knife
carrot
person
knife
person
carrot
person
person
hand
bag
purse
water
motorbike
bag
stroller
car
door
purse
bycicle
-+ Effect(E) causality(C,E)
carrot
knife
purse
stroller
hand
hand
door
carrot
bycicle
knife
bag
motorbike
water
purse
bag
hand
bag
motorbike
person
tree
person
person
person
11.208333
10.579545
10.053191
8.966667
8.042553
7.388889
5.026596
4.691489
4.339286
4.015152
3.800000
2.994286
2.453704
2.409091
2.345930
2.317308
2.297297
2.216667
2.037879
more to explore. Recent advances in object recognition and scene parsing already allow
us to segment and recognize objects in each frame. Object recognition, together with
temporal smoothing to impose consistency across frames, could significantly reduce
the human annotation labor necessary for labeling and tracking objects.
Sec. 2.7. Conclusion 'U'
U 2.7 Conclusion
We designed an open, easily accessible, and scalable annotation system to allow online
users to label a database of real-world videos. Using our labeling tool, we created a
video database that is diverse in samples and accurate, with human-guided annotations.
Based on this database, we studied motion statistics and cause-effect relationships be-
tween moving objects to demonstrate examples of the wide array of applications for our
database. Furthermore, we enriched our annotations by propagating depth information
from a static and densely annotated image database. We believe that this annotation
tool and database can greatly benefit the computer vision community by contributing
to the creation of ground-truth benchmarks for a variety of video processing algorithms,
as a means to explore information of moving objects.
50 CHAPTER 2. LABELME VIDEO: BUILDING A VIDEO DATABASE WITH HUMAN ANNOTATIONS
Chapter 3
A data-driven approach for unusual
event detection
N 3.1 Introduction
If we are told to visualize a street scene, we can imagine some composition with basic
elements in it. Moreover, if we are asked to imagine what can happen in it, we might
say there is a car moving through a road, being in contact to the ground and preserving
some velocity and size relationships with respect to other elements in the scene (say
a person or a building). Even when constrained by its composition (e.g. when being
shown a picture of it) we can predict things like an approximate speed of the car,
and maybe even its direction (see fig. 1.3). Human capacity for mental imagery and
story telling is driven by the years of prior knowledge we have about our surroundings.
Moreover, it has been found that static images implying motion are also important in
visual perception and are able to produce motion after-effects 156] and even activate
motion sensitive areas in the human brain [21]. As a consequence, the human visual
system is capable of accurately predicting plausible events in a static scene (or future
events in a video sequence) as well as is finely tuned to flag unusual configurations or
events.
Event and action detection are well-studied topics in computer vision. Several
52 CHAPTER 3. A DATA-DRIVEN APPROACH FOR UNUSUAL EVENT DETECTION
works have proposed models to study, characterize, and classify human actions ranging
from constrained environments 133,401 to actions in the OwildO such as TV shows,
sporting events, and cluttered backgrounds 124,341. In this scenario, the objective is to
identify the action class of a previously unknown query video given a training dataset
of action exemplars (captured at different locations). A different line of work is that
of event detection for video surveillance applications. In this case, the algorithm is
given a large corpus of training video captured at a particular location as input, and
the objective is to identify abnormal events taking place in the future in that same
scene 120,54,55,621. Consequently, deploying a surveillance system requires days of
data acquisition from the target and hours of training for each new location.
In this chapter we look into the problem of generic event prediction for scene
instances different from the ones in some large training corpus. In other words, given
an image (or a short video clip), we want to identify the possible events that may
occur as well as the abnormal ones. We motivate our problem with a parallel to object
recognition. Event prediction and anomaly detection technologies for surveillance are
now analogous to object instance recognition. Many works in object recognition are
moving towards the more generic problem of object category recognition 13,4]. We
aim to push the envelope in the video aspect by introducing a framework that can easily
adapt to new scene instances without the requirement of retraining a model for each
new location. Moreover, other potential applications lie in the areas of video collection
retrieval in online services such as YouTube, Vimeo, where video clips are captured in
different locations and greatly differ with respect to controlled video sources such as
surveillance feeds and tv programming as was pointed out by Zanetti et al. [611.
Given a query image, our purpose is to identify the events that are likely to take
place in it. We have a rich video corpus with 2401 real world videos acting as our prior
knowledge of the world. In an offline stage, we generate and cluster motion tracks
for each video in the corpus. Using scene-matching, our system retrieves videos with
Sec. 3.2. Related Work 53
similar image content. Track information from the retrieved videos is integrated to
make a prediction of where in the image motion is likely to take place. Alternatively,
if the input is a video, we track and cluster salient features in the query and compare
each to the ones in the retrieved neighbor set. A track cluster can then be flagged as
unusual if it does not match any in the retrieved set.
U 3.2 Related Work
Human action recognition is a popular problem in the video domain. The work by
Efros et al. 151 learns optical flow correlations of human actions in low resolution
video. Schechtman and Irani exploit self similarity correlations in space-time volumes
to find similar actions given an exemplar query. Niebles et al. 1341 characterize and
detect human actions under complex video sequences by learning probability distribu-
tions of sparse space-time interest points. Laptev et al. densely extracts spatio-temporal
features in a grid and uses a bag of features approach to detect actions in movies. Mess-
ing et al. models human activities as mixtures of bags of velocity trajectories, extracted
from track data. None of these works study the task of event prediction and are con-
strained to human actions. Similar in concept to our vision is the work by Li et al. 1281,
where the objective is action classification given an object and a scene . Our work is
geared towards localized prediction including trajectory generation, not classification.
Extensive work has also taken place in event and anomaly detection for surveil-
lance applications. A family of works relies on detecting, tracking, and classifying
objects of interest and learning features to distinguish events. Dalley et al. detect loi-
tering and bag dropping events using a blob tracker to extract moving objects and
detect humans and bags. The system idenfifies a loitering event if a person blob does
not move for a period of time. Bag dropping events are detected by checking the dis-
tance between a bag and a person; if the distance becomes larger than some threshold,
54 CHAPTER 3. A DATA-DRIVEN APPROACH FOR UNUSUAL EVENT DETECTION
it is identified as a dropped bag. A second family of works clusters motion features
and learning distributions on motion vectors across time. Wang et al. [551 uses a
non-parametric Bayesian model for trajectory clustering and analysis. A marginal like-
lihood is computed for each video clip, and low likelihood events are flagged as ab-
normal. One common assumption of these methods is that training data for each scene
instance where the system will be deployed is available. Therefore, the knowledge built
is not transferrable to new locations, as the algorithm needs to be retrained with video
feeds from each new location to be deployed.
Numerous works have demonstrated success using a rich databases for retrieving
and/or transferring information to queries in both image 114, 15,31,511 and video [30,
441. In video applications, Sivic et al. [441, proposed a video representation for
exemplar-based retrieval within the same movie. Moving objects are tracked and their
trajectories grouped. Upon selection of an image crop in some video frame, the sys-
tem searches across video key frames for similar image regions and retrieves portions
of the movie containing the queried object instance. The work proposed by Liu et
al. 1301 is the closest one to our system. It introduces a method for motion synthe-
sis from static images by matching a query image to a database of video clip frames
and transferring the moving regions from the nearest neighbor videos (identified as
regions where the optical flow magnitude is nonzero) to the static query image. This
work constructs independent interpretations per nearest neighbors. Instead, our work
builds localized motion maps as probability distributions after merging votes from sev-
eral nearest neighbors. Moreover, we aim to have a higher level representation where
each moving object is modeled as a track blob while 1301 generates hypotheses as one
motion region per frame. In summary, these works demonstrate the strong potential
of data-driven techniques, which to our knowledge no prior work has extended into
anomaly detection.
Sec. 3.3. Scene-based video retrieval 00
N 3.3 Scene-based video retrieval
The objective of this project is to use event knowledge from a training database of
videos to construct an event prediction for a given a static query image. To achieve
some semantic coherence, we want to transfer event information only from similar im-
ages. Therefore, we need a good retrieval system that will return matches with similar
scene structures (e.g.a picture of an alley will be matched with another alley photo shot
with a similar viewpoint) even if the scene instances are different. In this chapter we
will explore the usage of two scene matching techniques: GIST [361 and spatial pyra-
mid dense SIFT 1261 matching. The GIST descriptor encodes perceptual dimensions
that characterize the dominant spatial structure of a scene. The spatial pyramid SIFT
matching technique works by partitioning an image into subregions and computing his-
tograms of local features at each sub-region. As a result, images with similar global
geometric correspondence can be easily retrieved. The advantage of both the GIST and
dense SIFT retrieval methods is their speed and efficiency at projecting images into a
space where similar semantic scenes are close together. This idea has proven robust
in many non-parametric data-driven techniques such as label transfer 1311 and scene
completion 1151 amongst many others. To retrieve nearest videos from a database, we
perform matching between the first frame of the video query and the first frame of each
of the videos in the database.
* 3.4 Video event representation
We introduce a system that models a video as a set of trajectories of keypoints through-
out time. Individual tracks are further clustered into groups with similar motion. These
clusters will be used to represent events in the video.
56 CHAPTER 3. A DATA-DRIVEN APPROACH FOR UNUSUAL EVENT DETECTION
M 3.4.1 Recovering trajectories
For each video, we extract trajectories of points in the sequence using an implemen-
tation of the KLT tracker 1481 by Birchfield 121. The KLT tracking equation seeks
the displacement d = [dx,dy]T that minimizes the dissimilarity amongst two windows,
given a point p = [x,y]T and two consecutive frames 1 and J:
e (w) = [J (p + ) -(p - d)]2w(p)dp (3.1)JW 2 2
where W is the window neighborhood, and w(d) is the weighing function (set to 1).
Using a Taylor series expansion of J and I, the displacement that minimizes E is:
[J(p) -I(p)+ g'(p)d]g(p)w(p)dp 0 (3.2)
where g [= ( '72) y( )
The tracker finds salient points by examining the minimum eigenvalue of each
2 by 2 gradient matrix. We initialize the tracker by extracting 2000 salient points at
the first video frame. The tracker finds the correspondences of the points sequentially
throughout the frames in the video. Whenever a track is broken (a point is lost due
to high error or occlusions), new salient points are detected to maintain a consistent
number of tracks throughout the video. As a result, the algorithm produces tracks,
which are sequences of location tuples T (x(t),y(t))tED within a duration D for each
tracked point. For more details on the implementation, we refer to the the original KLT
tracker paper.
M 3.4.2 Clustering trajectories
Now that we have a set of trajectories for salient points in an image, we proceed to
group them at a higher level. Ideally, tracks from the same object should be clustered
Sec. 3.4. Video event representation
together. We define the following distance function between two tracks
dtrack (Ti,Tj) DinDJl t D j(xi(t ) -X(t)) 2 + (y,(t) -y(t)) 2  (3.3)
We use the distance function to create an affinity matrix between tracks and use nor-
malized cuts 1431 to cluster them. Each entry of the affinity matrix is defined as
Wij = exp(-dtrack(Ti,Tj)/. 2). The clustering output will thus be a group label as-
signment to each track. See fig. 3 for a visualization of the data. Since we do not
know the number of clusters for each video in advance, we set a value of 10. In some
cases this will cause an over segmentation of the tracks and will generate more than
one cluster for some objects.
N 3.4.3 Comparing track clusters
For each track cluster C {Ti}, we quantize the instantaneous velocity of each track
point into 8 orientations To ensure rough spatial coherency between clusters, we su-
perimpose a regular grid with a cell spacing of 10 pixels on top of the image frame to
create a spatial histogram containing 8 sub-bins at each cell in the grid. Let Hi and H2
denote the histograms formed by the track clusters C1 and C2 such that Hi (i,b) and
H2(ib) denote the number of velocity points from the first and second track clusters
respectively that fall into the bth sub-bin of the ith bin of the histogram, where i E G
and G denotes the bins in the grid. We define the similarity between two track clusters
as the intersection of their velocity histograms:
8
Sclust(C1,C2) I(H1,H2) = min (H (i,b),H 2 (ib)) (3.4)
This metric was designed in the same spirit as the bottom level of the spatial pyramid
matching method by Lazebnik et al. . We aim for matches that approximately preserve
Figure 3.1: Track clustering. Sample frames from the video sequence (a). The ground
truth annotations denoted by polygons surrounding moving objects (b) can be used to gen-
erate ground truth labels for the tracked points in the video (c). Our track distance affinity
function is used to to automatically cluster tracks into groups and generates fairly reasonable
clusters where each roughly correspond to independent objects in the scene (d). The track clus-
ters visualizations in (c) and (d) show the first frame of each video and the spatial location of
all tracked points for the duration of the clip color-coded by the track cluster that each point
corresponds to.
Sec. 3.5. Video database and ground truth
global spatial correspondences. Since our video neighbor knowledge-base is assumed
to be spatially aligned to our query, a good match shall also preserve an approximate
similar spatial coherence.
M 3.5 Video database and ground truth
Our database consists of 2277 videos belonging to 100 scene categories. The cate-
gories with the most videos are: street (809), plaza (135), interior of a church (103),
crosswalk (82), and aquarium (75). Additionally, 14 videos containing unusual events
were downloaded from the web (see fig. 3 for some sample frames). 500 of the videos
originate from the LabelMe video dataset 159]. As these videos were collected using
consumer cameras without a tripod, there is slight camera shake. Using the LabelMe
video system, the videos were stabilized. The object-level ground truth labeling in the
LabelMe video database allows us to easily visualize the ground truth clustering of
tracks and compare it with our automated results (see fig. 2). We split the database into
2301 training videos, selected 134 fully videos from outdoor urban scenes and the 14
unusual videos to create a test set with 148 videos.
M 3.6 Experiments and Applications
We present two applications of our framework. Given the information from nearest
neighbor videos, what can we say about the image if we were to see it in action? As
an example, we can make good predictions of where motion is bound to happen in an
image. We also present a method for determining the degree of anomaly of an event in
a video clip using our training data.
Figure 3.2: Unusual videos. We define an unusual or anomalous event as one that is not
likely to happen in our training data set. However, we ensured that they belong to scene classes
present in our video corpus.
Motion Prediction
ROC .
(D
C:
0
(D
_0
0.2 0.4 0.6
false alarm rate
(a)
Unusual Event Detection
ROC.. .
0.1
0.8
0.5
O 4 gist NN,
#sift NN-
011111r 3 rano NN
0.2-
0,1
0.8 1 0 0.2 0.4 0.6 0.8 1
false alarm rate
(b)
Figure 3.3: Localized motion prediction (a) and unusual event detection (b). The algorithm
was compared against two scene matching methods (GIST and dense SIFT) as well as a base-
line supported by random nearest neighbors. Retrieving videos similar to the query image
improves the classification rate.
Sec. 3.6. Experiments and Applications U1
U 3.6.1 Localized motion prediction
Given a static image, we can generate a probabilistic map determining the spatial ex-
tent of the motion. In order to estimate p(motion x,y,scene) we use a parzen window
estimator and the trajectories of the N=50 nearest neighbor videos retrieved with scene
matching methods (GIST or dense SIFT-based).
1 N IMip(motion x,y,scene) = 2 tK(x-xi,7(),y -yij(t);a) (3.5)
where N is the number of videos and Mi is the number of tracks in the ith video and
K(x,y; a) is a gaussian kernel of width a 2 . Fig. 4 a shows the per-pixel prediction ROC
curve compared using gist nearest neighbors, dense SIFT matching, and as a baseline,
a random set of nearest neighbors. The evaluation set is composed of the first frame
of each test video. We use the location of the tracked points in the test set as ground
truth. Notice that scenes can have multiple plausible motions occurring in them but
our current ground truth only provides one explanation. Despite our limited capacity
of evaluation, notice the improvement when using SIFT and GIST matching to retrieve
nearest neighbors. This graph suggests that (1) different sets of motions happen in
different scenes, and (2) scene matching techniques do help filtering out distracting
scenes to make more reliable predictions (for example, a person climbing the wall of a
building in a street scene would be considered unusual but a person climbing a wall in a
rock climbing scene is normal). Fig. 3.5 c and 3.6 c contain the probability motion map
constructed after integrating the track information from the nearest neighbors of each
query video depicted in column (a). Notice that the location of high probability regions
varies depending on the type of scenes. Moreover, the reliability of the motion maps
depends on (1) how accurately the scene retrieval system returns nearest neighbors
from the same scene category (2) whether the video corpus contains similar scenes.
62 CHAPTER 3. A DATA-DRIVEN APPROACH FOR UNUSUAL EVENT DETECTION
The reader can get an intuition of this by looking at column (e), which contains the
average nearest neighbor image. '
* 3.6.2 Event prediction from a single image
Given a static image, we demonstrated that we can generate a probabilistic function per
pixel. However, we are not only constrained to per-pixel information. We can use the
track clusters of videos retrieved from the database and generate coherent track cluster
predictions. One method is by directly transferring track clusters from nearest neigh-
bors into the query image. However, this might generate too many similar predictions.
Another way lies in clustering the retrieved track clusters. We use normalized cuts
clustering for this step at the track cluster level using the distance function described in
equation 3.4 to compare pairs of track clusters. Fig. 3.4 shows example track clusters
overlaid on top of the static query image. A required input to the normalized cuts algo-
rithm is the number of clusters. We try a series of values from 1 to 10 and choose the
clustering result that maximizes the distance between clusters. Notice how for different
query scenes different predictions that take the image structure are generated.
M 3.6.3 Anomaly detection
Given a video clip, we can also determine if an unusual event is taking place. First,
we break down the video clip into query track clusters (which roughly represent object
events) using the method described in section 3.4. We also retrieve the top 200 nearest
videos using scene matching. We negatively correlate the degree of anomaly of a query
track cluster with the maximum track cluster similarity between the query track cluster
and each of the track clusters from the nearest neighbors:
anomaly(Hquery) -argmax (I (HqueryHneigh) (3.6)
Hneigh
Figure 3.4: Event prediction. Each row shows a static image with its corresponding event
predictions. For each query image, we retrieve their nearest video clips using scene matching.
The events belonging to the nearest neighbors are resized to match the dimensions of the query
image and are further clustered to create different event predictions. For example, in a hallway
scene, the system predicts motions of different people; in street scenes, it predicts cars moving
along the road, etc.
64 CHAPTER 3. A DATA-DRIVEN APPROACH FOR UNUSUAL EVENT DETECTION
where Hquery is the spatial histogram of the velocity histories of the query track cluster
and Hieigi denotes the histogram of a track cluster originated from a nearest neighbor.
Intuitively, if we find a similar track cluster in a similar video clip, we consider it as
normal. Conversely, a poor similarity score implies that such event (track cluster) does
not usually happen in similar video clips. Fig. 3.5 shows examples of events that our
system identified as common by finding a nearest neighbor that minimized its anomaly
score. Notice how the nearest track clusters are fairly similar to the query ones and
also how the spatial layout of the nearest neighbor scenes matches that of the query
video. As a sanity check, notice the similarity of the nearest neighbors average image
to the query scene suggesting that the scene retrieval system is picking the right scenes
to make accurate predictions. Fig. 3.6 shows events with a higher anomaly score.
Notice how the nearest neighbors differ from the queries. Also, the average images
are indicators of noisy and random retrievals. By definition, unusual events will be
less likely to appear in our database. However, if the database does not have enough
examples of particular scenes, their events will be be flagged as unusual.
Fig. 3.3(b) shows a quantitative evaluation of this test. Our automatic clustering
generates 685 normal and 106 unusual track clusters from our test set. Each of these
clusters was scored achieving in similar classification rates when the system is powered
by either SIFT or GIST matching methods reaching a 70% detection rate with a 22%
false alarm rate. We use the scenario of a random set of nearest neighbors as a base-
line. Due to our track cluster distance function, if a cluster similar to the query cluster
appears in the random set, our algorithm will be able to identify it and classify the
event as common. However, notice that the scene matching methods are demonstrating
great utility cleaning up the retrieval set and narrowing videos to a fewer relevant ones.
Fig. 3.7 shows some examples of our system in action.
(C)
(d)
(e)
Figure 3.5: Track cluster retrieval for common events. A frame from a query video (a),
the tracks corresponding to one event in the video (b), the localized motion prediction map
(c) generated after integrating the track information of the nearest neighbors (some examples
shown in d), and the average image of the retrieved nearest neighbors (e). Notice the definition
of high probability motion regions in (c) and how its shape roughly matches the scene geometry
in (a). The maps in (c) were generated with no motion information originating from the query
videos videos.
BdBN"-4,- --
I PW
(b)
(d)
(e)
Figure 3.6: Track cluster retrieval for unusual events (left) and scenes with less samples in
our data set. When presented with unusual events such as a car crashing into the camera or a
person jumping over a car while in motion (left and middle columns; key frames can be seen in
fig. 3.7) our system is able to flag these as unusual events (b) due to their disparity with respect
to the events taking place in the nearest neighbor videos. Notice the supporting neighbors
belong to the same scene class as the query and the motion map predicts movements mostly in
the car regions. However, our system fails when an image does not have enough representation
in the database (right).
Sec. 3.7. Discussion and concluding remarks 67
uu; 1 X
Figure 3.7: Unusual event detection. Videos of a person jumping over a car and running
across it (left) and a car crashing into the camera (right). Our system outputs anomaly scores
for individual events. Common events shown in yellow and unusual ones in red. The thickness
and saturation of the red tracks is proportional to the degree of anomaly.
* 3.7 Discussion and concluding remarks
We have presented a flexible and robust system for unsupervised localized motion pre-
diction and anomaly detection powered by two phases: (1) scene matching to retrieve
similar videos given a query video or image, and (2) motion matching via a scene-
inspired and spatially aware histogram matching technique for velocity information.
We emphasize that most of the work in the literature focuses on action recognition and
detection and requires training models for each different action category. Our method
has no training phase, is quick, and naturally extends into applications that are not avail-
able under other supervised learning scenarios. Experiments demonstrate the validity
of our approach when given enough video samples of real world scenes. We envision
its applicability in areas such as finding unique content in video sharing websites and
future extensions in surveillance applications.
68 CHAPTER 3. A DATA-DRIVEN APPROACH FOR UNUSUAL EVENT DETECTION
Chapter 4
Car trajectory prediction from a
single image
* 4.1 Introduction
Object, scene, action, and event recognition/detection are active areas of research in
image understanding. They focus on understanding the present content in images and
video. The concept of event prediction focuses instead on what is likely to happen
in the future given a present configuration. If we are at an intersection and see a car
approaching, we can foresee where it will be in the next few seconds. The problem
of event prediction from a single image was originally posed by Liu et al. in 1311 and
further studied by Yuen and Torralba in 1601. The setting involves a static image and the
objective is to determine the future its elements as if it were a frame in a video sequence.
The motivation for this problem lies at the core of dynamic scene understanding.
This problem, seemingly impossible at first glance, has been approached using
data-driven approaches. While we have not seen hours of video from the particular
intersection we might be at, we have seen many examples of streets and intersections
in general. In [311, the single image can be matched against video frames in a large
database and, given a good match, new dynamic objects can be introduced into the
image, or a localized motion map can be computed; if an optical flow field or group
70 CHAPTER 4. CAR TRAJECTORY PREDICTION FROM A SINGLE IMAGE
of motion tracks is transferred to the image, they can also provide cues of types of
motions to expect in such image.
This chapter approaches event prediction in a different way. Our key contribu-
tion lies in modeling dynamic scenes in 3D and at the object level. Given a static image
of a street scene, and a car detection in it, our system will match the car and its scene
against all cars in similar scenes in a subset of the LabelMe video database 1591. It
will rank the detected cars a function of scene, pose, and location compatibility. As
the selected cars originate from videos, we can track their trajectories and estimate the
3D structure of the scenes they belong to. Having trajectories/annotations in world co-
ordinates allows placing all objects in the database in a common reference frame and
then re-projecting them to the query image for a better fit in the new scene. Our model
implicitly encapsulates object semantics (what object is it and what motion it is likely
to perform) and real world dimensions and velocities of objects.
* 4.2 Related Work
Human action recognition is an important problem in computer vision. In this setup,
the objective is to detect, understand, and make sense of what is going on with an object
across time. Many of these works leverage the information content in large databases
of videos to discover and build action models 15,251. Schechtman and Irani 1421 use
a self similarity descriptor to match video queries to actions in a database. In 1341,
Niebles et al. uses sparse space-time interest point distributions describe human ac-
tions in video sequences. This work was later expanded to detect more complicated
actions in olympic sports by representing activities as compositions of various atomic
motion segments [351. The work by Laptev et al. 1251 learns human actions in un-
constrained videos such as professional movies. It uses spatio-temporal interest points
to characterize action classes and addresses action annotation automatically via movie
Sec. 4.2. Related Work /I
scripts and captions. In the work by Messing et al. [331, human actions are modeled as
as mixtures of bags of velocity trajectories, extracted from track data. While this is an
active and extensive research area, it differs from the problem of object trajectory pre-
diction as the latter case is constrained to much less information at query time and the
algorithm is expected to synthesize the motion of an object instance it has not observed
previously.
In a similar flavor to event prediction lie several works classifying images based
on the events they imply 19,28,571. Amongst them, Li et al. integrates object and scene
level information to determine human-centric actions present in static images. Yao et
al. uses a structured representation to model human actions (such a person playing the
violin) also from static images. In [191 Jie et al. finds image-caption correspondences
amongst news images and text; it also learns the visual models linking poses and faces
to action verbs. This problem differs from event prediction its objective is to output
the name of an action class and not to provide future configurations of the objects in
question. In the video surveillance realm, large amounts of video data are available to
learn patterns of activity for a particular scene. Unsupervised models such as [22,551
learn spatio-temporal dependencies of moving agents in these complex and dynamic
scenes. As a consequence, complex temporal rules such as the right of way at an
intersection are discovered from the long video feeds. However, the knowledge built
in these current models for surveillance is specific to one scene instance and cannot be
transferred to a previously unseen location.
Many works have demonstrated success using 3D representations for image un-
derstanding and tracking. Images are a 2D projection of our 3D world; this projection
depends on camera parameters. When reasoning about objects across images, it has
been demonstrated useful to map a scene and its objects into a 3D coordinate system.
For instance, in the mobile vision system by Ess et al. 171, pedestrians in crowded
scenes are tracked by performing pose estimation and prediction of the the next frame
72 CHAPTER 4. CAR TRAJECTORY PREDICTION FROM A SINGLE IMAGE
motion. It makes use of cues such as stereo, depth, scene geometry, and feedback from
tracking. The work by Leibe et al. 1271 integrates the detection and frame-level pre-
diction of cars and pedestrians in an online mobile system. It's trajectory estimation
module analyzes the 3D observations to find physically plausible space-time trajecto-
ries. Even when the camera parameters are unknown (e.g.web images), some assump-
tions can be made to regarding the camera position and parameters by using the scene
layout. For instance, Hoiem et al. uses the appearance of image regions to learn and
classify geometric classes, which in turn describe the 3D orientation of a region with
respect to the camera. The 3D information of a scene can also be used to build priors
regarding the location and scale of objects such as cars and pedestrians as well as for
single view 3D reconstructions from static images 116,17,18,381. These applications
demonstrate the power that 3D representations can have in object recognition, detec-
tion, and tracking tasks. To the extent of our knowledge, no other work models objects
in a diverse video database captured through consumer point-and-shoot cameras where
the camera information is unknown.
E 4.3 Object and trajectory model
Images and videos in LabelMe are captured at different locations and by different sub-
jects. To make best use of the data regardless of the scene or camera setup, we make
some assumptions and use the moving objects as a cue to estimate the scene parameters
for each video. Consequently, the object trajectories are mapped to world coordinates
and placed into a global reference frame.
Camera and scene layout This work uses the same single view image representation
in 1381, where a scene is composed by objects represented as piecewise planes. No
camera parameters are available for the videos in our data set. To make estimates of
real world dimensions we make the following assumptions on the data: (1) the location
Sec. 4.3. Object and trajectory model
1.60 m
36.06 km/h
5mm
Jmi
1.73 m
Figure 4.1: Estimated average height and speed from annotations in the LabelMe video
dataset. With the approach first introduced by Hoiem et al. , we can estimate the average
height of objects in the data set. From video, we additionally estimate average velocities.
of the camera is at the mean human height level (e.g. 160 cm), (2) the ground plane is
flat, (3) the only camera rotation is due to pitch (no yaw or roll), and (4) people and cars
are always in contact with the ground plane and do not change in size. Assumptions
(2) and (3) result in a horizontal horizon line.
* 4.3.1 From 2D to 3D
In a LabelMe video annotation, each object is composed by a list of polygons at each
time frame. We simplify the annotations by using their bounding boxes. Since cars
and people stand on the ground plane, we consider the two lower points from each
bounding box as the contact points with the ground.
74 CHAPTER 4. CAR TRAJECTORY PREDICTION FROM A SINGLE IMAGE
Let x' be a vector containing the image coordinates of a bounding box belonging
to the ith object at the t frame where t E Di, the duration of of the trajectory. We assume
that each object stays constant in size in 3D and thus has a real world width wi and
height hi associated to it. Let vy be the image coordinate of the y component of the
horizon line. The vector of parameters to estimate is then e = [wi,hi1 ...wn,hn,vy]
We call a base point for a bounding box the midpoint between the two contact
points with the ground. The world coordinates of all base are converted to world coor-
dinates using the current parameters in the vector e. Bounding boxes in the 3D coordi-
nate system are then reconstructed using the base point information and the estimated
world dimensions for the object. Finally, let $4 denote the projection of the bounding
boxes in from world coordinates to the image plane. This results in an optimization
problem where the objective function is:
E(0) = |x -||2+
wi -Wcan(ci)2 + hi - hcan(Ci)||2 )
The constants Wcan and hca, denote the canonical width and heights for an object
of class ci. These constants can be taken from real world statistics. For details on the
processes to convert from image to world coordinates and vice-versa, we refer to the
literature in multi-view geometry [131.
We estimated the 3D measurements of all the car and person polygons in the
annotated videos. As shown by Hoiem et al. 118], having world measurements for
objects, we can generate statistics on the size and velocity of the objects (see figure
4.1). For instance, based solely on annotations, we find that the mean size of a person
is 1.73m and its mean velocity is 5.45 km/h while the mean height of a car is 1.60 m
Sec. 4.4. From trajectories to action discovery 75
and its mean velocity in the data set is 36.06 km/h.
U 4.4 From trajectories to action discovery
Having object trajectories in world coordinates gives us the freedom to place them
in a common frame of reference despite their original locations in the images or the
specifics of the scene. To compare trajectories, each is translated such that the base
point of the first box is in the world origin.
3 2
4 14.
5
6
count ooo JO0 41
bin# 1 2 3 4 5 6 7 8
Figure 4.2: Visualization of trajectory feature. The ground plane is divided radially into 8
equally sized regions. Each trajectory is translated to the world center and described by the
normalized count of bounding boxes landing in each region.
Once all the trajectories are translated to the origin, we compact them into fea-
ture descriptors. The ground of the world is divided into 8 equally sized angular regions
centered at the origin (see figure 4.3 for a visualization). A trajectory is described by
the normalized count of boxes landing in each region. Finally, the trajectories for each
object class are clustered in this feature space resulting in the clusters visualized in
figure 4.3; notice how each cluster represents motions in different directions.
CHAPTER 4. CAR TRAJECTORY PREDICTION FROM A SINGLE IMAGE
Figure 4.3: Discovered motion clusters for each object class. Object trajectories for each
class are normalized and transferred to a common point in world coordinates. The trajectories
are further clustered and each cluster is visualized as an energy map summarizing all of the
trajectories belonging to the cluster. The trajectories have been translated, re-projected and
resized to fit the displayed image crops .
E 4.4.1 Car trajectory prediction from a single image
The previous section described how we convert the location of objects from a 2d rep-
resentation to one in 3d. This section will describe the general pipeline for trajectory
prediction. Given a car detection in a static image (we assume the images depict urban
environments similar to the ones in the video database), the task is to predict a plausi-
ble trajectory for it. Figure 1.4 shows a visualization of such trajectory drawn over the
input image.
Sec. 4.4. From trajectories to action discovery /
Our prediction method relies on scene and object matching in a non-parametric
manner. We begin by retrieving the nearest videos at the scene level using the gist
descriptor. We do this by computing gist features for each frame in the video dataset
and comparing the gist descriptor from the query to each video frame in our dataset.
Every frame in each video is ranked and the frame closest to the query image becomes
the representative for the video it belongs to. Finally, we gather the top 200 frames
(each representing a separate video) closest to our query image to work with.
This first matching phase selects videos close to the query at the scene level.
However, we aim for a prediction at the object level; we need to reason about the data
contained in the nearest neighbors at the object (in this case car) level. Therefore, we
will proceed by detecting the cars using the Latent Deformable Part Model (LDPM)
detector 11] contained in the selected video frames. In selecting the appropriate car
to transfer information to the query, we will re-rank the detections using the following
criteria:
* How confidence the detection is (the score).
* How close the originating scene is to the query (the gist distance).
" If the detection is of a similar size to the one in the query (bounding box intersec-
tion).
" How similar is the pose of the detected car with that of the query.
To ensure that detections with a pose similar to the query are scored higher, we
utilize the concept of an Exemplar SVM (eSVM) detector introduced by Malisiewicz et
al. [321. For each query object, we train a separate exemplar SVM with only the crop
of the query car as a positive example, and a conventional set of negative instances
generated from 300 images that do not contain cars. This yields an instance-specific
75 CHAPTER 4. CAR TRAJECTORY PREDICTION FROM A SINGLE IMAGE
Figure 4.4: (right) Image containing the top detection using the LDPM car detector (in blue)
and the top exemplar SVM detection trained on one single exemplar (left). The LDPM detector
is trained on many instances of cars varying in shape and pose. The non-maximum-suppression
phase rules out overlapping detections and scores the blue detection as the highest. The eSVM,
trained on the single positive instance (left) identifies the window that matches the hatchback
template from the query the best (in this case the side of the cab excluding its hood). Our ap-
proach aims at detecting complete cars using the LDPM detector and filtering these detections
using an eSVM detector trained on the query crop. We compare the bounding box intersection
between the eSVM detection and the DPM one and discard detections that do not overlap more
than 70%.
classifier, detecting windows that similar to our instance-level template. In practice,
however, what we observe is a scenario depicted in figure 4.4. Having only one positive
example, the eSVM will fire in parts of the car that satisfy the template. Figure 4.4
shows how the eSVM fires for a portion of the cab excluding its hood to satisfy the
template of a hatchback model. While this result is reasonable, our application will
benefit from discarding partial detections. The LDPM method is trained in a large
variety of cars, and after non-maximum suppression, is able to mostly filter out partial
car detections. Therefore, we exploit the comprehensive knowledge-base of the LDPM
detector and use the instance-tuned capabilities of the eSVM detector to discard LDPM
detections that differ from our exemplar query.
We begin by defining a preliminary score E for some LDPM detection b. Let
Besmi denote the set of eSVM detections for the image that b originates from. E(b) re-
Sec. 4.4. From trajectories to action discovery /Y
turns the maximum bounding box intersection between b and each of the found eSVM
detections.
E(b) = max [int(b,be)]
be E Besvmn
The LDPM detector is executed on each gist neighboring image, ieigh, yielding
the corresponding detections. The new score for each LDPM detection is defined as:
S(bneigh)_ fLDPM(bneigh)+ int(bquery,bneigh)+ G(Ineigh,Iquery) : E(bneigh) > 0.7
0 : otherwise
where bneigh denotes an LDPM detection on Ineigh, LDPM(bneigh) is a value be-
tween 0 and 1 for the detection score (the original scores were mapped to a logistic
function), G(Ineigi,Iquery) denotes the gist compatibility (also between 0 and 1) be-
tween the query image and its neighbor (the image bneigh originates from. The final
score is now dependent on the bounding box intersection with the query, the gist dis-
tance amongst scenes, and the LDPM score. Any detection that does not have a sup-
porting eSVM detection that overlaps more than 70% will be discarded. Figure 4.5
shows the top detections ranked using different metrics.
Once a detection is selected as a source for trajectory transfer, the trajectory for
the originating object is transferred to the location of the static detection by calculating
the location of the detection in world coordinates and translating the retrieved object
trajectory information to the desired location. The trajectory can be generated by track-
ing the object. Finally, the size of the bounding boxes is adjusted to fit the actual size
of the detection in the static image. In order to calculate the real world dimensions of
a bounding box we require the location of the horizon line. Making the same assump-
80 CHAPTER 4. CAR TRAJECTORY PREDICTION FROM A SINGLE IMAGE
tions as described in section 4.3, we need to find the location of the horizon line, vy.
In a static image, the horizon line can be estimated using the gist descriptor 136,491 or
reasoning about object detections 1181 in an automatic manner or selected manually.
U 4.5 Experimental evaluation
Ground truth data for event prediction is not well defined. Even when extracting tra-
jectories from real world video, only one prediction is returned while in actuality there
exists an entire family of realistic predictions per detection. In prior work [301, an
evaluation of motion predictions compared a set of generated predictions against the
trajectory observed in video amongst a set of distractors. This approach works well
identifying if, within a number of attempts, the algorithm detects one observed motion
but it does not take into account the quality of the remaining predictions.
We take a different approach and conduct a user study. The objective of these
experiments is to evaluate the quality of object-level event predictions. Therefore, only
examples with reliably detected objects were used as queries in our test set. To create
this set, we randomly selected 500 videos and a random frame in each. We ran a car
detector [11] in this subset, sorted the detections in descending order by confidence
and selected the top 30 detections. For each detection, 5 predictions are generated
in two background modes. The first mode is on a synthetic background where there
are no other objects except for the one in question and the horizon line matches the
one in the originating scene (see an example in figure 7). The subjects were asked to
rank a prediction based solely on its appearance as very likely, unlikely but possible,
impossible, or cannot tell. Figure 4.8 (left) that for 84% of the test cases under the
original scene, at least one of the 5 predictions was classified as very likely. This value
drops to 12% when requiring the top 5 predictions to be very likely. This figure also
plots the same graph but when considering both very likely and unlikely but possible
Sec. 4.5. Experimental evaluation 81
results showing an 8% increase in performance for the 1 prediction case and up to 30%
increase when evaluating all 5 predictions.
A second mode of the test utilizes the same predictions and situates the objects
in their original scene. In this test, the users are asked to consider both the appearance
of the object and the surrounding scene. They are asked to consider the trajectory as a
line on the floor in 3D and consider predictions wrong if they intersect with obstacles
in the world (e.g.buildings). The test subjects are told to use their prior knowledge
on where these objects are likely to be in a real scene when answering the questions;
for example, even if a car is moving in a trajectory that matches its appearance, it is
unlikely that it will be moving on the sidewalk). See figure 7 for an example in both
test settings.
Figure 4.8 summarizes the results of the user study where each test was given
to 6 different subjects. We evaluate the data inspired by the evaluation criterion in 1311:
for each object, we count the number of predictions that were marked likely or very
likely. The bar chart shows the percentage of object samples where at least n predic-
tions (out of 5 total) were deemed likely or very likely. The error bars indicate the
variance amongst subjects.
Since the method does not integrate semantics regarding the world the object
lives in, we also tested the predictions in a synthetic environment (see an example in
figure 4.7, where subjects are asked to rate a prediction based solely on the pose of the
object. Interestingly, the same predictions placed in a virtual world performed slightly
worse compared to the same experiment in the original scene for some objects where
only one of the five predictions was correct. This can mean that humans are to some
extent unsure of the pose in some examples and context helps to disambiguate.
The variance amongst subjects when judging the objects in their original scene
is lower than when judging isolated objects. This might suggest that the contextual
information of the surrounding scene helps in the perception of predictions in a scene
82 CHAPTER 4. CAR TRAJECTORY PREDICTION FROM A SINGLE IMAGE
and complements a pose.
* 4.6 Discussion and concluding remarks
This chapter presented a framework for object-level event prediction. Our framework
produces plausible car trajectories. Our key contribution lies in the 3D representation
of scenes at the object-level. We maximize the amount of transferrable information in
our training set by estimating 3D trajectories from 2D video annotations and tracks. In
order to generate suitable predictions we employ a non-parametric framework. Given a
query image with a bounding box of the selected object, we retrieve the closest scenes
in the video database and detect the cars in the nearest scenes. The detections are fur-
ther re-ranked to consider their pose, location, detection confidence, and scene similar-
ity. Finally, we transfer trajectories to the detections in the static image. Experimental
results show that our system is able to make more reliable and consistent predictions
compared to prior work.
Given the modularity of our system, we can easily substitute different object
detection engines, making this framework reusable as newer and better detectors, fea-
ture descriptors, or specialized pose descriptors 16, 121 become available. In this work
we have focused on cars given the nature of the data available, however, adding new
objects (such as boats or bikes) is possible given sufficient detections and/or video an-
notations. We envision prediction technologies as important for the future development
of devices such as autonomous navigation systems and artificial vision systems for the
blind.
query
(a) top detections sorted by scene gist distance to query
(b) top detections sorted by DPM score
(c) top aetections sortea Dy evm score
(d) top detections sorted by our aggregated score
Figure 4.5: Top candidate detections for trajectory transfer. We detect cars in the 200 nearest
video frames. The naive approach of considering only the gist distance between scenes results
in very few reliable detections amongst the top scenes (a). Ordering detections by the LDPM
detection score gives a higher ratio of reliable car detections (b); however, there is no guarantee
that all detections will contain the same pose as of the query. An exemplar SVM approach
focuses the search on windows similar to the query (c); however, this approach sometimes fires
on only portions of entire cars (see yellow boxes). Finally, our approach integrates scene (gist)
similarity, bounding box intersection with the query detection, the LDPM and eSVM scores (d).
(a)
(b)
(c)
Figure 4.6: Predictions from a single image. (a) For each object, we can predict different
trajectories even from the same action/trajectory group. (b) other example predictions; note the
diversity in locations and sizes of objects and how their predictions match the motion implied
by their appearance. (c) failure cases can take place when the appearance of the object is
not correctly matched to the implied family of actions, when the horizon line is not correctly
estimated (or horizontal), or if there are obstacles in the scene that interfere with the predicted
trajectory.
Figure 4.7: User study scenarios. The prediction evaluation is presented in and a synthetic
world (/) and the original image where the the object resides (2). In the synthetic scenario,
the user is asked to determine the quality of the prediction based solely on the pose without
considering the semantics of the scene whereas in the original scene the user is asked to judge
the 3D trajectory taking the scene elements into consideration (e.g. cars should move on the
road and not on sidewalks or though obstacles).
User study results
synthetic background
0.9 - 0.9 original background
0.8 0.8 -
0.7 0 o7-
0.6 0.6
'A
C
0.5 0.5 
-
0.4 - .4 -
0.3- 3 
-
0.2- 0.2-
0.1 - 0.1
1 2 3 4 5 1 2 3 4 5
minimum # of predictions deemed likely (out of 5) minimum # of predictions deemed likely or possible (out of 5)
Figure 4.8: User study results. A set of 30 reliable car detections comprises our test set.
Our algorithm was configured to output 5 predictions per example. Subjects were asked to
score the predictions as very likely, unlikely but possible, impossible, or cannot tell. Each bar
represents the percentage of objects where at least x (out of the total 5) predictions are likely.
Our algorithm is evaluated under a synthetic background and a real one (blue and red bars).
Chapter 5
Conclusion
The previous chapters have presented a the different components of a pipeline for data-
driven event prediction and unusual event detection. They began with the design an
implementation of a tool for generating ground truth information in video. As a re-
sult, we generated a diverse video database and framework for annotating objects and
events. This database was then used as the building block for several applications in-
cluding object motion statistics, unusual event detection, and event prediction, amongst
others.
M 5.1 Contributions
Training data is crucial for most computer vision systems. Nowadays, video acquisition
at a large scale is feasible thanks to the wide availability of consumer cameras, mobile
devices, and, video sharing websites. In chapter 2, we introduce an video sharing re-
source for the research community. Videos are copyright free and open for anyone to
use. To this date, we have collected 7166 videos spanning different scene categories
such as streets, parking lots, offices, museums, etc. The current capabilities of the sys-
tem include: (1) the creation of ground truth annotations at the object level, delineating
objects throughout their lifetime in the videos and (2) the annotation of events, encap-
sulating the interaction of annotated objects at a higher level. The content in chapter 2
88 CHAPTER 5. CONCLUSION
was the building block for the applications in the rest of the thesis.
The second key contribution (in chapter 3) of this thesis uses the LabelMe video
database to build a model of common events. It builds on the premise that humans
are capable of transferring personal experiences in the real world to new scenes. In
particular, humans are finely tuned to identify events out of the ordinary. The work
in this chapter made use of scene matching to retrieve videos similar to a query and
integrated the motion information in the results to build a model of "normal" events.
Finally, it compares the observed motion information against the motion information
deemed "normal" and assigns scores to the observed motions. The experimental results
prove the validity of the approach when given enough video samples of real world
scenes.
The work in chapter 4 explores a different direction and looks deeply into event
prediction. It is based on the scene matching framework used in chapter 3 but inte-
grates the information from object annotations produced in chapter 1. We introduced
a framework for object-level prediction, in this case cars. In chapter 3, we looked into
an unsupervised method exploiting the inherent data in the LabelMe video dataset. In
chapter 4, we take a supervised approach. We use the object annotations of cars to infer
real world measurements of the scene and the objects in it. By translating objects in
different videos to a common frame of reference, we maximize the amount of trans-
ferrable information in our training set. Finally, given a static image, we predict car
trajectories expressed both in real world or image units.
This thesis has introduced ground work towards the study diverse video databases.
It also has re-framed traditional problems in computer vision. Traditional unusual de-
tection systems were framed in a monolithic manner, where large amounts of video
were recorded for a single scene, unable to transfer such data to new scenes. We use
scene matching to find the subset of videos that is closest to the query and integrate the
data in the nearest neighbors to learn the most common motion patterns. Finally, this
Sec. 5.1. Contributions 89
thesis posed a new problem in computer vision: event prediction from a single image.
We envision prediction technologies as important for the future development of devices
such as autonomous navigation systems and artificial vision systems for the blind.
90 CHAPTER 5. CONCLUSION
Bibliography
Il Marc Alexa, Daniel Cohen-Or, and David Levin. As-rigid-as-possible shape in-
terpolation. In SIGGRAPH '00: Proceedings of the 27th annual conference on
Computer graphics and interactive techniques, pages 157-164, 2000. ISBN 1-
58113-208-5. doi: http://doi.acm.org/10. 1145/344779.344859. 37
121 S. Birchfield. Derivation of kanade-lucas-tomasi tracking equation. Technical
report, 1997. URL http: / /www. ces. clemson. edu/-stb/klt/. 56
131 Navneet Dalal and William Triggs. Histograms of Oriented Gradients for Human
Detection. In IEEE Conference on Computer Vision and Pattern Recognition,
2005. 52
141 C. Desai, D. Ramanan, and C. Fowlkes. Discriminative models for multi-class
object layout. In International Conference on Computer Vision, 2009. 52
151 Alexei A. Efros, Alexander C. Berg, Greg Mori, and Jitendra Malik. Recognizing
action at a distance. In International Conference on Computer Vision, 2003. 32,
53,70
161 M. Eichner, M. Marin-Jimenez, A. Zisserman, and V. Ferrari. Articulateduman
pose estimation and search in (almost) unconstrained still images. In ETH Tech-
nical Report, 2010. 82
92 BIBLIOGRAPHY
171 A. Ess, B. Leibe, K. Schindler, and L. V. Gool. A mobile vision system for ro-
bust multi-person tracking. In IEEE Conference on Computer Vision and Pattern
Recognition, 2008. 71
[8] M. Everingham, A. Zisserman, C.K.I. Williams, and L. Van Gool. The pascal vi-
sual object classes challenge 2006 (VOC 2006) results. Technical report, Septem-
ber 2006. 31
191 L. Fei-Fei and L.-J. Li. What, where and who? telling the story of an image by
activity classiPcation, scene recognition and object categorization. In Studies in
Computational Intelligence- Computer Vision, 2010. 71
1101 L. Fei-Fei, R. Fergus, and P. Perona. One-shot learning of object categories. IEEE
Trans. on Pattern Analysis and Machine Intelligence, 28(4):594-6 11, 2006. 3 1
1111 P. Felzenszwalb, D. McAllester, and D. Ramanan. A discriminatively trained,
multiscale, deformable part model. In IEEE Conference on Computer Vision and
Pattern Recognition, 2008. 77, 80
1121 V. Ferrari, M. Marin-Jimenez, and A. Zisserman. Pose search: retrieving people
using their pose. In IEEE Conference on Computer Vision and Pattern Recogni-
tion, 2009. 82
1131 R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision.
Cambridge University Press, 2000. 74
1141 J. Hays and A. A. Efros. IM2GPS: estimating geographic information from a
single image. In IEEE Conference on Computer Vision and Pattern Recognition,
2008. 54
115/ James Hays and Alexei Efros. Scene completion using millions of photographs.
In "SIGGRAPH", 2007. 31, 54, 55
BIBLIOGRAPHY 93
1161 D. Hoiem, A.A. Efros, and M. Hebert. Auto-
matic photo pop-up. In SIGGRAPH, 2005. URL
http: //www-2.cs.cmu.edu/-dhoiem/projects/popup/. 72
1171 D. Hoiem, A.A. Efros, and M. Hebert. Geometric context from a sin-
gle image. In International Conference on Computer Vision, 2005. URL
http: //www.cs .cmu.edu/-dhoiem/projects/context/index.html.
72
1181 D. Hoiem, A. Efros, and M. Hebert. Putting objects in perspective. In IEEE
Conference on Computer Vision and Pattern Recognition, 2006. 72,74, 80
1191 L. Jie, B. Caputo, and V. Ferrari. Who's doing what: Joint modeling of names
and verbs for simultaneous face and pose annotation. In Advances in Neural Info.
Proc. Systems, 2009. 71
1201 I. N. Junejo, 0. Javed, and M. Shah. Multi feature path modeling for video
surveillance. Pattern Recognition, International Conjerence on, 2, 2004. 52
1211 B. Krekelberg, S. Dannenberg, K. P. Hoffmann, F. Bremmer, and J. Ross. Neural
correlates of implied motion. Nature, 424:674-677, 2003. 51
[221 D. Kuettel, M. D. Brietenstein, L. V. Gool, and V. Ferrari. Discovering spatio-
temporal dependencies in dynamic scenes. In IEEE Conference on Computer
Vision and Pattern Recognition, 2010. 71
1231 Jean-Frangois Lalonde, Derek Hoiem, Alexei A. Efros, Carsten Rother, John
Winn, and Antonio Criminisi. Photo clip art. ACM Transactions on Graphics
(SIGGRAPH 2007), 26(3), August 2007. 45
1241 1. Laptev and P. Perez. Retrieving actions in movies. In International Conference
on Computer Vision, 2007. 32, 52
94 BIBLIOGRAPHY
1251 Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozenfeld.
Learning realistic human actions from movies. In IEEE Conference on Computer
Vision and Pattern Recognition, 2008. 30, 32,70
1261 S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid
matching for recognizing natural scene categories. In cvpr, pages 2169-2178,
2006. 55
1271 B. Leibe, K. Schindler N. Cornelis, , and L. Van Gool. Coupled object detection
and tracking from static cameras and moving vehicles. In IEEE Trans. on Pattern
Analysis and Machine Intelligence, 2008. 72
1281 L.-J. Li and L. Fei-Fei. What, where and who? classifying event by scene and
object recognition. In International Conference on Computer Vision, 2007. 53,
71
1291 C. Liu, W.T. Freeman, E.H. Adelson, and Y. Weiss. Human-assisted motion anno-
tation. In IEEE Conference on Computer Vision and Pattern Recognition, pages
1-8, 2008. 30,44
1301 C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman. SIFT flow: dense
correspondence across different scenes. In European Conference on Comp, 2008.
54,80
1311 C. Liu, J. Yuen, and A. Torralba. Nonparametric scene parsing: Label transfer
via dense scene alignment. In IEEE Conference on Computer Vision and Pattern
Recognition, 2009. 54, 55, 69, 81
1321 Tomasz Malisiewicz, Abhinav Gupta, and Alexei A. Efros. Ensemble of
exemplar-svms for object detection and beyond. In ICCV, 2011. 77
BIBLIOGRAPHY
[331 Ross Messing, Chris Pal, and Henry Kautz. Activity recognition using the veloc-
ity histories of tracked keypoints. In ICCV, Washington, DC, USA, 2009. IEEE
Computer Society. 52,71
1341 J. C. Niebles, H. Wang, and L. Fei-Fei. Unsupervised learning of human action
categories using spatial-temporal words. Int. J. Comput. Vision, 79(3):299-318,
2008. ISSN 0920-5691. doi: http://dx.doi.org/ 10.1007/s 11263-007-0122-4. 52,
53,70
1351 J. C. Niebles, C. W. Chen, and L. Fei-Fei. Modeling temporal structure of de-
composable motion segments for activity classification. In European Conference
on Comp, 2010. 70
1361 A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic representa-
tion of the spatial envelope. 1JCV, 42(3):145-175, 2001. 55, 80
1371 J. Ponce, T. L. Berg, M. Everingham, D. A. Forsyth, M. Hebert, S. Lazebnik,
M. Marszalek, C. Schmid, B. C. Russell, A. Torralba, C. K. I. Williams, J. Zhang,
and A. Zisserman. Dataset issues in object recognition. In In Toward Category-
Level Object Recognition. Springer-Verlag Lecture Notes in Computer Science, J.
Ponce, M. Hebert, C. Schmid, and A. Zisserman (eds.), 2006. 31
[381 B. C. Russell and A. Torralba. Building a database of 3d scenes from user anno-
tations. In IEEE Conference on Computer Vision and Pattern Recognition, 2009.
45,72
1391 B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. LabelMe: a
database and web-based tool for image annotation. IJCV, 77(1-3):157-173,2008.
30, 31, 33, 34, 41, 45
96 BIBLIOGRAPHY
140] Christian Schuldt, Ivan Laptev, and Barbara Caputo. Recognizing human actions:
A local SVM approach. In ICPR, 2004. 32,52
[411 Flickr Photo Sharing Service. http://www.flickr.com. 31
[421 E. Shechtman and M. Irani. Matching local self-similarities across images and
videos. In IEEE Conference on Computer Vision and Pattern Recognition, 2007.
70
1431 J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans. on
Pattern Analysis and Machine Intelligence, 2000. 57
1441 J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object
matching in videos. In International Conference on Computer Vision, 2003. 54
[451 Alan F. Smeaton, Paul Over, and Wessel Kraaij. Evaluation campaigns and
trecvid. In MIR '06: Proceedings of the 8th ACM International Workshop on
Multimedia Information Retrieval, 2006. 30, 32
1461 A. Sorokin and D. Forsyth. Utility data annotation with Amazon Mechanical
Turk. In IEEE Workshop on Internet Vision, associated with CVPR, 2008. 30, 33
1471 M Spain and P. Perona. Measuring and predicting importance of objects in our
visual world. Technical report 9139,, California Institute of Technology, 2007.
14,41,43
1481 Carlo Tomasi and Takeo Kanade. Detection and tracking of point features. IJCV,
1991.56
1491 A. Torralba and P. Sinha. Statistical context priming for object detection. In
International Conference on Computer Vision, pages 763-770, 2001. 80
BIBLIOGRAPHY 97
1501 A. Torralba, R. Fergus, and W. T. Freeman. 80 million tiny images: a large dataset
for non-parametric object and scene recognition. IEEE Trans. on Pattern Analysis
and Machine Intelligence, 30(11):1958-1970, 2008. 31
1511 A. Torralba, R. Fergus, and W.T. Freeman. Tiny images. Technical Report AIM-
2005-025, MIT Al Lab Memo, September, 2005. 54
1521 L. von Ahn and L. Dabbish. Labeling images with a computer game. In SIGCHI,
2004. URL http: / /www. espgame.org/. 30,33
1531 L. von Ahn, R. Liu, and M. Blum. Peekaboom: A game for locating objects in
images. In In ACM CHI, 2006. URL http: / /peekaboom. org. 30
1541 X. Wang, K.Tieu, and E. Grimson. Learning semantic scene models by trajectory
analysis. In European Conference on Comp, 2006. 52
1551 X. Wang, K. T. Ma, G. Ng, and E. Grimson. Trajectory analysis and semantic
region modeling using a nonparametric bayesian model. IEEE Conference on
Computer Vision and Pattern Recognition, 2008. 52, 54, 71
1561 J. Winawer, A. C. Huk, and L. Boroditsky. A motion aftereffect from still pho-
tographs depicting motion. Psychological Science, 19:276-283, 2008. 51
[571 B. Yao and L. Fei-Fei. Modeling mutual context of object and human pose in
human-object interaction activities. In IEEE Conference on Computer Vision and
Pattern Recognition, 2010. 71
1581 B. Yao, X. Yang, and S.-C. Zhu. Introduction to a large scale general purpose
ground truth dataset: methodology, annotation tool, and benchmarks. In EMM-
CVPR, 2007. 30
98 BIBLIOGRAPHY
1591 J. Yuen, B. C. Russell, C. Liu, and A. Torralba. Labelme video: Building a video
database with human annotations. In International Conference on Computer Vi-
sion, 2009. 59, 70
1601 J. Yuen, , and A. Torralba. A data-driven approach for event prediction. In Euro-
pean Conference on Comp, 2010. 69
1611 S. Zanetti, L. Zelnik-Manor, and P. Perona. A walk through the web's video clips.
In IEEE Workshop on Internet Vision, associated with CVPR, 2008. 52
1621 H. Zhong, J. Shi, and M Visontai. Detecting unusual activity in video. In IEEE
Conference on Computer Vision and Pattern Recognition, 2004. 52