Labeling and modeling large databases of videos by Jenny Yuen Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering and Computer Science MASSACHUSTS INSTITUTE at the Massachusetts Institute of Technology A February 2012 @ 2012 Massachusetts Institute of Technology All Rights Reserved. ARCHIVES Author: Department of Electrical En ' eerin d Computer Science December 2, 2011 Certified by: Accepted by: Antonio Torralba Associate Professor of Electrical Engineering and Computer Science Thesis Supervisor Leslie A. Kolodziejski Chair, Department Committee on Graduate Students 3Dedicated to my family Labeling and modeling large databases of videos by Jenny Yuen Submitted to the Department of Electrical Engineering and Computer Science on December 2, 2011, in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering and Computer Science Abstract As humans, we can say many things about the scenes surrounding us. For in- stance, we can tell what type of scene and location an image depicts, describe what objects live in it, their material properties, or their spatial arrangement. These com- prise descriptions of a scene and are majorly studied areas in computer vision. This thesis, however, hypotheses that observers have an inherent prior knowledge that can be applied to the scene at hand. This prior knowledge can be translated into the cogni- sance of which objects move, or in the trajectories and velocities to expect. Conversely, when faced with unusual events such as car accidents, humans are very well tuned to identify them regardless of having observed the scene a priori. This is, in part, due to prior observations that we have for scenes with similar configurations to the current one. This thesis emulates the prior knowledge base of humans by creating a large and heterogeneous database and annotation tool for videos depicting real world scenes. The first application of this thesis is in the area of unusual event detection. Given a short clip, the task is to identify the moving portions of the scene that depict abnormal events. We adopt a data-driven framework powered by scene matching techniques to 6retrieve the videos nearest to the query clip and integrate the motion information in the nearest videos. The result is a final clip with localized annotations for unusual activity. The second application lies in the area of event prediction. Given a static image, we adapt our framework to compile a prediction of motions to expect in the image. This result is crafted by integrating the knowledge of videos depicting scenes similar to the query image. With the help of scene matching, only scenes relevant to the queries are considered, resulting in reliable predictions. Our dataset, experimentation, and proposed model introduce and explore a new facet of scene understanding in images and videos. Thesis Supervisor: Antonio Torralba Title: Associate Professor of Electrical Engineering and Computer Science Acknowledgments First and foremost, I thank Antonio Torralba, my advisor, for welcoming into his newly formed group more than four years ago and teaching me how to do research. In particular, to think outside the box and not being afraid of breaking new grounds, formulating new problems, and crafting novel solutions for them. I thank him for the countless hours spent not only on the big picture of problems, but also at the imple- mentation level. For his perseverance and passion for research, the 3 am e-mail brain- storming sessions, and for being the largest, most reliable data contributor, and best beta tester for LabelMe video project. Many thanks also go to my thesis committee: Fredo Durand, Alyosha Efros, Bill Freeman, and Antonio for their support, valuable comments, and feedback for this thesis. Over the last five and a half years, I have also been very fortunate to collaborate with brilliant researchers: Daniel Goldman, Ce Liu, Yasuyuki Matsushita, Bryan Rus- sell, Josef Sivic, and Larry Zitnick. Each one of them played a key role in developing raw ideas into interesting results and to that I give many thanks. The computer vision group at MIT has been a second home to me. Eric Grim- son and the APP group welcomed me to MIT in my first year. I have been fortunate to share offices with amazing people: Gerald Dalley, Biz Bose, Xiaogang Wang, Wanmei Ou, Thomas Yeo, Joseph Lim, Jianxiong Xiao, and Aditya Khosla. Thank you for the wonderful times, the great conversations and for being such an integral part in my grad- uate school experience. Special thanks to Joseph Lim for helping me descipher object detector libraries, to Tomasz Malisiewicz for proposing the Exemplar SVM model and helping me adapt it to my work, and to Sylvain Paris, for always being available to discuss ideas and help polish submissions. To Biliana Kaneva, Michael Bernstein, and Alvin Raj for their friendship and the late nights solving problem sets during our first year at MIT, bura. 8Last but certainly not least, I thank my family for their enormous support over the last years. This thesis is dedicated to my father, who seeded in me the thirst for learning and thorough understanding; to my mother whose dedication and stamina taught me to never give up; to my brother Hector, who everyday shows me how to stay calm and in control with humor; to my brother Joel, who has taught me math and science ever since I can remember; and finally to Justin, who listened to every single practice talk I have given throughout graduate school numerous times, walked with me throughout cities capturing data, and was always there for me. Thanks, for everything. Contents Acknowledgments 7 List of Figures 13 1 Introduction 21 1.1 Overview of techniques and contributions . . . . . . . . . . . . . . . 23 1.1.1 Chapter 2. LabelMe video: Building a video database with human annotations. . . . . . . . . . . . . . . . . . . . . . . . 23 1.1.2 Chapter 3: A data-driven approach for unusual event detection. 24 1.1.3 Chapter 4: Car trajectory prediction from a single image . . . 25 1.2 Other work not included in this thesis . . . . . . . . . . . . . . . . . 26 1.3 N otes .. . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2 LabelMe video: Building a video database with human annotations 29 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.2 Related W ork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.3 Online video annotation tool . . . . . . . . . . . . . . . . . . . . . . 32 2.3.1 Object Annotation . . . . . . . . . . . . . . . . . . . . . . . 34 2.3.2 Event Annotation . . . . . . . . . . . . . . . . . . . . . . . . 34 2.3.3 Stabilizing Annotations . . . . . . . . . . . . . . . . . . . . . 35 2.3.4 Annotation interpolation . . . . . . . . . . . 2.4 Data set statistics . . . . . . . . . . . . . . . . . . . 2.5 Beyond User Annotations . . . . . . . . . . . . . . . 2.5.1 Occlusion handling and depth ordering . . . 2.5.2 Cause-effect relations within moving objects 2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . 2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . 3 A data-driven approach for unusual event detection 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . 3.3 Scene-based video retrieval . . . . . . . . . . . . . . 3.4 Video event representation . . . . . . . . . . . . . . 3.4.1 Recovering trajectories . . . . . . . . . . . . 3.4.2 Clustering trajectories . . . . . . . . . . . . 3.4.3 Comparing track clusters . . . . . . . . . . . 3.5 Video database and ground truth . . . . . . . . . . . 3.6 Experiments and Applications . . . . . . . . . . . . 3.6.1 Localized motion prediction . . . . . . . . . 3.6.2 Event prediction from a single image . . . . 3.6.3 Anomaly detection . . . . . . . . . . . . . . 3.7 Discussion and concluding remarks . . . . . . . . . 4 Car 4.1 4.2 4.3 trajectory prediction from a single image Introduction . . . . . . . . . . . . . . . . . . Related Work . . . . . . . . . . . . . . . . . Object and trajectory model . . . . . . . . . . CONTENTS . . . . . . . . . 36 . . . . . . . . . 41 . . . . . . . . . 43 . . . . . . . . . 44 . . . . . . . . . 46 49 51 . . . . . . . . 51 . . . . . . . . 53 . . . . . . . . 55 . . . . . . . . 55 . . . . . . . . 56 . . . . . . . . 56 . . . . . . . . 57 . . . . . . . . 59 . . . . . . . . 59 . . . . . . . . 61 . . . . . . . . 62 . . . . . . . . 62 . . . . . . . . 67 CONTENTS 11 4.3.1 From 2D to 3D . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.4 From trajectories to action discovery . . . . . . . . . . . . . . . . . . 75 4.4.1 Car trajectory prediction from a single image . . . . . . . . . 76 4.5 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . . 80 4.6 Discussion and concluding remarks . . . . . . . . . . . . . . . . . . 82 5 Conclusion 87 5.1 Contributions . . . . ... . . . . . . . . . . . . . . . . . . . . . . . 87 Bibliography 91 12 CONTENTS List of Figures 1.1 What is occurring in this image? Can you identify the objects that move? What actions are they performing'? While humans are surprisingly good at this ill-posed problem, it is not the case for computers. (Image from Nikoart- work.com ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 1.2 Sample video frames with ground truth annotations overlaid. LabelMe video provides a way to create ground truth annotations for objects in a wide variety of scenes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.3 What do these images have in common? They depict objects moving towards the right. These images do not contain motion cues such as temporal informa- tion or motion blur. The implied motion is known because we can recognize the image content and make reliable predictions what would occur if these were movies playing based on prior experiences. This, at the same time, al- lows us to be very finely tuned at identifying events that do not align to our prior information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 1.4 Object trajectory prediction from a static image. Based solely on the appear- ance of a detected object (in yellow) and the horizon line in the scene, our algorithm can determine a plausible trajectory for the selected object (red) by leveraging the information from a database of annotated moving objects. . . 26 14 LIST OF FIGURES 2.1 Object annotation. Users annotate moving or static objects in a video by outlining their shape with a polygon and describing their actions. . . . . . 33 2.2 Event annotation. Simple and complex events can be annotated by entering free-form sentences and linking them to existing labeled objects in the video. 35 2.3 Interpolation comparison between constant 2D motion (red) and constant 3D motion (green). a) Two polygons from different frames and their vanish- ing point. b) Interpolation of an intermediate frame, and c) interpolation of the polygon centers for multiple frames between the two reference frames. . 39 2.4 Examples of annotations. Our interpolation framework is based on the heuris- tic that objects often move with constant velocity and follow straight trajec- tories. Our system can propagate annotations of rigid (or semi-rigid) objects such as cars, motorbikes, fish, cups, etc. across different frames in a video automatically aiming for minimal user intervention. Annotation of non-rigid objects (e.g.humans), while possible by the tool (but requiring more editing), remains a more challenging task than the one for rigid objects. Presently, users can opt to, instead, draw bounding boxes around non-rigid entities like people. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.5 Distribution of labels in data set. The vertical axis indicates the log fre- quency of the object/action instances in the database while the y axis indicates the rank of the class (the classes are sorted by frequency). As we aimed to capture videos from a variety of common scenes and events in the real world, these distributions are similar to natural word frequencies described by Zipf's law [471. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 LIST OF FIGURES 15 2.6 Occlusion relationships and depth estimation. A sample video frame (a), the propagated polygons created with our annotation tool (b), the polygons or- dered using the LabelMe-heuristic-based inference for occlusion relationships (c) polygon ordering using LabelMe heuristic (notice how in the top figure, the group of people standing far away from the camera are mistakenly ordered as closer than the man pushing the stroller and in the bottom figure there is a mistake in the ordering of the cars), and (d) ordered polygons inferred using 3D relationship heuristics (notice how the mistakes in (c) are fixed). . . . . 46 3.1 Track clustering. Sample frames from the video sequence (a). The ground truth annotations denoted by polygons surrounding moving objects (b) can be used to generate ground truth labels for the tracked points in the video (c). Our track distance affinity function is used to to automatically cluster tracks into groups and generates fairly reasonable clusters where each roughly corre- spond to independent objects in the scene (d). The track clusters visualizations in (c) and (d) show the first frame of each video and the spatial location of all tracked points for the duration of the clip color-coded by the track cluster that each point corresponds to. . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.2 Unusual videos. We define an unusual or anomalous event as one that is not likely to happen in our training data set. However, we ensured that they belong to scene classes present in our video corpus. . . . . . . . . . . . . . . . . 60 3.3 Localized motion prediction (a) and unusual event detection (b). The algo- rithm was compared against two scene matching methods (GIST and dense SIFT) as well as a baseline supported by random nearest neighbors. Retriev- ing videos similar to the query image improves the classification rate. . . . 60 16 LIST OF FIGURES 3.4 Event prediction. Each row shows a static image with its corresponding event predictions. For each query image, we retrieve their nearest video clips using scene matching. The events belonging to the nearest neighbors are resized to match the dimensions of the query image and are further clustered to cre- ate different event predictions. For example, in a hallway scene, the system predicts motions of different people; in street scenes, it predicts cars moving along the road, etc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.5 Track cluster retrieval for common events. A frame from a query video (a), the tracks corresponding to one event in the video (b), the localized motion prediction map (c) generated after integrating the track information of the nearest neighbors (some examples shown in d), and the average image of the retrieved nearest neighbors (e). Notice the definition of high probability motion regions in (c) and how its shape roughly matches the scene geometry in (a). The maps in (c) were generated with no motion information originating from the query videos videos. . . . . . . . . . . . . . . . . . . . . . . . 65 3.6 Track cluster retrieval for unusual events (left) and scenes with less samples in our data set. When presented with unusual events such as a car crashing into the camera or a person jumping over a car while in motion (left and middle columns; key frames can be seen in fig. 3.7) our system is able to flag these as unusual events (b) due to their disparity with respect to the events taking place in the nearest neighbor videos. Notice the supporting neighbors belong to the same scene class as the query and the motion map predicts movements mostly in the car regions. However, our system fails when an image does not have enough representation in the database (right). . . . . . . . . . . . . . 66 LIST OF FIGURES 17 3.7 Unusual event detection. Videos of a person jumping over a car and running across it (left) and a car crashing into the camera (right). Our system outputs anomaly scores for individual events. Common events shown in yellow and unusual ones in red. The thickness and saturation of the red tracks is propor- tional to the degree of anomaly. . . . . . . . . . . . . . . . . . . . . . . 67 4.1 Estimated average height and speed from annotations in the LabelMe video dataset. With the approach first introduced by Hoiem et al. , we can estimate the average height of objects in the data set. From video, we additionally estimate average velocities. . . . . . . . . . . . . . . . . . . . . . . . . 73 4.2 Visualization of trajectory feature. The ground plane is divided radially into 8 equally sized regions. Each trajectory is translated to the world center and described by the normalized count of bounding boxes landing in each region. 75 4.3 Discovered motion clusters for each object class. Object trajectories for each class are normalized and transferred to a common point in world coordinates. The trajectories are further clustered and each cluster is visualized as an en- ergy map summarizing all of the trajectories belonging to the cluster. The trajectories have been translated, re-projected and resized to fit the displayed im age crops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 18 LIST OF FIGURES 4.4 (right) Image containing the top detection using the LDPM car detector (in blue) and the top exemplar SVM detection trained on one single exemplar (left). The LDPM detector is trained on many instances of cars varying in shape and pose. The non-maximum-suppression phase rules out overlapping detections and scores the blue detection as the highest. The eSVM, trained on the single positive instance (left) identifies the window that matches the hatchback template from the query the best (in this case the side of the cab excluding its hood). Our approach aims at detecting complete cars using the LDPM detector and filtering these detections using an eSVM detector trained on the query crop. We compare the bounding box intersection between the eSVM detection and the DPM one and discard detections that do not overlap more than 70% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.5 Top candidate detections for trajectory transfer. We detect cars in the 200 nearest video frames. The naive approach of considering only the gist distance between scenes results in very few reliable detections amongst the top scenes (a). Ordering detections by the LDPM detection score gives a higher ratio of reliable car detections (b); however, there is no guarantee that all detections will contain the same pose as of the query. An exemplar SVM approach focuses the search on windows similar to the query (c); however, this approach sometimes fires on only portions of entire cars (see yellow boxes). Finally, our approach integrates scene (gist) similarity, bounding box intersection with the query detection, the LDPM and eSVM scores (d). . . . . . . . . . . . . . 83 LIST OF FIGURES 19 4.6 Predictions from a single image. (a) For each object, we can predict different trajectories even from the same action/trajectory group. (b) other example predictions; note the diversity in locations and sizes of objects and how their predictions match the motion implied by their appearance. (c) failure cases can take place when the appearance of the object is not correctly matched to the implied family of actions, when the horizon line is not correctly estimated (or horizontal), or if there are obstacles in the scene that interfere with the predicted trajectory. . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.7 User study scenarios. The prediction evaluation is presented in and a syn- thetic world (1) and the original image where the the object resides (2). In the synthetic scenario, the user is asked to determine the quality of the predic- tion based solely on the pose without considering the semantics of the scene whereas in the original scene the user is asked to judge the 3D trajectory tak- ing the scene elements into consideration (e.g. cars should move on the road and not on sidewalks or though obstacles). . . . . . . . . . . . . . . . . . 85 4.8 User study results. A set of 30 reliable car detections comprises our test set. Our algorithm was configured to output 5 predictions per example. Subjects were asked to score the predictions as very likely, unlikely but possible, im- possible, or cannot tell. Each bar represents the percentage of objects where at least x (out of the total 5) predictions are likely. Our algorithm is evaluated under a synthetic background and a real one (blue and red bars). . . . . . . 86 20 LIST OF FIGURES Chapter 1 Introduction Consider the image in figure 1.1. What can you say about it? What objects are in it? What actions were the objects in the scene performing before and after the picture was taken? These are questions that humans can easily answer but are extremely difficult for an artificial system. As humans, we can quickly reason about the scene without having been to the depicted location, or interacted with the object instances depicted in the scene. More interestingly, we are able to make reliable predictions for the objects in the scene. In other words, we can animate the scene in our minds and picture the person walking on the sidewalk, the biker following its lane on the road, and the car stopping at the intersection. Imagine the situation where we want to create a robot capable of navigating real world environments, with crowds of people walking and vehicles on the road moving at high speeds. The robot requires foresight capabilities to navigate these highly dynamic environments given only prior information, which might consist of a few seconds, or even just a few snapshots of the scene. How is it that humans can make reliable predictions given (in the extreme) only one snapshot of our surroundings or even in just an abstracted drawing? How are we so certain that the car in the picture is moving on the road, most likely to the left of the scene? or that the person is going to cross the street if it hasn't stepped a foot on the crosswalk yet'? One hypothesis lies in the large volumes of training data that we 44 /CHAPTER 1. INTRODUCTION Figure 1.1: What is occurring in this image? Can you identify the objects that move? What actions are they performing? While humans are surprisingly good at this ill-posed problem, it is not the case for computers. (Image from Nikoartwork.com) feed to our memory throughout the years of repeated interactions in this world. This information is clearly transferrable in that, we can innately walk or drive in a different road or city for the first time and adapt prior knowledge acquired and reinforced by experiences in many other roads, vehicles, office spaces, etc. We will refer to this capacity of foresight, as event prediction. Event prediction is challenging in various aspects. The first aspect involves the intricacies of efficient data acquisition and ground truth annotation. The second challenge lies in the high variability of the data to learn from. Consider, even in the simplest case, learning for only a single street scene. Since images are 2d projections of the 3d world, we would get different images for the same scene depending on the location, viewpoint, and in Sec. 1.1. Overview of techniques and contributions 23 general the setup of the camera. Compound this variability with different objects, their configurations, and different scene locations. Clearly, we need a compact, flexible, and comprehensive representation to cover as many environments and configurations as possible. The last challenge lies in the inherently high dimensionality of videos. With a large video corpus to train from, it is very important to represent data compactly and flexibly. In summary, this thesis will leverage the information n video databases to power methods for event prediction and unusual event detection. We introduce (1) a real world video database and a tool to annotate objects and events in it, (2) a method that integrates the raw information in this video corpus and helps identify unusual events previously unseen videos, and, (3) a framework for event prediction from a single im- age, powered by user-generated annotations in the training video corpus. * 1.1 Overview of techniques and contributions This section provides a preview of techniques and contributions. * 1.1.1 Chapter 2. LabelMe video: Building a video database with human annotations. With the wide availability of consumer cameras, larger volumes of video are captured everyday though amateurs, professionals, surveillance systems, etc. As reported by Youtube.com, users are uploading hundreds of thousands of videos daily; every minute, 24 hours of video is uploaded to Youtube. However, current video analysis algorithms suffer from lack of information regarding the objects present, their interactions, as well as from missing comprehensive annotated video databases for benchmarking. We designed an online and openly accessible video annotation system that allows any- one with a browser and internet access to efficiently annotate object category, shape, 4''4 CHAPTER 1. INTRODUCTION Figure 1.2: Sample video frames with ground truth annotations overlaid. LabelMe video provides a way to create ground truth annotations for objects in a wide variety of scenes. motion, and activity information in real-world videos. The annotations are also com- plemented with knowledge from static image databases to infer occlusion and depth information. Using this system, we have built a scalable video database composed of diverse video samples and paired with human-guided annotations. M 1.1.2 Chapter 3: A data-driven approach for unusual event detection. When a human observes a short video clip, it is easy to decide if the event taking place is normal or unexpected, even if the video depicts a new place not familiar for the viewer. This is in contrast with work in surveillance and outlier event detection. Those models rely on thousands of hours of video recorded at a single place in order to identify what constitutes an unusual event. In this work we present a simple method to identify videos with unusual events in a large collection of short video clips. The algo- rithm is inspired by recent approaches in computer vision that rely on large databases. In this work we show how, relying on large collections of videos, we can retrieve other Sec. 1.1. Overview of techniques and contributions 25 Figure 1.3: What do these images have in common? They depict objects moving towards the right. These images do not contain motion cues such as temporal information or motion blur. The implied motion is known because we can recognize the image content and make reliable predictions what would occur if these were movies playing based on prior experiences. This, at the same time, allows us to be very finely tuned at identifying events that do not align to our prior information. videos similar to the query to build a simple model of the distribution of expected mo- tions for the query. Then, the model can evaluate how unusual is the video as well as make predictions. We show how a very simple retrieval model is able to provide reliable results. * 1.1.3 Chapter 4: Car trajectory prediction from a single image Given a single static picture, humans can interpret, not just the instantaneous content captured by the image, but also they are able to infer the chain of dynamic events that are currently happening or that are likely to happen in the near future. Image understanding not only consists of parsing what is in our surroundings, but also in determining what is likely to happen in the future. In this chapter, we propose a system that, given a static outdoor urban image, predicts potential trajectories for cars for the next few seconds. This work leverages the information in a database of annotated videos captured at different locations by different users. The core component lies in the video data, which is modeled as dynamic projections of 3D objects into a 2D plane. Our experiments show how this method is more descriptive and reliable at generating plausible object trajectory predictions. 4wo CHAPTER 1. INTRODUCTION Figure 1.4: Object trajectory prediction from a static image. Based solely on the appearance of a detected object (in yellow) and the horizon line in the scene, our algorithm can determine a plausible trajectory for the selected object (red) by leveraging the information from a database of annotated moving objects. N 1.2 Other work not included in this thesis During my PhD studies, I also had the privilege to work and contribute in other disci- plines. I worked with Dr. Ce Liu, Dr. Josef Sivic, and Professors Antonio Torralba, and and William Freeman in SIFT flow, an algorithm for generating pixel-wise dense correspondences across scenes. Additionally, I worked with Prof. Antonio Torralba and Dr. Ce Liu on using SIFT flow for object recognition via label transfer. Finally, through an internship with Microsoft Research, I had the opportunity to work with Dr. Lawrence Zitnick, Dr. Ce Liu and Prof. Antonio Torralba on a paper titled "Maximum entropy framework for encoding object-level image priors". * 1.3 Notes Parts of this thesis have been published at the International Conference on Computer Vision (ICCV 2009) and the European Conference on Computer Vision (ECCV 2010). This work was supported by a National Defense Science and Engineering Grad- Sec. 1.3. Notes 27 uate Fellowship and a National Science Foundation Graduate Fellowship. 28 CHAPTER 1. INTRODUCTION Chapter 2 LabelMe video: Building a video database with human annotations U 2.1 Introduction Video processing and understanding are very important problems in computer vision. Researchers have studied motion estimation and object tracking to analyze temporal correspondences of pixels or objects across the frames. The motion information of a static scene with a moving camera can further help to infer the 3D geometry of the scene. In some video segmentation approaches, pixels that move together are grouped into layers or objects. Higher level information, such as object identities, events and activities, has also been widely used for video retrieval, surveillance and advanced video editing. Despite the advancements achieved with the various approaches, we observe that their commonality lies in that they are built in a bottom-up direction: image features and pixel-wise flow vectors are typically the first things analyzed to build these video processing systems. Little has been taken into account for the prior knowledge of motion, location and appearance at the object and object interaction levels in real-world videos. Moreover, video analysis algorithms are often designed and tested on different sets of data and sometimes suffer of having limited number of samples. Consequently, 30 CHAPTER 2. LABELME VIDEO: BUILDING A VIDEO DATABASE WITH HUMAN ANNOTATIONS it is hard to evaluate and further improve these algorithms on a common ground. We believe in the importance of creating a large and comprehensive video database with rich human annotations. We can utilize this annotated database to obtain video priors at the object level to facilitate advanced video analysis algorithms. This database can also provide a platform to benchmark object tracking, video segmentation, object recognition and activity analysis algorithms. Although there have been several anno- tated video databases in the literature [25,29,451, our objective is to build one that will scale in quantity, variety, and quality like the currently available ones in benchmark databases for static images 139,52] . Therefore, our criteria for designing this annotated video database are diversity, accuracy and openness. We want to collect a large and diverse database of videos that span many different scene, object, and action categories, and to accurately label the identity and location of objects and actions. Furthermore, we wish to allow open and easy access to the data without copyright restrictions. Note that this last point differentiates us from to the Lotus Hill database 1581, which has similar goals but is not freely available. However, it is not easy to obtain such an annotated video database. Particularly, challenges arise when collecting a large amount of video data free of copyright; it is also difficult to make temporally consistent annotations across the frames with little human interaction. Accurately annotating objects using layers and their associated motions 1291 can also be tedious. Finally, including advanced tracking and motion analysis algorithms may prevent the users from interacting with videos in real time. Inspired by the recent success of online image annotation applications such as LabelMe 139] Mechanical Turk 1461, and labeling games such as the ESP game 1521 and Peekaboom [531, we developed an online video annotation system enabling in- ternet users to upload and label video data with minimal effort. Since tracking algo- rithms are too expensive for efficient use in client-side software, we use a homography- Sec. 2.2. Related Work 31 preserving shape interpolation to propagate annotations temporally and with the aid of global motion estimation. Using our online video annotation tool, we have annotated 238 object classes, and 70 action classes for 1903 video sequences. As this online video annotation system allows internet users to interact with videos, we expect the database to grow rapidly after the tool is released to the public. Using the annotated video database, we are able to obtain statistics of moving objects and information regarding their interactions. In particular, we explored mo- tion statistics for each object classes and cause-effect relationships between moving objects. We also generated coarse depth information and video pop-ups by combining our database with a thoroughly labeled image database 1391. These preliminary results suggest potential in a wide variety of applications for the computer vision community. N 2.2 Related Work There has been a variety of recent work and considerable progress on scene understand- ing and object recognition. One component critical to the success of this task is the collection and use of large, high quality image databases with ground truth annotations spanning many different scene and object classes [8,10,37,39,41,501. Annotations may provide information about the depicted scene and objects, along with their spatial extent. Such databases are useful for training and validating recognition algorithms, in addition to being useful for a variety of tasks 1151. Similar databases would be useful for recognition of scenes, objects, and actions in videos although it is nontrivial to collect such databases. A number of prior works have looked at collecting such data. For example, surveillance videos have offered an abundance of data, resulting in a wide body of interesting work in tracking and activity recognition. However, these videos primarily depict a single static scene with a limited number of object and action semantic classes. Furthermore, there is little ground truth 32 CHAPTER 2. LABELME VIDEO: BUILDING A VIDEO DATABASE WITH HUMAN ANNOTATIONS annotation indicating objects and actions, and their extent. Efforts have taken place to collect annotated video databases with a more diverse set of action classes. The KTH database has been widely used as a benchmark, which depicts close-up views of a number of human action classes performed at different viewpoints 1401. A similar database was collected containing various sports actions 151. While these databases offer a richer vocabulary of actions, the number of object and action classes and examples is yet small. There also has been recent work to scale up video databases to contain a larger number of examples. The TRECVID 1451 project contains many hours of television programs and is a widely used benchmark in the information retrieval community. This database provides tags of scenes, objects, and actions, which are used for training and validation of retrieval tasks. Another example is the database in 1241, and later ex- tended in [25], which was collected from Hollywood movies. This database contains up to hundreds of examples per action class, with some actions being quite subtle (e.g. drinking and smoking). However, there is little annotation of objects and their spatial extent and the distribution of the data is troublesome due to copyright issues. In summary, current video databases do not meet the requirements for exploring the priors of objects and activities in video at a large scale or for benchmarking video processing algorithms at a common ground. In this article we introduce a tool to create a video database composed of a diverse collection of real-world scenes, containing accurately labeled objects and events, open to download, and growth. * 2.3 Online video annotation tool We aim to create an open database of videos where users can upload, annotate, and download content efficiently. Some desired features include speed, responsiveness, and intuitiveness. In addition, we wish to handle system failures such as those related Sec. 2.3. Online video annotation tool M3 Figure 2.1: Object annotation. Users annotate moving or static objects in a video by outlining their shape with a polygon and describing their actions. to camera tracking, interpolation, etc., so as not to dramatically hinder the user experi- ence. The consideration of these features is vital to the development of our system as they constrain the computer vision techniques that can be feasibly used. We wish to allow multi-platform accessibility and easy access from virtually any computer. Therefore, we have chosen to deploy an online service in the spirit of image annotation tools such as LabelMe [39], ESP game 1521, and Mechanical Turk-based applications 1461. This section will describe the design and implementation choices, as well as challenges, involved in developing a workflow for annotating objects and events in videos. 34 CHAPTER 2. LABELME VIDEO: BUILDING A VIDEO DATABASE WITH HUMAN ANNOTATIONS U 2.3.1 Object Annotation We designed a drawing tool similar to the one for annotating static images in La- belMe 1391. For our case, an annotation consists of a segmentation (represented by a polygon, and information about the object) and its motion. The user begins the an- notation process by clicking control points along the boundary of an object to form a polygon. When the polygon is closed, the user is prompted for the name of the object and information about its motion. The user may indicate whether the object is static or moving and describe the action it is performing, if any. The entered information is recorded on the server and the polygon is propagated across all frames in the video as if it were static and present at all times throughout the sequence. The user can fur- ther navigate across the video using the video controls to inspect and edit the polygons propagated across the different frames. To correctly annotate moving objects, our tool allows the user to edit key frames in the sequence. Specifically, the tool allows selection, translation, resizing, and editing of polygons at any frame to adjust the annotation based on the new location and form of the object. Upon finishing, the web client uses the manually edited keyframes to interpolate or extrapolate the position and shape of the object at the missing locations (Section 2.3.4 describes how annotations are interpolated). Figure 2.1 shows a screen shot of our tool and illustrates some of the key features of the described system. * 2.3.2 Event Annotation The second feature is designed for annotating more complex events where one or more nouns interact with each other. To enter an event, the user clicks on the Add Event button, which prompts a panel where the user is asked for a sentence description of the event (e.g.the dog is chewing a bone). The event annotation tool renders a button for each token in the sentence, which the user can click on and link with one or more Sec. 2.3. Online video annotation tool 35 Figure 2.2: Event annotation. Simple and complex events can be annotated by entering free-form sentences and linking them to existing labeled objects in the video. polygons in the video. Finally, the user is asked to specify the time when the described event occurs using a time slider. Once the event is annotated, the user can browse through objects and events to visualize the annotation details. Figure 2.2 illustrates this feature. * 2.3.3 Stabilizing Annotations As video cameras become ubiquitous, we expect most of the content uploaded to our system to be captured from handheld recorders. Handheld-captured videos contain some degree of ego-motion, even with shake correction features enabled in cameras. Due to the camera motion, the annotation of static objects can become tedious as a sim- ple cloning of polygon locations across time might produce misaligned polygons. One way to correct this problem is to compute the global motion between two consecutive frames to stabilize the sequence. Some drawbacks of this approach include the intro- duction of missing pixel patches due to image warping (especially visible with large camera movements), as well as potential camera tracking errors that result in visually unpleasant artifacts in the sequences. 36 CHAPTER 2. LABELME VIDEO: BUILDING A VIDEO DATABASE WITH HUMAN ANNOTATIONS Our approach consists of estimating camera motion as a homographic transfor- mation between each pair of consecutive frames during an offline pre-processing stage. The camera motion parameters are encoded, saved in our servers, and downloaded by the web client when the user loads a video to annotate. When the user finishes outlin- ing an object, the web client software propagates the location of the polygon across the video by taking into account the camera parameters. Therefore, if the object is static, the annotation will move together with the camera and not require further correction from the user. In this setup, even with failures in the camera tracking, we observe that the user can correct the annotation of the polygon and continue annotating without generating uncorrectable artifacts in the video or in the final annotation. * 2.3.4 Annotation interpolation To fill the missing polygons in between keyframes, we have chosen to use interpolation techniques. These methods can be computationally lightweight for our web client, easy to implement, and still produce very compelling results. An initial interpolation algorithm assumes that control points in outlining objects are transformed by a 2D homography plus a residual term: pi =SR*po+T+r (2.1) where po and pi are vectors containing the 2D coordinates of the control points for the annotation of some object at two user annotated key frames, say t = 0 and t =1 respectively; S,R, and T are scaling, rotation, and translation matrices encoding the homographic projection from po to pi that minimizes the residual term r. A polygon at frame t E [0,1], can then be linearly interpolated in 2D as: p, = [SR]'po +t[T +r] (2.2) Sec. 2.3. Online video annotation tool Once a user starts creating key frames, our tool interpolates the location of the control points for the frames in between two frames or linearly extrapolates the control points in the case of a missing key frame in either temporal extreme of the sequence. Figure 2.4 shows interpolation examples for some object annotations and illustrates how, with relatively few user edits, our tool can annotate several objects common in the real world such as cars, boats, and pedestrians. 3D Linear Motion Prior So far, we assume that the motion between two frames is linear in 2D (equation 2.2). However, in many real videos, objects do not always move parallel to the image plane, but move in a 3D space. As a result, the user must make corrections in the annotation to compensate for foreshortening effects during motion. We can further assist the user by making simple assumptions about the most likely motions between two annotated frames [1]. Figure 2.3(a) shows an example of two polygons corresponding to the annota- tion of a car at two distinct times within the video. The car has moved from one frame to another and has changed in location and size. The scaling is a cue to 3D motion. Therefore, instead of assuming a constant velocity in 2D, it would be more accurate to assume constant velocity in 3D in order to interpolate intermediate frames. Interest- ingly, this interpolation can be done without knowing the camera parameters. We start by assuming that a given point on the object moves in a straight line in the 3D world. The motion of point X(t) at time t in 3D can be written as X(t) = Xo + A (t)D, where Xo is an initial point, D is the 3D direction, and A(t) is the displacement along the direction vector. Here, we assume that the points X = (X,Y,Z, 1) live in projective space. For the camera, we assume perspective projection and that the camera is sta- tionary. Therefore, the intrinsic and extrinsic parameters of the camera can be ex- pressed as a 3 x 4 matrix P. Points on the line are projected to the image plane 38 CHAPTER 2. LABELME VIDEO: BUILDING A VIDEO DATABASE WITH HUMAN ANNOTATIONS as x(t) = PX(t) = xo + X (t)xy, where xo = PXo and x, = PD. Using the fact that x, = lim,()-oxo + A(t)xy, we see that xy is the vanishing point of the line. More explicitly, the image coordinates for points on the object can be written as: (x)+.(t),t =O , X(2.3) Furthermore, we assume that the point moves with constant velocity. Therefore, X (t) = vt, where v is a scalar denoting the velocity of the point along the line. Given a corresponding second point x(i) (iJ, 1) along the path projected into another frame, we can recover the velocity as v = In summary, to find the image coordinates for points on the object at any time, we simply need to know the coordinates of a point at two different times. To recover the vanishing points for the control points belonging to a polygon, we assume that all of the points move in parallel lines (in 3D space) toward the same vanishing point. With this assumption, we estimate the vanishing point from polygons in two key frames by intersecting lines passing through two point correspondences, and taking the median of these intersections points as illustrated in Figure 2.3(a). Note that not all points need to move at the same velocity. Figure 2.3(b) compares the result of interpolating a frame using constant 2D velocity interpolation versus using constant 3D velocity interpolation. The validity of the interpolation depends on the statistics of typical 3D motions that objects undergo. We evaluated the error the interpolation method by using two annotated frames to predict intermediate annotated frames (users have introduced the true location of the intermediate frame). We compare against our baseline approach that uses a 2D linear interpolation. Table I shows that, for several of the tested objects, the pixel error is reduced by more than 50% when using the constant 3D velocity assumption. t=o a) - Annotated ' - Constant 2D velocity interpolation - - - - - Constant 3D velocity interpolation b) )t= C) t=0.41 t=0.66 t=0.73 Figure 2.3: Interpolation comparison between constant 2D motion (red) and constant 3D motion (green). a) Two polygons from different frames and their vanishing point. b) Interpo- lation of an intermediate frame, and c) interpolation of the polygon centers for multiple frames between the two reference frames. t=1 t=0.90 t=1 Figure 2.4: Examples of annotations. Our interpolation framework is based on the heuristic that objects often move with constant velocity and follow straight trajectories. Our system can propagate annotations of rigid (or semni-rigid) objects such as cars, motorbikes, fish, cups, etc. across different frames in a video automatically aiming for minimal user intervention. Annotation of non-rigid objects (e.g.humans), while possible by the tool (but requiring more editing), remains a more challenging task than the one for rigid objects. Presently, users can opt to, instead, draw bounding boxes around non-rigid entities like people. Sec. 2.4. Data set statistics 41 Table 2.1: Interpolation evaluation. Pixel error per object class. Linear in 2D Linear in 3D # test samples car 36.1 18.6 21 motorbike 34.6 14.7 11 person 15.5 8.6 35 U 2.4 Data set statistics We intend to grow the video annotation database with contributions from Internet users. As an initial contribution, we have provided and annotated a first set of videos. These videos were captured at a diverse set of geographical locations, which includes both indoor and outdoor scenes. Currently, the database contains a total of 1903 annotations, 238 object classes, and 70 action classes. The statistics of the annotations for each object category are listed in table 2.2. We found that the frequency of any category is inversely proportional to its rank in the frequency table (following Zipf's law 1471), as illustrated in Figure 2.5. This figure describes the frequency distribution of the objects in our video database by plotting the number of annotated instances for each class by the object rank (objects names are sorted by their frequency in the database). The graph includes plots for static and moving objects, and action descriptions. For comparison, we also show the curve of the annotation of static images 1391. The most frequently annotated static objects in the video database are buildings (13%), windows (6%), and doors (6%). In the case of moving objects the order is persons (33%), cars (17%), and hands (7%). The most common actions are moving forward (3 1%), walking (8%), and swimming (3%). Table 2.2: Object and action statistics. Number of instances per object/action class in our current database. Static # Moving # Action # object object building 183 person 187 moving forward 124 window 86 car 100 walking 136 door 83 hand 40 swimming 13 sidewalk 77 motorbike 16 waving 13 tree 76 water 14 riding motorbike 8 road 71 bag 13 flowing 7 sky 68 knife 12 opening 5 car 65 purse 11 floating 4 street lamp 34 tree 11 eating 3 wall 31 door 9 flying 3 motorbike 25 blue fish 7 holding knife 3 pole 24 bycicle 7 riding bike 3 column 20 carrot 7 running 3 person 20 flag 7 standing 3 balcony 18 stroller 7 stopping 3 sign 18 dog 6 turning 3 floor 13 faucet 6 being peeled 2 Sec. 2.5. Beyond User Annotations 4J - static 10 - moving - action names = static+motion s traditional labelme 10 C1 0 0) .2102 101 100 101 102 103 log rank Figure 2.5: Distribution of labels in data set. The vertical axis indicates the log frequency of the object/action instances in the database while the y axis indicates the rank of the class (the classes are sorted by frequency). As we aimed to capture videos from a variety of common scenes and events in the real world, these distributions are similar to natural word frequencies described by Zipf's law[47]. N 2.5 Beyond User Annotations Once a set of videos is annotated, we can infer other properties not explicitly provided by the users in the annotation. For example: How do objects occlude each other? Which objects in our surroundings move and what are their common motions like? Which objects move autonomously and make others move? In this section, we will demonstrate how to use the user annotations to infer extra information as well as rela- tionships between moving objects. Large databases of annotated static images are currently available, and labeling single images appears easier since no temporal consistency needs to be taken into ac- count. We expect that there would be more annotated images than annotated videos at the same level of annotation accuracy. Therefore, it can be a useful strategy to propa- gate information from static image databases onto our video database. 44 CHAPTER 2. LABELME VIDEO: BUILDING A VIDEO DATABASE WITH HUMAN ANNOTATIONS Table 2.3: Motion statistics. We can compute the probability that an object of a certain class moves based on the observations in our annotated dataset. Due to the state of the database, there are some cases where there were no annotated static instances, resulting in probabilities equaling 1. We expect this to change as the database grows. Object Motion Object Motion probability probability hand 1.00 building 0 purse 1.00 column 0 bag 0.92 floor 0 person 0.90 grass 0 water 0.74 plant 0 knife 0.67 pole 0 car 0.61 road 0 bycicle 0.50 sidewalk 0 boat 0.45 sign 0 motorbike 0.39 sky 0 tree 0.13 street lamp 0 door 0.10 traffic light 0 awning 0 wall 0 balcony 0 window 0 * 2.5.1 Occlusion handling and depth ordering Occlusion and depth ordering are important relationships between objects in a scene. We need to sort objects with respect to their depth to infer the visibility of each polygon at each frame. Depth ordering can be provided by the user 1291, but it makes the annotation tool tedious to use for naive users because the depth ordering of the objects may change during the video sequence. In fact, occlusion information can be recovered by post-processing the annotated data. One possibility is to model the appearance of the object and, wherever there is overlapping with another object, infer which object owns the visible part based on matching appearance. Although this would work in general, it can be unreliable when the appearance of the object is changing (e.g.a walking person), when the occlusion is small (e.g.a person walking behind a lamp-post) or when the resolution is low (for far Sec. 2.5. Beyond User Annotations and small objects). Another alternative, as proposed in 1391, is that when two objects overlap, the polygon with more control points in the intersection region is in front. This simple heuristic provides around 95% correct decisions when applied to static images. However, this approach does not work for video because moving objects are generally annotated with fewer points than static objects, which can be annotated in great detail. Our experiment has demonstrated that this approach, when applied to videos, failed in almost all cases (Fig. 2.6.c). As suggested by recent work of [38], it is indeed possible to extract accurate depth information using the object labels and to infer support relationships from a large database of annotated images. We use the algorithm from 1381 to infer polygon order information in our video annotations and show examples of a 3d-video pop up in Fig. 2.6. We found that this approach reliably handles occlusions for outdoor scenes, but indoor scenes remains a challenge. As shown in 1231, information about the distribution of object sizes (e.g., the average height of a person is 1.7 meters) and camera parameters can be extracted from user annotations. The procedure is an iterative algorithm that alternatively updates the camera parameters (location of the horizon line in the image) and the object heights. In order to recover the 3D scene geometry, we also need to recover the support relations between objects [381. The process starts by defining a subset of objects as being ground objects (e.g., road, sidewalk, etc.). The support relationship between two objects can be inferred by counting how many times the bottom part of a polygon overlaps with the supporting object (e.g., the boundary defining a person will overlap with the boundary defining a sidewalk, whenever these two objects co-occur in an image and they are nearby each other). Once support relations are estimated, we can use the contact point of an object with the ground to recover its 3D position. Both techniques (recovering camera parameters with 3D object sizes and in- ferring the support graph) benefit from a large collection of annotated images. This 40 CHAPTER 2. LABELME VIDEO: BUILDING A VID EO DATABASE WITH HUMAN ANNOTATIONS 1km 1Om/fj Example 3D-popup generated 1m from annotations a) video frame b) propagated c) occlusion d) occlusion handling e) depth map annotations handling with with 3D information LabelMe heuristic Figure 2.6: Occlusion relationships and depth estimation. A sample video frame (a), the propagated polygons created with our annotation tool (b), the polygons o rdered using the LabelMe-heuristic-based inference for occlusion relationships (c) polygon orde ring using La- belMe heuristic (notice how in the top figure, the group of people standing far away from the camera are mistakenly ordered as closer than the man pushi ng the stroller and in the bottom figure there is a mistake in the ordering of the cars), and (d) ordered polygon s inferred using 3D relationship heuristics (notice how the mistakes in (c) are fixed). information, learned from still images, is used to recover a 3D model of the scene. As our video scenes share similar objects with LabelMe, we are able to estimate 3D in- formation for each video frame in our database (even when there is no camera mot ion for inferring 3D using structure from motion techniques). We found that this techniq ue works well in most outdoor street scenes, but fails in many indoor scenes due to the lack of a clear ground plane. Fig. 2.6 shows some results with succe ssful depth order inference in a video annotation. * 2.5.2 Cause-effect relations within moving objects In the previous section we showed how a collection of static im ages can be used to extract additional information from annotated videos. Here, we discuss how to extract information from videos that might not be inferred solely from static images. There- fore, the two collections (images and video) can complement each other. As described in Section 2.3.1, our annotation tool allows users to rec ord whether an object is moving or static. Using this coarse motion information, we can infer cause- r -7IIN Sec. 2.6. Discussion 47 effect motion relationships for common objects. We define a measure of causality, which is the degree to which an object class C causes the motion in an object of class E: de) p(E moves C moves and contacts E) causality(C,E) =pEueIC vs omeoonat) (2.4) p(E mocesjCdocs not mote or contact E) Table 2.4 shows the inferred cause-effect motion relationships from the objects annotated in our database. It accurately learns that people cause the motions of most objects in our surroundings and distinguishes inert objects, such as strollers, bags, doors, etc., as being the ones moved by living objects. * 2.6 Discussion Most of the existing video analysis systems (e.g.motion estimation, object tracking, video segmentation, object recognition, and activity analysis) use a bottom-up ap- proach for inference. Despite the high correlation in these topics, solutions are often sought independently for these problems. We believe that the next step in developing video analysis techniques involves integrating top-down approaches by incorporating prior information at the object and action levels. For example, motion estimation can be performed in a completely different way by first recognizing the identities of the objects, accessing motion priors for each object category, and possibly integrating oc- clusion relationships of the objects in the scenes to finally estimate the motion of the whole scene. As it is inherently easier to annotate a database of static images, propagating the annotations of static images to label a video database can be crucial to grow the video database in both scale and dimension. We showed how, for example, that depth information can be propagated from static images to video sequences, but there is a lot 48 CHAPTER 2. LABELME VIDEO: BUILDING A VIDEO DATABASE WITH HUMAN ANNOTATIONS Table 2.4: Inferred cause-effect motion relationships. The cause-effect relationships are ranked by causality score. The threshold line separates correct relationships from the incorrect ones (with lower scores). Notice that many pairs of relationships appear in the list in their two possible forms (e.g.knife- > bag and bag- > knife) but that in all the cases the correct one has a higher score than the incorrect one. Cause(C) hand hand person person knife carrot person knife person carrot person person hand bag purse water motorbike bag stroller car door purse bycicle -+ Effect(E) causality(C,E) carrot knife purse stroller hand hand door carrot bycicle knife bag motorbike water purse bag hand bag motorbike person tree person person person 11.208333 10.579545 10.053191 8.966667 8.042553 7.388889 5.026596 4.691489 4.339286 4.015152 3.800000 2.994286 2.453704 2.409091 2.345930 2.317308 2.297297 2.216667 2.037879 more to explore. Recent advances in object recognition and scene parsing already allow us to segment and recognize objects in each frame. Object recognition, together with temporal smoothing to impose consistency across frames, could significantly reduce the human annotation labor necessary for labeling and tracking objects. Sec. 2.7. Conclusion 'U' U 2.7 Conclusion We designed an open, easily accessible, and scalable annotation system to allow online users to label a database of real-world videos. Using our labeling tool, we created a video database that is diverse in samples and accurate, with human-guided annotations. Based on this database, we studied motion statistics and cause-effect relationships be- tween moving objects to demonstrate examples of the wide array of applications for our database. Furthermore, we enriched our annotations by propagating depth information from a static and densely annotated image database. We believe that this annotation tool and database can greatly benefit the computer vision community by contributing to the creation of ground-truth benchmarks for a variety of video processing algorithms, as a means to explore information of moving objects. 50 CHAPTER 2. LABELME VIDEO: BUILDING A VIDEO DATABASE WITH HUMAN ANNOTATIONS Chapter 3 A data-driven approach for unusual event detection N 3.1 Introduction If we are told to visualize a street scene, we can imagine some composition with basic elements in it. Moreover, if we are asked to imagine what can happen in it, we might say there is a car moving through a road, being in contact to the ground and preserving some velocity and size relationships with respect to other elements in the scene (say a person or a building). Even when constrained by its composition (e.g. when being shown a picture of it) we can predict things like an approximate speed of the car, and maybe even its direction (see fig. 1.3). Human capacity for mental imagery and story telling is driven by the years of prior knowledge we have about our surroundings. Moreover, it has been found that static images implying motion are also important in visual perception and are able to produce motion after-effects 156] and even activate motion sensitive areas in the human brain [21]. As a consequence, the human visual system is capable of accurately predicting plausible events in a static scene (or future events in a video sequence) as well as is finely tuned to flag unusual configurations or events. Event and action detection are well-studied topics in computer vision. Several 52 CHAPTER 3. A DATA-DRIVEN APPROACH FOR UNUSUAL EVENT DETECTION works have proposed models to study, characterize, and classify human actions ranging from constrained environments 133,401 to actions in the OwildO such as TV shows, sporting events, and cluttered backgrounds 124,341. In this scenario, the objective is to identify the action class of a previously unknown query video given a training dataset of action exemplars (captured at different locations). A different line of work is that of event detection for video surveillance applications. In this case, the algorithm is given a large corpus of training video captured at a particular location as input, and the objective is to identify abnormal events taking place in the future in that same scene 120,54,55,621. Consequently, deploying a surveillance system requires days of data acquisition from the target and hours of training for each new location. In this chapter we look into the problem of generic event prediction for scene instances different from the ones in some large training corpus. In other words, given an image (or a short video clip), we want to identify the possible events that may occur as well as the abnormal ones. We motivate our problem with a parallel to object recognition. Event prediction and anomaly detection technologies for surveillance are now analogous to object instance recognition. Many works in object recognition are moving towards the more generic problem of object category recognition 13,4]. We aim to push the envelope in the video aspect by introducing a framework that can easily adapt to new scene instances without the requirement of retraining a model for each new location. Moreover, other potential applications lie in the areas of video collection retrieval in online services such as YouTube, Vimeo, where video clips are captured in different locations and greatly differ with respect to controlled video sources such as surveillance feeds and tv programming as was pointed out by Zanetti et al. [611. Given a query image, our purpose is to identify the events that are likely to take place in it. We have a rich video corpus with 2401 real world videos acting as our prior knowledge of the world. In an offline stage, we generate and cluster motion tracks for each video in the corpus. Using scene-matching, our system retrieves videos with Sec. 3.2. Related Work 53 similar image content. Track information from the retrieved videos is integrated to make a prediction of where in the image motion is likely to take place. Alternatively, if the input is a video, we track and cluster salient features in the query and compare each to the ones in the retrieved neighbor set. A track cluster can then be flagged as unusual if it does not match any in the retrieved set. U 3.2 Related Work Human action recognition is a popular problem in the video domain. The work by Efros et al. 151 learns optical flow correlations of human actions in low resolution video. Schechtman and Irani exploit self similarity correlations in space-time volumes to find similar actions given an exemplar query. Niebles et al. 1341 characterize and detect human actions under complex video sequences by learning probability distribu- tions of sparse space-time interest points. Laptev et al. densely extracts spatio-temporal features in a grid and uses a bag of features approach to detect actions in movies. Mess- ing et al. models human activities as mixtures of bags of velocity trajectories, extracted from track data. None of these works study the task of event prediction and are con- strained to human actions. Similar in concept to our vision is the work by Li et al. 1281, where the objective is action classification given an object and a scene . Our work is geared towards localized prediction including trajectory generation, not classification. Extensive work has also taken place in event and anomaly detection for surveil- lance applications. A family of works relies on detecting, tracking, and classifying objects of interest and learning features to distinguish events. Dalley et al. detect loi- tering and bag dropping events using a blob tracker to extract moving objects and detect humans and bags. The system idenfifies a loitering event if a person blob does not move for a period of time. Bag dropping events are detected by checking the dis- tance between a bag and a person; if the distance becomes larger than some threshold, 54 CHAPTER 3. A DATA-DRIVEN APPROACH FOR UNUSUAL EVENT DETECTION it is identified as a dropped bag. A second family of works clusters motion features and learning distributions on motion vectors across time. Wang et al. [551 uses a non-parametric Bayesian model for trajectory clustering and analysis. A marginal like- lihood is computed for each video clip, and low likelihood events are flagged as ab- normal. One common assumption of these methods is that training data for each scene instance where the system will be deployed is available. Therefore, the knowledge built is not transferrable to new locations, as the algorithm needs to be retrained with video feeds from each new location to be deployed. Numerous works have demonstrated success using a rich databases for retrieving and/or transferring information to queries in both image 114, 15,31,511 and video [30, 441. In video applications, Sivic et al. [441, proposed a video representation for exemplar-based retrieval within the same movie. Moving objects are tracked and their trajectories grouped. Upon selection of an image crop in some video frame, the sys- tem searches across video key frames for similar image regions and retrieves portions of the movie containing the queried object instance. The work proposed by Liu et al. 1301 is the closest one to our system. It introduces a method for motion synthe- sis from static images by matching a query image to a database of video clip frames and transferring the moving regions from the nearest neighbor videos (identified as regions where the optical flow magnitude is nonzero) to the static query image. This work constructs independent interpretations per nearest neighbors. Instead, our work builds localized motion maps as probability distributions after merging votes from sev- eral nearest neighbors. Moreover, we aim to have a higher level representation where each moving object is modeled as a track blob while 1301 generates hypotheses as one motion region per frame. In summary, these works demonstrate the strong potential of data-driven techniques, which to our knowledge no prior work has extended into anomaly detection. Sec. 3.3. Scene-based video retrieval 00 N 3.3 Scene-based video retrieval The objective of this project is to use event knowledge from a training database of videos to construct an event prediction for a given a static query image. To achieve some semantic coherence, we want to transfer event information only from similar im- ages. Therefore, we need a good retrieval system that will return matches with similar scene structures (e.g.a picture of an alley will be matched with another alley photo shot with a similar viewpoint) even if the scene instances are different. In this chapter we will explore the usage of two scene matching techniques: GIST [361 and spatial pyra- mid dense SIFT 1261 matching. The GIST descriptor encodes perceptual dimensions that characterize the dominant spatial structure of a scene. The spatial pyramid SIFT matching technique works by partitioning an image into subregions and computing his- tograms of local features at each sub-region. As a result, images with similar global geometric correspondence can be easily retrieved. The advantage of both the GIST and dense SIFT retrieval methods is their speed and efficiency at projecting images into a space where similar semantic scenes are close together. This idea has proven robust in many non-parametric data-driven techniques such as label transfer 1311 and scene completion 1151 amongst many others. To retrieve nearest videos from a database, we perform matching between the first frame of the video query and the first frame of each of the videos in the database. * 3.4 Video event representation We introduce a system that models a video as a set of trajectories of keypoints through- out time. Individual tracks are further clustered into groups with similar motion. These clusters will be used to represent events in the video. 56 CHAPTER 3. A DATA-DRIVEN APPROACH FOR UNUSUAL EVENT DETECTION M 3.4.1 Recovering trajectories For each video, we extract trajectories of points in the sequence using an implemen- tation of the KLT tracker 1481 by Birchfield 121. The KLT tracking equation seeks the displacement d = [dx,dy]T that minimizes the dissimilarity amongst two windows, given a point p = [x,y]T and two consecutive frames 1 and J: e (w) = [J (p + ) -(p - d)]2w(p)dp (3.1)JW 2 2 where W is the window neighborhood, and w(d) is the weighing function (set to 1). Using a Taylor series expansion of J and I, the displacement that minimizes E is: [J(p) -I(p)+ g'(p)d]g(p)w(p)dp 0 (3.2) where g [= ( '72) y( ) The tracker finds salient points by examining the minimum eigenvalue of each 2 by 2 gradient matrix. We initialize the tracker by extracting 2000 salient points at the first video frame. The tracker finds the correspondences of the points sequentially throughout the frames in the video. Whenever a track is broken (a point is lost due to high error or occlusions), new salient points are detected to maintain a consistent number of tracks throughout the video. As a result, the algorithm produces tracks, which are sequences of location tuples T (x(t),y(t))tED within a duration D for each tracked point. For more details on the implementation, we refer to the the original KLT tracker paper. M 3.4.2 Clustering trajectories Now that we have a set of trajectories for salient points in an image, we proceed to group them at a higher level. Ideally, tracks from the same object should be clustered Sec. 3.4. Video event representation together. We define the following distance function between two tracks dtrack (Ti,Tj) DinDJl t D j(xi(t ) -X(t)) 2 + (y,(t) -y(t)) 2 (3.3) We use the distance function to create an affinity matrix between tracks and use nor- malized cuts 1431 to cluster them. Each entry of the affinity matrix is defined as Wij = exp(-dtrack(Ti,Tj)/. 2). The clustering output will thus be a group label as- signment to each track. See fig. 3 for a visualization of the data. Since we do not know the number of clusters for each video in advance, we set a value of 10. In some cases this will cause an over segmentation of the tracks and will generate more than one cluster for some objects. N 3.4.3 Comparing track clusters For each track cluster C {Ti}, we quantize the instantaneous velocity of each track point into 8 orientations To ensure rough spatial coherency between clusters, we su- perimpose a regular grid with a cell spacing of 10 pixels on top of the image frame to create a spatial histogram containing 8 sub-bins at each cell in the grid. Let Hi and H2 denote the histograms formed by the track clusters C1 and C2 such that Hi (i,b) and H2(ib) denote the number of velocity points from the first and second track clusters respectively that fall into the bth sub-bin of the ith bin of the histogram, where i E G and G denotes the bins in the grid. We define the similarity between two track clusters as the intersection of their velocity histograms: 8 Sclust(C1,C2) I(H1,H2) = min (H (i,b),H 2 (ib)) (3.4) This metric was designed in the same spirit as the bottom level of the spatial pyramid matching method by Lazebnik et al. . We aim for matches that approximately preserve Figure 3.1: Track clustering. Sample frames from the video sequence (a). The ground truth annotations denoted by polygons surrounding moving objects (b) can be used to gen- erate ground truth labels for the tracked points in the video (c). Our track distance affinity function is used to to automatically cluster tracks into groups and generates fairly reasonable clusters where each roughly correspond to independent objects in the scene (d). The track clus- ters visualizations in (c) and (d) show the first frame of each video and the spatial location of all tracked points for the duration of the clip color-coded by the track cluster that each point corresponds to. Sec. 3.5. Video database and ground truth global spatial correspondences. Since our video neighbor knowledge-base is assumed to be spatially aligned to our query, a good match shall also preserve an approximate similar spatial coherence. M 3.5 Video database and ground truth Our database consists of 2277 videos belonging to 100 scene categories. The cate- gories with the most videos are: street (809), plaza (135), interior of a church (103), crosswalk (82), and aquarium (75). Additionally, 14 videos containing unusual events were downloaded from the web (see fig. 3 for some sample frames). 500 of the videos originate from the LabelMe video dataset 159]. As these videos were collected using consumer cameras without a tripod, there is slight camera shake. Using the LabelMe video system, the videos were stabilized. The object-level ground truth labeling in the LabelMe video database allows us to easily visualize the ground truth clustering of tracks and compare it with our automated results (see fig. 2). We split the database into 2301 training videos, selected 134 fully videos from outdoor urban scenes and the 14 unusual videos to create a test set with 148 videos. M 3.6 Experiments and Applications We present two applications of our framework. Given the information from nearest neighbor videos, what can we say about the image if we were to see it in action? As an example, we can make good predictions of where motion is bound to happen in an image. We also present a method for determining the degree of anomaly of an event in a video clip using our training data. Figure 3.2: Unusual videos. We define an unusual or anomalous event as one that is not likely to happen in our training data set. However, we ensured that they belong to scene classes present in our video corpus. Motion Prediction ROC . (D C: 0 (D _0 0.2 0.4 0.6 false alarm rate (a) Unusual Event Detection ROC.. . 0.1 0.8 0.5 O 4 gist NN, #sift NN- 011111r 3 rano NN 0.2- 0,1 0.8 1 0 0.2 0.4 0.6 0.8 1 false alarm rate (b) Figure 3.3: Localized motion prediction (a) and unusual event detection (b). The algorithm was compared against two scene matching methods (GIST and dense SIFT) as well as a base- line supported by random nearest neighbors. Retrieving videos similar to the query image improves the classification rate. Sec. 3.6. Experiments and Applications U1 U 3.6.1 Localized motion prediction Given a static image, we can generate a probabilistic map determining the spatial ex- tent of the motion. In order to estimate p(motion x,y,scene) we use a parzen window estimator and the trajectories of the N=50 nearest neighbor videos retrieved with scene matching methods (GIST or dense SIFT-based). 1 N IMip(motion x,y,scene) = 2 tK(x-xi,7(),y -yij(t);a) (3.5) where N is the number of videos and Mi is the number of tracks in the ith video and K(x,y; a) is a gaussian kernel of width a 2 . Fig. 4 a shows the per-pixel prediction ROC curve compared using gist nearest neighbors, dense SIFT matching, and as a baseline, a random set of nearest neighbors. The evaluation set is composed of the first frame of each test video. We use the location of the tracked points in the test set as ground truth. Notice that scenes can have multiple plausible motions occurring in them but our current ground truth only provides one explanation. Despite our limited capacity of evaluation, notice the improvement when using SIFT and GIST matching to retrieve nearest neighbors. This graph suggests that (1) different sets of motions happen in different scenes, and (2) scene matching techniques do help filtering out distracting scenes to make more reliable predictions (for example, a person climbing the wall of a building in a street scene would be considered unusual but a person climbing a wall in a rock climbing scene is normal). Fig. 3.5 c and 3.6 c contain the probability motion map constructed after integrating the track information from the nearest neighbors of each query video depicted in column (a). Notice that the location of high probability regions varies depending on the type of scenes. Moreover, the reliability of the motion maps depends on (1) how accurately the scene retrieval system returns nearest neighbors from the same scene category (2) whether the video corpus contains similar scenes. 62 CHAPTER 3. A DATA-DRIVEN APPROACH FOR UNUSUAL EVENT DETECTION The reader can get an intuition of this by looking at column (e), which contains the average nearest neighbor image. ' * 3.6.2 Event prediction from a single image Given a static image, we demonstrated that we can generate a probabilistic function per pixel. However, we are not only constrained to per-pixel information. We can use the track clusters of videos retrieved from the database and generate coherent track cluster predictions. One method is by directly transferring track clusters from nearest neigh- bors into the query image. However, this might generate too many similar predictions. Another way lies in clustering the retrieved track clusters. We use normalized cuts clustering for this step at the track cluster level using the distance function described in equation 3.4 to compare pairs of track clusters. Fig. 3.4 shows example track clusters overlaid on top of the static query image. A required input to the normalized cuts algo- rithm is the number of clusters. We try a series of values from 1 to 10 and choose the clustering result that maximizes the distance between clusters. Notice how for different query scenes different predictions that take the image structure are generated. M 3.6.3 Anomaly detection Given a video clip, we can also determine if an unusual event is taking place. First, we break down the video clip into query track clusters (which roughly represent object events) using the method described in section 3.4. We also retrieve the top 200 nearest videos using scene matching. We negatively correlate the degree of anomaly of a query track cluster with the maximum track cluster similarity between the query track cluster and each of the track clusters from the nearest neighbors: anomaly(Hquery) -argmax (I (HqueryHneigh) (3.6) Hneigh Figure 3.4: Event prediction. Each row shows a static image with its corresponding event predictions. For each query image, we retrieve their nearest video clips using scene matching. The events belonging to the nearest neighbors are resized to match the dimensions of the query image and are further clustered to create different event predictions. For example, in a hallway scene, the system predicts motions of different people; in street scenes, it predicts cars moving along the road, etc. 64 CHAPTER 3. A DATA-DRIVEN APPROACH FOR UNUSUAL EVENT DETECTION where Hquery is the spatial histogram of the velocity histories of the query track cluster and Hieigi denotes the histogram of a track cluster originated from a nearest neighbor. Intuitively, if we find a similar track cluster in a similar video clip, we consider it as normal. Conversely, a poor similarity score implies that such event (track cluster) does not usually happen in similar video clips. Fig. 3.5 shows examples of events that our system identified as common by finding a nearest neighbor that minimized its anomaly score. Notice how the nearest track clusters are fairly similar to the query ones and also how the spatial layout of the nearest neighbor scenes matches that of the query video. As a sanity check, notice the similarity of the nearest neighbors average image to the query scene suggesting that the scene retrieval system is picking the right scenes to make accurate predictions. Fig. 3.6 shows events with a higher anomaly score. Notice how the nearest neighbors differ from the queries. Also, the average images are indicators of noisy and random retrievals. By definition, unusual events will be less likely to appear in our database. However, if the database does not have enough examples of particular scenes, their events will be be flagged as unusual. Fig. 3.3(b) shows a quantitative evaluation of this test. Our automatic clustering generates 685 normal and 106 unusual track clusters from our test set. Each of these clusters was scored achieving in similar classification rates when the system is powered by either SIFT or GIST matching methods reaching a 70% detection rate with a 22% false alarm rate. We use the scenario of a random set of nearest neighbors as a base- line. Due to our track cluster distance function, if a cluster similar to the query cluster appears in the random set, our algorithm will be able to identify it and classify the event as common. However, notice that the scene matching methods are demonstrating great utility cleaning up the retrieval set and narrowing videos to a fewer relevant ones. Fig. 3.7 shows some examples of our system in action. (C) (d) (e) Figure 3.5: Track cluster retrieval for common events. A frame from a query video (a), the tracks corresponding to one event in the video (b), the localized motion prediction map (c) generated after integrating the track information of the nearest neighbors (some examples shown in d), and the average image of the retrieved nearest neighbors (e). Notice the definition of high probability motion regions in (c) and how its shape roughly matches the scene geometry in (a). The maps in (c) were generated with no motion information originating from the query videos videos. BdBN"-4,- -- I PW (b) (d) (e) Figure 3.6: Track cluster retrieval for unusual events (left) and scenes with less samples in our data set. When presented with unusual events such as a car crashing into the camera or a person jumping over a car while in motion (left and middle columns; key frames can be seen in fig. 3.7) our system is able to flag these as unusual events (b) due to their disparity with respect to the events taking place in the nearest neighbor videos. Notice the supporting neighbors belong to the same scene class as the query and the motion map predicts movements mostly in the car regions. However, our system fails when an image does not have enough representation in the database (right). Sec. 3.7. Discussion and concluding remarks 67 uu; 1 X Figure 3.7: Unusual event detection. Videos of a person jumping over a car and running across it (left) and a car crashing into the camera (right). Our system outputs anomaly scores for individual events. Common events shown in yellow and unusual ones in red. The thickness and saturation of the red tracks is proportional to the degree of anomaly. * 3.7 Discussion and concluding remarks We have presented a flexible and robust system for unsupervised localized motion pre- diction and anomaly detection powered by two phases: (1) scene matching to retrieve similar videos given a query video or image, and (2) motion matching via a scene- inspired and spatially aware histogram matching technique for velocity information. We emphasize that most of the work in the literature focuses on action recognition and detection and requires training models for each different action category. Our method has no training phase, is quick, and naturally extends into applications that are not avail- able under other supervised learning scenarios. Experiments demonstrate the validity of our approach when given enough video samples of real world scenes. We envision its applicability in areas such as finding unique content in video sharing websites and future extensions in surveillance applications. 68 CHAPTER 3. A DATA-DRIVEN APPROACH FOR UNUSUAL EVENT DETECTION Chapter 4 Car trajectory prediction from a single image * 4.1 Introduction Object, scene, action, and event recognition/detection are active areas of research in image understanding. They focus on understanding the present content in images and video. The concept of event prediction focuses instead on what is likely to happen in the future given a present configuration. If we are at an intersection and see a car approaching, we can foresee where it will be in the next few seconds. The problem of event prediction from a single image was originally posed by Liu et al. in 1311 and further studied by Yuen and Torralba in 1601. The setting involves a static image and the objective is to determine the future its elements as if it were a frame in a video sequence. The motivation for this problem lies at the core of dynamic scene understanding. This problem, seemingly impossible at first glance, has been approached using data-driven approaches. While we have not seen hours of video from the particular intersection we might be at, we have seen many examples of streets and intersections in general. In [311, the single image can be matched against video frames in a large database and, given a good match, new dynamic objects can be introduced into the image, or a localized motion map can be computed; if an optical flow field or group 70 CHAPTER 4. CAR TRAJECTORY PREDICTION FROM A SINGLE IMAGE of motion tracks is transferred to the image, they can also provide cues of types of motions to expect in such image. This chapter approaches event prediction in a different way. Our key contribu- tion lies in modeling dynamic scenes in 3D and at the object level. Given a static image of a street scene, and a car detection in it, our system will match the car and its scene against all cars in similar scenes in a subset of the LabelMe video database 1591. It will rank the detected cars a function of scene, pose, and location compatibility. As the selected cars originate from videos, we can track their trajectories and estimate the 3D structure of the scenes they belong to. Having trajectories/annotations in world co- ordinates allows placing all objects in the database in a common reference frame and then re-projecting them to the query image for a better fit in the new scene. Our model implicitly encapsulates object semantics (what object is it and what motion it is likely to perform) and real world dimensions and velocities of objects. * 4.2 Related Work Human action recognition is an important problem in computer vision. In this setup, the objective is to detect, understand, and make sense of what is going on with an object across time. Many of these works leverage the information content in large databases of videos to discover and build action models 15,251. Schechtman and Irani 1421 use a self similarity descriptor to match video queries to actions in a database. In 1341, Niebles et al. uses sparse space-time interest point distributions describe human ac- tions in video sequences. This work was later expanded to detect more complicated actions in olympic sports by representing activities as compositions of various atomic motion segments [351. The work by Laptev et al. 1251 learns human actions in un- constrained videos such as professional movies. It uses spatio-temporal interest points to characterize action classes and addresses action annotation automatically via movie Sec. 4.2. Related Work /I scripts and captions. In the work by Messing et al. [331, human actions are modeled as as mixtures of bags of velocity trajectories, extracted from track data. While this is an active and extensive research area, it differs from the problem of object trajectory pre- diction as the latter case is constrained to much less information at query time and the algorithm is expected to synthesize the motion of an object instance it has not observed previously. In a similar flavor to event prediction lie several works classifying images based on the events they imply 19,28,571. Amongst them, Li et al. integrates object and scene level information to determine human-centric actions present in static images. Yao et al. uses a structured representation to model human actions (such a person playing the violin) also from static images. In [191 Jie et al. finds image-caption correspondences amongst news images and text; it also learns the visual models linking poses and faces to action verbs. This problem differs from event prediction its objective is to output the name of an action class and not to provide future configurations of the objects in question. In the video surveillance realm, large amounts of video data are available to learn patterns of activity for a particular scene. Unsupervised models such as [22,551 learn spatio-temporal dependencies of moving agents in these complex and dynamic scenes. As a consequence, complex temporal rules such as the right of way at an intersection are discovered from the long video feeds. However, the knowledge built in these current models for surveillance is specific to one scene instance and cannot be transferred to a previously unseen location. Many works have demonstrated success using 3D representations for image un- derstanding and tracking. Images are a 2D projection of our 3D world; this projection depends on camera parameters. When reasoning about objects across images, it has been demonstrated useful to map a scene and its objects into a 3D coordinate system. For instance, in the mobile vision system by Ess et al. 171, pedestrians in crowded scenes are tracked by performing pose estimation and prediction of the the next frame 72 CHAPTER 4. CAR TRAJECTORY PREDICTION FROM A SINGLE IMAGE motion. It makes use of cues such as stereo, depth, scene geometry, and feedback from tracking. The work by Leibe et al. 1271 integrates the detection and frame-level pre- diction of cars and pedestrians in an online mobile system. It's trajectory estimation module analyzes the 3D observations to find physically plausible space-time trajecto- ries. Even when the camera parameters are unknown (e.g.web images), some assump- tions can be made to regarding the camera position and parameters by using the scene layout. For instance, Hoiem et al. uses the appearance of image regions to learn and classify geometric classes, which in turn describe the 3D orientation of a region with respect to the camera. The 3D information of a scene can also be used to build priors regarding the location and scale of objects such as cars and pedestrians as well as for single view 3D reconstructions from static images 116,17,18,381. These applications demonstrate the power that 3D representations can have in object recognition, detec- tion, and tracking tasks. To the extent of our knowledge, no other work models objects in a diverse video database captured through consumer point-and-shoot cameras where the camera information is unknown. E 4.3 Object and trajectory model Images and videos in LabelMe are captured at different locations and by different sub- jects. To make best use of the data regardless of the scene or camera setup, we make some assumptions and use the moving objects as a cue to estimate the scene parameters for each video. Consequently, the object trajectories are mapped to world coordinates and placed into a global reference frame. Camera and scene layout This work uses the same single view image representation in 1381, where a scene is composed by objects represented as piecewise planes. No camera parameters are available for the videos in our data set. To make estimates of real world dimensions we make the following assumptions on the data: (1) the location Sec. 4.3. Object and trajectory model 1.60 m 36.06 km/h 5mm Jmi 1.73 m Figure 4.1: Estimated average height and speed from annotations in the LabelMe video dataset. With the approach first introduced by Hoiem et al. , we can estimate the average height of objects in the data set. From video, we additionally estimate average velocities. of the camera is at the mean human height level (e.g. 160 cm), (2) the ground plane is flat, (3) the only camera rotation is due to pitch (no yaw or roll), and (4) people and cars are always in contact with the ground plane and do not change in size. Assumptions (2) and (3) result in a horizontal horizon line. * 4.3.1 From 2D to 3D In a LabelMe video annotation, each object is composed by a list of polygons at each time frame. We simplify the annotations by using their bounding boxes. Since cars and people stand on the ground plane, we consider the two lower points from each bounding box as the contact points with the ground. 74 CHAPTER 4. CAR TRAJECTORY PREDICTION FROM A SINGLE IMAGE Let x' be a vector containing the image coordinates of a bounding box belonging to the ith object at the t frame where t E Di, the duration of of the trajectory. We assume that each object stays constant in size in 3D and thus has a real world width wi and height hi associated to it. Let vy be the image coordinate of the y component of the horizon line. The vector of parameters to estimate is then e = [wi,hi1 ...wn,hn,vy] We call a base point for a bounding box the midpoint between the two contact points with the ground. The world coordinates of all base are converted to world coor- dinates using the current parameters in the vector e. Bounding boxes in the 3D coordi- nate system are then reconstructed using the base point information and the estimated world dimensions for the object. Finally, let $4 denote the projection of the bounding boxes in from world coordinates to the image plane. This results in an optimization problem where the objective function is: E(0) = |x -||2+ wi -Wcan(ci)2 + hi - hcan(Ci)||2 ) The constants Wcan and hca, denote the canonical width and heights for an object of class ci. These constants can be taken from real world statistics. For details on the processes to convert from image to world coordinates and vice-versa, we refer to the literature in multi-view geometry [131. We estimated the 3D measurements of all the car and person polygons in the annotated videos. As shown by Hoiem et al. 118], having world measurements for objects, we can generate statistics on the size and velocity of the objects (see figure 4.1). For instance, based solely on annotations, we find that the mean size of a person is 1.73m and its mean velocity is 5.45 km/h while the mean height of a car is 1.60 m Sec. 4.4. From trajectories to action discovery 75 and its mean velocity in the data set is 36.06 km/h. U 4.4 From trajectories to action discovery Having object trajectories in world coordinates gives us the freedom to place them in a common frame of reference despite their original locations in the images or the specifics of the scene. To compare trajectories, each is translated such that the base point of the first box is in the world origin. 3 2 4 14. 5 6 count ooo JO0 41 bin# 1 2 3 4 5 6 7 8 Figure 4.2: Visualization of trajectory feature. The ground plane is divided radially into 8 equally sized regions. Each trajectory is translated to the world center and described by the normalized count of bounding boxes landing in each region. Once all the trajectories are translated to the origin, we compact them into fea- ture descriptors. The ground of the world is divided into 8 equally sized angular regions centered at the origin (see figure 4.3 for a visualization). A trajectory is described by the normalized count of boxes landing in each region. Finally, the trajectories for each object class are clustered in this feature space resulting in the clusters visualized in figure 4.3; notice how each cluster represents motions in different directions. CHAPTER 4. CAR TRAJECTORY PREDICTION FROM A SINGLE IMAGE Figure 4.3: Discovered motion clusters for each object class. Object trajectories for each class are normalized and transferred to a common point in world coordinates. The trajectories are further clustered and each cluster is visualized as an energy map summarizing all of the trajectories belonging to the cluster. The trajectories have been translated, re-projected and resized to fit the displayed image crops . E 4.4.1 Car trajectory prediction from a single image The previous section described how we convert the location of objects from a 2d rep- resentation to one in 3d. This section will describe the general pipeline for trajectory prediction. Given a car detection in a static image (we assume the images depict urban environments similar to the ones in the video database), the task is to predict a plausi- ble trajectory for it. Figure 1.4 shows a visualization of such trajectory drawn over the input image. Sec. 4.4. From trajectories to action discovery / Our prediction method relies on scene and object matching in a non-parametric manner. We begin by retrieving the nearest videos at the scene level using the gist descriptor. We do this by computing gist features for each frame in the video dataset and comparing the gist descriptor from the query to each video frame in our dataset. Every frame in each video is ranked and the frame closest to the query image becomes the representative for the video it belongs to. Finally, we gather the top 200 frames (each representing a separate video) closest to our query image to work with. This first matching phase selects videos close to the query at the scene level. However, we aim for a prediction at the object level; we need to reason about the data contained in the nearest neighbors at the object (in this case car) level. Therefore, we will proceed by detecting the cars using the Latent Deformable Part Model (LDPM) detector 11] contained in the selected video frames. In selecting the appropriate car to transfer information to the query, we will re-rank the detections using the following criteria: * How confidence the detection is (the score). * How close the originating scene is to the query (the gist distance). " If the detection is of a similar size to the one in the query (bounding box intersec- tion). " How similar is the pose of the detected car with that of the query. To ensure that detections with a pose similar to the query are scored higher, we utilize the concept of an Exemplar SVM (eSVM) detector introduced by Malisiewicz et al. [321. For each query object, we train a separate exemplar SVM with only the crop of the query car as a positive example, and a conventional set of negative instances generated from 300 images that do not contain cars. This yields an instance-specific 75 CHAPTER 4. CAR TRAJECTORY PREDICTION FROM A SINGLE IMAGE Figure 4.4: (right) Image containing the top detection using the LDPM car detector (in blue) and the top exemplar SVM detection trained on one single exemplar (left). The LDPM detector is trained on many instances of cars varying in shape and pose. The non-maximum-suppression phase rules out overlapping detections and scores the blue detection as the highest. The eSVM, trained on the single positive instance (left) identifies the window that matches the hatchback template from the query the best (in this case the side of the cab excluding its hood). Our ap- proach aims at detecting complete cars using the LDPM detector and filtering these detections using an eSVM detector trained on the query crop. We compare the bounding box intersection between the eSVM detection and the DPM one and discard detections that do not overlap more than 70%. classifier, detecting windows that similar to our instance-level template. In practice, however, what we observe is a scenario depicted in figure 4.4. Having only one positive example, the eSVM will fire in parts of the car that satisfy the template. Figure 4.4 shows how the eSVM fires for a portion of the cab excluding its hood to satisfy the template of a hatchback model. While this result is reasonable, our application will benefit from discarding partial detections. The LDPM method is trained in a large variety of cars, and after non-maximum suppression, is able to mostly filter out partial car detections. Therefore, we exploit the comprehensive knowledge-base of the LDPM detector and use the instance-tuned capabilities of the eSVM detector to discard LDPM detections that differ from our exemplar query. We begin by defining a preliminary score E for some LDPM detection b. Let Besmi denote the set of eSVM detections for the image that b originates from. E(b) re- Sec. 4.4. From trajectories to action discovery /Y turns the maximum bounding box intersection between b and each of the found eSVM detections. E(b) = max [int(b,be)] be E Besvmn The LDPM detector is executed on each gist neighboring image, ieigh, yielding the corresponding detections. The new score for each LDPM detection is defined as: S(bneigh)_ fLDPM(bneigh)+ int(bquery,bneigh)+ G(Ineigh,Iquery) : E(bneigh) > 0.7 0 : otherwise where bneigh denotes an LDPM detection on Ineigh, LDPM(bneigh) is a value be- tween 0 and 1 for the detection score (the original scores were mapped to a logistic function), G(Ineigi,Iquery) denotes the gist compatibility (also between 0 and 1) be- tween the query image and its neighbor (the image bneigh originates from. The final score is now dependent on the bounding box intersection with the query, the gist dis- tance amongst scenes, and the LDPM score. Any detection that does not have a sup- porting eSVM detection that overlaps more than 70% will be discarded. Figure 4.5 shows the top detections ranked using different metrics. Once a detection is selected as a source for trajectory transfer, the trajectory for the originating object is transferred to the location of the static detection by calculating the location of the detection in world coordinates and translating the retrieved object trajectory information to the desired location. The trajectory can be generated by track- ing the object. Finally, the size of the bounding boxes is adjusted to fit the actual size of the detection in the static image. In order to calculate the real world dimensions of a bounding box we require the location of the horizon line. Making the same assump- 80 CHAPTER 4. CAR TRAJECTORY PREDICTION FROM A SINGLE IMAGE tions as described in section 4.3, we need to find the location of the horizon line, vy. In a static image, the horizon line can be estimated using the gist descriptor 136,491 or reasoning about object detections 1181 in an automatic manner or selected manually. U 4.5 Experimental evaluation Ground truth data for event prediction is not well defined. Even when extracting tra- jectories from real world video, only one prediction is returned while in actuality there exists an entire family of realistic predictions per detection. In prior work [301, an evaluation of motion predictions compared a set of generated predictions against the trajectory observed in video amongst a set of distractors. This approach works well identifying if, within a number of attempts, the algorithm detects one observed motion but it does not take into account the quality of the remaining predictions. We take a different approach and conduct a user study. The objective of these experiments is to evaluate the quality of object-level event predictions. Therefore, only examples with reliably detected objects were used as queries in our test set. To create this set, we randomly selected 500 videos and a random frame in each. We ran a car detector [11] in this subset, sorted the detections in descending order by confidence and selected the top 30 detections. For each detection, 5 predictions are generated in two background modes. The first mode is on a synthetic background where there are no other objects except for the one in question and the horizon line matches the one in the originating scene (see an example in figure 7). The subjects were asked to rank a prediction based solely on its appearance as very likely, unlikely but possible, impossible, or cannot tell. Figure 4.8 (left) that for 84% of the test cases under the original scene, at least one of the 5 predictions was classified as very likely. This value drops to 12% when requiring the top 5 predictions to be very likely. This figure also plots the same graph but when considering both very likely and unlikely but possible Sec. 4.5. Experimental evaluation 81 results showing an 8% increase in performance for the 1 prediction case and up to 30% increase when evaluating all 5 predictions. A second mode of the test utilizes the same predictions and situates the objects in their original scene. In this test, the users are asked to consider both the appearance of the object and the surrounding scene. They are asked to consider the trajectory as a line on the floor in 3D and consider predictions wrong if they intersect with obstacles in the world (e.g.buildings). The test subjects are told to use their prior knowledge on where these objects are likely to be in a real scene when answering the questions; for example, even if a car is moving in a trajectory that matches its appearance, it is unlikely that it will be moving on the sidewalk). See figure 7 for an example in both test settings. Figure 4.8 summarizes the results of the user study where each test was given to 6 different subjects. We evaluate the data inspired by the evaluation criterion in 1311: for each object, we count the number of predictions that were marked likely or very likely. The bar chart shows the percentage of object samples where at least n predic- tions (out of 5 total) were deemed likely or very likely. The error bars indicate the variance amongst subjects. Since the method does not integrate semantics regarding the world the object lives in, we also tested the predictions in a synthetic environment (see an example in figure 4.7, where subjects are asked to rate a prediction based solely on the pose of the object. Interestingly, the same predictions placed in a virtual world performed slightly worse compared to the same experiment in the original scene for some objects where only one of the five predictions was correct. This can mean that humans are to some extent unsure of the pose in some examples and context helps to disambiguate. The variance amongst subjects when judging the objects in their original scene is lower than when judging isolated objects. This might suggest that the contextual information of the surrounding scene helps in the perception of predictions in a scene 82 CHAPTER 4. CAR TRAJECTORY PREDICTION FROM A SINGLE IMAGE and complements a pose. * 4.6 Discussion and concluding remarks This chapter presented a framework for object-level event prediction. Our framework produces plausible car trajectories. Our key contribution lies in the 3D representation of scenes at the object-level. We maximize the amount of transferrable information in our training set by estimating 3D trajectories from 2D video annotations and tracks. In order to generate suitable predictions we employ a non-parametric framework. Given a query image with a bounding box of the selected object, we retrieve the closest scenes in the video database and detect the cars in the nearest scenes. The detections are fur- ther re-ranked to consider their pose, location, detection confidence, and scene similar- ity. Finally, we transfer trajectories to the detections in the static image. Experimental results show that our system is able to make more reliable and consistent predictions compared to prior work. Given the modularity of our system, we can easily substitute different object detection engines, making this framework reusable as newer and better detectors, fea- ture descriptors, or specialized pose descriptors 16, 121 become available. In this work we have focused on cars given the nature of the data available, however, adding new objects (such as boats or bikes) is possible given sufficient detections and/or video an- notations. We envision prediction technologies as important for the future development of devices such as autonomous navigation systems and artificial vision systems for the blind. query (a) top detections sorted by scene gist distance to query (b) top detections sorted by DPM score (c) top aetections sortea Dy evm score (d) top detections sorted by our aggregated score Figure 4.5: Top candidate detections for trajectory transfer. We detect cars in the 200 nearest video frames. The naive approach of considering only the gist distance between scenes results in very few reliable detections amongst the top scenes (a). Ordering detections by the LDPM detection score gives a higher ratio of reliable car detections (b); however, there is no guarantee that all detections will contain the same pose as of the query. An exemplar SVM approach focuses the search on windows similar to the query (c); however, this approach sometimes fires on only portions of entire cars (see yellow boxes). Finally, our approach integrates scene (gist) similarity, bounding box intersection with the query detection, the LDPM and eSVM scores (d). (a) (b) (c) Figure 4.6: Predictions from a single image. (a) For each object, we can predict different trajectories even from the same action/trajectory group. (b) other example predictions; note the diversity in locations and sizes of objects and how their predictions match the motion implied by their appearance. (c) failure cases can take place when the appearance of the object is not correctly matched to the implied family of actions, when the horizon line is not correctly estimated (or horizontal), or if there are obstacles in the scene that interfere with the predicted trajectory. Figure 4.7: User study scenarios. The prediction evaluation is presented in and a synthetic world (/) and the original image where the the object resides (2). In the synthetic scenario, the user is asked to determine the quality of the prediction based solely on the pose without considering the semantics of the scene whereas in the original scene the user is asked to judge the 3D trajectory taking the scene elements into consideration (e.g. cars should move on the road and not on sidewalks or though obstacles). User study results synthetic background 0.9 - 0.9 original background 0.8 0.8 - 0.7 0 o7- 0.6 0.6 'A C 0.5 0.5 - 0.4 - .4 - 0.3- 3 - 0.2- 0.2- 0.1 - 0.1 1 2 3 4 5 1 2 3 4 5 minimum # of predictions deemed likely (out of 5) minimum # of predictions deemed likely or possible (out of 5) Figure 4.8: User study results. A set of 30 reliable car detections comprises our test set. Our algorithm was configured to output 5 predictions per example. Subjects were asked to score the predictions as very likely, unlikely but possible, impossible, or cannot tell. Each bar represents the percentage of objects where at least x (out of the total 5) predictions are likely. Our algorithm is evaluated under a synthetic background and a real one (blue and red bars). Chapter 5 Conclusion The previous chapters have presented a the different components of a pipeline for data- driven event prediction and unusual event detection. They began with the design an implementation of a tool for generating ground truth information in video. As a re- sult, we generated a diverse video database and framework for annotating objects and events. This database was then used as the building block for several applications in- cluding object motion statistics, unusual event detection, and event prediction, amongst others. M 5.1 Contributions Training data is crucial for most computer vision systems. Nowadays, video acquisition at a large scale is feasible thanks to the wide availability of consumer cameras, mobile devices, and, video sharing websites. In chapter 2, we introduce an video sharing re- source for the research community. Videos are copyright free and open for anyone to use. To this date, we have collected 7166 videos spanning different scene categories such as streets, parking lots, offices, museums, etc. The current capabilities of the sys- tem include: (1) the creation of ground truth annotations at the object level, delineating objects throughout their lifetime in the videos and (2) the annotation of events, encap- sulating the interaction of annotated objects at a higher level. The content in chapter 2 88 CHAPTER 5. CONCLUSION was the building block for the applications in the rest of the thesis. The second key contribution (in chapter 3) of this thesis uses the LabelMe video database to build a model of common events. It builds on the premise that humans are capable of transferring personal experiences in the real world to new scenes. In particular, humans are finely tuned to identify events out of the ordinary. The work in this chapter made use of scene matching to retrieve videos similar to a query and integrated the motion information in the results to build a model of "normal" events. Finally, it compares the observed motion information against the motion information deemed "normal" and assigns scores to the observed motions. The experimental results prove the validity of the approach when given enough video samples of real world scenes. The work in chapter 4 explores a different direction and looks deeply into event prediction. It is based on the scene matching framework used in chapter 3 but inte- grates the information from object annotations produced in chapter 1. We introduced a framework for object-level prediction, in this case cars. In chapter 3, we looked into an unsupervised method exploiting the inherent data in the LabelMe video dataset. In chapter 4, we take a supervised approach. We use the object annotations of cars to infer real world measurements of the scene and the objects in it. By translating objects in different videos to a common frame of reference, we maximize the amount of trans- ferrable information in our training set. Finally, given a static image, we predict car trajectories expressed both in real world or image units. This thesis has introduced ground work towards the study diverse video databases. It also has re-framed traditional problems in computer vision. Traditional unusual de- tection systems were framed in a monolithic manner, where large amounts of video were recorded for a single scene, unable to transfer such data to new scenes. We use scene matching to find the subset of videos that is closest to the query and integrate the data in the nearest neighbors to learn the most common motion patterns. Finally, this Sec. 5.1. Contributions 89 thesis posed a new problem in computer vision: event prediction from a single image. We envision prediction technologies as important for the future development of devices such as autonomous navigation systems and artificial vision systems for the blind. 90 CHAPTER 5. CONCLUSION Bibliography Il Marc Alexa, Daniel Cohen-Or, and David Levin. As-rigid-as-possible shape in- terpolation. In SIGGRAPH '00: Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pages 157-164, 2000. ISBN 1- 58113-208-5. doi: http://doi.acm.org/10. 1145/344779.344859. 37 121 S. Birchfield. Derivation of kanade-lucas-tomasi tracking equation. Technical report, 1997. URL http: / /www. ces. clemson. edu/-stb/klt/. 56 131 Navneet Dalal and William Triggs. Histograms of Oriented Gradients for Human Detection. In IEEE Conference on Computer Vision and Pattern Recognition, 2005. 52 141 C. Desai, D. Ramanan, and C. Fowlkes. Discriminative models for multi-class object layout. In International Conference on Computer Vision, 2009. 52 151 Alexei A. Efros, Alexander C. Berg, Greg Mori, and Jitendra Malik. Recognizing action at a distance. In International Conference on Computer Vision, 2003. 32, 53,70 161 M. Eichner, M. Marin-Jimenez, A. Zisserman, and V. Ferrari. Articulateduman pose estimation and search in (almost) unconstrained still images. In ETH Tech- nical Report, 2010. 82 92 BIBLIOGRAPHY 171 A. Ess, B. Leibe, K. Schindler, and L. V. Gool. A mobile vision system for ro- bust multi-person tracking. In IEEE Conference on Computer Vision and Pattern Recognition, 2008. 71 [8] M. Everingham, A. Zisserman, C.K.I. Williams, and L. Van Gool. The pascal vi- sual object classes challenge 2006 (VOC 2006) results. Technical report, Septem- ber 2006. 31 191 L. Fei-Fei and L.-J. Li. What, where and who? telling the story of an image by activity classiPcation, scene recognition and object categorization. In Studies in Computational Intelligence- Computer Vision, 2010. 71 1101 L. Fei-Fei, R. Fergus, and P. Perona. One-shot learning of object categories. IEEE Trans. on Pattern Analysis and Machine Intelligence, 28(4):594-6 11, 2006. 3 1 1111 P. Felzenszwalb, D. McAllester, and D. Ramanan. A discriminatively trained, multiscale, deformable part model. In IEEE Conference on Computer Vision and Pattern Recognition, 2008. 77, 80 1121 V. Ferrari, M. Marin-Jimenez, and A. Zisserman. Pose search: retrieving people using their pose. In IEEE Conference on Computer Vision and Pattern Recogni- tion, 2009. 82 1131 R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, 2000. 74 1141 J. Hays and A. A. Efros. IM2GPS: estimating geographic information from a single image. In IEEE Conference on Computer Vision and Pattern Recognition, 2008. 54 115/ James Hays and Alexei Efros. Scene completion using millions of photographs. In "SIGGRAPH", 2007. 31, 54, 55 BIBLIOGRAPHY 93 1161 D. Hoiem, A.A. Efros, and M. Hebert. Auto- matic photo pop-up. In SIGGRAPH, 2005. URL http: //www-2.cs.cmu.edu/-dhoiem/projects/popup/. 72 1171 D. Hoiem, A.A. Efros, and M. Hebert. Geometric context from a sin- gle image. In International Conference on Computer Vision, 2005. URL http: //www.cs .cmu.edu/-dhoiem/projects/context/index.html. 72 1181 D. Hoiem, A. Efros, and M. Hebert. Putting objects in perspective. In IEEE Conference on Computer Vision and Pattern Recognition, 2006. 72,74, 80 1191 L. Jie, B. Caputo, and V. Ferrari. Who's doing what: Joint modeling of names and verbs for simultaneous face and pose annotation. In Advances in Neural Info. Proc. Systems, 2009. 71 1201 I. N. Junejo, 0. Javed, and M. Shah. Multi feature path modeling for video surveillance. Pattern Recognition, International Conjerence on, 2, 2004. 52 1211 B. Krekelberg, S. Dannenberg, K. P. Hoffmann, F. Bremmer, and J. Ross. Neural correlates of implied motion. Nature, 424:674-677, 2003. 51 [221 D. Kuettel, M. D. Brietenstein, L. V. Gool, and V. Ferrari. Discovering spatio- temporal dependencies in dynamic scenes. In IEEE Conference on Computer Vision and Pattern Recognition, 2010. 71 1231 Jean-Frangois Lalonde, Derek Hoiem, Alexei A. Efros, Carsten Rother, John Winn, and Antonio Criminisi. Photo clip art. ACM Transactions on Graphics (SIGGRAPH 2007), 26(3), August 2007. 45 1241 1. Laptev and P. Perez. Retrieving actions in movies. In International Conference on Computer Vision, 2007. 32, 52 94 BIBLIOGRAPHY 1251 Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozenfeld. Learning realistic human actions from movies. In IEEE Conference on Computer Vision and Pattern Recognition, 2008. 30, 32,70 1261 S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In cvpr, pages 2169-2178, 2006. 55 1271 B. Leibe, K. Schindler N. Cornelis, , and L. Van Gool. Coupled object detection and tracking from static cameras and moving vehicles. In IEEE Trans. on Pattern Analysis and Machine Intelligence, 2008. 72 1281 L.-J. Li and L. Fei-Fei. What, where and who? classifying event by scene and object recognition. In International Conference on Computer Vision, 2007. 53, 71 1291 C. Liu, W.T. Freeman, E.H. Adelson, and Y. Weiss. Human-assisted motion anno- tation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1-8, 2008. 30,44 1301 C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman. SIFT flow: dense correspondence across different scenes. In European Conference on Comp, 2008. 54,80 1311 C. Liu, J. Yuen, and A. Torralba. Nonparametric scene parsing: Label transfer via dense scene alignment. In IEEE Conference on Computer Vision and Pattern Recognition, 2009. 54, 55, 69, 81 1321 Tomasz Malisiewicz, Abhinav Gupta, and Alexei A. Efros. Ensemble of exemplar-svms for object detection and beyond. In ICCV, 2011. 77 BIBLIOGRAPHY [331 Ross Messing, Chris Pal, and Henry Kautz. Activity recognition using the veloc- ity histories of tracked keypoints. In ICCV, Washington, DC, USA, 2009. IEEE Computer Society. 52,71 1341 J. C. Niebles, H. Wang, and L. Fei-Fei. Unsupervised learning of human action categories using spatial-temporal words. Int. J. Comput. Vision, 79(3):299-318, 2008. ISSN 0920-5691. doi: http://dx.doi.org/ 10.1007/s 11263-007-0122-4. 52, 53,70 1351 J. C. Niebles, C. W. Chen, and L. Fei-Fei. Modeling temporal structure of de- composable motion segments for activity classification. In European Conference on Comp, 2010. 70 1361 A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic representa- tion of the spatial envelope. 1JCV, 42(3):145-175, 2001. 55, 80 1371 J. Ponce, T. L. Berg, M. Everingham, D. A. Forsyth, M. Hebert, S. Lazebnik, M. Marszalek, C. Schmid, B. C. Russell, A. Torralba, C. K. I. Williams, J. Zhang, and A. Zisserman. Dataset issues in object recognition. In In Toward Category- Level Object Recognition. Springer-Verlag Lecture Notes in Computer Science, J. Ponce, M. Hebert, C. Schmid, and A. Zisserman (eds.), 2006. 31 [381 B. C. Russell and A. Torralba. Building a database of 3d scenes from user anno- tations. In IEEE Conference on Computer Vision and Pattern Recognition, 2009. 45,72 1391 B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. LabelMe: a database and web-based tool for image annotation. IJCV, 77(1-3):157-173,2008. 30, 31, 33, 34, 41, 45 96 BIBLIOGRAPHY 140] Christian Schuldt, Ivan Laptev, and Barbara Caputo. Recognizing human actions: A local SVM approach. In ICPR, 2004. 32,52 [411 Flickr Photo Sharing Service. http://www.flickr.com. 31 [421 E. Shechtman and M. Irani. Matching local self-similarities across images and videos. In IEEE Conference on Computer Vision and Pattern Recognition, 2007. 70 1431 J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2000. 57 1441 J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object matching in videos. In International Conference on Computer Vision, 2003. 54 [451 Alan F. Smeaton, Paul Over, and Wessel Kraaij. Evaluation campaigns and trecvid. In MIR '06: Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval, 2006. 30, 32 1461 A. Sorokin and D. Forsyth. Utility data annotation with Amazon Mechanical Turk. In IEEE Workshop on Internet Vision, associated with CVPR, 2008. 30, 33 1471 M Spain and P. Perona. Measuring and predicting importance of objects in our visual world. Technical report 9139,, California Institute of Technology, 2007. 14,41,43 1481 Carlo Tomasi and Takeo Kanade. Detection and tracking of point features. IJCV, 1991.56 1491 A. Torralba and P. Sinha. Statistical context priming for object detection. In International Conference on Computer Vision, pages 763-770, 2001. 80 BIBLIOGRAPHY 97 1501 A. Torralba, R. Fergus, and W. T. Freeman. 80 million tiny images: a large dataset for non-parametric object and scene recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence, 30(11):1958-1970, 2008. 31 1511 A. Torralba, R. Fergus, and W.T. Freeman. Tiny images. Technical Report AIM- 2005-025, MIT Al Lab Memo, September, 2005. 54 1521 L. von Ahn and L. Dabbish. Labeling images with a computer game. In SIGCHI, 2004. URL http: / /www. espgame.org/. 30,33 1531 L. von Ahn, R. Liu, and M. Blum. Peekaboom: A game for locating objects in images. In In ACM CHI, 2006. URL http: / /peekaboom. org. 30 1541 X. Wang, K.Tieu, and E. Grimson. Learning semantic scene models by trajectory analysis. In European Conference on Comp, 2006. 52 1551 X. Wang, K. T. Ma, G. Ng, and E. Grimson. Trajectory analysis and semantic region modeling using a nonparametric bayesian model. IEEE Conference on Computer Vision and Pattern Recognition, 2008. 52, 54, 71 1561 J. Winawer, A. C. Huk, and L. Boroditsky. A motion aftereffect from still pho- tographs depicting motion. Psychological Science, 19:276-283, 2008. 51 [571 B. Yao and L. Fei-Fei. Modeling mutual context of object and human pose in human-object interaction activities. In IEEE Conference on Computer Vision and Pattern Recognition, 2010. 71 1581 B. Yao, X. Yang, and S.-C. Zhu. Introduction to a large scale general purpose ground truth dataset: methodology, annotation tool, and benchmarks. In EMM- CVPR, 2007. 30 98 BIBLIOGRAPHY 1591 J. Yuen, B. C. Russell, C. Liu, and A. Torralba. Labelme video: Building a video database with human annotations. In International Conference on Computer Vi- sion, 2009. 59, 70 1601 J. Yuen, , and A. Torralba. A data-driven approach for event prediction. In Euro- pean Conference on Comp, 2010. 69 1611 S. Zanetti, L. Zelnik-Manor, and P. Perona. A walk through the web's video clips. In IEEE Workshop on Internet Vision, associated with CVPR, 2008. 52 1621 H. Zhong, J. Shi, and M Visontai. Detecting unusual activity in video. In IEEE Conference on Computer Vision and Pattern Recognition, 2004. 52