Efficiently Searching for Objects Within Large Collections of Images and Video

Moll, Oscar R.

dc.contributor.advisor	Madden, Samuel R.
dc.contributor.author	Moll, Oscar R.
dc.date.accessioned	2024-04-17T21:09:45Z
dc.date.available	2024-04-17T21:09:45Z
dc.date.issued	2023-06
dc.date.submitted	2023-07-13T14:26:01.254Z
dc.identifier.uri	https://hdl.handle.net/1721.1/154184
dc.description.abstract	Images and videos are now ubiquitously captured and collected. Cameras in our phones let us capture our day to day, camera drones help engineers monitor structures and map terrain. Video conferencing is a key enabler of remote work and distributed teams. Video and images are also a key part of robotic perception, including selfdriving cars, which use cameras as key sensors. Social media is increasingly image and video based. Video from these applications often makes its way to cloud storage, either for archival or for more complex uses, such as building datasets for machine learning algorithms used by the application. These large image and video datasets are enabling a new generation of applications. For example, video data from vehicle dashboard-mounted cameras (dashcams) is used to train object detection and tracking models for autonomous driving systems [82], to annotate map datasets such as OpenStreetMap with locations of traffic lights, stop signs, and other infrastructure [59]; and to automate insurance claims processing by analyzing collision scene footage [61]. In parallel to the above trend, deep-learning-based computer vision models have f lourished in the last decade. High-quality semantic embedding models are widely available today for images, and it is reasonable to assume a basic level of automated semantic understanding of individual images and video frames for all of the application scenarios above. Despite these enabling advances, carrying out even simple high-level tasks on image and video data collections is difficult. For example, searching your own data for some ad-hoc object of interest is not always easy: publicly available models are not always accurate enough when used on proprietary datasets, and they are not equally accurate for all searches. Ad-hoc object searches are useful in their own right for data exploration, and these searches are also a key step in other processes, such as labeling data for supervised machine learning. Data labeling is an expensive process involving human input, in practice, this implies picking a subset of the data only. Labeling presumes we can locate images or videos with the desired label distribution in the training and validation sets. This desired distribution may differ greatly from the natural distribution of the collected data. For example, Tesla runs an extensive suite of checks on their autopilot perception models[44]. These checks correspond to important scenarios, more akin to corner cases discovered over time, than to a random sample. The following are two important bottlenecks for any work with images and video collections: • Adapting existing models to our own data and tasks is laborious, requiring some human annotation and testing. • The computational demands of the models used are high, so, processing large data collections is slow and expensive. The goal of this thesis is to describe two systems: SeeSaw and ExSample, designed to tackle instances of these problems on the large image and video collections, respectively. SeeSaw [57] helps users find results in queries where the visual semantic embedding used to index the data performs poorly. The high-level approach in SeeSaw is to take user feedback during the search process in the form of bounding boxes around regions of relevance within previously shown results. Behind the scenes, SeeSaw integrates this feedback into future results. This is challenging as SeeSaw must deal with the problem of integrating feedback from small samples of high dimensional data in a way that increases rather than decreases result quality. Searching for examples in the video presents extra challenges. Video data is much larger in volume than images, as one second of typical video commonly maps to 30 image frames. This multiplier makes video datasets harder to manage as costs are higher across the board: costs of storage, costs of data IO, compute costs for decompressing it, and finally, any processing itself. At the same time, video data is also more redundant, as consecutive frames are usually very similar in content. This redundancy reduces the utility of each frame individually. We designed ExSample [55] to tackle these trade-offs between cost and utility that are specific to video. ExSample assumes there is an accurate object detector available, which can detect whether a given frame has an object of interest and where the object is. It uses this information to decide whether results are being found, and, importantly, whether they are redundant. This information is used to decide which areas of the data, either single files or segments of long video files, are likely to yield the most new examples in a future sample, and then pick frames from that area. The statistical methodology used to estimate the probability of new results is based on a similar intuition as the chance of finding new gene variants in a sampled population, new words in a text, or new species in an expedition, known as missing mass estimation. ExSample adapts related ideas to improve sampling from videos in an online setting. Beyond their common application goal of helping users with image and video searches, both SeeSaw and ExSample also share some design goals: 1. Accessibility: we want to lower barriers to getting started using the systems. We want users to get started with SeeSaw and ExSample without requiring a prior extensive labeling effort or without requiring models that work well with their data already. One target application of the systems is helping users find data they can use for labeling in order to make models for their data. 2. Scalability: the exponential growth of data means that for these systems to continue being useful, costs and latencies should scale favorably as the data grows. In particular, expensive processing at query time or repeated wait times/costs that grow linearly with data are undesirable. 3. Adaptability: SeeSaw and ExSample adapt to users’ unique data and queries. SeeSaw is meant to help users with queries when pre-trained embedding models are falling short, adapting to the users’ queries and datasets. ExSample is designed to adapt to different video files and scenarios, for example including both moving camera and static camera videos, large videos or many small videos, etc. We describe SeeSaw in Chapter 2, and then ExSample in Chapter 3, following a discussion of limitations, areas for future work, and ways in which both systems could work together in Chapter 4.
dc.publisher	Massachusetts Institute of Technology
dc.rights	Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights	Copyright retained by author(s)
dc.rights.uri	https://creativecommons.org/licenses/by-sa/4.0/
dc.title	Efficiently Searching for Objects Within Large Collections of Images and Video
dc.type	Thesis
dc.description.degree	Ph.D.
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
dc.identifier.orcid	0000-0002-5888-4318
mit.thesis.degree	Doctoral
thesis.degree.name	Doctor of Philosophy

Files in this item

Name:: mollthomae-orm-phd-eecs-2023-t ...
Size:: 2.143Mb
Format:: PDF
Description:: Thesis PDF

View/Open

This item appears in the following Collection(s)

Doctoral Theses

Show simple item record