Machine learning and coresets for automated real-time data segmentation and summarization
Author(s)
Volkov, Mikhail, Ph. D. Massachusetts Institute of Technology
DownloadFull printable version (15.44Mb)
Other Contributors
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.
Advisor
Daniela Rus.
Terms of use
Metadata
Show full item recordAbstract
In this thesis, we develop a family of real-time data reduction algorithms for large data streams, by computing a compact and meaningful representation of the data called a coreset. This representation can then be used to enable efficient analysis such as segmentation, summarization, classification, and prediction. Our proposed algorithms support large streams and datasets that axe too large to store in memory, allow easy parallelization, and generalize to different data types and analyses. We discuss some of the challenges that arise when dealing with real Big Data systems. Such systems are designed to routinely process unseen, possibly unbounded, data streams; are expected to perform reliably, online, in real-time, in the presence of noise, and under many performance and bandwidth limitations; and are required to produce results that are provably close to optimal. We will motivate the need for new data reduction techniques, in the form of theoretical and practical open problems in computer science, robotics, and medicine, and show how coresets can help to overcome these challenges and enable us to build several practical systems that meet these specifications. We propose a theoretical framework for constructing several coreset algorithms that efficiently compress the data while preserving its semantic content. We provide an efficient construction of our algorithms and present several systems that are capable of handling unbounded, real-time data streams, and are easily scalable and parallelizable. Finally, we demonstrate the performance of our systems with numerous experimental results on a variety of data sources, from financial price data to laparoscopic surgery video.
Description
Thesis: Ph. D. in Computer Science and Engineering, Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2016. Cataloged from PDF version of thesis. Includes bibliographical references (pages 160-174).
Date issued
2016Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology
Keywords
Electrical Engineering and Computer Science.