MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Doctoral Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Doctoral Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Sublinear algorithms for massive data problems

Author(s)
Mahabadi, Sepideh
Thumbnail
DownloadFull printable version (3.483Mb)
Other Contributors
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.
Advisor
Piotr Indyk.
Terms of use
MIT theses are protected by copyright. They may be viewed, downloaded, or printed from this source but further reproduction or distribution in any format is prohibited without written permission. http://dspace.mit.edu/handle/1721.1/7582
Metadata
Show full item record
Abstract
In this thesis, we present algorithms and prove lower bounds for fundamental computational problems in the models that address massive data sets. The models include streaming algorithms, sublinear time algorithms, property testing algorithms, sublinear query time algorithms with preprocessing, or computing small summaries for large data. More precisely, we study the following problems. The (Approximate) Nearest Neighbor problem models the task of searching among a large data set of objects. Given a data set of n points in a high dimensional space, its goal is to search for the closest point in the data set to a given query point, in sublinear time, and by suitably preprocessing the data. This problem has numerous applications in image and video databases, information retrieval, clustering, and many others. In these applications, the points model the objects in a large data set, and their closeness measure similarity between the objects. However, for the purpose of many applications, the basic formulation of Nearest Neighbor as described, encounters several challenges which we address in this thesis: we show how to deal with the case where the data is corrupted or incomplete, how to handle multiple related queries, and how to handle a data set of more complex objects rather than simple points. Next, we show a general approach for solving massive data problems. We introduce the notion of Composable Coresets, defined as small summaries of multiple data sets that can be aggregated together to summarize the whole data. We show how to compute such summaries for several clustering problems, and at the same time, demonstrate that no such summaries are possible for other natural problems such as maximum coverage. Finally, we study the Set Cover problem in alternate sublinear models: streaming algorithms (where one makes a small number of passes over the data using small storage), and sublinear time algorithms (where one computes the answer without reading the whole input). We present tight approximation algorithms for the Set Cover problem in both of these models. In this thesis, we introduce theoretical problems and concepts that model computational issues arising in databases, computer vision and other areas. Most of the presented algorithms are simple and practical to implement.
Description
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2017.
 
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
 
Cataloged from student-submitted PDF version of thesis.
 
Includes bibliographical references (pages 227-244).
 
Date issued
2017
URI
http://hdl.handle.net/1721.1/113933
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology
Keywords
Electrical Engineering and Computer Science.

Collections
  • Doctoral Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.