Sublinear algorithms for massive data problems

Mahabadi, Sepideh

dc.contributor.advisor	Piotr Indyk.	en_US
dc.contributor.author	Mahabadi, Sepideh	en_US
dc.contributor.other	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.	en_US
dc.date.accessioned	2018-03-02T21:39:50Z
dc.date.available	2018-03-02T21:39:50Z
dc.date.copyright	2017	en_US
dc.date.issued	2017	en_US
dc.identifier.uri	http://hdl.handle.net/1721.1/113933
dc.description	Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2017.	en_US
dc.description	This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.	en_US
dc.description	Cataloged from student-submitted PDF version of thesis.	en_US
dc.description	Includes bibliographical references (pages 227-244).	en_US
dc.description.abstract	In this thesis, we present algorithms and prove lower bounds for fundamental computational problems in the models that address massive data sets. The models include streaming algorithms, sublinear time algorithms, property testing algorithms, sublinear query time algorithms with preprocessing, or computing small summaries for large data. More precisely, we study the following problems. The (Approximate) Nearest Neighbor problem models the task of searching among a large data set of objects. Given a data set of n points in a high dimensional space, its goal is to search for the closest point in the data set to a given query point, in sublinear time, and by suitably preprocessing the data. This problem has numerous applications in image and video databases, information retrieval, clustering, and many others. In these applications, the points model the objects in a large data set, and their closeness measure similarity between the objects. However, for the purpose of many applications, the basic formulation of Nearest Neighbor as described, encounters several challenges which we address in this thesis: we show how to deal with the case where the data is corrupted or incomplete, how to handle multiple related queries, and how to handle a data set of more complex objects rather than simple points. Next, we show a general approach for solving massive data problems. We introduce the notion of Composable Coresets, defined as small summaries of multiple data sets that can be aggregated together to summarize the whole data. We show how to compute such summaries for several clustering problems, and at the same time, demonstrate that no such summaries are possible for other natural problems such as maximum coverage. Finally, we study the Set Cover problem in alternate sublinear models: streaming algorithms (where one makes a small number of passes over the data using small storage), and sublinear time algorithms (where one computes the answer without reading the whole input). We present tight approximation algorithms for the Set Cover problem in both of these models. In this thesis, we introduce theoretical problems and concepts that model computational issues arising in databases, computer vision and other areas. Most of the presented algorithms are simple and practical to implement.	en_US
dc.description.statementofresponsibility	by Sepideh Mahabadi.	en_US
dc.format.extent	244 pages	en_US
dc.language.iso	eng	en_US
dc.publisher	Massachusetts Institute of Technology	en_US
dc.rights	MIT theses are protected by copyright. They may be viewed, downloaded, or printed from this source but further reproduction or distribution in any format is prohibited without written permission.	en_US
dc.rights.uri	http://dspace.mit.edu/handle/1721.1/7582	en_US
dc.subject	Electrical Engineering and Computer Science.	en_US
dc.title	Sublinear algorithms for massive data problems	en_US
dc.type	Thesis	en_US
dc.description.degree	Ph. D.	en_US
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
dc.identifier.oclc	1023861405	en_US

Files in this item

Name:: 1023861405-MIT.pdf
Size:: 3.483Mb
Format:: PDF
Description:: Full printable version

View/Open

This item appears in the following Collection(s)

Doctoral Theses

Show simple item record