Show simple item record

dc.contributor.advisorPiotr Indyk.en_US
dc.contributor.authorWoodruff, David Paul, 1980-en_US
dc.contributor.otherMassachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science.en_US
dc.date.accessioned2008-09-03T15:03:26Z
dc.date.available2008-09-03T15:03:26Z
dc.date.copyright2007en_US
dc.date.issued2007en_US
dc.identifier.urihttp://hdl.handle.net/1721.1/42243
dc.descriptionThesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.en_US
dc.descriptionIncludes bibliographical references (p. 109-114).en_US
dc.description.abstractThis thesis studies distance approximation in two closely related models - the streaming model and the two-party communication model. In the streaming model, a massive data stream is presented in an arbitrary order to a randomized algorithm that tries to approximate certain statistics of tile data with only a few (usually one) passes over the data. For instance, the data may be a flow of packets on the internet or a set of records in a large database. The size of the data necessitates the use of extremely efficient randomized approximation algorithms. Problems of interest include approximating the number of distinct elements, approximating the surprise index of a stream, or more generally, approximating the norm of a dynamically-changing vector in which coordinates are updated multiple times in an arbitrary order. In the two-party communication model, there are two parties who wish to efficiently compute a relation of their inputs. We consider the problem of approximating Lp distances for any p > 0. It turns out that lower bounds on the communication complexity of these relations yield lower bounds on the memory required of streaming algorithms for the problems listed above. Moreover, upper bounds in the streaming model translate to constant-round protocols in the communication model with communication proportional to the memory required of the streaming algorithm. The communication model also hias its own applications, such as secure datamining, where in addition to low communication, the goal is not to allow either party to learn more about the other's input other than what follows from the output and his/her private input.en_US
dc.description.abstract(cont.) We develop new algorithms and lower bounds that resolve key open questions in both of these models. The highlights of the results are as follows. 1. We give an Q(1/E2) lower bound for approximating the number of distinct elements of a data stream in one pass to within a (1 ± c) factor with constant probability, as well as the p-th frequency moment Fp for any p Ž 0. This is tight up to very small factors, and greatly improves upon the earlier Q(1/E) lower bound for these problems. It also gives the same quadratic improvement for the communication complexity of 1-round protocols for approximating the Lp distance for any p 2 0. 2. We give a 1-pass O(ml-2/p)-space streaming algorithm for (1 ± 6)-approximating the Lp norm of an m-dimensional vector presented as a data stream for any p 2 2. This algorithm improves the previous ((m1-1/(P-')) bound, and is optimal up to polylogarithmic factors. As a special case our algorithm can be used to approximate the frequency moments Fp of a data stream with the same optimal amount of space. This resolves the main open question of the 1996 paper by Alon, Matias, and Szegedy. 3. In the two-party communication model, we give a protocol for privately approximating the Euclidean distance (L2) between two m-dimensional vectors, held by different parties, with only polylog m communication and 0(1) rounds. This tremendously improves upon the earlier protocol of Feigenbaum, Ishai, Malkin, Nissim, Strauss, and Wright, which achieved O(vm) communication for privately approximating the Hamming distance only. This thesis also contains several previously unpublished results concerning the first item above, including new lower bounds for the communication complexity of approximating the Lp distances when the vectors are uniformly distributed and the protocol is only correct for most inputs, as well as tight lower bounds for the multiround complexity for a restricted class of protocols that we call linear.en_US
dc.description.statementofresponsibilityby David P. Woodruff.en_US
dc.format.extent114 p.en_US
dc.language.isoengen_US
dc.publisherMassachusetts Institute of Technologyen_US
dc.rightsM.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission.en_US
dc.rights.urihttp://dspace.mit.edu/handle/1721.1/7582en_US
dc.subjectElectrical Engineering and Computer Science.en_US
dc.titleEfficient and private distance approximation in the communication and streaming modelsen_US
dc.typeThesisen_US
dc.description.degreePh.D.en_US
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
dc.identifier.oclc231629033en_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record