Abstract:
This thesis studies distance approximation in two closely related models - the streaming model and the two-party communication model. In the streaming model, a massive data stream is presented in an arbitrary order to a randomized algorithm that tries to approximate certain statistics of tile data with only a few (usually one) passes over the data. For instance, the data may be a flow of packets on the internet or a set of records in a large database. The size of the data necessitates the use of extremely efficient randomized approximation algorithms. Problems of interest include approximating the number of distinct elements, approximating the surprise index of a stream, or more generally, approximating the norm of a dynamically-changing vector in which coordinates are updated multiple times in an arbitrary order. In the two-party communication model, there are two parties who wish to efficiently compute a relation of their inputs. We consider the problem of approximating Lp distances for any p > 0. It turns out that lower bounds on the communication complexity of these relations yield lower bounds on the memory required of streaming algorithms for the problems listed above. Moreover, upper bounds in the streaming model translate to constant-round protocols in the communication model with communication proportional to the memory required of the streaming algorithm. The communication model also hias its own applications, such as secure datamining, where in addition to low communication, the goal is not to allow either party to learn more about the other's input other than what follows from the output and his/her private input.(cont.) We develop new algorithms and lower bounds that resolve key open questions in both of these models. The highlights of the results are as follows. 1. We give an Q(1/E2) lower bound for approximating the number of distinct elements of a data stream in one pass to within a (1 ± c) factor with constant probability, as well as the p-th frequency moment Fp for any p Ž 0. This is tight up to very small factors, and greatly improves upon the earlier Q(1/E) lower bound for these problems. It also gives the same quadratic improvement for the communication complexity of 1-round protocols for approximating the Lp distance for any p 2 0. 2. We give a 1-pass O(ml-2/p)-space streaming algorithm for (1 ± 6)-approximating the Lp norm of an m-dimensional vector presented as a data stream for any p 2 2. This algorithm improves the previous ((m1-1/(P-')) bound, and is optimal up to polylogarithmic factors. As a special case our algorithm can be used to approximate the frequency moments Fp of a data stream with the same optimal amount of space. This resolves the main open question of the 1996 paper by Alon, Matias, and Szegedy. 3. In the two-party communication model, we give a protocol for privately approximating the Euclidean distance (L2) between two m-dimensional vectors, held by different parties, with only polylog m communication and 0(1) rounds. This tremendously improves upon the earlier protocol of Feigenbaum, Ishai, Malkin, Nissim, Strauss, and Wright, which achieved O(vm) communication for privately approximating the Hamming distance only. This thesis also contains several previously unpublished results concerning the first item above, including new lower bounds for the communication complexity of approximating the Lp distances when the vectors are uniformly distributed and the protocol is only correct for most inputs, as well as tight lower bounds for the multiround complexity for a restricted class of protocols that we call linear.

Description:
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.; Includes bibliographical references (p. 109-114).