Abstract:
Over the last decade, an immense amount of data has become available. From collections of photos, to genetic data, and to network traffic statistics, modern technologies and cheap storage have made it possible to accumulate huge datasets. But how can we effectively use all this data? The ever growing sizes of the datasets make it imperative to design new algorithms capable of sifting through this data with extreme efficiency. A fundamental computational primitive for dealing with massive dataset is the Nearest Neighbor (NN) problem. In the NN problem, the goal is to preprocess a set of objects, so that later, given a query object, one can find efficiently the data object most similar to the query. This problem has a broad set of applications in data processing and analysis. For instance, it forms the basis of a widely used classification method in machine learning: to give a label for a new object, find the most similar labeled object and copy its label. Other applications include information retrieval, searching image databases, finding duplicate files and web pages, vector quantization, and many others. To represent the objects and the similarity measures, one often uses geometric notions. For example, a black-and-white image may be modeled by a high-dimensional vector, with one coordinate per pixel, whereas the similarity measure may be the standard Euclidean distance between the resulting vectors. Many other, more elaborate ways of representing objects by high-dimensional feature vectors have been studied. In this thesis, we study the NN problem, as well as other related problems that occur frequently when dealing with the massive datasets.(cont.) Our contribution is two-fold: we significantly improve the algorithms within the classical approaches to NN, as well as propose new approaches where the classical ones fail. We focus on several key distances and similarity measures, including the Euclidean distance, string edit distance and the Earth-Mover Distance (a popular method for comparing images). We also give a number of impossibility results, pointing out the limits of the NN algorithms. The high-level structure of our thesis is summarized as follows. New algorithms via the classical approaches. We give a new algorithm for the approximate NN problem in the d-dimensional Euclidean space. For an approximation factor c > 1, our algorithm achieves dnP query time and dnl+P space for p = 1/c 2+o(1). This greatly improves on the previous algorithms that achieved p that was only slightly smaller than 1/c. The same technique also yields an algorithm with dno(p) query time and space near-linear in n. Furthermore, our algorithm is near-optimal in the class of "hashing" algorithms. Failure of the classical approaches for some hard distances. We give an evidence that the classical approaches to NN under certain hard distances, such as the string edit distance, meet a concrete barrier at a nearly logarithmic approximation. Specifically, we show that for all classical approaches to NN under the edit distance, involving embeddings into a general class of spaces (such as l1, powers of l2, etc), the resulting approximation has to be at least near-logarithmic in the strings' length. A new approach to NN under hard distances.(cont.) Motivated by the above impossibility results, we develop a new approach to the NN problem, where the classical approaches fail. Using this approach, we give a new efficient NN algorithm for a variant of the edit distance, the Ulam distance, which achieves a double-logarithmic approximation. This is an exponential improvement over the lower bound on the approximation achievable via the previous classical approaches to this problem. Data structure lower bounds. To complement our algorithms, we prove lower bounds on NN data structures for the Euclidean distance and for the mysterious but important case of the ... distance. In both cases, our lower bounds are the first ones to hold in the same computational model as the respective upper bounds. Furthermore, for both problems, our lower bounds are optimal in the considered models. External applications. Although our main focus is on the NN problem, our techniques naturally extend to related problems. We give such applications for each of our algorithmic tools. For example, we give an algorithm for computing the edit distance between two strings of length d in near-linear time. Our algorithm achieves approximation 20 ..., improving over the previous bound of ... . We note that this problem has a classical exact algorithm based on dynamic programming, running in quadratic time.