Understanding Indexing Efficiency for Approximate Nearest Neighbor Search in High-dimensional Vector Databases
Author(s)
Qin, Yuting
DownloadThesis PDF (2.535Mb)
Advisor
Chen, Xuhao
Arvind
Terms of use
Metadata
Show full item recordAbstract
Deep learning has transformed almost all types of data (e.g., images, videos, documents) into high-dimension vectors, which in turn forms Vector Databases as the data engines of various applications. As a result, queries on vector databases have become the cornerstone for many important online services, including search, eCommerce, and recommendation systems. In a vector database, the major operation is to search the 𝑘 closest vectors to a given query vector, known as 𝑘-Nearest-Neighbor (𝑘-NN) search. Due to massive data scale in practice, Approximate Nearest-Neighbor (ANN), which builds a search index offline to accelerate search online, is often used instead. One of the most promising ANN indexing approaches is the graphbased approach, which first constructs a proximity graph on the dataset, connecting pairs of vectors that are close to each other, then traverse the proximity graph for each query to find the closest vectors to a query vector. The search performance, in terms of the scope of traversal that leads to convergence, is highly dependent on the quality of the graph. There exist lots of prior work on improving the graph quality with various heuristics. However, no analysis or modeling work has been done to quatitatively evaluate the heuristics and their impact on the performance. Hence, it is unclear how to pick or combine the right heuristics to build a high-quality graph. This thesis aims to establish this connection to fill the gap. The key challenge in quantifying the heuristics is the complex tradeoff between the search accuracy and search speed, which makes it almost impossible to establish an analytical model. To this end, we propose to leverage machine learning as the modeling tool. We first build an unified framework to characterize various graph building heuristics, by decoupling the graph construction and search phases. We then extract graph attributes (e.g., diameter), and collect ground-truth performance data (e.g., search speed and accuracy) within our framework, across multiple datasets and graph configurations. Based on the collected data, we train a linear regression model to predict the search performance. We show experimental results on our model performance, and also discuss the implications on selecting heuristics that improve the quality of the indexing graphs.
Date issued
2024-05Department
Massachusetts Institute of Technology. Department of Brain and Cognitive SciencesPublisher
Massachusetts Institute of Technology