Learned String Index Structures for In-Memory Databases
Author(s)
Spector, Benjamin
DownloadThesis PDF (968.7Kb)
Advisor
Kraska, Tim
Terms of use
Metadata
Show full item recordAbstract
Within the field of machine learning for systems, learning-based methods have brought new perspective to indexing by reframing it as a cumulative distribution function (CDF) modeling problem. The burgeoning field, despite its nascence, has brought with it many opportunities and efficiencies. However, most work in this area has focused on efficiently indexing numerical keys, as the additional challenges posed by indexing strings have prevented the effective application of these techniques to string domains. We hypothesize that the machine learning approaches which have, in recent years, made significant strides in scalar indexing applications can also be effectively adapted to string applications. First, we introduce the RadixStringSpline (RSS) learned index structure for efficiently indexing strings. RSS is a tree of learned radix splines each indexing a fixed number of bytes. RSS achieves better performance than other structures by first using the minimal string prefix to sufficiently distinguish the data, followed by a contextual learned model to predict its location. Additionally, the bounded-error nature of RSS accelerates the last mile search and also enables a memory-efficient hash-table lookup accelerator. Second, we benchmark RSS against existing algorithms on several real-world string datasets and study its performance in-depth. RSS approaches or exceeds the performance of traditional string indexes while using up to 300× less memory, suggesting this line of research may be promising for future memory-intensive database applications.
Date issued
2022-05Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology