The case for a Learned sorting algorithm
Author(s)
Vaidya, Kapil Eknath.
Download1252064657-MIT.pdf (2.951Mb)
Other Contributors
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.
Advisor
Tim Kraska.
Terms of use
Metadata
Show full item recordAbstract
Sorting is one of the most fundamental algorithms in Computer Science and a common operation in databases not just for sorting query results but also as part of joins (i.e., sort-merge-join) or indexing. In this work, we introduce a new type of distribution sort that leverages a learned model of the empirical CDF of the data. Our algorithm uses a model to efficiently get an approximation of the scaled empirical CDF for each record key and map it to the corresponding position in the output array. We then apply a deterministic sorting algorithm that works well on nearly-sorted arrays (e.g., Insertion Sort) to establish a totally sorted order. We compared this algorithm against common sorting approaches and measured its performance for up to 1 billion normally-distributed double-precision keys. The results show that our approach yields upto 3.38x performance improvement over C++ STL sort , which is an optimized Quicksort hybrid, 1.49x improvement over sequential Radix Sort, 1.31x over IS⁴o[2] and 5.54x improvement over a C++ implementation of Timsort, which is the default sorting function for Java and Python, over several real-world datasets. While these results are very encouraging, duplicates have a particular negative impact on the sorting performance of Learned Sort, as we show in our experiments.
Description
Thesis: S.M., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, February, 2021 Cataloged from the official PDF version of thesis. Includes bibliographical references (pages 55-59).
Date issued
2021Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology
Keywords
Electrical Engineering and Computer Science.