Streaming Similarity Search over One Billion Tweets Using Parallel Locality-Sensitive Hashing

Sundaram, Narayanan; Turmukhametova, Aizana Z.; Satish, Nadathur; Mostak, Todd; Indyk, Piotr; Madden, Samuel R.; Dubey, Pradeep

Author(s)

Sundaram, Narayanan; Turmukhametova, Aizana Z.; Satish, Nadathur; Mostak, Todd; Indyk, Piotr; ... Show more

DownloadIndyk_Streaming similarity.pdf (1.475Mb)

OPEN_ACCESS_POLICY

Terms of use

Creative Commons Attribution-Noncommercial-Share Alike http://creativecommons.org/licenses/by-nc-sa/4.0/

Metadata

Show full item record

Abstract

Finding nearest neighbors has become an important operation on databases, with applications to text search, multimedia indexing, and many other areas. One popular algorithm for similarity search, especially for high dimensional data (where spatial indexes like kd-trees do not perform well) is Locality Sensitive Hashing (LSH), an approximation algorithm for finding similar objects. In this paper, we describe a new variant of LSH, called Parallel LSH (PLSH) designed to be extremely efficient, capable of scaling out on multiple nodes and multiple cores, and which supports high-throughput streaming of new data. Our approach employs several novel ideas, including: cache-conscious hash table layout, using a 2-level merge algorithm for hash table construction; an efficient algorithm for duplicate elimination during hash-table querying; an insert-optimized hash table structure and efficient data expiration algorithm for streaming data; and a performance model that accurately estimates performance of the algorithm and can be used to optimize parameter settings. We show that on a workload where we perform similarity search on a dataset of > 1 Billion tweets, with hundreds of millions of new tweets per day, we can achieve query times of 1-2.5 ms. We show that this is an order of magnitude faster than existing indexing schemes, such as inverted indexes. To the best of our knowledge, this is the fastest implementation of LSH, with table construction times up to 3.7× faster and query times that are 8.3× faster than a basic implementation.

Date issued

2013-09

URI

http://hdl.handle.net/1721.1/86923

Department

Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory; Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Journal

Proceedings of the VLDB Endowment

Publisher

Association for Computing Machinery (ACM)

Citation

Narayanan Sundaram, Aizana Turmukhametova, Nadathur Satish, Todd Mostak, Piotr Indyk, Samuel Madden, and Pradeep Dubey. 2013. Streaming similarity search over one billion tweets using parallel locality-sensitive hashing. Proc. VLDB Endow. 6, 14 (September 2013), 1930-1941.

Version: Author's final manuscript

ISSN

2150-8097

Collections

MIT Open Access Articles

DSpace@MIT