Streaming Similarity Search over One Billion Tweets Using Parallel Locality-Sensitive Hashing

Sundaram, Narayanan; Turmukhametova, Aizana Z.; Satish, Nadathur; Mostak, Todd; Indyk, Piotr; Madden, Samuel R.; Dubey, Pradeep

dc.contributor.author	Sundaram, Narayanan
dc.contributor.author	Turmukhametova, Aizana Z.
dc.contributor.author	Satish, Nadathur
dc.contributor.author	Mostak, Todd
dc.contributor.author	Indyk, Piotr
dc.contributor.author	Madden, Samuel R.
dc.contributor.author	Dubey, Pradeep
dc.date.accessioned	2014-05-09T18:21:09Z
dc.date.available	2014-05-09T18:21:09Z
dc.date.issued	2013-09
dc.date.submitted	2013-08
dc.identifier.issn	2150-8097
dc.identifier.uri	http://hdl.handle.net/1721.1/86923
dc.description.abstract	Finding nearest neighbors has become an important operation on databases, with applications to text search, multimedia indexing, and many other areas. One popular algorithm for similarity search, especially for high dimensional data (where spatial indexes like kd-trees do not perform well) is Locality Sensitive Hashing (LSH), an approximation algorithm for finding similar objects. In this paper, we describe a new variant of LSH, called Parallel LSH (PLSH) designed to be extremely efficient, capable of scaling out on multiple nodes and multiple cores, and which supports high-throughput streaming of new data. Our approach employs several novel ideas, including: cache-conscious hash table layout, using a 2-level merge algorithm for hash table construction; an efficient algorithm for duplicate elimination during hash-table querying; an insert-optimized hash table structure and efficient data expiration algorithm for streaming data; and a performance model that accurately estimates performance of the algorithm and can be used to optimize parameter settings. We show that on a workload where we perform similarity search on a dataset of > 1 Billion tweets, with hundreds of millions of new tweets per day, we can achieve query times of 1-2.5 ms. We show that this is an order of magnitude faster than existing indexing schemes, such as inverted indexes. To the best of our knowledge, this is the fastest implementation of LSH, with table construction times up to 3.7× faster and query times that are 8.3× faster than a basic implementation.	en_US
dc.description.sponsorship	Intel Corporation (Intel Science and Technology Center in Big Data)	en_US
dc.language.iso	en_US
dc.publisher	Association for Computing Machinery (ACM)	en_US
dc.relation.isversionof	http://dl.acm.org/citation.cfm?id=2556574	en_US
dc.rights	Creative Commons Attribution-Noncommercial-Share Alike	en_US
dc.rights.uri	http://creativecommons.org/licenses/by-nc-sa/4.0/	en_US
dc.source	Other repository	en_US
dc.title	Streaming Similarity Search over One Billion Tweets Using Parallel Locality-Sensitive Hashing	en_US
dc.type	Article	en_US
dc.identifier.citation	Narayanan Sundaram, Aizana Turmukhametova, Nadathur Satish, Todd Mostak, Piotr Indyk, Samuel Madden, and Pradeep Dubey. 2013. Streaming similarity search over one billion tweets using parallel locality-sensitive hashing. Proc. VLDB Endow. 6, 14 (September 2013), 1930-1941.	en_US
dc.contributor.department	Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory	en_US
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science	en_US
dc.contributor.mitauthor	Turmukhametova, Aizana Z.	en_US
dc.contributor.mitauthor	Mostak, Todd	en_US
dc.contributor.mitauthor	Indyk, Piotr	en_US
dc.contributor.mitauthor	Madden, Samuel R.	en_US
dc.relation.journal	Proceedings of the VLDB Endowment	en_US
dc.eprint.version	Author's final manuscript	en_US
dc.type.uri	http://purl.org/eprint/type/ConferencePaper	en_US
eprint.status	http://purl.org/eprint/status/NonPeerReviewed	en_US
dspace.orderedauthors	Sundaram, Narayanan; Turmukhametova, Aizana Z.; Satish, Nadathur; Mostak, Todd; Indyk, Piotr; Madden, Samuel R.; Dubey, Pradeep
dc.identifier.orcid	https://orcid.org/0000-0002-7470-3265
dc.identifier.orcid	https://orcid.org/0000-0002-7983-9524
mit.license	OPEN_ACCESS_POLICY	en_US
mit.metadata.status	Complete

Files in this item

Name:: Indyk_Streaming similarity.pdf
Size:: 1.475Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

MIT Open Access Articles

Show simple item record