Show simple item record

dc.contributor.authorSundaram, Narayanan
dc.contributor.authorTurmukhametova, Aizana Z.
dc.contributor.authorSatish, Nadathur
dc.contributor.authorMostak, Todd
dc.contributor.authorIndyk, Piotr
dc.contributor.authorMadden, Samuel R.
dc.contributor.authorDubey, Pradeep
dc.date.accessioned2014-05-09T18:21:09Z
dc.date.available2014-05-09T18:21:09Z
dc.date.issued2013-09
dc.date.submitted2013-08
dc.identifier.issn2150-8097
dc.identifier.urihttp://hdl.handle.net/1721.1/86923
dc.description.abstractFinding nearest neighbors has become an important operation on databases, with applications to text search, multimedia indexing, and many other areas. One popular algorithm for similarity search, especially for high dimensional data (where spatial indexes like kd-trees do not perform well) is Locality Sensitive Hashing (LSH), an approximation algorithm for finding similar objects. In this paper, we describe a new variant of LSH, called Parallel LSH (PLSH) designed to be extremely efficient, capable of scaling out on multiple nodes and multiple cores, and which supports high-throughput streaming of new data. Our approach employs several novel ideas, including: cache-conscious hash table layout, using a 2-level merge algorithm for hash table construction; an efficient algorithm for duplicate elimination during hash-table querying; an insert-optimized hash table structure and efficient data expiration algorithm for streaming data; and a performance model that accurately estimates performance of the algorithm and can be used to optimize parameter settings. We show that on a workload where we perform similarity search on a dataset of > 1 Billion tweets, with hundreds of millions of new tweets per day, we can achieve query times of 1-2.5 ms. We show that this is an order of magnitude faster than existing indexing schemes, such as inverted indexes. To the best of our knowledge, this is the fastest implementation of LSH, with table construction times up to 3.7× faster and query times that are 8.3× faster than a basic implementation.en_US
dc.description.sponsorshipIntel Corporation (Intel Science and Technology Center in Big Data)en_US
dc.language.isoen_US
dc.publisherAssociation for Computing Machinery (ACM)en_US
dc.relation.isversionofhttp://dl.acm.org/citation.cfm?id=2556574en_US
dc.rightsCreative Commons Attribution-Noncommercial-Share Alikeen_US
dc.rights.urihttp://creativecommons.org/licenses/by-nc-sa/4.0/en_US
dc.sourceOther repositoryen_US
dc.titleStreaming Similarity Search over One Billion Tweets Using Parallel Locality-Sensitive Hashingen_US
dc.typeArticleen_US
dc.identifier.citationNarayanan Sundaram, Aizana Turmukhametova, Nadathur Satish, Todd Mostak, Piotr Indyk, Samuel Madden, and Pradeep Dubey. 2013. Streaming similarity search over one billion tweets using parallel locality-sensitive hashing. Proc. VLDB Endow. 6, 14 (September 2013), 1930-1941.en_US
dc.contributor.departmentMassachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratoryen_US
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Scienceen_US
dc.contributor.mitauthorTurmukhametova, Aizana Z.en_US
dc.contributor.mitauthorMostak, Todden_US
dc.contributor.mitauthorIndyk, Piotren_US
dc.contributor.mitauthorMadden, Samuel R.en_US
dc.relation.journalProceedings of the VLDB Endowmenten_US
dc.eprint.versionAuthor's final manuscripten_US
dc.type.urihttp://purl.org/eprint/type/ConferencePaperen_US
eprint.statushttp://purl.org/eprint/status/NonPeerRevieweden_US
dspace.orderedauthorsSundaram, Narayanan; Turmukhametova, Aizana Z.; Satish, Nadathur; Mostak, Todd; Indyk, Piotr; Madden, Samuel R.; Dubey, Pradeep
dc.identifier.orcidhttps://orcid.org/0000-0002-7470-3265
dc.identifier.orcidhttps://orcid.org/0000-0002-7983-9524
mit.licenseOPEN_ACCESS_POLICYen_US
mit.metadata.statusComplete


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record