Show simple item record

dc.contributor.authorMetwally, Ahmed
dc.contributor.authorShum, Michael
dc.date.accessioned2024-07-23T20:09:12Z
dc.date.available2024-07-23T20:09:12Z
dc.date.issued2024-06-09
dc.identifier.isbn979-8-4007-0422-2
dc.identifier.urihttps://hdl.handle.net/1721.1/155773
dc.description.abstractIdentifying all pairs of records from two datasets whose similarity exceeds a given threshold is crucial for data cleaning and clustering. Our work on similarity-joins is motivated by detecting fraud and abuse. We focus on similarity-joins of sparse features, where records represent sparse sets, multisets, or vectors. Most state-of-the-art techniques are distributed versions of sequential algorithms. This is the reason they fail to scale when the alphabet is large, the records are large, or some elements are shared by numerous records. In this paper, we propose FastScalableSparseJoiner (FSSJ) that introduces quasi-prefix filtering, a novel flavor of prefix filtering [7] that exploits the skew in element popularity to avoid processing the most-popular elements without broadcasting the sorted elements to all executors. FastScalableSparseJoiner effectively prunes candidate pairs, and efficiently distributes the computation of partial results for load-balancing across executors. FSSJ can be adopted in any shared-nothing architecture. Based on our evaluation on synthetic and real datasets using a Spark-based implementation, FSSJ is very competitive on small datasets, and is the only algorithm that can join industry-scale datasets even with limited resources.en_US
dc.publisherACM|Companion of the 2024 International Conference on Management of Dataen_US
dc.relation.isversionof10.1145/3626246.3653370en_US
dc.rightsArticle is made available in accordance with the publisher's policy and may be subject to US copyright law. Please refer to the publisher's site for terms of use.en_US
dc.sourceAssociation for Computing Machineryen_US
dc.titleSimilarity Joins of Sparse Featuresen_US
dc.typeArticleen_US
dc.identifier.citationMetwally, Ahmed and Shum, Michael. 2024. "Similarity Joins of Sparse Features."
dc.contributor.departmentMassachusetts Institute of Technology. Department of Mechanical Engineering
dc.identifier.mitlicensePUBLISHER_POLICY
dc.eprint.versionFinal published versionen_US
dc.type.urihttp://purl.org/eprint/type/ConferencePaperen_US
eprint.statushttp://purl.org/eprint/status/NonPeerRevieweden_US
dc.date.updated2024-07-01T07:54:14Z
dc.language.rfc3066en
dc.rights.holderThe author(s)
dspace.date.submission2024-07-01T07:54:15Z
mit.licensePUBLISHER_POLICY
mit.metadata.statusAuthority Work and Publication Information Neededen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record