Similarity Joins of Sparse Features
Author(s)
Metwally, Ahmed; Shum, Michael
Download3626246.3653370.pdf (2.013Mb)
Publisher Policy
Publisher Policy
Article is made available in accordance with the publisher's policy and may be subject to US copyright law. Please refer to the publisher's site for terms of use.
Terms of use
Metadata
Show full item recordAbstract
Identifying all pairs of records from two datasets whose similarity exceeds a given threshold is crucial for data cleaning and clustering. Our work on similarity-joins is motivated by detecting fraud and abuse. We focus on similarity-joins of sparse features, where records represent sparse sets, multisets, or vectors. Most state-of-the-art techniques are distributed versions of sequential algorithms. This is the reason they fail to scale when the alphabet is large, the records are large, or some elements are shared by numerous records. In this paper, we propose FastScalableSparseJoiner (FSSJ) that introduces quasi-prefix filtering, a novel flavor of prefix filtering [7] that exploits the skew in element popularity to avoid processing the most-popular elements without broadcasting the sorted elements to all executors. FastScalableSparseJoiner effectively prunes candidate pairs, and efficiently distributes the computation of partial results for load-balancing across executors. FSSJ can be adopted in any shared-nothing architecture. Based on our evaluation on synthetic and real datasets using a Spark-based implementation, FSSJ is very competitive on small datasets, and is the only algorithm that can join industry-scale datasets even with limited resources.
Date issued
2024-06-09Department
Massachusetts Institute of Technology. Department of Mechanical EngineeringPublisher
ACM|Companion of the 2024 International Conference on Management of Data
Citation
Metwally, Ahmed and Shum, Michael. 2024. "Similarity Joins of Sparse Features."
Version: Final published version
ISBN
979-8-4007-0422-2