Similarity Joins of Sparse Features

Metwally, Ahmed; Shum, Michael

Author(s)

Metwally, Ahmed; Shum, Michael

Download3626246.3653370.pdf (2.013Mb)

Publisher Policy

Terms of use

Article is made available in accordance with the publisher's policy and may be subject to US copyright law. Please refer to the publisher's site for terms of use.

Metadata

Show full item record

Abstract

Identifying all pairs of records from two datasets whose similarity exceeds a given threshold is crucial for data cleaning and clustering. Our work on similarity-joins is motivated by detecting fraud and abuse. We focus on similarity-joins of sparse features, where records represent sparse sets, multisets, or vectors. Most state-of-the-art techniques are distributed versions of sequential algorithms. This is the reason they fail to scale when the alphabet is large, the records are large, or some elements are shared by numerous records. In this paper, we propose FastScalableSparseJoiner (FSSJ) that introduces quasi-prefix filtering, a novel flavor of prefix filtering [7] that exploits the skew in element popularity to avoid processing the most-popular elements without broadcasting the sorted elements to all executors. FastScalableSparseJoiner effectively prunes candidate pairs, and efficiently distributes the computation of partial results for load-balancing across executors. FSSJ can be adopted in any shared-nothing architecture. Based on our evaluation on synthetic and real datasets using a Spark-based implementation, FSSJ is very competitive on small datasets, and is the only algorithm that can join industry-scale datasets even with limited resources.

Date issued

2024-06-09

URI

https://hdl.handle.net/1721.1/155773

Department

Massachusetts Institute of Technology. Department of Mechanical Engineering

Publisher

ACM|Companion of the 2024 International Conference on Management of Data

Citation

Metwally, Ahmed and Shum, Michael. 2024. "Similarity Joins of Sparse Features."

Version: Final published version

ISBN

979-8-4007-0422-2

Collections

MIT Open Access Articles