MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Open Access Articles
  • MIT Open Access Articles
  • View Item
  • DSpace@MIT Home
  • MIT Open Access Articles
  • MIT Open Access Articles
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Similarity Joins of Sparse Features

Author(s)
Metwally, Ahmed; Shum, Michael
Thumbnail
Download3626246.3653370.pdf (2.013Mb)
Publisher Policy

Publisher Policy

Article is made available in accordance with the publisher's policy and may be subject to US copyright law. Please refer to the publisher's site for terms of use.

Terms of use
Article is made available in accordance with the publisher's policy and may be subject to US copyright law. Please refer to the publisher's site for terms of use.
Metadata
Show full item record
Abstract
Identifying all pairs of records from two datasets whose similarity exceeds a given threshold is crucial for data cleaning and clustering. Our work on similarity-joins is motivated by detecting fraud and abuse. We focus on similarity-joins of sparse features, where records represent sparse sets, multisets, or vectors. Most state-of-the-art techniques are distributed versions of sequential algorithms. This is the reason they fail to scale when the alphabet is large, the records are large, or some elements are shared by numerous records. In this paper, we propose FastScalableSparseJoiner (FSSJ) that introduces quasi-prefix filtering, a novel flavor of prefix filtering [7] that exploits the skew in element popularity to avoid processing the most-popular elements without broadcasting the sorted elements to all executors. FastScalableSparseJoiner effectively prunes candidate pairs, and efficiently distributes the computation of partial results for load-balancing across executors. FSSJ can be adopted in any shared-nothing architecture. Based on our evaluation on synthetic and real datasets using a Spark-based implementation, FSSJ is very competitive on small datasets, and is the only algorithm that can join industry-scale datasets even with limited resources.
Date issued
2024-06-09
URI
https://hdl.handle.net/1721.1/155773
Department
Massachusetts Institute of Technology. Department of Mechanical Engineering
Publisher
ACM|Companion of the 2024 International Conference on Management of Data
Citation
Metwally, Ahmed and Shum, Michael. 2024. "Similarity Joins of Sparse Features."
Version: Final published version
ISBN
979-8-4007-0422-2

Collections
  • MIT Open Access Articles

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.