| dc.contributor.author | Metwally, Ahmed | |
| dc.contributor.author | Shum, Michael | |
| dc.date.accessioned | 2024-07-23T20:09:12Z | |
| dc.date.available | 2024-07-23T20:09:12Z | |
| dc.date.issued | 2024-06-09 | |
| dc.identifier.isbn | 979-8-4007-0422-2 | |
| dc.identifier.uri | https://hdl.handle.net/1721.1/155773 | |
| dc.description.abstract | Identifying all pairs of records from two datasets whose similarity exceeds a given threshold is crucial for data cleaning and clustering. Our work on similarity-joins is motivated by detecting fraud and abuse. We focus on similarity-joins of sparse features, where records represent sparse sets, multisets, or vectors. Most state-of-the-art techniques are distributed versions of sequential algorithms. This is the reason they fail to scale when the alphabet is large, the records are large, or some elements are shared by numerous records. In this paper, we propose FastScalableSparseJoiner (FSSJ) that introduces quasi-prefix filtering, a novel flavor of prefix filtering [7] that exploits the skew in element popularity to avoid processing the most-popular elements without broadcasting the sorted elements to all executors. FastScalableSparseJoiner effectively prunes candidate pairs, and efficiently distributes the computation of partial results for load-balancing across executors. FSSJ can be adopted in any shared-nothing architecture. Based on our evaluation on synthetic and real datasets using a Spark-based implementation, FSSJ is very competitive on small datasets, and is the only algorithm that can join industry-scale datasets even with limited resources. | en_US |
| dc.publisher | ACM|Companion of the 2024 International Conference on Management of Data | en_US |
| dc.relation.isversionof | 10.1145/3626246.3653370 | en_US |
| dc.rights | Article is made available in accordance with the publisher's policy and may be subject to US copyright law. Please refer to the publisher's site for terms of use. | en_US |
| dc.source | Association for Computing Machinery | en_US |
| dc.title | Similarity Joins of Sparse Features | en_US |
| dc.type | Article | en_US |
| dc.identifier.citation | Metwally, Ahmed and Shum, Michael. 2024. "Similarity Joins of Sparse Features." | |
| dc.contributor.department | Massachusetts Institute of Technology. Department of Mechanical Engineering | |
| dc.identifier.mitlicense | PUBLISHER_POLICY | |
| dc.eprint.version | Final published version | en_US |
| dc.type.uri | http://purl.org/eprint/type/ConferencePaper | en_US |
| eprint.status | http://purl.org/eprint/status/NonPeerReviewed | en_US |
| dc.date.updated | 2024-07-01T07:54:14Z | |
| dc.language.rfc3066 | en | |
| dc.rights.holder | The author(s) | |
| dspace.date.submission | 2024-07-01T07:54:15Z | |
| mit.license | PUBLISHER_POLICY | |
| mit.metadata.status | Authority Work and Publication Information Needed | en_US |