Show simple item record

dc.contributor.authorDeng, Dong
dc.contributor.authorKim, Albert
dc.contributor.authorMadden, Samuel R
dc.contributor.authorStonebraker, Michael
dc.date.accessioned2019-06-18T15:41:13Z
dc.date.available2019-06-18T15:41:13Z
dc.date.issued2017
dc.date.submitted2017-04-16
dc.identifier.issn1570-8667
dc.identifier.urihttps://hdl.handle.net/1721.1/121341
dc.description.abstractDetermining if two sets are related - that is, if they have similar values or if one set contains the other - is an important problem with many applications in data cleaning, data integration, and information retrieval. For example, set relatedness can be a useful tool to discover whether columns from two different databases are joinable; if enough of the values in the columns match, it may make sense to join them. A common metric is to measure the relatedness of two sets by treating the elements as vertices of a bipartite graph and calculating the score of the maximum matching pairing between elements. Compared to other metrics which require exact matchings between elements, this metric uses a similarity function to compare elements between the two sets, making it robust to small dissimilarities in elements and more useful for real-world, dirty data. Unfortunately, the metric suffers from expensive computational cost, taking O(n 3 ) time, where n is the number of elements in the sets, for each set-to-set comparison. Thus for applications that try to search for all pairings of related sets in a brute-force manner, the runtime becomes unacceptably large. To address this challenge, we developed SILKMOTH, a system capable of rapidly discovering related set pairs in collections of sets. Internally, SILKMOTH creates a signature for each set, with the property that any other set which is related must match the signature. SILKMOTH then uses these signatures to prune the search space, so only sets that match the signatures are left as candidates. Finally, SILKMOTH applies the maximum matching metric on remaining candidates to verify which of these candidates are truly related sets. An important property of SILKMOTH is that it is guaranteed to output exactly the same related set pairings as the bruteforce method, unlike approximate techniques. Thus, a contribution of this paper is the characterization of the space of signatures which enable this property. We show that selecting the optimal signature in this space is NP-complete, and based on insights from the characterization of the space, we propose two novel filters which help to prune the candidates further before verification. In addition, we introduce a simple optimization to the calculation of the maximum matching metric itself based on the triangle inequality. Compared to related approaches, SILKMOTH is much more general, handling a larger space of similarity functions and relatedness metrics, and is an order of magnitude more efficient on real datasets.en_US
dc.language.isoen
dc.publisherVLDB Endowmenten_US
dc.relation.isversionof10.14778/3115404.3115413en_US
dc.rightsCreative Commons Attribution-NonCommercial-NoDerivs Licenseen_US
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/4.0/en_US
dc.sourceThe Proceedings of the VLDB Endowmenten_US
dc.titleSilkMoth: an efficient method for finding related sets with maximum matching constraintsen_US
dc.typeArticleen_US
dc.identifier.citationDeng, Dong, Albert Kim, Samuel Madde and Michael Stonebraker. "SilkMoth: an efficient method for finding related sets with maximum matching constraints." Proceedings of the VLDB Endowment, Vol. 10 No. 10 April 2017.en_US
dc.contributor.departmentMassachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratoryen_US
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Scienceen_US
dc.relation.journalProceedings of the VLDB Endowmenten_US
dc.eprint.versionFinal published versionen_US
dc.type.urihttp://purl.org/eprint/type/ConferencePaperen_US
eprint.statushttp://purl.org/eprint/status/NonPeerRevieweden_US
dc.date.updated2019-06-18T14:58:41Z
dspace.date.submission2019-06-18T14:58:42Z
mit.journal.volumeVol. 10en_US
mit.journal.issueNo. 10en_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record