Show simple item record

dc.contributor.authorKhayyat, Zuhair
dc.contributor.authorIlyas, Ihab F.
dc.contributor.authorOuzzani, Mourad
dc.contributor.authorPapotti, Paolo
dc.contributor.authorQuiané-Ruiz, Jorge-Arnulfo
dc.contributor.authorTang, Nan
dc.contributor.authorYin, Si
dc.contributor.authorMadden, Samuel R
dc.contributor.authorJindal, Alekh
dc.date.accessioned2017-12-29T18:56:50Z
dc.date.available2017-12-29T18:56:50Z
dc.date.issued2013-05
dc.identifier.isbn978-1-4503-2758-9
dc.identifier.urihttp://hdl.handle.net/1721.1/112981
dc.description.abstractData cleansing approaches have usually focused on detecting and fixing errors with little attention to scaling to big datasets. This presents a serious impediment since data cleansing often involves costly computations such as enumerating pairs of tuples, handling inequality joins, and dealing with user-defined functions. In this paper, we present BigDansing, a Big Data Cleansing system to tackle efficiency, scalability, and ease-of-use issues in data cleansing. The system can run on top of most common general purpose data processing platforms, ranging from DBMSs to MapReduce-like frameworks. A user-friendly programming interface allows users to express data quality rules both declaratively and procedurally, with no requirement of being aware of the underlying distributed platform. BigDansing takes these rules into a series of transformations that enable distributed computations and several optimizations, such as shared scans and specialized joins operators. Experimental results on both synthetic and real datasets show that BigDansing outperforms existing baseline systems up to more than two orders of magnitude without sacrificing the quality provided by the repair algorithms.en_US
dc.language.isoen_US
dc.publisherAssociation for Computing Machineryen_US
dc.relation.isversionofhttp://dx.doi.org/10.1145/2723372.2747646en_US
dc.rightsCreative Commons Attribution-Noncommercial-Share Alikeen_US
dc.rights.urihttp://creativecommons.org/licenses/by-nc-sa/4.0/en_US
dc.sourceOther univ. web domainen_US
dc.titleBigDansingen_US
dc.typeArticleen_US
dc.identifier.citationKhayyat, Zuhair, et al. "BigDansing: A System for Big Data Cleansing. Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD '15, 31 May - June 4, 2105, Melbourne, Australia, ACM Press, 2015, pp. 1215–30.en_US
dc.contributor.departmentMassachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratoryen_US
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Scienceen_US
dc.contributor.mitauthorMadden, Samuel R
dc.contributor.mitauthorJindal, Alekh
dc.relation.journalProceedings of the 2015 ACM SIGMOD International Conference on Management of Data - SIGMOD '15en_US
dc.eprint.versionAuthor's final manuscripten_US
dc.type.urihttp://purl.org/eprint/type/ConferencePaperen_US
eprint.statushttp://purl.org/eprint/status/NonPeerRevieweden_US
dspace.orderedauthorsKhayyat, Zuhair; Ilyas, Ihab F.; Jindal, Alekh; Madden, Samuel; Ouzzani, Mourad; Papotti, Paolo; Quiané-Ruiz, Jorge-Arnulfo; Tang, Nan; Yin, Sien_US
dspace.embargo.termsNen_US
dc.identifier.orcidhttps://orcid.org/0000-0002-7470-3265
mit.licenseOPEN_ACCESS_POLICYen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record