MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Open Access Articles
  • MIT Open Access Articles
  • View Item
  • DSpace@MIT Home
  • MIT Open Access Articles
  • MIT Open Access Articles
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

BigDansing

Author(s)
Khayyat, Zuhair; Ilyas, Ihab F.; Ouzzani, Mourad; Papotti, Paolo; Quiané-Ruiz, Jorge-Arnulfo; Tang, Nan; Yin, Si; Madden, Samuel R; Jindal, Alekh; ... Show more Show less
Thumbnail
DownloadMadden_BigDansing.pdf (946.8Kb)
OPEN_ACCESS_POLICY

Open Access Policy

Creative Commons Attribution-Noncommercial-Share Alike

Terms of use
Creative Commons Attribution-Noncommercial-Share Alike http://creativecommons.org/licenses/by-nc-sa/4.0/
Metadata
Show full item record
Abstract
Data cleansing approaches have usually focused on detecting and fixing errors with little attention to scaling to big datasets. This presents a serious impediment since data cleansing often involves costly computations such as enumerating pairs of tuples, handling inequality joins, and dealing with user-defined functions. In this paper, we present BigDansing, a Big Data Cleansing system to tackle efficiency, scalability, and ease-of-use issues in data cleansing. The system can run on top of most common general purpose data processing platforms, ranging from DBMSs to MapReduce-like frameworks. A user-friendly programming interface allows users to express data quality rules both declaratively and procedurally, with no requirement of being aware of the underlying distributed platform. BigDansing takes these rules into a series of transformations that enable distributed computations and several optimizations, such as shared scans and specialized joins operators. Experimental results on both synthetic and real datasets show that BigDansing outperforms existing baseline systems up to more than two orders of magnitude without sacrificing the quality provided by the repair algorithms.
Date issued
2013-05
URI
http://hdl.handle.net/1721.1/112981
Department
Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory; Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Journal
Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data - SIGMOD '15
Publisher
Association for Computing Machinery
Citation
Khayyat, Zuhair, et al. "BigDansing: A System for Big Data Cleansing. Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD '15, 31 May - June 4, 2105, Melbourne, Australia, ACM Press, 2015, pp. 1215–30.
Version: Author's final manuscript
ISBN
978-1-4503-2758-9

Collections
  • MIT Open Access Articles

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.