Building Data Civilizer Pipelines with an Advanced Workflow Engine
Author(s)
Mansour, Essam; Deng, Dong; Castro Fernandez, Raul; Qahtan, Abdulhakim A.; Tao, Wenbo; Abedjan, Ziawasch; Elmagarmid, Ahmed; Ilyas, Ihab F.; Madden, Samuel R; Ouzzani, Mourad; Stonebraker, Michael; Tang, Nan; ... Show more Show less
DownloadAccepted version (346.4Kb)
Open Access Policy
Open Access Policy
Creative Commons Attribution-Noncommercial-Share Alike
Terms of use
Metadata
Show full item recordAbstract
© 2018 IEEE. In order for an enterprise to gain insight into its internal business and the changing outside environment, it is essential to provide the relevant data for in-depth analysis. Enterprise data is usually scattered across departments and geographic regions and is often inconsistent. Data scientists spend the majority of their time finding, preparing, integrating, and cleaning relevant data sets. Data Civilizer is an end-To-end data preparation system. In this paper, we present the complete system, focusing on our new workflow engine, a superior system for entity matching and consolidation, and new cleaning tools. Our workflow engine allows data scientists to author, execute and retrofit data preparation pipelines of different data discovery and cleaning services. Our end-To-end demo scenario is based on data from the MIT data warehouse and e-commerce data sets.
Date issued
2018-04Department
Massachusetts Institute of Technology. Computer Science and Artificial Intelligence LaboratoryPublisher
IEEE
Citation
Mansour, Essam, Deng, Dong, Fernandez, Raul Castro, Qahtan, Abdulhakim A., Tao, Wenbo et al. 2018. "Building Data Civilizer Pipelines with an Advanced Workflow Engine."
Version: Author's final manuscript