DataHub: Collaborative Data Science & Dataset Version Management at Scale
Author(s)
Bhardwaj, Anant P.; Bhattacherjee, Souvik; Chavan, Amit; Deshpande, Amol; Elmore, Aaron J.; Madden, Samuel R.; Parameswaran, Aditya; ... Show more Show less
DownloadMadden_DataHub.pdf (480.2Kb)
PUBLISHER_CC
Publisher with Creative Commons License
Creative Commons Attribution
Terms of use
Metadata
Show full item recordAbstract
Relational databases have limited support for data collaboration, where teams collaboratively curate and analyze large datasets. Inspired by software version control systems like git, we propose (a) a dataset version control system, giving users the ability to create, branch, merge, difference and search large, divergent collections of datasets, and (b) a platform, DATA HUB, that gives users the ability to perform collaborative data analysis building on this version control system. We outline the challenges in providing dataset version control at scale.
Date issued
2015-01Department
Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory; Massachusetts Institute of Technology. Department of Electrical Engineering and Computer ScienceJournal
Proceeings of the 7th Biennial Conference on Innovative Data Systems Research (CIDR ’15)
Citation
Bhardwaj, Anant, Souvik Bhattacherjee, Amit Chavan, Amol Deshpande, Aaron J. Elmore, Samuel Madden, Aditya Parameswaran. "DataHub: Collaborative Data Science & Dataset Version Management at Scale." 7th Biennial Conference on Innovative Data Systems Research (CIDR ’15) (January 2015).
Version: Author's final manuscript