DataHub: Collaborative Data Science & Dataset Version Management at Scale
Author(s)Bhardwaj, Anant P.; Bhattacherjee, Souvik; Chavan, Amit; Deshpande, Amol; Elmore, Aaron J.; Madden, Samuel R.; Parameswaran, Aditya; ... Show more Show less
MetadataShow full item record
Relational databases have limited support for data collaboration, where teams collaboratively curate and analyze large datasets. Inspired by software version control systems like git, we propose (a) a dataset version control system, giving users the ability to create, branch, merge, difference and search large, divergent collections of datasets, and (b) a platform, DATA HUB, that gives users the ability to perform collaborative data analysis building on this version control system. We outline the challenges in providing dataset version control at scale.
DepartmentMassachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory; Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Proceeings of the 7th Biennial Conference on Innovative Data Systems Research (CIDR ’15)
Bhardwaj, Anant, Souvik Bhattacherjee, Amit Chavan, Amol Deshpande, Aaron J. Elmore, Samuel Madden, Aditya Parameswaran. "DataHub: Collaborative Data Science & Dataset Version Management at Scale." 7th Biennial Conference on Innovative Data Systems Research (CIDR ’15) (January 2015).
Author's final manuscript