MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Doctoral Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Doctoral Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Collaborative, open, and automated data science

Author(s)
Smith, Micah J.
Thumbnail
DownloadThesis PDF (6.659Mb)
Advisor
Veeramachaneni, Kalyan
Terms of use
In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/
Metadata
Show full item record
Abstract
Data science and machine learning have already revolutionized many industries and organizations and are increasingly being used in an open-source setting to address important societal problems. However, there remain many challenges to developing predictive machine learning models in practice, such as the complexity of the steps in the modern data science development process, the involvement of many different people with varying skills and roles, and the necessity of, yet difficulty in, collaborating across steps and people. In this thesis, I describe progress in two directions in supporting the development of predictive models. First, I propose to focus the effort of data scientists and support structured collaboration on the most challenging steps in a data science project. In Ballet, we create a new approach to collaborative data science development, based on adapting and extending the open-source software development model for the collaborative development of feature engineering pipelines, and is the first collaborative feature engineering framework. Using Ballet as a probe, we conduct a detailed case study analysis of an open-source personal income prediction project in order to better understand data science collaborations. Second, I propose to supplement human collaborators with advanced automated machine learning within end-to-end data science and machine learning pipelines. In the Machine Learning Bazaar, we create a flexible and powerful framework for developing machine learning and automated machine learning systems. In our approach, experts annotate and curate components from different machine learning libraries, which can be seamlessly composed into end-to-end pipelines using a unified interface. We build into these pipelines support for automated model selection and hyperparameter tuning. We use these components to create an open-source, general-purpose, automated machine learning system, and describe several other applications.
Date issued
2021-09
URI
https://hdl.handle.net/1721.1/140016
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Doctoral Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.