Show simple item record

dc.contributor.advisorVeeramachaneni, Kalyan
dc.contributor.authorSmith, Micah J.
dc.date.accessioned2022-02-07T15:19:10Z
dc.date.available2022-02-07T15:19:10Z
dc.date.issued2021-09
dc.date.submitted2021-09-21T19:30:44.833Z
dc.identifier.urihttps://hdl.handle.net/1721.1/140016
dc.description.abstractData science and machine learning have already revolutionized many industries and organizations and are increasingly being used in an open-source setting to address important societal problems. However, there remain many challenges to developing predictive machine learning models in practice, such as the complexity of the steps in the modern data science development process, the involvement of many different people with varying skills and roles, and the necessity of, yet difficulty in, collaborating across steps and people. In this thesis, I describe progress in two directions in supporting the development of predictive models. First, I propose to focus the effort of data scientists and support structured collaboration on the most challenging steps in a data science project. In Ballet, we create a new approach to collaborative data science development, based on adapting and extending the open-source software development model for the collaborative development of feature engineering pipelines, and is the first collaborative feature engineering framework. Using Ballet as a probe, we conduct a detailed case study analysis of an open-source personal income prediction project in order to better understand data science collaborations. Second, I propose to supplement human collaborators with advanced automated machine learning within end-to-end data science and machine learning pipelines. In the Machine Learning Bazaar, we create a flexible and powerful framework for developing machine learning and automated machine learning systems. In our approach, experts annotate and curate components from different machine learning libraries, which can be seamlessly composed into end-to-end pipelines using a unified interface. We build into these pipelines support for automated model selection and hyperparameter tuning. We use these components to create an open-source, general-purpose, automated machine learning system, and describe several other applications.
dc.publisherMassachusetts Institute of Technology
dc.rightsIn Copyright - Educational Use Permitted
dc.rightsCopyright retained by author(s)
dc.rights.urihttps://rightsstatements.org/page/InC-EDU/1.0/
dc.titleCollaborative, open, and automated data science
dc.typeThesis
dc.description.degreePh.D.
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degreeDoctoral
thesis.degree.nameDoctor of Philosophy


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record