Towards Rigorously Tested & Reliable Machine Learning for Health
Author(s)
Oberst, Michael Karl
DownloadThesis PDF (13.02Mb)
Advisor
Sontag, David
Terms of use
Metadata
Show full item recordAbstract
When can we rely on machine learning in high-risk domains like healthcare? In the long-term, we want machine learning systems to be as reliable as any FDA-approved medication or diagnostic test. Building reliable models is complicated by the need for causal reasoning and robust performance. To support decision-making, we want to draw causal conclusions about the impact of model recommendations (e.g., will recommending a particular drug lead to better patient outcomes?). Moreover, we want our models to perform well across different hospitals and patient populations, including those that differ from the hospitals / populations seen during model development.
These objectives run into limitations of what our data can tell us without further assumptions. For instance, we only observe outcomes for the treatments that were actually prescribed to patients, not all possible treatments. Similarly, we do not observe performance on every conceivable hospital where a model might be deployed, but only on the (typically much more limited) data we have access to.
In this thesis, I approach these challenges using tools from causality and statistics, incorporating external knowledge into the process of both model validation and design. External knowledge can come from a variety of sources, including human experts (e.g., clinicians) or gold-standard data (e.g., from randomized trials). First, I introduce methods for assessing and improving the credibility of causal inference, including methods to help domain experts “sanity check” the causal reasoning of models for decision-making, identify under-represented populations in causal analyses, and incorporate limited experimental data to improve the credibility of causal conclusions. Second, I introduce tools for building robust predictive models by incorporating domain knowledge of plausible variation across environments: Both estimating worst-case predictive performance (e.g., accuracy) of models under domain-specific changes in the data generating process, as well as optimizing models to obtain optimal worst-case performance.
Date issued
2023-09Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology