Causal Inference with Survival Outcomes via Orthogonal Statistical Learning
Author(s)
Xu, Shenbo
DownloadThesis PDF (5.180Mb)
Advisor
Welsch, Roy E.
Finkelstein, Stan N.
Terms of use
Metadata
Show full item recordAbstract
The field of causal inference has recently made great strides in incorporating machine learning into confounding adjustment and estimation of heterogeneous treatment effects (HTE). However, there were some gaps regarding survival outcomes.
First, overlap-weighted effect estimators based on machine learning nuisance models were not available for such outcomes. Thus, researchers wishing to mitigate bias and variance from poor overlap had to accept potential bias from nuisance model misspecification in its place. In Chapter 2, we fill this gap by proposing a class of one-step cross-fitted double/debiased machine learning estimators for cumulative weighted average treatment effects for both survival outcomes and competing risk outcomes. Our approach combines importance sampling, semiparametric theory, and Neyman orthogonality to resolve both model misspecification and lack of covariate overlap between treatment arms in observational studies with censored outcomes. We give regularity conditions for the consistency, asymptotic linearity, and semiparametric efficiency bounds of the proposed estimators. Through simulation, it is shown that the proposed estimators do not require oracle parametric nuisance models. We apply the proposed estimators to compare the effects of two first-line anti-diabetic drugs on cancer outcomes.
Second, a wide range of machine learning methods (or ”learners”) for estimating heterogeneous treatment effects were not applicable to estimating effects on survival outcomes, particularly in the presence of competing risks. In Chapter 3, we fill this gap by developing several once-for-all (orthogonal) censoring unbiased transformations which convert time-to-event data into continuous outcomes, such that all HTE learners and oracle rates for continuous outcomes can be borrowed. Our approach not only reduces the pressing need to develop various HTE learners for censored outcomes and especially competing risks, but also fully leverage the state of the art of existing schemes. Through direct application of HTE learners to these transformed continuous outcomes, we obtain consistent estimates of heterogeneous cumulative incidence effects, total effects, and separable direct effects. We provide generic model-free learner-specific oracle inequalities bounding the finite-sample excess risk. The oracle efficiency results depend on the oracle selector and estimated nuisance functions from all steps involved in the transformation. We demonstrate the empirical performance of the proposed methods in simulation studies.
An important application area for causal inference methods, and one which originally motivated my interest in the field, is drug repurposing. In Chapter 4, we apply the methods of Chapter 2 to investigate whether metformin, a diabetes medication, might also have unexpected beneficial effects on cancer. The analysis encountered three major challenges: poor overlap between treatment groups, model misspecification, and pre-cancer death as competing risks for cancer incidence. To resolve these issues simultaneously, we take balancingweighted total cause-specific effects, controlled direct effect, and separable effects as causal estimands and develop balancing-weighted double/debiased machine learning estimators for both cumulative incidence functions and restricted mean time lost, with all estimators satisfying Neyman orthogonality. Using the Clinical Practice Research Datalink (CPRD) data, we find that metformin revealed a preventive direct effect on cancer incidence over sulfonylureas. The results also demonstrate the advantage of choosing the average treatment effect for the overlap population as the target quantity.
Finally, just as machine learning helps to automate nuisance model estimation for confounding adjustment and modeling effect heterogeneity, causally informed artificial intelligence (AI) and large language models (LLMs) might help to automate hypothesis generation for drug repurposing and surveillance opportunities. In Chapter 5, we explore this potential by developing a high-throughput screening approach to evaluate available drugs across multiple diseases. The screening methodology aims to identify drug-disease pairs with significant positive signals that could represent promising repurposing candidates, while also detecting pairs with negative signals that might indicate potential safety concerns–both being critical aspects for pharmacoepidemiology research. This systematic approach leverages the convergence of expanding healthcare data sources and modern data science advances to establish a data-driven framework for drug repurposing discovery and pharmacovigilance.
To conclude, we discuss the limitations of the proposed methods and provide possible future research directions.
Date issued
2025-02Department
Massachusetts Institute of Technology. Department of Mechanical EngineeringPublisher
Massachusetts Institute of Technology