Causal Inference with Survival Outcomes via Orthogonal Statistical Learning

Xu, Shenbo

Author(s)

Xu, Shenbo

DownloadThesis PDF (5.180Mb)

Advisor

Welsch, Roy E.

Finkelstein, Stan N.

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

The field of causal inference has recently made great strides in incorporating machine learning into confounding adjustment and estimation of heterogeneous treatment effects (HTE). However, there were some gaps regarding survival outcomes. First, overlap-weighted effect estimators based on machine learning nuisance models were not available for such outcomes. Thus, researchers wishing to mitigate bias and variance from poor overlap had to accept potential bias from nuisance model misspecification in its place. In Chapter 2, we fill this gap by proposing a class of one-step cross-fitted double/debiased machine learning estimators for cumulative weighted average treatment effects for both survival outcomes and competing risk outcomes. Our approach combines importance sampling, semiparametric theory, and Neyman orthogonality to resolve both model misspecification and lack of covariate overlap between treatment arms in observational studies with censored outcomes. We give regularity conditions for the consistency, asymptotic linearity, and semiparametric efficiency bounds of the proposed estimators. Through simulation, it is shown that the proposed estimators do not require oracle parametric nuisance models. We apply the proposed estimators to compare the effects of two first-line anti-diabetic drugs on cancer outcomes. Second, a wide range of machine learning methods (or ”learners”) for estimating heterogeneous treatment effects were not applicable to estimating effects on survival outcomes, particularly in the presence of competing risks. In Chapter 3, we fill this gap by developing several once-for-all (orthogonal) censoring unbiased transformations which convert time-to-event data into continuous outcomes, such that all HTE learners and oracle rates for continuous outcomes can be borrowed. Our approach not only reduces the pressing need to develop various HTE learners for censored outcomes and especially competing risks, but also fully leverage the state of the art of existing schemes. Through direct application of HTE learners to these transformed continuous outcomes, we obtain consistent estimates of heterogeneous cumulative incidence effects, total effects, and separable direct effects. We provide generic model-free learner-specific oracle inequalities bounding the finite-sample excess risk. The oracle efficiency results depend on the oracle selector and estimated nuisance functions from all steps involved in the transformation. We demonstrate the empirical performance of the proposed methods in simulation studies. An important application area for causal inference methods, and one which originally motivated my interest in the field, is drug repurposing. In Chapter 4, we apply the methods of Chapter 2 to investigate whether metformin, a diabetes medication, might also have unexpected beneficial effects on cancer. The analysis encountered three major challenges: poor overlap between treatment groups, model misspecification, and pre-cancer death as competing risks for cancer incidence. To resolve these issues simultaneously, we take balancingweighted total cause-specific effects, controlled direct effect, and separable effects as causal estimands and develop balancing-weighted double/debiased machine learning estimators for both cumulative incidence functions and restricted mean time lost, with all estimators satisfying Neyman orthogonality. Using the Clinical Practice Research Datalink (CPRD) data, we find that metformin revealed a preventive direct effect on cancer incidence over sulfonylureas. The results also demonstrate the advantage of choosing the average treatment effect for the overlap population as the target quantity. Finally, just as machine learning helps to automate nuisance model estimation for confounding adjustment and modeling effect heterogeneity, causally informed artificial intelligence (AI) and large language models (LLMs) might help to automate hypothesis generation for drug repurposing and surveillance opportunities. In Chapter 5, we explore this potential by developing a high-throughput screening approach to evaluate available drugs across multiple diseases. The screening methodology aims to identify drug-disease pairs with significant positive signals that could represent promising repurposing candidates, while also detecting pairs with negative signals that might indicate potential safety concerns–both being critical aspects for pharmacoepidemiology research. This systematic approach leverages the convergence of expanding healthcare data sources and modern data science advances to establish a data-driven framework for drug repurposing discovery and pharmacovigilance. To conclude, we discuss the limitations of the proposed methods and provide possible future research directions.

Date issued

2025-02

URI

https://hdl.handle.net/1721.1/158798

Department

Massachusetts Institute of Technology. Department of Mechanical Engineering

Publisher

Massachusetts Institute of Technology

Collections

Doctoral Theses