A Predictive Model for Pancreatic Cancer Diagnosis

Xiong, Thomas

dc.contributor.advisor	Rinard, Martin
dc.contributor.author	Xiong, Thomas
dc.date.accessioned	2022-11-30T19:39:52Z
dc.date.available	2022-11-30T19:39:52Z
dc.date.issued	2021-06
dc.date.submitted	2021-06-17T20:14:59.711Z
dc.identifier.uri	https://hdl.handle.net/1721.1/146664
dc.description.abstract	Pancreatic ductal adenocarcinoma (PDAC), a specific type of pancreatic cancer, has a five-year survival rate of 8.5% and is the third-deadliest cancer in the United States. However, earlier detection can raise survival rates dramatically. In this thesis, we investigate the hypothesis that predictive models from a variety of model classes can use different indicators from electronic health record (EHR) data in order to predict PDAC diagnosis. We find that logistic regression, random forest, and XGBoost models perform the best when using patients’ unique diagnoses, lab test frequencies, medication frequencies, and race and ethnicity as data, with our best logistic regression model achieving an AUROC of 0.801 on a held-out test set. To better approximate these models’ use case in practice, we construct a time-dependent regime for model evaluation. Overall, we found that model performance decreased in the time-dependent regime as compared to the time-independent regime, suggesting the possibility of concept drift in our dataset. Moreover, through ℓ₀ regularization, we found that lab test frequencies tended to be the most important features in the best logistic regression model. The intended use for our deployed model is to serve as a prescreening tool to deliver an enriched population for further targeted PDAC screening. Our best model for this purpose delivers a sensitivity of 0.46 at a specificity of 0.9. According to our medical collaborators, this combination of sensitivity and specificity qualifies this model as suitable for our intended prescreening use. In this context, the ability of our model to work only with information derived from electronic health records, collected as part of routine medical care, is a significant advantage. We describe the steps taken to begin to model deployment into an existing federated EHR database. In this scenario, we envision that our model would be integrated into hospital EHR systems and routinely and automatically run over broad patient populations as EHR data is collected over time to produce a history of patient risk scores as patient data becomes available. Patient selection for further targeted PDAC screening can then consider both absolute scores and their evolution.
dc.publisher	Massachusetts Institute of Technology
dc.rights	In Copyright - Educational Use Permitted
dc.rights	Copyright MIT
dc.rights.uri	http://rightsstatements.org/page/InC-EDU/1.0/
dc.title	A Predictive Model for Pancreatic Cancer Diagnosis
dc.type	Thesis
dc.description.degree	M.Eng.
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degree	Master
thesis.degree.name	Master of Engineering in Computer Science and Molecular Biology

Files in this item

Name:: Xiong-txiong-meng-eecs-2021-th ...
Size:: 8.994Mb
Format:: PDF
Description:: Thesis PDF

View/Open

This item appears in the following Collection(s)

Graduate Theses

Show simple item record