Show simple item record

dc.contributor.advisorRinard, Martin
dc.contributor.authorXiong, Thomas
dc.date.accessioned2022-11-30T19:39:52Z
dc.date.available2022-11-30T19:39:52Z
dc.date.issued2021-06
dc.date.submitted2021-06-17T20:14:59.711Z
dc.identifier.urihttps://hdl.handle.net/1721.1/146664
dc.description.abstractPancreatic ductal adenocarcinoma (PDAC), a specific type of pancreatic cancer, has a five-year survival rate of 8.5% and is the third-deadliest cancer in the United States. However, earlier detection can raise survival rates dramatically. In this thesis, we investigate the hypothesis that predictive models from a variety of model classes can use different indicators from electronic health record (EHR) data in order to predict PDAC diagnosis. We find that logistic regression, random forest, and XGBoost models perform the best when using patients’ unique diagnoses, lab test frequencies, medication frequencies, and race and ethnicity as data, with our best logistic regression model achieving an AUROC of 0.801 on a held-out test set. To better approximate these models’ use case in practice, we construct a time-dependent regime for model evaluation. Overall, we found that model performance decreased in the time-dependent regime as compared to the time-independent regime, suggesting the possibility of concept drift in our dataset. Moreover, through ℓ₀ regularization, we found that lab test frequencies tended to be the most important features in the best logistic regression model. The intended use for our deployed model is to serve as a prescreening tool to deliver an enriched population for further targeted PDAC screening. Our best model for this purpose delivers a sensitivity of 0.46 at a specificity of 0.9. According to our medical collaborators, this combination of sensitivity and specificity qualifies this model as suitable for our intended prescreening use. In this context, the ability of our model to work only with information derived from electronic health records, collected as part of routine medical care, is a significant advantage. We describe the steps taken to begin to model deployment into an existing federated EHR database. In this scenario, we envision that our model would be integrated into hospital EHR systems and routinely and automatically run over broad patient populations as EHR data is collected over time to produce a history of patient risk scores as patient data becomes available. Patient selection for further targeted PDAC screening can then consider both absolute scores and their evolution.
dc.publisherMassachusetts Institute of Technology
dc.rightsIn Copyright - Educational Use Permitted
dc.rightsCopyright MIT
dc.rights.urihttp://rightsstatements.org/page/InC-EDU/1.0/
dc.titleA Predictive Model for Pancreatic Cancer Diagnosis
dc.typeThesis
dc.description.degreeM.Eng.
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degreeMaster
thesis.degree.nameMaster of Engineering in Computer Science and Molecular Biology


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record