Extracting Coronary Lesion Information from Angiogram Reports for Patient Screening Applications
Author(s)
Gaffney, Leah Paige
DownloadThesis PDF (5.086Mb)
Advisor
Jónasson, Jónas Oddur
Gray, Martha
Heldt, Thomas
Terms of use
Metadata
Show full item recordAbstract
Dramatic improvements in the management of heart disease over the past 60 years (>70% reduction in mortality) may be plateauing, and there are challenges ahead for achieving cardiovascular health objectives, with heart disease still the leading cause of death in the US. One group of heart disease patients, coronary artery disease (CAD) patients, are now presenting with increased clinical complexity and higher risk profiles due to increases in lifespan and comorbid disease states. Percutaneous coronary interventions (PCI) are the best treatment options for a subset of CAD patients, but patients are increasingly deemed ineligible due to their higher risk of procedural complications. A new option, protected PCI, makes PCI safer for those patients.
Abiomed developed and manufactures the Impella pump, a temporary support option for the heart that provides the “protection” in protected PCI. Our work aims to ensure that the protected PCI option is available to these patients. This work supports the development of patient screener tools that identify patients with high-risk CAD who have not been offered PCI but should be eligible for protected PCI.
Specifically, we tackle one of the eligibility requirements by extracting coronary lesion location and severity information from clinical records. Natural language processing (NLP) tools are enabling more advanced electronic health record (EHR) based patient research. We collected and curated a dataset of 72 diagnostic coronary angiogram reports from health systems which contributed data to the Abiomed cVAD registry. Of these, 39 reports from 6 sites were used as a training set and 13 reports from the same 6 sites as a development set for a data processing pipeline to extract coronary lesion information. This work expands on the existing solutions for extracting ejection fraction information from echocardiogram reports. The ejection fraction extraction task has been solved with regular expressions, a simple and somewhat inflexible pattern-matching approach.
Our coronary lesion extraction followed a two-step general architectural approach that is common in NLP (Named Entity Recognition NER followed by Relation Extraction REL). We compare a machine learning based NER approach and a dictionary and regular expression ("matching") NER approach. Our REL implementation is rules-based. On entities alone, an intermediate outcome of the initial stage (NER), we achieve 92.1% recall and 93.9% precision with the machine learning based model and 95.1% recall with 52.6% precision for the matching-based model (on 370 total entities of types: location, vessel, and severity in the development set). The machine learning (ML) approach overcomes the inability for matching to be precise. This difference may not affect the final prediction performance depending on the second stage implementation. We achieve 89.7% recall and 84.5% precision on the second stage independently. (This is a conservative representation, as 7 of 103 relations in the development set are from types of sentences that are explicitly not yet handled by our second stage model. Recall increases to 92.6% and precision to 90.6% when those types of cases are ignored).
Each stage independently achieves reasonable performance. We analyze errors to recommend the next steps of development for both stages. With the two stages together, we achieve 79.6% recall and 71.8% precision with the ML-based NER model and 76.2% recall and 77.7% precision 76.9% for the matching-based NER model (without correcting for expected future improvements in performance). Non-ML approaches can solve at least three-quarters of this text extraction problem. We recommend advanced methods, including grammatical dependency rules for relations and improving ML-based entity prediction with more training examples from specific contexts.
This work provided a roadmap and the first pipeline to leverage data from the cVAD registry for algorithm development for patient screening applications. We developed structured data models and an annotated dataset for coronary lesion description extraction from coronary angiograms. We present the results of the entire algorithm and its component parts and propose advanced methods to refine the approach for implementation in future patient screening tools.
Date issued
2024-05Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science; Sloan School of ManagementPublisher
Massachusetts Institute of Technology