Clinical Question-Answering over Distributed EHR Data

Jiang, Emily

dc.contributor.advisor	Kagal, Lalana
dc.contributor.author	Jiang, Emily
dc.date.accessioned	2024-09-24T18:26:39Z
dc.date.available	2024-09-24T18:26:39Z
dc.date.issued	2024-05
dc.date.submitted	2024-07-11T14:37:27.255Z
dc.identifier.uri	https://hdl.handle.net/1721.1/157011
dc.description.abstract	Electronic health records (EHRs) have become standard in US clinical practice. However, the distributed, dynamic, private, and jargon-dense nature of medical data is a barrier in harnessing Large Language Models (LLMs) for the domain. Retrievalaugmented generation (RAG), in which an LLM is provided with both the question and context returned by an external retriever, is a promising technique for addressing the unique qualities of clinical text. LLMs using RAG can answer questions about patient records without training on privacy-sensitive data; updated records can also be queried immediately without finetuning. By exposing the source documents that inform the model response, RAG enables greater physician interpretability as well as reduced hallucination, both of which are crucial for safe deployment in healthcare. This thesis presents FedRAG, a retrieval-augmented clinical question-answering (QA) system for clinicians to explore trends in patient data across distributed storage. We introduce a novel hierarchical design for federated document retrieval, in which leaf nodes perform local similarity search while non-leaf nodes route queries based on access policies and aggregate documents returned by their children. We also create a dataset on clinical trends over the MIMIC-IV database for the evaluation of QA systems on EHR data. FedRAG is implemented in Python as a federation of Flask servers using LangChain, the Qdrant vector database for retrieval, and GPT-3.5 Turbo for generation. We present a case study of three medical organizations, and find that the federation scheme results in no loss of quality against a centralized baseline. We explore the impact of resource accessibility among users with varying access permissions, observing that retrieval and generation quality degrade reasonably as document access is restricted. Finally, we evaluate performance in the key abilities required of RAG systems. We conclude that despite remaining challenges in achieving high retrieval quality and noise robustness, FedRAG is effective at synthesizing clinical trends through information integration across EHR documents.
dc.publisher	Massachusetts Institute of Technology
dc.rights	In Copyright - Educational Use Permitted
dc.rights	Copyright retained by author(s)
dc.rights.uri	https://rightsstatements.org/page/InC-EDU/1.0/
dc.title	Clinical Question-Answering over Distributed EHR Data
dc.type	Thesis
dc.description.degree	M.Eng.
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degree	Master
thesis.degree.name	Master of Engineering in Electrical Engineering and Computer Science

Files in this item

Name:: jiang-emji-meng-eecs-2024-thes ...
Size:: 1.497Mb
Format:: PDF
Description:: Thesis PDF

View/Open

This item appears in the following Collection(s)

Graduate Theses

Show simple item record