Show simple item record

dc.contributor.advisorKagal, Lalana
dc.contributor.authorJiang, Emily
dc.date.accessioned2024-09-24T18:26:39Z
dc.date.available2024-09-24T18:26:39Z
dc.date.issued2024-05
dc.date.submitted2024-07-11T14:37:27.255Z
dc.identifier.urihttps://hdl.handle.net/1721.1/157011
dc.description.abstractElectronic health records (EHRs) have become standard in US clinical practice. However, the distributed, dynamic, private, and jargon-dense nature of medical data is a barrier in harnessing Large Language Models (LLMs) for the domain. Retrievalaugmented generation (RAG), in which an LLM is provided with both the question and context returned by an external retriever, is a promising technique for addressing the unique qualities of clinical text. LLMs using RAG can answer questions about patient records without training on privacy-sensitive data; updated records can also be queried immediately without finetuning. By exposing the source documents that inform the model response, RAG enables greater physician interpretability as well as reduced hallucination, both of which are crucial for safe deployment in healthcare. This thesis presents FedRAG, a retrieval-augmented clinical question-answering (QA) system for clinicians to explore trends in patient data across distributed storage. We introduce a novel hierarchical design for federated document retrieval, in which leaf nodes perform local similarity search while non-leaf nodes route queries based on access policies and aggregate documents returned by their children. We also create a dataset on clinical trends over the MIMIC-IV database for the evaluation of QA systems on EHR data. FedRAG is implemented in Python as a federation of Flask servers using LangChain, the Qdrant vector database for retrieval, and GPT-3.5 Turbo for generation. We present a case study of three medical organizations, and find that the federation scheme results in no loss of quality against a centralized baseline. We explore the impact of resource accessibility among users with varying access permissions, observing that retrieval and generation quality degrade reasonably as document access is restricted. Finally, we evaluate performance in the key abilities required of RAG systems. We conclude that despite remaining challenges in achieving high retrieval quality and noise robustness, FedRAG is effective at synthesizing clinical trends through information integration across EHR documents.
dc.publisherMassachusetts Institute of Technology
dc.rightsIn Copyright - Educational Use Permitted
dc.rightsCopyright retained by author(s)
dc.rights.urihttps://rightsstatements.org/page/InC-EDU/1.0/
dc.titleClinical Question-Answering over Distributed EHR Data
dc.typeThesis
dc.description.degreeM.Eng.
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degreeMaster
thesis.degree.nameMaster of Engineering in Electrical Engineering and Computer Science


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record