Systematic Development of Healthcare AI: From Data Curation,
Algorithm Optimization, Benchmark Design and Clinical Applications

Gao, Mingye

Author(s)

Gao, Mingye

DownloadThesis PDF (17.93Mb)

Advisor

Anthonry, Brian

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

Artificial intelligence (AI) has brought transformative changes to healthcare industry in the recent years from various aspects, such as patient care, disease diagnosis and medical research. As healthcare systems worldwide face increasing pressure from aging populations and rising chronic disease rates, there is an urgent need for systematic approaches to develop reliable and safe AI solutions. This thesis advances the systematic development of healthcare AI through four interconnected components: data curation, algorithm optimization, benchmark design, and clinical applications. The primary contribution of this thesis focuses on establishing a comprehensive pipeline for healthcare large language models (LLMs), spanning from data curation to clinical deployment. At the data level, a rule-based filtering framework was developed to select high-quality subsets from the large pre-training corpora, significantly improving both continue pre-training and fine-tuning performance of LLMs. For safety alignment, an automated pipeline was developed for preference learning that includes preference dataset synthesis, rule-based and data-adaptive annotation, and reward model training. Additionally, two novel benchmarks were created to ensure reliability and safety of LLMs in healthcare tasks: one assessing demographic biases of LLMs across common diseases, while another assessing models’ ability to reject illogical requests from users in drug-related scenarios. Finally, LLMs were used to generating patient-friendly educational content for clinical trials, demonstrating their role in improving patient education and engagement in clinical trials. This systematic progression from data to deployment establishes a blueprint for developing safe and effective language models in healthcare settings. Beyond language models, machine learning techniques were applied on an additional healthcare task. In this project, a novel approach combining normalized cross-correlation and attention graph convolutional recurrent networks was developed to realize contactless, continuous and reliable radar-based vital signs monitoring in dynamic home environments. Through systematic data collection and algorithm optimization, the accurate heart rate can be obtained across varying radar-subject distances (2-2.5m) and subject orientations, demonstrating robust performance in real-world conditions through extensive validation in four test houses with six subjects. Collectively, these contributions advance healthcare AI development across 2 fronts: establishing frameworks for safe and effective deployment of language models in healthcare settings and enabling reliable and continuous health monitoring at-home without wearable devices.

Date issued

2025-05

URI

https://hdl.handle.net/1721.1/164060

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Doctoral Theses