Show simple item record

dc.contributor.advisorPierrot, Thomas
dc.contributor.advisorTidor, Bruce
dc.contributor.authorLin, Vincent
dc.date.accessioned2025-10-06T17:38:27Z
dc.date.available2025-10-06T17:38:27Z
dc.date.issued2025-05
dc.date.submitted2025-06-23T14:02:52.087Z
dc.identifier.urihttps://hdl.handle.net/1721.1/162990
dc.description.abstractAs single-cell transcriptomics datasets continue to grow in size and biological complexity, current models for cell type annotation remain limited in their generalizability and are often evaluated on only a small fraction of the standardized cell types defined in modern ontologies. Current state-of-the-art models for transcriptomic representation demonstrate that deep learning models can extract rich features on single-cell data but are evaluated on very few cell types and perform poorly on broader datasets. This work introduces a multimodal model architecture that integrates large language models (LLMs) with gene expression encoders to address this scalability gap in cell type annotation. Inspired by vision-language frameworks, our architecture combines a pretrained scRNA encoder with a Perceiver Resampler that maps gene expression profiles into the latent space of a large language model. We construct structured, ontology-grounded datasets of up to 197 cell types and evaluate our model's performance using instruction fine-tuning. Our experiments analyze the impact of integrating language modeling components with scRNA encoders and their benefit on cell type annotation performance for large, diverse datasets. Our results show that while a scRNA encoder may be sufficient for small datasets, our single-cell model leveraging LLMs consistently outperforms the scRNA encoder baseline on larger datasets, with a widening gap in classification performance as data complexity increases, demonstrating the scalability and improved generalizability of our multimodal architecture. We also provide further analysis of the tradeoffs associated with using the natural language domain for biological analysis.
dc.publisherMassachusetts Institute of Technology
dc.rightsIn Copyright - Educational Use Permitted
dc.rightsCopyright retained by author(s)
dc.rights.urihttps://rightsstatements.org/page/InC-EDU/1.0/
dc.titleSingle-Cell Language Model for Transcriptomics & Cell Type Annotation
dc.typeThesis
dc.description.degreeM.Eng.
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degreeMaster
thesis.degree.nameMaster of Engineering in Electrical Engineering and Computer Science


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record