Single-Cell Language Model for Transcriptomics & Cell Type Annotation

Lin, Vincent

Author(s)

Lin, Vincent

DownloadThesis PDF (1.878Mb)

Advisor

Pierrot, Thomas

Tidor, Bruce

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

As single-cell transcriptomics datasets continue to grow in size and biological complexity, current models for cell type annotation remain limited in their generalizability and are often evaluated on only a small fraction of the standardized cell types defined in modern ontologies. Current state-of-the-art models for transcriptomic representation demonstrate that deep learning models can extract rich features on single-cell data but are evaluated on very few cell types and perform poorly on broader datasets. This work introduces a multimodal model architecture that integrates large language models (LLMs) with gene expression encoders to address this scalability gap in cell type annotation. Inspired by vision-language frameworks, our architecture combines a pretrained scRNA encoder with a Perceiver Resampler that maps gene expression profiles into the latent space of a large language model. We construct structured, ontology-grounded datasets of up to 197 cell types and evaluate our model's performance using instruction fine-tuning. Our experiments analyze the impact of integrating language modeling components with scRNA encoders and their benefit on cell type annotation performance for large, diverse datasets. Our results show that while a scRNA encoder may be sufficient for small datasets, our single-cell model leveraging LLMs consistently outperforms the scRNA encoder baseline on larger datasets, with a widening gap in classification performance as data complexity increases, demonstrating the scalability and improved generalizability of our multimodal architecture. We also provide further analysis of the tradeoffs associated with using the natural language domain for biological analysis.

Date issued

2025-05

URI

https://hdl.handle.net/1721.1/162990

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses