Single-Cell Language Model for Transcriptomics & Cell Type Annotation
Author(s)
Lin, Vincent
DownloadThesis PDF (1.878Mb)
Advisor
Pierrot, Thomas
Tidor, Bruce
Terms of use
Metadata
Show full item recordAbstract
As single-cell transcriptomics datasets continue to grow in size and biological complexity, current models for cell type annotation remain limited in their generalizability and are often evaluated on only a small fraction of the standardized cell types defined in modern ontologies. Current state-of-the-art models for transcriptomic representation demonstrate that deep learning models can extract rich features on single-cell data but are evaluated on very few cell types and perform poorly on broader datasets. This work introduces a multimodal model architecture that integrates large language models (LLMs) with gene expression encoders to address this scalability gap in cell type annotation. Inspired by vision-language frameworks, our architecture combines a pretrained scRNA encoder with a Perceiver Resampler that maps gene expression profiles into the latent space of a large language model. We construct structured, ontology-grounded datasets of up to 197 cell types and evaluate our model's performance using instruction fine-tuning. Our experiments analyze the impact of integrating language modeling components with scRNA encoders and their benefit on cell type annotation performance for large, diverse datasets. Our results show that while a scRNA encoder may be sufficient for small datasets, our single-cell model leveraging LLMs consistently outperforms the scRNA encoder baseline on larger datasets, with a widening gap in classification performance as data complexity increases, demonstrating the scalability and improved generalizability of our multimodal architecture. We also provide further analysis of the tradeoffs associated with using the natural language domain for biological analysis.
Date issued
2025-05Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology