MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Single-Cell Language Model for Transcriptomics & Cell Type Annotation

Author(s)
Lin, Vincent
Thumbnail
DownloadThesis PDF (1.878Mb)
Advisor
Pierrot, Thomas
Tidor, Bruce
Terms of use
In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/
Metadata
Show full item record
Abstract
As single-cell transcriptomics datasets continue to grow in size and biological complexity, current models for cell type annotation remain limited in their generalizability and are often evaluated on only a small fraction of the standardized cell types defined in modern ontologies. Current state-of-the-art models for transcriptomic representation demonstrate that deep learning models can extract rich features on single-cell data but are evaluated on very few cell types and perform poorly on broader datasets. This work introduces a multimodal model architecture that integrates large language models (LLMs) with gene expression encoders to address this scalability gap in cell type annotation. Inspired by vision-language frameworks, our architecture combines a pretrained scRNA encoder with a Perceiver Resampler that maps gene expression profiles into the latent space of a large language model. We construct structured, ontology-grounded datasets of up to 197 cell types and evaluate our model's performance using instruction fine-tuning. Our experiments analyze the impact of integrating language modeling components with scRNA encoders and their benefit on cell type annotation performance for large, diverse datasets. Our results show that while a scRNA encoder may be sufficient for small datasets, our single-cell model leveraging LLMs consistently outperforms the scRNA encoder baseline on larger datasets, with a widening gap in classification performance as data complexity increases, demonstrating the scalability and improved generalizability of our multimodal architecture. We also provide further analysis of the tradeoffs associated with using the natural language domain for biological analysis.
Date issued
2025-05
URI
https://hdl.handle.net/1721.1/162990
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Graduate Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.