Genomic Language Models for Protein Function and Property Prediction

Boshar, Sam T.

Author(s)

Boshar, Sam T.

DownloadThesis PDF (5.174Mb)

Advisor

Kellis, Manolis

Trop, Evan

Terms of use

Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) Copyright retained by author(s) https://creativecommons.org/licenses/by-nc-nd/4.0/

Metadata

Show full item record

Abstract

In the field of natural language processing (NLP), large language models (LLMs) trained on enormous corpora of unlabeled sequence data have demonstrated state-of-the-art performance on a variety of downstream tasks. This approach is appealing because one model can be easily adapted to do well in many modalities, rather than requiring many specialized models. This same architecture has found great success modeling biological data, including protein, mRNA and genomic sequences. Representations from biological language models have also outperformed highly specialized models, especially in data-scarce scenarios. How- ever, since the genome contains all of the information encoding proteins, genomic language model (gLMs) have the potential to model DNA, RNA and proteins. In spite of this, the performance of gLMs on proteins is largely unknown due to the lack of datasets pairing proteins with their true coding sequences. In this work, we curate five such coding sequence datasets and use them to study gLMs and protein language model (pLM) performance on protein function and property prediction. We show that gLMs are competitive and even outperform their pLMs counterparts on some tasks and that they perform best using the curated true coding sequences over alternative codon sampling strategies. We perform a series of experiments to find interpretable explanations for gLM performance, and investigate architecture changes to address their shortcomings and improve the ability of gLM to represent proteins. We found that a joint genomic-proteomic architecture outperforms each individual approach, showing that they capture different, but complementary sequence representations. We identify examples of such distinct representations in a detailed analysis of their respective embedding spaces. In studying the application of gLMs to proteomics, we look to encourage further research into a unified and synergistic approach to many biological modalities.

Date issued

2024-05

URI

https://hdl.handle.net/1721.1/156816

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses