Genomic Language Models for Protein Function and Property Prediction

Boshar, Sam T.

dc.contributor.advisor	Kellis, Manolis
dc.contributor.advisor	Trop, Evan
dc.contributor.author	Boshar, Sam T.
dc.date.accessioned	2024-09-16T13:50:56Z
dc.date.available	2024-09-16T13:50:56Z
dc.date.issued	2024-05
dc.date.submitted	2024-07-11T14:36:28.606Z
dc.identifier.uri	https://hdl.handle.net/1721.1/156816
dc.description.abstract	In the field of natural language processing (NLP), large language models (LLMs) trained on enormous corpora of unlabeled sequence data have demonstrated state-of-the-art performance on a variety of downstream tasks. This approach is appealing because one model can be easily adapted to do well in many modalities, rather than requiring many specialized models. This same architecture has found great success modeling biological data, including protein, mRNA and genomic sequences. Representations from biological language models have also outperformed highly specialized models, especially in data-scarce scenarios. How- ever, since the genome contains all of the information encoding proteins, genomic language model (gLMs) have the potential to model DNA, RNA and proteins. In spite of this, the performance of gLMs on proteins is largely unknown due to the lack of datasets pairing proteins with their true coding sequences. In this work, we curate five such coding sequence datasets and use them to study gLMs and protein language model (pLM) performance on protein function and property prediction. We show that gLMs are competitive and even outperform their pLMs counterparts on some tasks and that they perform best using the curated true coding sequences over alternative codon sampling strategies. We perform a series of experiments to find interpretable explanations for gLM performance, and investigate architecture changes to address their shortcomings and improve the ability of gLM to represent proteins. We found that a joint genomic-proteomic architecture outperforms each individual approach, showing that they capture different, but complementary sequence representations. We identify examples of such distinct representations in a detailed analysis of their respective embedding spaces. In studying the application of gLMs to proteomics, we look to encourage further research into a unified and synergistic approach to many biological modalities.
dc.publisher	Massachusetts Institute of Technology
dc.rights	Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
dc.rights	Copyright retained by author(s)
dc.rights.uri	https://creativecommons.org/licenses/by-nc-nd/4.0/
dc.title	Genomic Language Models for Protein Function and Property Prediction
dc.type	Thesis
dc.description.degree	M.Eng.
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
dc.identifier.orcid	0009-0007-5694-6080
mit.thesis.degree	Master
thesis.degree.name	Master of Engineering in Electrical Engineering and Computer Science

Files in this item

Name:: boshar-sboshar-meng-eecs-2024- ...
Size:: 5.174Mb
Format:: PDF
Description:: Thesis PDF

View/Open

This item appears in the following Collection(s)

Graduate Theses

Show simple item record