MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Genomic Language Models for Protein Function and Property Prediction

Author(s)
Boshar, Sam T.
Thumbnail
DownloadThesis PDF (5.174Mb)
Advisor
Kellis, Manolis
Trop, Evan
Terms of use
Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) Copyright retained by author(s) https://creativecommons.org/licenses/by-nc-nd/4.0/
Metadata
Show full item record
Abstract
In the field of natural language processing (NLP), large language models (LLMs) trained on enormous corpora of unlabeled sequence data have demonstrated state-of-the-art performance on a variety of downstream tasks. This approach is appealing because one model can be easily adapted to do well in many modalities, rather than requiring many specialized models. This same architecture has found great success modeling biological data, including protein, mRNA and genomic sequences. Representations from biological language models have also outperformed highly specialized models, especially in data-scarce scenarios. How- ever, since the genome contains all of the information encoding proteins, genomic language model (gLMs) have the potential to model DNA, RNA and proteins. In spite of this, the performance of gLMs on proteins is largely unknown due to the lack of datasets pairing proteins with their true coding sequences. In this work, we curate five such coding sequence datasets and use them to study gLMs and protein language model (pLM) performance on protein function and property prediction. We show that gLMs are competitive and even outperform their pLMs counterparts on some tasks and that they perform best using the curated true coding sequences over alternative codon sampling strategies. We perform a series of experiments to find interpretable explanations for gLM performance, and investigate architecture changes to address their shortcomings and improve the ability of gLM to represent proteins. We found that a joint genomic-proteomic architecture outperforms each individual approach, showing that they capture different, but complementary sequence representations. We identify examples of such distinct representations in a detailed analysis of their respective embedding spaces. In studying the application of gLMs to proteomics, we look to encourage further research into a unified and synergistic approach to many biological modalities.
Date issued
2024-05
URI
https://hdl.handle.net/1721.1/156816
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Graduate Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.