Show simple item record

dc.contributor.advisorKellis, Manolis
dc.contributor.advisorTrop, Evan
dc.contributor.authorBoshar, Sam T.
dc.date.accessioned2024-09-16T13:50:56Z
dc.date.available2024-09-16T13:50:56Z
dc.date.issued2024-05
dc.date.submitted2024-07-11T14:36:28.606Z
dc.identifier.urihttps://hdl.handle.net/1721.1/156816
dc.description.abstractIn the field of natural language processing (NLP), large language models (LLMs) trained on enormous corpora of unlabeled sequence data have demonstrated state-of-the-art performance on a variety of downstream tasks. This approach is appealing because one model can be easily adapted to do well in many modalities, rather than requiring many specialized models. This same architecture has found great success modeling biological data, including protein, mRNA and genomic sequences. Representations from biological language models have also outperformed highly specialized models, especially in data-scarce scenarios. How- ever, since the genome contains all of the information encoding proteins, genomic language model (gLMs) have the potential to model DNA, RNA and proteins. In spite of this, the performance of gLMs on proteins is largely unknown due to the lack of datasets pairing proteins with their true coding sequences. In this work, we curate five such coding sequence datasets and use them to study gLMs and protein language model (pLM) performance on protein function and property prediction. We show that gLMs are competitive and even outperform their pLMs counterparts on some tasks and that they perform best using the curated true coding sequences over alternative codon sampling strategies. We perform a series of experiments to find interpretable explanations for gLM performance, and investigate architecture changes to address their shortcomings and improve the ability of gLM to represent proteins. We found that a joint genomic-proteomic architecture outperforms each individual approach, showing that they capture different, but complementary sequence representations. We identify examples of such distinct representations in a detailed analysis of their respective embedding spaces. In studying the application of gLMs to proteomics, we look to encourage further research into a unified and synergistic approach to many biological modalities.
dc.publisherMassachusetts Institute of Technology
dc.rightsAttribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
dc.rightsCopyright retained by author(s)
dc.rights.urihttps://creativecommons.org/licenses/by-nc-nd/4.0/
dc.titleGenomic Language Models for Protein Function and Property Prediction
dc.typeThesis
dc.description.degreeM.Eng.
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
dc.identifier.orcid0009-0007-5694-6080
mit.thesis.degreeMaster
thesis.degree.nameMaster of Engineering in Electrical Engineering and Computer Science


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record