ChaperoNet: Distillation of Language Model Semantics to Folded Three-Dimensional Protein Structures
Author(s)
dos Santos Costa, Allan
DownloadThesis PDF (7.257Mb)
Advisor
Jacobson, Joseph M.
Terms of use
Metadata
Show full item recordAbstract
Determining the structure of proteins has been a long-standing goal in biology. Lan- guage models have been recently deployed to capture the evolutionary semantics of protein sequences, and as an emergent property, were found to be structural learn- ers. Enriched with multiple sequence alignments (MSA), these transformer models were able to capture significant information about a protein’s tertiary structure. In this work, we show how such structural information can be recovered by processing language model embeddings, and introduce a two-stage folding pipeline to directly es- timate three-dimensional folded structures from protein sequences. We envision that this pipeline will provide a basis for efficient, end-to-end protein structure prediction through protein language modeling.
Date issued
2021-09Department
Program in Media Arts and Sciences (Massachusetts Institute of Technology)Publisher
Massachusetts Institute of Technology