MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Doctoral Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Doctoral Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Engineering TEV Protease Specificity: An Exploration of Machine Learning and High-Throughput Experimentation for Protein Design

Author(s)
Sundar, Vikram
Thumbnail
DownloadThesis PDF (18.15Mb)
Advisor
Esvelt, Kevin M.
Terms of use
Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) Copyright retained by author(s) https://creativecommons.org/licenses/by-nc-nd/4.0/
Metadata
Show full item record
Abstract
Engineering sequence-specific proteases would enable a wide variety of therapeutic applications in diseases ranging from cancer to Parkinson’s disease. However, many previous experimental and physics-based attempts at protease engineering have failed to engineer specificity in cleaving alternative substrates, rendering them useless. In this thesis, we aim to engineer TEV (tobacco etch virus) protease, a highly sequence-specific protease, to cleave alternative substrates. We incorporate novel high-throughput assays and powerful machine learning (ML) methods for highly effective protein engineering. The first portion of this thesis focuses on generating fitness landscapes from high-throughput experiments. Most machine learning models do not account for experimental noise, harming model performance and changing model rankings in benchmarking studies. Here we develop FLIGHTED, a Bayesian method of accounting for uncertainty by generating probabilistic fitness landscapes from noisy high-throughput experiments. We demonstrate how FLIGHTED can improve model performance on two categories of experiments: single-step selection assays, such as phage display, and a novel high-throughput assay called DHARMA that ties activity to base editing. FLIGHTED can be used to generate robust, well-calibrated fitness landscapes, and when combined with DHARMA, our methods enable us to generate fitness landscapes of millions of variants. We then evaluate how to model protein fitness given a fitness dataset of millions of variants. Accounting for noise via FLIGHTED significantly improves model performance, especially of high-performing models. Data size, not model scale, is the most important factor in improving model performance. Furthermore, the choice of top model architecture matters more than the protein language model embedding. The best way to generate sufficient data scale is via error-prone PCR libraries; models trained on these landscapes achieve high accuracy. Using these methods, we successfully engineer both activity on an alternative substrate and specificity when compared to the wild-type. The ML-designed variants outperform anything found in the training set, demonstrating the value of machine learning even with experimental libraries of millions of variants. However, our results are limited to relatively close substrates. How best to improve model performance on distant substrates remains an open question.
Date issued
2025-05
URI
https://hdl.handle.net/1721.1/163577
Department
Massachusetts Institute of Technology. Computational and Systems Biology Program
Publisher
Massachusetts Institute of Technology

Collections
  • Doctoral Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.