Engineering TEV Protease Specificity: An Exploration of Machine Learning and High-Throughput Experimentation for Protein Design

Sundar, Vikram

Author(s)

Sundar, Vikram

DownloadThesis PDF (18.15Mb)

Advisor

Esvelt, Kevin M.

Terms of use

Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) Copyright retained by author(s) https://creativecommons.org/licenses/by-nc-nd/4.0/

Metadata

Show full item record

Abstract

Engineering sequence-specific proteases would enable a wide variety of therapeutic applications in diseases ranging from cancer to Parkinson’s disease. However, many previous experimental and physics-based attempts at protease engineering have failed to engineer specificity in cleaving alternative substrates, rendering them useless. In this thesis, we aim to engineer TEV (tobacco etch virus) protease, a highly sequence-specific protease, to cleave alternative substrates. We incorporate novel high-throughput assays and powerful machine learning (ML) methods for highly effective protein engineering. The first portion of this thesis focuses on generating fitness landscapes from high-throughput experiments. Most machine learning models do not account for experimental noise, harming model performance and changing model rankings in benchmarking studies. Here we develop FLIGHTED, a Bayesian method of accounting for uncertainty by generating probabilistic fitness landscapes from noisy high-throughput experiments. We demonstrate how FLIGHTED can improve model performance on two categories of experiments: single-step selection assays, such as phage display, and a novel high-throughput assay called DHARMA that ties activity to base editing. FLIGHTED can be used to generate robust, well-calibrated fitness landscapes, and when combined with DHARMA, our methods enable us to generate fitness landscapes of millions of variants. We then evaluate how to model protein fitness given a fitness dataset of millions of variants. Accounting for noise via FLIGHTED significantly improves model performance, especially of high-performing models. Data size, not model scale, is the most important factor in improving model performance. Furthermore, the choice of top model architecture matters more than the protein language model embedding. The best way to generate sufficient data scale is via error-prone PCR libraries; models trained on these landscapes achieve high accuracy. Using these methods, we successfully engineer both activity on an alternative substrate and specificity when compared to the wild-type. The ML-designed variants outperform anything found in the training set, demonstrating the value of machine learning even with experimental libraries of millions of variants. However, our results are limited to relatively close substrates. How best to improve model performance on distant substrates remains an open question.

Date issued

2025-05

URI

https://hdl.handle.net/1721.1/163577

Department

Massachusetts Institute of Technology. Computational and Systems Biology Program

Publisher

Massachusetts Institute of Technology

Collections

Doctoral Theses