Modeling structured biological processes with machine learning

Shen, Max Walt

Author(s)

Shen, Max Walt

DownloadThesis PDF (15.77Mb)

Advisor

Liu, David R.

Terms of use

In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

Models of natural phenomena have played a fundamental role in scientific progress. In modern biology, we seek to model ever more complex phenomena, driven by advances in highthroughput measurement technology and machine learning. These advances motivate a topdown data-driven modeling approach, but directly applying such methods to model complex biological processes can fail to yield models with causal understanding. It would be desirable to build models that combine the rich bodies of causal knowledge built over decades of research with modern flexible machine learning methods that scale to large and rich datasets. Here, I present deep data-driven models that incorporate biological and causal prior knowledge to model fundamental biological processes in genome editing and directed evolution. I first consider a model of DNA repair following CRISPR/Cas9 cleavage, which was generally thought to be unpredictable. In a large-scale dataset, I find signatures implicating an alternative and more predictable DNA repair pathway. I describe a model that accurately predicts genome editing outcomes by representing these competing but mechanistically independent repair pathways while flexibly learning unknown relationships from data. I use the model to discover a new genome editing strategy for efficiently and precisely correcting a class of disease-causing genetic mutations. Next, I consider a model for base editing, where I decompose a complex prediction problem into simpler subproblems and solve one with an autoregressive sequence-todistribution of sequences model. The models enable designing genome editing strategies with optimized outcomes for disease-causing mutation and enabled the first demonstration of transversion base editing by cytosine base editors, broadening the scope of base editing to potentially correcting new classes of mutations. These models also broaden the scope of C to G base editors with restrictive sequence preferences. Finally, I propose a method for reconstructing sequence-to-function datasets from directed evolution that can help increase the availability of datasets for machine learning for protein engineering. This method exploits the structure of a differential equation governing natural selection for efficient inference and is capable of proposing variants with higher activity than conventional methods. Incorporating prior knowledge and structure into models of natural phenomena can support scientific discovery.

Date issued

2021-06

URI

https://hdl.handle.net/1721.1/139524

Department

Massachusetts Institute of Technology. Computational and Systems Biology Program

Publisher

Massachusetts Institute of Technology

Collections

Doctoral Theses