MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Doctoral Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Doctoral Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Modeling structured biological processes with machine learning

Author(s)
Shen, Max Walt
Thumbnail
DownloadThesis PDF (15.77Mb)
Advisor
Liu, David R.
Terms of use
In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/
Metadata
Show full item record
Abstract
Models of natural phenomena have played a fundamental role in scientific progress. In modern biology, we seek to model ever more complex phenomena, driven by advances in highthroughput measurement technology and machine learning. These advances motivate a topdown data-driven modeling approach, but directly applying such methods to model complex biological processes can fail to yield models with causal understanding. It would be desirable to build models that combine the rich bodies of causal knowledge built over decades of research with modern flexible machine learning methods that scale to large and rich datasets. Here, I present deep data-driven models that incorporate biological and causal prior knowledge to model fundamental biological processes in genome editing and directed evolution. I first consider a model of DNA repair following CRISPR/Cas9 cleavage, which was generally thought to be unpredictable. In a large-scale dataset, I find signatures implicating an alternative and more predictable DNA repair pathway. I describe a model that accurately predicts genome editing outcomes by representing these competing but mechanistically independent repair pathways while flexibly learning unknown relationships from data. I use the model to discover a new genome editing strategy for efficiently and precisely correcting a class of disease-causing genetic mutations. Next, I consider a model for base editing, where I decompose a complex prediction problem into simpler subproblems and solve one with an autoregressive sequence-todistribution of sequences model. The models enable designing genome editing strategies with optimized outcomes for disease-causing mutation and enabled the first demonstration of transversion base editing by cytosine base editors, broadening the scope of base editing to potentially correcting new classes of mutations. These models also broaden the scope of C to G base editors with restrictive sequence preferences. Finally, I propose a method for reconstructing sequence-to-function datasets from directed evolution that can help increase the availability of datasets for machine learning for protein engineering. This method exploits the structure of a differential equation governing natural selection for efficient inference and is capable of proposing variants with higher activity than conventional methods. Incorporating prior knowledge and structure into models of natural phenomena can support scientific discovery.
Date issued
2021-06
URI
https://hdl.handle.net/1721.1/139524
Department
Massachusetts Institute of Technology. Computational and Systems Biology Program
Publisher
Massachusetts Institute of Technology

Collections
  • Doctoral Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.