Machine learning for understanding protein sequence and structure

Bepler, Tristan(Tristan Wendland)

dc.contributor.advisor	Bonnie Berger.	en_US
dc.contributor.author	Bepler, Tristan(Tristan Wendland)	en_US
dc.contributor.other	Massachusetts Institute of Technology. Computational and Systems Biology Program.	en_US
dc.date.accessioned	2021-02-19T20:40:23Z
dc.date.available	2021-02-19T20:40:23Z
dc.date.copyright	2020	en_US
dc.date.issued	2020	en_US
dc.identifier.uri	https://hdl.handle.net/1721.1/129888
dc.description	Thesis: Ph. D., Massachusetts Institute of Technology, Computational and Systems Biology Program, February, 2020	en_US
dc.description	Cataloged from student-submitted PDF of thesis.	en_US
dc.description	Includes bibliographical references (pages 183-200).	en_US
dc.description.abstract	Proteins are the fundamental building blocks of life, carrying out a vast array of functions at the molecular level. Understanding these molecular machines has been a core problem in biology for decades. Recent advances in cryo-electron microscopy (cryoEM) has enabled high resolution experimental measurement of proteins in their native states. However, this technology remains expensive and low throughput. At the same time, ever growing protein databases offer new opportunities for understanding the diversity of natural proteins and for linking sequence to structure and function. This thesis introduces a variety of machine learning methods for accelerating protein structure determination by cryoEM and for learning from large protein databases. We first consider the problem of protein identification in the large images collected in cryoEM. We propose a positive-unlabeled learning framework that enables high accuracy particle detection with few labeled data points, both improving data quality and analysis speed. Next, we develop a deep denoising model for cryo-electron micrographs. By learning the denoising model from large amounts of real cryoEM data, we are able to capture the noise generation process and accurately denoise micrographs, improving the ability of experamentalists to examine and interpret their data. We then introduce a neural network model for understanding continuous variability in proteins in cryoEM data by explicitly disentangling variation of interest (structure) for nuisance variation due to rotation and translation. Finally, we move beyond cryoEM and propose a method for learning vector embeddings of proteins using information from structure and sequence. Many of the machine learning methods developed here are general purpose and can be applied to other data domains.	en_US
dc.description.statementofresponsibility	by Tristan Bepler.	en_US
dc.format.extent	200 pages	en_US
dc.language.iso	eng	en_US
dc.publisher	Massachusetts Institute of Technology	en_US
dc.rights	MIT theses may be protected by copyright. Please reuse MIT thesis content according to the MIT Libraries Permissions Policy, which is available through the URL provided.	en_US
dc.rights.uri	http://dspace.mit.edu/handle/1721.1/7582	en_US
dc.subject	Computational and Systems Biology Program.	en_US
dc.title	Machine learning for understanding protein sequence and structure	en_US
dc.type	Thesis	en_US
dc.description.degree	Ph. D.	en_US
dc.contributor.department	Massachusetts Institute of Technology. Computational and Systems Biology Program	en_US
dc.identifier.oclc	1237266130	en_US
dc.description.collection	Ph.D. Massachusetts Institute of Technology, Computational and Systems Biology Program	en_US
dspace.imported	2021-02-19T20:39:53Z	en_US
mit.thesis.degree	Doctoral	en_US
mit.thesis.department	CSB	en_US

Files in this item

Name:: 1237266130-MIT.pdf
Size:: 22.13Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Doctoral Theses

Show simple item record