Computational regulatory genomics : motifs, networks, and dynamics
Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science.
MetadataShow full item record
Gene regulation, the process responsible for taking a static genome and producing the diversity and complexity of life, is largely mediated through the sequence specific binding of regulators. The short, degenerate nature of the recognized elements and the unknown rules through which they interact makes deciphering gene regulation a significant challenge. In this thesis, we utilize comparative genomics and other approaches to exploit large-scale experimental datasets and better understand the sequence elements and regulators responsible for regulatory programs. In particular, we develop new computational approaches to (1) predict the binding sites of regulators using the genomes of many, closely related species; (2) understand the sequence motifs associated with transcription factors; (3) discover and characterize microRNAs, an important class of regulators; (4) use static predictions for binding sites in conjunction with chromatin modifications to better understand the dynamics of regulation; and (5) systematically validate the predicted motif instances using a massively parallel reporter assay. We find that the predictions made by our algorithms are of high quality and are comparable to those made by leading experimental approaches. Moreover, we find that experimental and computational approaches are often complementary. Regions experimentally identified to be bound by a factor can be species and cell line specific, but they lack the resolution and unbiased nature of our predictions. Experimentally identified miRNAs have unmistakable signs of being processed, but cannot provide the same insights our machine learning framework does. Further emphasizing the importance of integration, combining chromatin mark annotations and gene expression from multiple cell types with our static motif instances allows for increasing our power and making additional biologically relevant insights. We successfully apply the algorithms in this thesis to 29 mammals and 12 flies and expect them to be applicable to other clades of eukaryotic species. Moreover, we find that our performance has not yet plateaued and believe these methods will continue to be relevant as sequencing becomes increasingly commonplace and thousands of genomes become available.
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student-submitted PDF version of thesis.Includes bibliographical references (p. 147-169).
DepartmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Massachusetts Institute of Technology
Electrical Engineering and Computer Science.