A computational framework for the identification, cataloging, and classification of evolutionary conserved genomic DNA
Author(s)
Saluja, Sunil K. (Sunil Kumar), 1968-
DownloadFull printable version (1.661Mb)
Other Contributors
Harvard University--MIT Division of Health Sciences and Technology.
Advisor
Isaac S. Kohane.
Terms of use
Metadata
Show full item recordAbstract
Evolutionarily conserved genomic regions (ecores) are understudied, and yet comprise a very large percentage of the Human Genome. Highly conserved human-mouse non-coding ecores, for example, are more abundant within the Human Genome than those regions, which are currently estimated to encode for proteins. Subsets of these ecores also exhibit conservation that extends across several species. These genomic regions have managed to survive millions of years of evolution despite the fact that they do not appear to directly encode for proteins. The survival of these regions compels us to investigate their potential function. Development of a computational framework for the classification and clustering of these regions may be the first step in understanding their function. The need for a standardized framework is underscored by the explosive growth in the number of publicly available, fully sequenced genomes, and the diverse set of methodologies used to generate cross-species alignments. This project describes the design and implementation of a system for the identification, classification and cataloguing of ecores across multiple species. A key feature of this system is its ability to quickly incorporate new genomes and assemblies as they become available. Additionally, this system provides investigators with a feature rich user interface, which facilitates the retrieval of ecores based on a wide range of parameters. The system returns a dynamically annotated list of evolutionarily conserved regions, which is used as input to several classification schemes, aimed at identifying families of ecores that share similar features, including depth of evolutionary conservation, position relative to known genes, sequence similarity, (cont.) and content of transcription factor binding sites. Families of ecores have already been retrieved by the system and clustered using this feature space, and are currently awaiting biological validation.
Description
Thesis (S.M.)--Harvard-MIT Division of Health Sciences and Technology, 2004. Includes bibliographical references (leaves 27-29).
Date issued
2004Department
Harvard University--MIT Division of Health Sciences and TechnologyPublisher
Massachusetts Institute of Technology
Keywords
Harvard University--MIT Division of Health Sciences and Technology.