Extracting regulatory signals from DNA sequences using syntactic pattern discovery
Author(s)
Gupta, Vipin, 1978-
DownloadFull printable version (8.086Mb)
Alternative title
Capstone report : commercialization of bioinformatics tools
Other Contributors
Massachusetts Institute of Technology. Dept. of Chemical Engineering.
Advisor
Gregory Stephanopoulos.
Terms of use
Metadata
Show full item recordAbstract
(cont.) algorithm was validated on synthetic as well as real datasets. When tested on a set of 30 well-studied regulons in Escherichia Coli, with known instances of regulatory motifs collected from biological literature, the algorithm showed, in 14 cases, a high sensitivity and specificity of 70% and 80%, respectively. TABS was shown to perform better than two other popular state-of-the-art motif-finding algorithms. In addition, its applicability on synthetic microarray-like data was demonstrated. Several significant novel motifs detected by the algorithm that form good targets for investigation of regulatory function by biological experiments were reported. One of the major challenges facing biologists is to understand the mechanisms governing the regulation of gene expression. Completely sequenced genomes, together with the emerging DNA microarray technologies have enabled the measurement of gene expression levels in cell cultures and opened new possibilities for studying gene regulation. A fundamental sub-problem in unraveling regulatory interactions in both prokaryotes and eukaryotes is to identify common binding sites or promoters in the regulatory regions of genes. For a gene's mRNA to be expressed, a class of proteins called transcription factors must bind to the cis-regulatory elements on the DNA sequence upstream of the gene, to enhance RNA polymerase binding and hence initiate transcription. These binding sites are believed to be located within several hundred base pairs upstream of the respective ORFs. Biological methods for discovering regulatory binding sites are slow and time consuming. To address this problem, several heuristic-based computational methods have been developed in the past with either of two approaches--sequence-driven or pattern-driven. In this dissertation, we propose a novel approach for finding shared motifs in DNA sequences based on an exhaustive pattern enumeration algorithm, that combines the benefits of the pattern-driven and sequence-driven approaches. We developed TABS, a method that identifies local regions of high similarity by clustering statistically significant patterns to obtain putative binding sites. The method assumes minimal apriori information about the sites and can detect signals in a subset of the input sequences, making it amenable for motif-discovery in gene clusters obtained from microarray experiments. The performance of the
Description
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Chemical Engineering, 2004. Includes bibliographical references.
Date issued
2004Department
Massachusetts Institute of Technology. Department of Chemical EngineeringPublisher
Massachusetts Institute of Technology
Keywords
Chemical Engineering.