Show simple item record

dc.contributor.advisorGregory Stephanopoulos.en_US
dc.contributor.authorStyczynski, Mark Philip-Walteren_US
dc.contributor.otherMassachusetts Institute of Technology. Dept. of Chemical Engineering.en_US
dc.date.accessioned2007-09-28T13:24:00Z
dc.date.available2007-09-28T13:24:00Z
dc.date.copyright2007en_US
dc.date.issued2007en_US
dc.identifier.urihttp://hdl.handle.net/1721.1/38976
dc.descriptionThesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Chemical Engineering, 2007.en_US
dc.descriptionIncludes bibliographical references (p. 437-458).en_US
dc.description.abstractSequential motif discovery, the ability to identify conserved patterns in ordered datasets without a priori knowledge of exactly what those patterns will be, is a frequently encountered and difficult problem in computational biology and biochemical engineering. The most prevalent example of such a problem is finding conserved DNA sequences in the upstream regions of genes that are believed to be coregulated. Other examples are as diverse as identifying conserved secondary structure in proteins and interpreting time-series data. This thesis creates a unified, generic approach to addressing these (and other) problems in sequential motif discovery and demonstrates the utility of that approach on a number of applications. A generic motif discovery algorithm was created for the purpose of finding conserved patterns in arbitrary data types. This approach and implementation, name Gemoda, decouples three key steps in the motif discovery process: comparison, clustering, and convolution. Since it decouples these steps, Gemoda is a modular algorithm; that is, any comparison metric can be used with any clustering algorithm and any convolution scheme. The comparison metric is a data-specific function that transforms the motif discovery problem into a solvable graph-theoretic problem that still adequately represents the important similarities in the data.en_US
dc.description.abstract(cont.) This thesis presents the development of Gemoda as well as applications of this approach in a number of different contexts. One application is an exhaustive solution of an abstraction of the transcription factor binding site discovery problem in DNA. A similar application is to the analysis of upstream regions of regulons in microbial DNA. Another application is the identification of protein sequence homologies in a set of related proteins in the presence of significant noise. A quite different application is the discovery of extended local secondary structure homology between a protein and a protein complex known to be in the same structural family. The final application is to the analysis of metabolomic datasets. The diversity of these sample applications, which range from the analysis of strings (like DNA and amino acid sequences) to real-valued data (like protein structures and metabolomic datasets) demonstrates that our generic approach is successful and useful for solving established and novel problems alike. The last application, of analyzing metabolomic datasets, is of particular interest. Using Gemoda, an appropriate comparison function, and appropriate data handling, a novel and useful approach to the interpretation of metabolite profiling datasets obtained from gas chromatography coupled to mass spectrometry is developed.en_US
dc.description.abstract(cont.) The use of a motif discovery approach allows for the expansion of the scope of metabolites that can be tracked and analyzed in an untargeted metabolite profiling (or metabolomic) experiment. This new approach, named SpectConnect, is presented herein along with examples that verify its efficacy and utility in some validation experiments. The beginning of a broader application of SpectConnect's potential is presented as well. The success of SpectConnect, a novel application of Gemoda, validates the utility of a truly generic approach to motif discovery. By not getting bogged down in the specifics of a type of data and a problem unique to that type of data, a broader class of problems can be addressed that otherwise would have been extremely difficult to handle.en_US
dc.description.statementofresponsibilityby Mark Philip-Walter Styczynski.en_US
dc.format.extent458 p.en_US
dc.language.isoengen_US
dc.publisherMassachusetts Institute of Technologyen_US
dc.rightsM.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission.en_US
dc.rights.urihttp://dspace.mit.edu/handle/1721.1/7582
dc.subjectChemical Engineering.en_US
dc.titleApplications of motif discovery in biological dataen_US
dc.typeThesisen_US
dc.description.degreePh.D.en_US
dc.contributor.departmentMassachusetts Institute of Technology. Department of Chemical Engineering
dc.identifier.oclc166330592en_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record