Show simple item record

dc.contributor.advisorBonnie Berger.en_US
dc.contributor.authorPalmer, Nathan Patricken_US
dc.contributor.otherMassachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science.en_US
dc.date.accessioned2012-01-12T19:32:04Z
dc.date.available2012-01-12T19:32:04Z
dc.date.copyright2011en_US
dc.date.issued2011en_US
dc.identifier.urihttp://hdl.handle.net/1721.1/68493
dc.descriptionThesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2011.en_US
dc.descriptionCataloged from PDF version of thesis.en_US
dc.descriptionIncludes bibliographical references (p. 238-256).en_US
dc.description.abstractModern computational biology is awash in large-scale data mining problems. Several high-throughput technologies have been developed that enable us, with relative ease and little expense, to evaluate the coordinated expression levels of tens of thousands of genes, evaluate hundreds of thousands of single-nucleotide polymorphisms, and sequence individual genomes. The data produced by these assays has provided the research and commercial communities with the opportunity to derive improved clinical prognostic indicators, as well as develop an understanding, at the molecular level, of the systemic underpinnings of a variety of diseases. Aside from the statistical methods used to evaluate these assays, another, more subtle challenge is emerging. Despite the explosive growth in the amount of data being generated and submitted to the various publicly available data repositories, very little attention has been paid to managing the phenotypic characterization of their samples (i.e., managing class labels in a controlled fashion). If sense is to be made of the underlying assay data, the samples' descriptive metadata must first be standardized in a machine-readable format. In this thesis, we explore these issues, specifically within the context of curating and analyzing a large DNA microarray database. We address three main challenges. First, we acquire a large subset of a publicly available microarray repository and develop a principled method for extracting phenotype information from freetext sample labels, then use that information to generate an index of the sample's medically-relevant annotation. The indexing method we develop, Concordia, incorporates pre-existing expert knowledge relating to the hierarchical relationships between medical terms, allowing queries of arbitrary specificity to be efficiently answered. Second, we describe a highly flexible approach to answering the question: "Given a previously unseen gene expression sample, how can we compute its similarity to all of the labeled samples in our database, and how can we utilize those similarity scores to predict the phenotype of the new sample?" Third, we describe a method for identifying phenotype-specific transcriptional profiles within the context of this database, and explore a method for measuring the relative strength of those signatures across the rest of the database, allowing us to identify molecular signatures that are shared across various tissues ad diseases. These shared fingerprints may form a quantitative basis for optimal therapy selection and drug repositioning for a variety of diseases.en_US
dc.description.statementofresponsibilityby Nathan Patrick Palmer.en_US
dc.format.extent256 p.en_US
dc.language.isoengen_US
dc.publisherMassachusetts Institute of Technologyen_US
dc.rightsM.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission.en_US
dc.rights.urihttp://dspace.mit.edu/handle/1721.1/7582en_US
dc.subjectElectrical Engineering and Computer Science.en_US
dc.titleData mining techniques for large-scale gene expression analysisen_US
dc.typeThesisen_US
dc.description.degreePh.D.en_US
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
dc.identifier.oclc770409532en_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record