dc.contributor.advisor | Bonnie Berger. | en_US |
dc.contributor.author | Yu, Yun William | en_US |
dc.contributor.other | Massachusetts Institute of Technology. Department of Mathematics. | en_US |
dc.date.accessioned | 2017-12-20T18:15:50Z | |
dc.date.available | 2017-12-20T18:15:50Z | |
dc.date.copyright | 2017 | en_US |
dc.date.issued | 2017 | en_US |
dc.identifier.uri | http://hdl.handle.net/1721.1/112879 | |
dc.description | Thesis: Ph. D., Massachusetts Institute of Technology, Department of Mathematics, 2017. | en_US |
dc.description | Cataloged from PDF version of thesis. | en_US |
dc.description | Includes bibliographical references (pages 187-197). | en_US |
dc.description.abstract | Disparate biological datasets often exhibit similar well-defined structure; efficient algorithms can be designed to exploit this structure. In this doctoral thesis, we present a framework for similarity search based on entropy and fractal dimension; here, we prove that a clustered search algorithm scales in time with metric entropy number of covering hyperspheres-if the fractal dimension is low. Using these ideas, entropy-scaling versions of standard bioinformatics search tools can be designed, including for small-molecule, metagenomics, and protein structure search. This 'compressive acceleration' approach taking advantage of redundancy and sparsity in biological data can be leveraged also for next-generation sequencing (NGS) read mapping. By pairing together a clustered grouping over similar reads and a homology table for similarities in the human genome, our CORA framework can accelerate all-mapping by several orders of magnitude. Additionally, we also present work on filtering empirical base-calling quality scores from Next Generation Sequencing data. By using the sparsity of k-mers of sufficient length in the human genome and imposing a human prior through the use of frequent k-mers in a large corpus of human DNA reads, we are able to quickly discard over 90% of the information found in those quality scores while retaining or even improving downstream variant-calling accuracy. This filtering step allows for fast lossy compression of quality scores. | en_US |
dc.description.statementofresponsibility | by Yun William Yu. | en_US |
dc.format.extent | 197 pages | en_US |
dc.language.iso | eng | en_US |
dc.publisher | Massachusetts Institute of Technology | en_US |
dc.rights | MIT theses are protected by copyright. They may be viewed, downloaded, or printed from this source but further reproduction or distribution in any format is prohibited without written permission. | en_US |
dc.rights.uri | http://dspace.mit.edu/handle/1721.1/7582 | en_US |
dc.subject | Mathematics. | en_US |
dc.title | Compressive algorithms for search and storage in biological data | en_US |
dc.type | Thesis | en_US |
dc.description.degree | Ph. D. | en_US |
dc.contributor.department | Massachusetts Institute of Technology. Department of Mathematics | |
dc.identifier.oclc | 1014342717 | en_US |