Compressive algorithms for search and storage in biological data
Author(s)Yu, Yun William
Massachusetts Institute of Technology. Department of Mathematics.
MetadataShow full item record
Disparate biological datasets often exhibit similar well-defined structure; efficient algorithms can be designed to exploit this structure. In this doctoral thesis, we present a framework for similarity search based on entropy and fractal dimension; here, we prove that a clustered search algorithm scales in time with metric entropy number of covering hyperspheres-if the fractal dimension is low. Using these ideas, entropy-scaling versions of standard bioinformatics search tools can be designed, including for small-molecule, metagenomics, and protein structure search. This 'compressive acceleration' approach taking advantage of redundancy and sparsity in biological data can be leveraged also for next-generation sequencing (NGS) read mapping. By pairing together a clustered grouping over similar reads and a homology table for similarities in the human genome, our CORA framework can accelerate all-mapping by several orders of magnitude. Additionally, we also present work on filtering empirical base-calling quality scores from Next Generation Sequencing data. By using the sparsity of k-mers of sufficient length in the human genome and imposing a human prior through the use of frequent k-mers in a large corpus of human DNA reads, we are able to quickly discard over 90% of the information found in those quality scores while retaining or even improving downstream variant-calling accuracy. This filtering step allows for fast lossy compression of quality scores.
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Mathematics, 2017.Cataloged from PDF version of thesis.Includes bibliographical references (pages 187-197).
DepartmentMassachusetts Institute of Technology. Department of Mathematics.
Massachusetts Institute of Technology