Scalable sketching and indexing algorithms for large biological datasets
Author(s)
Ekim, Bariş C.
DownloadThesis PDF (1.784Mb)
Advisor
Berger, Bonnie A.
Terms of use
Metadata
Show full item recordAbstract
DNA sequencing data continues to progress towards longer sequencing reads with increasingly lower error rates. In order to efficiently process the ever-growing collections of sequencing data, there is a crucial need for more time- and memory-efficient algorithms and data structures. In this thesis, we propose several ways to represent DNA sequences in order to mitigate some of these challenges in practical biological tasks. Firstly, we expand upon an existing k-mer (a substring of length k) -based approach, a universal hitting set (UHS), to sample a subset of locations on a DNA sequence. We show that UHSs can be efficiently constructed using a randomized parallel algorithm, and propose ways in which UHSs can be used in sketching and indexing sequences for downstream analysis. Secondly, we introduce the concept of minimizer-space sequencing data analysis, where a set of minimizers, rather than DNA nucleotides, are the atomic tokens of the alphabet. We propose that minimizer-space representations can be seamlessly applied to the problem of genome assembly, the task of reconstructing a genome from a collection of DNA sequences. By projecting sequences into ordered lists of minimizers, we claim that we can achieve orders-of-magnitude improvement in runtime and memory usage over existing methods without much loss of accuracy. We expect these approaches to be essential for downstream bioinformatics applications, such as read mapping, metagenomics, and pangenomics, as well as to provide ways to better store, search, and compress large collections of sequencing data.
Date issued
2022-09Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology