Scalable sketching and indexing algorithms for large biological datasets

Ekim, Bariş C.

Author(s)

Ekim, Bariş C.

DownloadThesis PDF (1.784Mb)

Advisor

Berger, Bonnie A.

Terms of use

In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

DNA sequencing data continues to progress towards longer sequencing reads with increasingly lower error rates. In order to efficiently process the ever-growing collections of sequencing data, there is a crucial need for more time- and memory-efficient algorithms and data structures. In this thesis, we propose several ways to represent DNA sequences in order to mitigate some of these challenges in practical biological tasks. Firstly, we expand upon an existing k-mer (a substring of length k) -based approach, a universal hitting set (UHS), to sample a subset of locations on a DNA sequence. We show that UHSs can be efficiently constructed using a randomized parallel algorithm, and propose ways in which UHSs can be used in sketching and indexing sequences for downstream analysis. Secondly, we introduce the concept of minimizer-space sequencing data analysis, where a set of minimizers, rather than DNA nucleotides, are the atomic tokens of the alphabet. We propose that minimizer-space representations can be seamlessly applied to the problem of genome assembly, the task of reconstructing a genome from a collection of DNA sequences. By projecting sequences into ordered lists of minimizers, we claim that we can achieve orders-of-magnitude improvement in runtime and memory usage over existing methods without much loss of accuracy. We expect these approaches to be essential for downstream bioinformatics applications, such as read mapping, metagenomics, and pangenomics, as well as to provide ways to better store, search, and compress large collections of sequencing data.

Date issued

2022-09

URI

https://hdl.handle.net/1721.1/147392

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses