High-Performance Computational Genomics
Author(s)
Shajii, Ariya
DownloadThesis PDF (1.668Mb)
Advisor
Berger, Bonnie
Amarasinghe, Saman
Terms of use
Metadata
Show full item recordAbstract
Next-generation sequencing data is growing at an unprecedented rate, leading to new revelations in biology, healthcare, and medicine. Many researchers use high-level programming languages to navigate and analyze this data, but as gigabytes grow to terabytes or even petabytes, high-level languages become prohibitive and impractical for performance reasons. This thesis introduces Seq, a Python-based, domain-specific language for bioinformatics and genomics that combines the power and usability of high-level languages like Python with the performance of low-level languages like C or C++. Seq allows for shorter, simpler code, is readily usable by a novice programmer, and obtains significant performance improvements over existing languages and frameworks. Seq is showcased and evaluated by implementing a range of standard, widely-used applications from all stages of the genomics analysis pipeline, including genomic index construction, data pre- and post-processing, read mapping and alignment, and haplotype phasing. We show that the Seq implementations are up to an order of magnitude faster than existing hand-optimized implementations, with just a fraction of the code. Seq's substantial performance gains are made possible by a host of novel genomics-specific compiler optimizations that are out of reach for general-purpose compilers, coupled with a static type system that avoids all of Python's runtime overhead and object metadata. By enabling researchers of all backgrounds to easily implement high-performance analysis tools, Seq aims to act as a catalyst for scientific discovery and innovation. Finally, we also generalize many of the principles used by Seq to create a domain-configurable compiler called Codon, which can be applied to other domains with similar results.
Date issued
2021-09Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology