High-Performance Computational Genomics
Name
Shajii-arshajii-PhD-EECS-2021-thesis.pdf
Description
Thesis PDF
Size
1.67 MB
Format
Adobe PDF
Checksum (MD5)
f47d1800d4c77ecfde4ed0a62258e2cb
Author(s)
Shajii, Ariya
Advisor(s)
Berger, Bonnie
Amarasinghe, Saman
Date Issued
September 2021
Publisher
Massachusetts Institute of Technology
Abstract
Next-generation sequencing data is growing at an unprecedented rate, leading to new revelations in biology, healthcare, and medicine. Many researchers use high-level programming languages to navigate and analyze this data, but as gigabytes grow to terabytes or even petabytes, high-level languages become prohibitive and impractical for performance reasons. This thesis introduces Seq, a Python-based, domain-specific language for bioinformatics and genomics that combines the power and usability of high-level languages like Python with the performance of low-level languages like C or C++. Seq allows for shorter, simpler code, is readily usable by a novice programmer, and obtains significant performance improvements over existing languages and frameworks. Seq is showcased and evaluated by implementing a range of standard, widely-used applications from all stages of the genomics analysis pipeline, including genomic index construction, data pre- and post-processing, read mapping and alignment, and haplotype phasing. We show that the Seq implementations are up to an order of magnitude faster than existing hand-optimized implementations, with just a fraction of the code. Seq's substantial performance gains are made possible by a host of novel genomics-specific compiler optimizations that are out of reach for general-purpose compilers, coupled with a static type system that avoids all of Python's runtime overhead and object metadata. By enabling researchers of all backgrounds to easily implement high-performance analysis tools, Seq aims to act as a catalyst for scientific discovery and innovation. Finally, we also generalize many of the principles used by Seq to create a domain-configurable compiler called Codon, which can be applied to other domains with similar results.
MIT Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Terms of Use
In Copyright - Educational Use Permitted
Copyright MIT
Persistent DSpace Link