Repository logo
Log in(current)
Repository logoMIT Open ScholarshipDSpace@MIT
  1. Home
  2. MIT Libraries
  3. MIT Theses
  4. Doctoral Theses
  5. High-Performance Computational Genomics

High-Performance Computational Genomics

Thumbnail Image
Download
Name

Shajii-arshajii-PhD-EECS-2021-thesis.pdf

Description
Thesis PDF
Size

1.67 MB

Format

Adobe PDF

Checksum (MD5)

f47d1800d4c77ecfde4ed0a62258e2cb

Author(s)
Shajii, Ariya
Advisor(s)
Berger, Bonnie
Amarasinghe, Saman
Date Issued
September 2021
Publisher
Massachusetts Institute of Technology
Abstract
Next-generation sequencing data is growing at an unprecedented rate, leading to new revelations in biology, healthcare, and medicine. Many researchers use high-level programming languages to navigate and analyze this data, but as gigabytes grow to terabytes or even petabytes, high-level languages become prohibitive and impractical for performance reasons. This thesis introduces Seq, a Python-based, domain-specific language for bioinformatics and genomics that combines the power and usability of high-level languages like Python with the performance of low-level languages like C or C++. Seq allows for shorter, simpler code, is readily usable by a novice programmer, and obtains significant performance improvements over existing languages and frameworks. Seq is showcased and evaluated by implementing a range of standard, widely-used applications from all stages of the genomics analysis pipeline, including genomic index construction, data pre- and post-processing, read mapping and alignment, and haplotype phasing. We show that the Seq implementations are up to an order of magnitude faster than existing hand-optimized implementations, with just a fraction of the code. Seq's substantial performance gains are made possible by a host of novel genomics-specific compiler optimizations that are out of reach for general-purpose compilers, coupled with a static type system that avoids all of Python's runtime overhead and object metadata. By enabling researchers of all backgrounds to easily implement high-performance analysis tools, Seq aims to act as a catalyst for scientific discovery and innovation. Finally, we also generalize many of the principles used by Seq to create a domain-configurable compiler called Codon, which can be applied to other domains with similar results.
MIT Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Terms of Use
In Copyright - Educational Use Permitted
http://rightsstatements.org/page/InC-EDU/1.0/
Copyright MIT
Persistent DSpace Link
https://hdl.handle.net/1721.1/140081
Repository logo
PrivacyPermissionsAccessibilityContact us
Repository logo
Notify us about copyright concerns.