Modeling Sequence Uncertainty in Comparative Genomics
with a Probabilistic DNA Representation

Zhao, Sarah Ann

Author(s)

Zhao, Sarah Ann

DownloadThesis PDF (2.458Mb)

Advisor

Kellis, Manolis

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

Uncertainty in nucleotide sequences is widespread in bioinformatics, arising from somatic mutations, population-level variation, sequencing errors, and ancestral state inference. Yet, standard formats like FASTA encode DNA deterministically using ASCII string characters, omitting this uncertainty and contributing to pervasive reference biases in genomics. Graph pangenomes have recently emerged to address these limitations by representing genetic variation across populations as bidirected graphs. While promising, these approaches are still developing and are not yet fully integrated with widely used linearly-referenced genomic tools and databases. To bridge this gap, I introduce pDNA (probabilistic DNA), a linearly-referenced data structure that encodes nucleotide-level uncertainty in a vector format compatible with traditional genomics workflows. Each position in a pDNA sequence is represented as a 4-dimension probability vector over the four possible DNA nucleotides, inspired by position weight matrices and one-hot encodings. I also introduce pFASTA, a binary file format for efficient storage of pDNA sequences, along with an open-source software package for generating, manipulating, and analyzing these data. This framework enables uncertainty-aware sequence analysis while maintaining compatibility with existing genomics infrastructure. I apply this framework to ancestral sequence reconstruction.

Date issued

2025-05

URI

https://hdl.handle.net/1721.1/162955

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses