Modeling Sequence Uncertainty in Comparative Genomics with a Probabilistic DNA Representation
Author(s)
Zhao, Sarah Ann
DownloadThesis PDF (2.458Mb)
Advisor
Kellis, Manolis
Terms of use
Metadata
Show full item recordAbstract
Uncertainty in nucleotide sequences is widespread in bioinformatics, arising from somatic mutations, population-level variation, sequencing errors, and ancestral state inference. Yet, standard formats like FASTA encode DNA deterministically using ASCII string characters, omitting this uncertainty and contributing to pervasive reference biases in genomics. Graph pangenomes have recently emerged to address these limitations by representing genetic variation across populations as bidirected graphs. While promising, these approaches are still developing and are not yet fully integrated with widely used linearly-referenced genomic tools and databases. To bridge this gap, I introduce pDNA (probabilistic DNA), a linearly-referenced data structure that encodes nucleotide-level uncertainty in a vector format compatible with traditional genomics workflows. Each position in a pDNA sequence is represented as a 4-dimension probability vector over the four possible DNA nucleotides, inspired by position weight matrices and one-hot encodings. I also introduce pFASTA, a binary file format for efficient storage of pDNA sequences, along with an open-source software package for generating, manipulating, and analyzing these data. This framework enables uncertainty-aware sequence analysis while maintaining compatibility with existing genomics infrastructure. I apply this framework to ancestral sequence reconstruction.
Date issued
2025-05Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology