MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Modeling Sequence Uncertainty in Comparative Genomics with a Probabilistic DNA Representation

Author(s)
Zhao, Sarah Ann
Thumbnail
DownloadThesis PDF (2.458Mb)
Advisor
Kellis, Manolis
Terms of use
In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/
Metadata
Show full item record
Abstract
Uncertainty in nucleotide sequences is widespread in bioinformatics, arising from somatic mutations, population-level variation, sequencing errors, and ancestral state inference. Yet, standard formats like FASTA encode DNA deterministically using ASCII string characters, omitting this uncertainty and contributing to pervasive reference biases in genomics. Graph pangenomes have recently emerged to address these limitations by representing genetic variation across populations as bidirected graphs. While promising, these approaches are still developing and are not yet fully integrated with widely used linearly-referenced genomic tools and databases. To bridge this gap, I introduce pDNA (probabilistic DNA), a linearly-referenced data structure that encodes nucleotide-level uncertainty in a vector format compatible with traditional genomics workflows. Each position in a pDNA sequence is represented as a 4-dimension probability vector over the four possible DNA nucleotides, inspired by position weight matrices and one-hot encodings. I also introduce pFASTA, a binary file format for efficient storage of pDNA sequences, along with an open-source software package for generating, manipulating, and analyzing these data. This framework enables uncertainty-aware sequence analysis while maintaining compatibility with existing genomics infrastructure. I apply this framework to ancestral sequence reconstruction.
Date issued
2025-05
URI
https://hdl.handle.net/1721.1/162955
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Graduate Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.