Show simple item record

dc.contributor.advisorKellis, Manolis
dc.contributor.authorZhao, Sarah Ann
dc.date.accessioned2025-10-06T17:36:41Z
dc.date.available2025-10-06T17:36:41Z
dc.date.issued2025-05
dc.date.submitted2025-06-23T14:04:50.921Z
dc.identifier.urihttps://hdl.handle.net/1721.1/162955
dc.description.abstractUncertainty in nucleotide sequences is widespread in bioinformatics, arising from somatic mutations, population-level variation, sequencing errors, and ancestral state inference. Yet, standard formats like FASTA encode DNA deterministically using ASCII string characters, omitting this uncertainty and contributing to pervasive reference biases in genomics. Graph pangenomes have recently emerged to address these limitations by representing genetic variation across populations as bidirected graphs. While promising, these approaches are still developing and are not yet fully integrated with widely used linearly-referenced genomic tools and databases. To bridge this gap, I introduce pDNA (probabilistic DNA), a linearly-referenced data structure that encodes nucleotide-level uncertainty in a vector format compatible with traditional genomics workflows. Each position in a pDNA sequence is represented as a 4-dimension probability vector over the four possible DNA nucleotides, inspired by position weight matrices and one-hot encodings. I also introduce pFASTA, a binary file format for efficient storage of pDNA sequences, along with an open-source software package for generating, manipulating, and analyzing these data. This framework enables uncertainty-aware sequence analysis while maintaining compatibility with existing genomics infrastructure. I apply this framework to ancestral sequence reconstruction.
dc.publisherMassachusetts Institute of Technology
dc.rightsIn Copyright - Educational Use Permitted
dc.rightsCopyright retained by author(s)
dc.rights.urihttps://rightsstatements.org/page/InC-EDU/1.0/
dc.titleModeling Sequence Uncertainty in Comparative Genomics with a Probabilistic DNA Representation
dc.typeThesis
dc.description.degreeM.Eng.
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degreeMaster
thesis.degree.nameMaster of Engineering in Electrical Engineering and Computer Science


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record