MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Doctoral Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Doctoral Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Improved prediction and optimal sequencing strategies for genomic variant discovery via Bayesian nonparametrics

Author(s)
Masoero, Lorenzo
Thumbnail
DownloadThesis PDF (11.38Mb)
Advisor
Broderick, Tamara
Terms of use
In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/
Metadata
Show full item record
Abstract
Despite the advent of Big Data, data-gathering in many domains can still be an expensive process that necessitates careful planning when operating under a fixed, limited budget. For instance, sequencing new genomic data is a complex procedure that requires careful tuning: researchers can spend resources to sequence a greater number of genomes (quantity), or spend resources to sequence genomes with increased accuracy (quality). In this thesis, I consider the common setting in which scientists have already conducted a pilot study to reveal variants in a genome and are contemplating a follow-up study. Spending additional resources has the potential to reveal new variations in the genome, and thereby new genetic insights. Therefore, practitioners are interested in (i) predicting how many new discoveries they will make under different experimental design choices. In turn, they can leverage these predictions to optimally allocate available resources in the design of a future experiment, e.g. (ii) to maximize the number of future discoveries or (iii) to optimize the usefulness of a future experiment for the task at hand, e.g. the power of an associated statistical test. In this thesis, I introduce novel methodologies to solve the problems mentioned above. My approach relies on a Bayesian nonparametric formulation that facilitates (i) prediction for the number of new variants in the follow-up study based on the pilot study. I show empirically that, when experimental conditions are kept constant between the pilot and follow-up, my method's prediction is competitive with the best existing methods. Unlike current methods, though, my new method allows practitioners to change experimental conditions between the pilot and the follow-up. I demonstrate how this distinction allows my method to be used for more realistic predictions and for optimal allocation of a fixed budget between quality and quantity. In particular, I first show how, under a fixed budget, my predictions can be used to maximize (ii) the number of new genomic variants discovered in a follow-up study. Last, I show how my framework can guide practitioners in other experimental design problems, and specifically how to achieve (iii) the highest possible power in statistical tests in the context of rare variants association studies.
Date issued
2021-09
URI
https://hdl.handle.net/1721.1/140066
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Doctoral Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.