Improved prediction and optimal sequencing strategies for genomic variant discovery via Bayesian nonparametrics

Masoero, Lorenzo

Author(s)

Masoero, Lorenzo

DownloadThesis PDF (11.38Mb)

Advisor

Broderick, Tamara

Terms of use

In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

Despite the advent of Big Data, data-gathering in many domains can still be an expensive process that necessitates careful planning when operating under a fixed, limited budget. For instance, sequencing new genomic data is a complex procedure that requires careful tuning: researchers can spend resources to sequence a greater number of genomes (quantity), or spend resources to sequence genomes with increased accuracy (quality). In this thesis, I consider the common setting in which scientists have already conducted a pilot study to reveal variants in a genome and are contemplating a follow-up study. Spending additional resources has the potential to reveal new variations in the genome, and thereby new genetic insights. Therefore, practitioners are interested in (i) predicting how many new discoveries they will make under different experimental design choices. In turn, they can leverage these predictions to optimally allocate available resources in the design of a future experiment, e.g. (ii) to maximize the number of future discoveries or (iii) to optimize the usefulness of a future experiment for the task at hand, e.g. the power of an associated statistical test. In this thesis, I introduce novel methodologies to solve the problems mentioned above. My approach relies on a Bayesian nonparametric formulation that facilitates (i) prediction for the number of new variants in the follow-up study based on the pilot study. I show empirically that, when experimental conditions are kept constant between the pilot and follow-up, my method's prediction is competitive with the best existing methods. Unlike current methods, though, my new method allows practitioners to change experimental conditions between the pilot and the follow-up. I demonstrate how this distinction allows my method to be used for more realistic predictions and for optimal allocation of a fixed budget between quality and quantity. In particular, I first show how, under a fixed budget, my predictions can be used to maximize (ii) the number of new genomic variants discovered in a follow-up study. Last, I show how my framework can guide practitioners in other experimental design problems, and specifically how to achieve (iii) the highest possible power in statistical tests in the context of rare variants association studies.

Date issued

2021-09

URI

https://hdl.handle.net/1721.1/140066

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Doctoral Theses