Protecting Genomic Data Privacy with Probabilistic Modeling

Simmons, Sean; Berger Leighton, Bonnie; Sahinalp, Cenk

dc.contributor.author	Simmons, Sean
dc.contributor.author	Berger Leighton, Bonnie
dc.contributor.author	Sahinalp, Cenk
dc.date.accessioned	2019-11-26T20:52:22Z
dc.date.available	2019-11-26T20:52:22Z
dc.date.issued	2019
dc.identifier.isbn	978-981-3279-81-0
dc.identifier.uri	https://hdl.handle.net/1721.1/123095
dc.description.abstract	As genetic sequencing becomes less expensive and data sets linking genetic data and medical records (e.g., Biobanks) become larger and more common, issues of data privacy and computational challenges become more necessary to address in order to realize the benefits of these datasets. One possibility for alleviating these issues is through the use of already-computed summary statistics (e.g., slopes and standard errors from a regression model of a phenotype on a genotype). If groups share summary statistics from their analyses of biobanks, many of the privacy issues and computational challenges concerning the access of these data could be bypassed. In this paper we explore the possibility of using summary statistics from simple linear models of phenotype on genotype in order to make inferences about more complex phenotypes (those that are derived from two or more simple phenotypes). We provide exact formulas for the slope, intercept, and standard error of the slope for linear regressions when combining phenotypes. Derived equations are validated via simulation and tested on a real data set exploring the genetics of fatty acids. Keywords: privacy; biobank; genetics; genome-wide association study; single nucleotide variant; computational challenges; data security; phenotypes	en_US
dc.language.iso	en
dc.publisher	World Scientific	en_US
dc.relation.isversionof	http://dx.doi.org/10.1142/9789813279827_0037	en_US
dc.rights	Creative Commons Attribution NonCommercial License 4.0	en_US
dc.rights.uri	https://creativecommons.org/licenses/by-nc/4.0/	en_US
dc.source	World Scientific	en_US
dc.title	Protecting Genomic Data Privacy with Probabilistic Modeling	en_US
dc.type	Article	en_US
dc.identifier.citation	Gasdaska, Angela et al. "Leveraging summary statistics to make inferences about complex phenotypes in large biobanks." Biocomputing 2019, January 2019, Kohala Coast, Hawaii, USA, World Scientific, 2018 © 2018 The Authors	en_US
dc.contributor.department	Massachusetts Institute of Technology. Department of Mathematics	en_US
dc.contributor.department	Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory	en_US
dc.relation.journal	Biocomputing 2019	en_US
dc.eprint.version	Final published version	en_US
dc.type.uri	http://purl.org/eprint/type/ConferencePaper	en_US
eprint.status	http://purl.org/eprint/status/NonPeerReviewed	en_US
dc.date.updated	2019-11-07T18:58:59Z
dspace.date.submission	2019-11-07T18:59:04Z

Files in this item

Name:: 9789813279827_0037.pdf
Size:: 1.178Mb
Format:: PDF
Description:: Published version

View/Open

This item appears in the following Collection(s)

MIT Open Access Articles

Show simple item record