Large-Scale Quality Analysis of Published ChIP-seq Data

Marinov, G. K.; Kundaje, A.; Park, P. J.; Wold, B. J.

dc.contributor.author	Kundaje, Anshul
dc.contributor.author	Marinov, Georgi K.
dc.contributor.author	Park, Peter J.
dc.contributor.author	Wold, Barbara J.
dc.date.accessioned	2014-05-30T14:50:19Z
dc.date.available	2014-05-30T14:50:19Z
dc.date.issued	2013-12
dc.date.submitted	2013-09
dc.identifier.issn	2160-1836
dc.identifier.uri	http://hdl.handle.net/1721.1/87581
dc.description.abstract	ChIP-seq has become the primary method for identifying in vivo protein–DNA interactions on a genome-wide scale, with nearly 800 publications involving the technique appearing in PubMed as of December 2012. Individually and in aggregate, these data are an important and information-rich resource. However, uncertainties about data quality confound their use by the wider research community. Recently, the Encyclopedia of DNA Elements (ENCODE) project developed and applied metrics to objectively measure ChIP-seq data quality. The ENCODE quality analysis was useful for flagging datasets for closer inspection, eliminating or replacing poor data, and for driving changes in experimental pipelines. There had been no similarly systematic quality analysis of the large and disparate body of published ChIP-seq profiles. Here, we report a uniform analysis of vertebrate transcription factor ChIP-seq datasets in the Gene Expression Omnibus (GEO) repository as of April 1, 2012. The majority (55%) of datasets scored as being highly successful, but a substantial minority (20%) were of apparently poor quality, and another ∼25% were of intermediate quality. We discuss how different uses of ChIP-seq data are affected by specific aspects of data quality, and we highlight exceptional instances for which the metric values should not be taken at face value. Unexpectedly, we discovered that a significant subset of control datasets (i.e., no immunoprecipitation and mock immunoprecipitation samples) display an enrichment structure similar to successful ChIP-seq data. This can, in turn, affect peak calling and data interpretation. Published datasets identified here as high-quality comprise a large group that users can draw on for large-scale integrated analysis. In the future, ChIP-seq quality assessment similar to that used here could guide experimentalists at early stages in a study, provide useful input in the publication process, and be used to stratify ChIP-seq data for different community-wide uses.	en_US
dc.language.iso	en_US
dc.publisher	Genetics Society of America	en_US
dc.relation.isversionof	http://dx.doi.org/10.1534/g3.113.008680	en_US
dc.rights	Creative Commons Attribution	en_US
dc.rights.uri	http://creativecommons.org/licenses/by/3.0/	en_US
dc.source	Genetics Society of America	en_US
dc.title	Large-Scale Quality Analysis of Published ChIP-seq Data	en_US
dc.type	Article	en_US
dc.identifier.citation	Marinov, G. K., A. Kundaje, P. J. Park, and B. J. Wold. “Large-Scale Quality Analysis of Published ChIP-Seq Data.” G3: Genes-Genomes-Genetics 4, no. 2 (March 13, 2014): 209–223.	en_US
dc.contributor.department	Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory	en_US
dc.contributor.mitauthor	Kundaje, Anshul	en_US
dc.relation.journal	G3: Genes-Genomes-Genetics	en_US
dc.eprint.version	Final published version	en_US
dc.type.uri	http://purl.org/eprint/type/JournalArticle	en_US
eprint.status	http://purl.org/eprint/status/PeerReviewed	en_US
dspace.orderedauthors	Marinov, G. K.; Kundaje, A.; Park, P. J.; Wold, B. J.	en_US
mit.license	PUBLISHER_CC	en_US
mit.metadata.status	Complete

Files in this item

Name:: Marinov-2014-Large-scale quali ...
Size:: 1.642Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

MIT Open Access Articles

Show simple item record