Error and Error Mitigation in Low-Coverage Genome Assemblies

Hubisz, Melissa J.; Lin, Michael F.; Kellis, Manolis; Siepel, Adam

dc.contributor.author	Hubisz, Melissa J.
dc.contributor.author	Lin, Michael F.
dc.contributor.author	Kellis, Manolis
dc.contributor.author	Siepel, Adam
dc.date.accessioned	2011-08-26T15:59:48Z
dc.date.available	2011-08-26T15:59:48Z
dc.date.issued	2011-02
dc.date.submitted	2010-11
dc.identifier.issn	1932-6203
dc.identifier.uri	http://hdl.handle.net/1721.1/65407
dc.description.abstract	The recent release of twenty-two new genome sequences has dramatically increased the data available for mammalian comparative genomics, but twenty of these new sequences are currently limited to ~2× coverage. Here we examine the extent of sequencing error in these 2× assemblies, and its potential impact in downstream analyses. By comparing 2× assemblies with high-quality sequences from the ENCODE regions, we estimate the rate of sequencing error to be 1–4 errors per kilobase. While this error rate is fairly modest, sequencing error can still have surprising effects. For example, an apparent lineage-specific insertion in a coding region is more likely to reflect sequencing error than a true biological event, and the length distribution of coding indels is strongly distorted by error. We find that most errors are contributed by a small fraction of bases with low quality scores, in particular, by the ends of reads in regions of single-read coverage in the assembly. We explore several approaches for automatic sequencing error mitigation (SEM), making use of the localized nature of sequencing error, the fact that it is well predicted by quality scores, and information about errors that comes from comparisons across species. Our automatic methods for error mitigation cannot replace the need for additional sequencing, but they do allow substantial fractions of errors to be masked or eliminated at the cost of modest amounts of over-correction, and they can reduce the impact of error in downstream phylogenomic analyses. Our error-mitigated alignments are available for download.	en_US
dc.description.sponsorship	National Science Foundation (U.S.) (Faculty Early Career Development grant DBI-0644111)	en_US
dc.description.sponsorship	National Science Foundation (U.S.) (Faculty Early Career Development grant DBI-0644282)	en_US
dc.description.sponsorship	National Science Foundation (U.S.) (Faculty Early Career Development grant U54 HG004555-01)	en_US
dc.description.sponsorship	David & Lucile Packard Foundation	en_US
dc.description.sponsorship	David & Lucile Packard Foundation (Fellowship for Science and Engineering)	en_US
dc.language.iso	en_US
dc.publisher	Public Library of Science	en_US
dc.relation.isversionof	http://dx.doi.org/10.1371/journal.pone.0017034	en_US
dc.rights	Creative Commons Attribution	en_US
dc.rights.uri	http://creativecommons.org/licenses/by/2.5/	en_US
dc.source	PLoS	en_US
dc.title	Error and Error Mitigation in Low-Coverage Genome Assemblies	en_US
dc.type	Article	en_US
dc.identifier.citation	Hubisz, Melissa J. et al. “Error and Error Mitigation in Low-Coverage Genome Assemblies.” Ed. Thomas Mailund. PLoS ONE 6.2 (2011) : e17034.	en_US
dc.contributor.department	Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory	en_US
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science	en_US
dc.contributor.approver	Kellis, Manolis
dc.contributor.mitauthor	Kellis, Manolis
dc.relation.journal	PLoS ONE	en_US
dc.eprint.version	Final published version	en_US
dc.type.uri	http://purl.org/eprint/type/JournalArticle	en_US
eprint.status	http://purl.org/eprint/status/PeerReviewed	en_US
dspace.orderedauthors	Hubisz, Melissa J.; Lin, Michael F.; Kellis, Manolis; Siepel, Adam	en
mit.license	PUBLISHER_CC	en_US
mit.metadata.status	Complete

Files in this item

Name:: Hubisz-2011-Error and Error ...
Size:: 565.7Kb
Format:: PDF

View/Open

This item appears in the following Collection(s)

MIT Open Access Articles

Show simple item record