Traversing the k-mer Landscape of NGS Read Datasets for Quality Score Sparsification
Author(s)
Berger Leighton, Bonnie; Yu, Yun William; Yorukoglu, Deniz
Downloadnihms892487.pdf (562.2Kb)
OPEN_ACCESS_POLICY
Open Access Policy
Creative Commons Attribution-Noncommercial-Share Alike
Terms of use
Metadata
Show full item recordAbstract
It is becoming increasingly impractical to indefinitely store raw sequencing data for later processing in an uncompressed state. In this paper, we describe a scalable compressive framework, Read-Quality-Sparsifier (RQS), which substantially outperforms the compression ratio and speed of other de novo quality score compression methods while maintaining SNP-calling accuracy. Surprisingly, RQS also improves the SNP-calling accuracy on a gold-standard, real-life sequencing dataset (NA12878) using a k-mer density profile constructed from 77 other individuals from the 1000 Genomes Project. This improvement in downstream accuracy emerges from the observation that quality score values within NGS datasets are inherently encoded in the k-mer landscape of the genomic sequences. To our knowledge, RQS is the first scalable sequence-based quality compression method that can efficiently compress quality scores of terabyte-sized and larger sequencing datasets. Availability: An implementation of our method, RQS, is available for download at: http://rqs.csail.mit.edu/. © 2014 Springer International Publishing Switzerland. Keywords: RQS; quality score; sparsification; compression; accuracy; variant calling
Date issued
2014-04Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science; Massachusetts Institute of Technology. Department of MathematicsJournal
Research in Computational Molecular Biology
Publisher
Springer Nature
Citation
Yu, Y. William, et al. “Traversing the K-Mer Landscape of NGS Read Datasets for Quality Score Sparsification.” Research in Computational Molecular Biology, edited by Roded Sharan, vol. 8394, Springer International Publishing, 2014, pp. 385–99.
Version: Author's final manuscript
ISBN
978-3-319-05268-7
978-3-319-05269-4
ISSN
0302-9743
1611-3349