Statistical Binning for Barcoded Reads Improves Downstream Analyses

Shajii, Ariya; Numanagic, Ibrahim; Whelan, Christopher; Berger Leighton, Bonnie

dc.contributor.author	Shajii, Ariya
dc.contributor.author	Numanagic, Ibrahim
dc.contributor.author	Whelan, Christopher
dc.contributor.author	Berger Leighton, Bonnie
dc.date.accessioned	2019-11-14T19:12:22Z
dc.date.available	2019-11-14T19:12:22Z
dc.date.issued	2018-08
dc.identifier.issn	2405-4712
dc.identifier.uri	https://hdl.handle.net/1721.1/122938
dc.description.abstract	Sequencing technologies are capturing longer-range genomic information at lower error rates, enabling alignment to genomic regions that are inaccessible with short reads. However, many methods are unable to align reads to much of the genome, recognized as important in disease, and thus report erroneous results in downstream analyses. We introduce EMA, a novel two-tiered statistical binning model for barcoded read alignment, that first probabilistically maps reads to potentially multiple “read clouds” and then within clouds by newly exploiting the non-uniform read densities characteristic of barcoded read sequencing. EMA substantially improves downstream accuracy over existing methods, including phasing and genotyping on 10x data, with fewer false variant calls in nearly half the time. EMA effectively resolves particularly challenging alignments in genomic regions that contain nearby homologous elements, uncovering variants in the pharmacogenomically important CYP2D region, and clinically important genes C4 (schizophrenia) and AMY1A (obesity), which go undetected by existing methods. Our work provides a framework for future generation sequencing. Researchers are applying barcoded read sequencing to capture longer-range information in the genome at low error rates. We introduce a two-tiered statistical binning model, named EMA, which probabilistically assigns reads to “clouds” and then optimizes read assignments within clouds based on read densities. Unlike previous approaches, our efficient method enables alignment to highly homologous regions of the genome important in disease and substantially improves downstream genotyping and haplotyping. Our method also uncovers rare variants in clinically important genes. Keywords: third-generation sequencing; read mapping; barcoded short-reads; linked-reads	en_US
dc.description.sponsorship	National Institutes of Health (U.S.) (Grant GM108348)	en_US
dc.language.iso	en
dc.publisher	Elsevier BV	en_US
dc.relation.isversionof	http://dx.doi.org/10.1016/j.cels.2018.07.005	en_US
dc.rights	Creative Commons Attribution-NonCommercial-NoDerivs License	en_US
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/4.0/	en_US
dc.source	Elsevier	en_US
dc.title	Statistical Binning for Barcoded Reads Improves Downstream Analyses	en_US
dc.type	Article	en_US
dc.identifier.citation	Shajii, Ariya et al. "Statistical Binning for Barcoded Reads Improves Downstream Analyses." Cell Systems 7, 2 (2018): 219-226 © 2018 The Author(s)	en_US
dc.contributor.department	Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory	en_US
dc.contributor.department	Massachusetts Institute of Technology. Department of Mathematics	en_US
dc.relation.journal	Cell Systems	en_US
dc.eprint.version	Final published version	en_US
dc.type.uri	http://purl.org/eprint/type/JournalArticle	en_US
eprint.status	http://purl.org/eprint/status/PeerReviewed	en_US
dc.date.updated	2019-11-07T19:02:19Z
dspace.date.submission	2019-11-07T19:02:23Z
mit.journal.volume	7	en_US
mit.journal.issue	2	en_US

Files in this item

Name:: 1-s2.0-S2405471218302849-main.pdf
Size:: 2.623Mb
Format:: PDF
Description:: Published version

View/Open

This item appears in the following Collection(s)

MIT Open Access Articles

Show simple item record