FASTQSim: platform-independent data characterization and in silico read generation for NGS datasets

Shcherbina, Anna

dc.contributor.author	Shcherbina, Anna
dc.date.accessioned	2014-10-07T20:17:21Z
dc.date.available	2014-10-07T20:17:21Z
dc.date.issued	2014-08
dc.date.submitted	2014-04
dc.identifier.issn	1756-0500
dc.identifier.uri	http://hdl.handle.net/1721.1/90619
dc.description.abstract	Background High-throughput next generation sequencing technologies have enabled rapid characterization of clinical and environmental samples. Consequently, the largest bottleneck to actionable data has become sample processing and bioinformatics analysis, creating a need for accurate and rapid algorithms to process genetic data. Perfectly characterized in silico datasets are a useful tool for evaluating the performance of such algorithms. Background contaminating organisms are observed in sequenced mixtures of organisms. In silico samples provide exact truth. To create the best value for evaluating algorithms, in silico data should mimic actual sequencer data as closely as possible. Results FASTQSim is a tool that provides the dual functionality of NGS dataset characterization and metagenomic data generation. FASTQSim is sequencing platform-independent, and computes distributions of read length, quality scores, indel rates, single point mutation rates, indel size, and similar statistics for any sequencing platform. To create training or testing datasets, FASTQSim has the ability to convert target sequences into in silico reads with specific error profiles obtained in the characterization step. Conclusions FASTQSim enables users to assess the quality of NGS datasets. The tool provides information about read length, read quality, repetitive and non-repetitive indel profiles, and single base pair substitutions. FASTQSim allows the user to simulate individual read datasets that can be used as standardized test scenarios for planning sequencing projects or for benchmarking metagenomic software. In this regard, in silico datasets generated with the FASTQsim tool hold several advantages over natural datasets: they are sequencing platform independent, extremely well characterized, and less expensive to generate. Such datasets are valuable in a number of applications, including the training of assemblers for multiple platforms, benchmarking bioinformatics algorithm performance, and creating challenge datasets for detecting genetic engineering toolmarks, etc.	en_US
dc.description.sponsorship	United States. Defense Threat Reduction Agency (Air Force Contract FA8721-05-C-0002)	en_US
dc.publisher	BioMed Central Ltd	en_US
dc.relation.isversionof	http://dx.doi.org/10.1186/1756-0500-7-533	en_US
dc.rights	Creative Commons Attribution	en_US
dc.rights.uri	http://creativecommons.org/licenses/by/2.0	en_US
dc.source	BioMed Central Ltd	en_US
dc.title	FASTQSim: platform-independent data characterization and in silico read generation for NGS datasets	en_US
dc.type	Article	en_US
dc.identifier.citation	Shcherbina, Anna. "FASTQSim: platform-independent data characterization and in silico read generation for NGS datasets." BMC Research Notes 2014, 7:533.	en_US
dc.contributor.department	Lincoln Laboratory	en_US
dc.contributor.mitauthor	Shcherbina, Anna	en_US
dc.relation.journal	BMC Research Notes	en_US
dc.eprint.version	Final published version	en_US
dc.type.uri	http://purl.org/eprint/type/JournalArticle	en_US
eprint.status	http://purl.org/eprint/status/PeerReviewed	en_US
dc.date.updated	2014-10-02T19:09:23Z
dc.language.rfc3066	en
dc.rights.holder	Anna Shcherbina et al.; licensee BioMed Central Ltd.
dspace.orderedauthors	Shcherbina, Anna	en_US
mit.license	PUBLISHER_CC	en_US
mit.metadata.status	Complete

Files in this item

Name:: 1756-0500-7-533.pdf
Size:: 3.676Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

MIT Open Access Articles

Show simple item record