Optimization of de novo transcriptome assembly from high-throughput short read sequencing data improves functional annotation for non-model organisms

Haznedaroglu, Berat Z; Reeves, Darryl; Rismani-Yazdi, Hamid; Peccia, Jordan

dc.contributor.author	Rismani-Yazdi, Hamid
dc.contributor.author	Haznedaroglu, Berat Z.
dc.contributor.author	Reeves, Darryl
dc.contributor.author	Peccia, Jordan
dc.date.accessioned	2013-01-30T19:03:37Z
dc.date.available	2013-01-30T19:03:37Z
dc.date.issued	2012-07
dc.date.submitted	2012-02
dc.identifier.issn	1471-2105
dc.identifier.uri	http://hdl.handle.net/1721.1/76672
dc.description.abstract	Background: The k-mer hash length is a key factor affecting the output of de novo transcriptome assembly packages using de Bruijn graph algorithms. Assemblies constructed with varying single k-mer choices might result in the loss of unique contiguous sequences (contigs) and relevant biological information. A common solution to this problem is the clustering of single k-mer assemblies. Even though annotation is one of the primary goals of a transcriptome assembly, the success of assembly strategies does not consider the impact of k-mer selection on the annotation output. This study provides an in-depth k-mer selection analysis that is focused on the degree of functional annotation achieved for a non-model organism where no reference genome information is available. Individual k-mers and clustered assemblies (CA) were considered using three representative software packages. Pair-wise comparison analyses (between individual k-mers and CAs) were produced to reveal missing Kyoto Encyclopedia of Genes and Genomes (KEGG) ortholog identifiers (KOIs), and to determine a strategy that maximizes the recovery of biological information in a de novo transcriptome assembly. Results: Analyses of single k-mer assemblies resulted in the generation of various quantities of contigs and functional annotations within the selection window of k-mers (k-19 to k-63). For each k-mer in this window, generated assemblies contained certain unique contigs and KOIs that were not present in the other k-mer assemblies. Producing a non-redundant CA of k-mers 19 to 63 resulted in a more complete functional annotation than any single k-mer assembly. However, a fraction of unique annotations remained (~0.19 to 0.27% of total KOIs) in the assemblies of individual k-mers (k-19 to k-63) that were not present in the non-redundant CA. A workflow to recover these unique annotations is presented. Conclusions: This study demonstrated that different k-mer choices result in various quantities of unique contigs per single k-mer assembly which affects biological information that is retrievable from the transcriptome. This undesirable effect can be minimized, but not eliminated, with clustering of multi-k assemblies with redundancy removal. The complete extraction of biological information in de novo transcriptomics studies requires both the production of a CA and efforts to identify unique contigs that are present in individual k-mer assemblies but not in the CA.	en_US
dc.language.iso	en_US
dc.publisher	Biomed Central Ltd.	en_US
dc.relation.isversionof	http://dx.doi.org/10.1186/1471-2105-13-170	en_US
dc.rights	Creative Commons Attribution	en_US
dc.rights.uri	http://creativecommons.org/licenses/by/2.0	en_US
dc.source	BioMed Central	en_US
dc.title	Optimization of de novo transcriptome assembly from high-throughput short read sequencing data improves functional annotation for non-model organisms	en_US
dc.type	Article	en_US
dc.identifier.citation	Haznedaroglu, Berat Z et al. “Optimization of De Novo Transcriptome Assembly from High-throughput Short Read Sequencing Data Improves Functional Annotation for Non-model Organisms.” BMC Bioinformatics 13.1 (2012): 170.	en_US
dc.contributor.department	Massachusetts Institute of Technology. Department of Chemical Engineering	en_US
dc.contributor.mitauthor	Rismani-Yazdi, Hamid
dc.relation.journal	BMC Bioinformatics	en_US
dc.eprint.version	Final published version	en_US
dc.type.uri	http://purl.org/eprint/type/JournalArticle	en_US
eprint.status	http://purl.org/eprint/status/PeerReviewed	en_US
dspace.orderedauthors	Haznedaroglu, Berat Z; Reeves, Darryl; Rismani-Yazdi, Hamid; Peccia, Jordan	en
mit.license	PUBLISHER_CC	en_US
mit.metadata.status	Complete

Files in this item

Name:: Haznedaroglu-2012-Optimization ...
Size:: 1.643Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

MIT Open Access Articles

Show simple item record