Since version 2.3, newbler has a -cdna option for de novo transcriptome assembly. In this post, I’ll explain the principles and setting up the transcriptome assembly. The next post will discuss the output of a transcriptome assembly.
1) Principles of transcriptome assembly
As with other assembly projects, the first steps for transcriptome assembly are identical, and newbler builds a contig graph, see this post. Ideally, the reads coming from the transcript of a certain gene should result in a single contig. However, because of splice-variants (and other sequence particularities), there may be several contigs for each transcript, which themselves form a small contig graph. Splice-variants will result in reads that , relative to other reads have an insert (representing an additional exon in the transcript), thereby breaking the contig graph, see the figure.
So, for transcriptome projects, there will be numerous subgraphs each potentially representing one gene. Each of these subgraphs are called an isogroup. Next, newbler will traverse the contigs in the subgraphs of each isogroup to generate transcript variants, which are called isotigs, again, see the figure. There are certain rules for this traversing step, for example, for starting the path and for ending it. Another rule, for complex graphs, is a cutoff such that no more than a maximum number of isotigs are generated per isogroup (by default set to 100 isotigs). If fully traversing the graph will result in more isotigs than this maximum, the contigs of this isogroup are reported in the output instead of the isotigs. Read the rest of this entry »