What is new in newbler 2.6
Posted by lexnederbragt on July 12, 2011
The latest version of newbler, version 2.6, has some welcome additions for input and output. As I have so far only treated de novo assembly, I will skip the updates on the gsMapper (except for mentioning that it is now able to provide a bam file using the -bam option).
FASTQ files support
Newbler could already use sff files (including those from the IonTorrent, by the way!), and fasta/qual files (e.g Sanger reads). Now, fastq files (a much used format for Next Generation Sequencing, see http://en.wikipedia.org/wiki/Fastq) are also supported. In principle, one can now use any fastq file, also those downloaded from the NCBI Short Read Archive, and import it directly into newbler. Newbler should be recognize the quality scoring version, and which reads are paired up (read1 and read2) based on the header text. I did a quick test on one such a fastq file, and it seem to work.
Gap filling with repeat contigs
Some contigs, usually with high depth, represent collapsed repeats. These make for many of the gaps in scaffolds. With the new -scaffold flag, you can now ask newbler to place a copy of the repeat in the gap it forms, effectively closing the gap. This potentially leads to much more complete assemblies. Note, however, that a contig from collapsed repeats is the consensus sequence from all occurrences of the repeat. Newbler places an instance of this contig as it is, so if the actual repeat instances in the original genome have sequence variation, it introduces errors in the scaffolds. This you will have to take into account. Two new output files are gene when the -scaffold flag is set, 454ContigScaffolds.txt and 454ScaffoldContigs.fna/qual, which I will describe in the next blog post.
New sections in the 454NewblerMetrics.txt for assemblies which produce scaffolds has been added, called ‘largeContigEndMetrics’, ‘scaffoldGapMetrics’ and ‘scaffoldEndMetrics’. An ‘edge’ represent reads that exit a contig or scaffold and enter another one. These metrics report for contigs and scaffolds, the number and percentage that have ‘NoEdges’, ‘OneEdge’, ‘TwoEdges’, or ‘ManyEdges’. For gaps in scaffolds, those that have ‘BothNoEdges’, ‘OneNoEdges’, ‘BothOneEdge’ and ‘MultiEdges’ are reported. Although there is little documentation about this, I understand that if there are many contigs (or scaffolds or gaps) with no edges, this could indicate not enough reads (too low coverage) to bridge reads.
Increased assembly of splice variants in cDNA assembly
The new -isplit option will look for depth spikes in the read alignment for transcriptome datasets. Such a spike could be the result of a specific splice variant. Setting this flag results in using these spikes for starting the generation of isotigs, potentially resulting in more isotigs.