In the post on what is new in newbler version 2.6, I introduced the -scaffold option. Briefly, with this option instances (i.e. the consensus sequence) of repeats are placed in gaps. As I mentioned, setting -scaffold results in two extra files. With this post, I will explain these in detail.
Archive for the ‘How it works’ Category
Posted by lexnederbragt on July 12, 2011
Posted by lexnederbragt on February 9, 2010
I thought to start by explaining briefly how newbler works. I’ll do this by following the output newbler generates during the assembly process. This information is displayed during assembly, and can also be found in the 454NewblerProgress.txt file. It is a good thing anyways to have a look at this file, as it sometimes displays certain warnings (see below).
This example assembly is based on a read dataset consisting of both shotgun reads, and paired end reads (for more on 454 paired end reads, have a look here).
The first thing you’ll see is a message stating that the assembly computation started, and which version of newbler you used.
Then, you’ll see messages for each input file saying Indexing XXXXXXX.sff…, and a counter. During indexing, newbler scans the input file, performs some checks and trims the reads (sometimes more than the base-calling software already did). One of the checks is for possible 3′ and 5′ primers: if a certain percentage of reads contains the same sequence on either the 3′ or 5′ end, this is mentioned. I’ve had some surprises here, such as finding out that reads I got from another group contained an adaptor sequence, which caused problems during the assembly. More on primer removal later…
If an input sff file contains paired end reads, this will be mentioned, as well as the number of reads that contained the paired end linker sequence, for example:
224024 reads, 58599257 bases, 112080 paired reads.
Setting up long overlap detection…
XXXXX reads to align
Building a tree for YYYYYY seeds…
Computing long overlap alignments…
The first phase of assembly is finding overlap between reads. Newbler splits this phase into one for long reads (this goes very fast) and shorter reads (can take quite some time). As aligning all reads against each other would take too long time, newbler (and many other programs) actually make seeds, 16-mers of each read, where each seed starts 12 bases upstream of the previous one. These seed length and step sizes can be changed if you want (I’ve never tried this, though). When two different reads have identical seeds the program tries to extend the overlap between the reads until the minimum overlap (default 40 bp) with the minimum alignment percentage default 90%) has been reached. These settings can also be changed and influence the alignment stringency, this I will come back to in a later post. Read the rest of this entry »