How newbler works

Posted by lexnederbragt on February 9, 2010

I thought to start by explaining briefly how newbler works. I’ll do this by following the output newbler generates during the assembly process. This information is displayed during assembly, and can also be found in the 454NewblerProgress.txt file. It is a good thing anyways to have a look at this file, as it sometimes displays certain warnings (see below).

This example assembly is based on a read dataset consisting of both shotgun reads, and paired end reads (for more on 454 paired end reads, have a look here).

The first thing you’ll see is a message stating that the assembly computation started, and which version of newbler you used.

Then, you’ll see messages for each input file saying Indexing XXXXXXX.sff…, and a counter. During indexing, newbler scans the input file, performs some checks and trims the reads (sometimes more than the base-calling software already did). One of the checks is for possible 3′ and 5′ primers: if a certain percentage of reads contains the same sequence on either the 3′ or 5′ end, this is mentioned. I’ve had some surprises here, such as finding out that reads I got from another group contained an adaptor sequence, which caused problems during the assembly. More on primer removal later…

If an input sff file contains paired end reads, this will be mentioned, as well as the number of reads that contained the paired end linker sequence, for example:

224024 reads, 58599257 bases, 112080 paired reads.

Setting up long overlap detection…
XXXXX reads to align
Building a tree for YYYYYY seeds…
Computing long overlap alignments…

The first phase of assembly is finding overlap between reads. Newbler splits this phase into one for long reads (this goes very fast) and shorter reads (can take quite some time). As aligning all reads against each other would take too long time, newbler (and many other programs) actually make seeds, 16-mers of each read, where each seed starts 12 bases upstream of the previous one. These seed length and step sizes can be changed if you want (I’ve never tried this, though). When two different reads have identical seeds the program tries to extend the overlap between the reads until the minimum overlap (default 40 bp) with the minimum alignment percentage default 90%) has been reached. These settings can also be changed and influence the alignment stringency, this I will come back to in a later post.

When reads overlap they can be used to generate consensus contigs. The illustration at the top of the post shows an ideal situation with reads that all are the same length and have no variation in sequence in the overlap. In real assemblies, reads will actually show differences, and the contig sequence is based on a consensus estimate.

After long overlap follows short overlap:

Setting up overlap detection…
XXXXXXX reads to align
Building a tree for YYYYYYYY seeds…
Computing alignments…

Then the curious message appears:

Checkpointing…

Basically, checkpointing means writing the intermediate results to disc, so that in the case of a crash, you could continue the assembly from the last ‘checkpoint’.

At this point, newbler, as many other assemblers, has created a contig graph. Aligned reads form the ‘nodes’, reads going from one contig to another form the ‘edges’. For example, a small part of the graph could like like this:

Why would reads go from contig 1 to contig 3, as well as from contig 2 to contig 3? If the sequence from contig 3 is repeated in the genome, the reads coming from these repeats are very similar and collapse into a single contig. At the beginning and end of this contig there will be reads extending into the respective neighboring, single copy genomic regions. Essentially, here is the problem with assembly: repeats cause a complicated contig graph structure.

The graph in the above picture can be simplified to this:

After aligning all the reads, the contig graph potentially has many nodes and edges. The size and complexity of the graph depend on the size of the genome and the repeat structure. The ‘real’ genome is a path through the graph visiting all nodes.

Newbler continues:

Detangling alignments…
-> Phase 1…
-> Phase 2…
-> Level 1, Phase 3, Round 1…
…
-> Level 4, Phase 9, Round 2…
Checkpointing…

At this stage, the graph is simplified, I’ll leave out the details.

Setting up short read alignment detection…
-> 39593 of 39593…
Mapping short reads to consensi…

Paired end read halves shorter than a certain size are not assembled during alignment, but mapped to the contigs afterwards. The longer halves are assembled with the shotgun reads. See below.

Adding mapped short reads to alignments…
Checkpointing…
Building contigs/scaffolds…

Using all uniquely mapped paired end halves, the contigs can be ordered and oriented. At least two paired reads must link two contigs for them to be scaffolded. In order to estimate gap sizes between the contigs, newbler uses the paired end reads where both halves map to the same contig for estimating the paired end library insert size (this is reported in the 454NewblerMetrics.txt file). This insert size is than used to estimate the gap size between contigs. So, a scaffold is a series of contig-gap-contig-gap-contig and so on.

Next, newbler reports on how many scaffolds and contigs it found

-> XX scaffolds, YYY large contigs

(‘Large’ refers to contigs of at least 500 bp) , and starts producing output:

Computing signals…
Reading flowgrams…
Checkpointing…
Generating output…
Reading flowgrams…

Newbler here reads in all information from the flowgrams (sff file, these include the light intensity during sequencing for each read) and calculates for each position in the contig the consensus signal. Also, a consensus quality score is calculated for each base. Several output files are written during this phase.

Finally, newbler tells you it finished successfully.

There is a lot more that can be said about the process of assembly, but this post is long enough already. Next time: what do the different output files actually contain?

Feel free to post questions below!

This entry was posted on February 9, 2010 at 17:04 and is filed under How it works. Tagged: 454NewblerProgress.txt, alignment, assembly, contig, contig graph, newbler, scaffold. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

61 Responses to “How newbler works”

Vladimir said

May 19, 2010 at 07:39
Thanks for the nice explanation. But still one question. How long does it usually take Newbler to read flowgrams? I’ve come with my assembly pretty fast up to that point, but reading flowgrams has been taken already quite a while. Thanks.
- contig said
  
  May 19, 2010 at 08:22
  Yes, for a large assembly (many reads) this phase can be quit time consuming. One thing you could do is use the -qo flag. From the manual:
  
  “Flag to generate quick output for mapping and assembly. Disables signal distribution computation for calling consensus sequences and can decrease accuracy.”
  Using this flag saves time during the output phase. However, your output will be slightly different than the normal, full output. There will be base changes, and slightly different contig/scaffold metrics (number, N50 length).
  
  Using -qo is good for getting a first impression of an assembly, but always use the full output for the final analysis!
  
  Hope this helps!
Brian said

June 28, 2010 at 18:31
fixlex, that is a very good summary, thank you. I’m doing some metagenomics work and running newbler with -large. What I’m finding is that newbler is not including singletons in its 454AllContigs files but I really want to have them included. Should I reduce the value given to -a? Or is more required?
- flxlex said
  
  June 29, 2010 at 09:46
  You’re correct, newbler does not make a file with singletons. Unfortunately, there is no setting to make it do that. The only option you have is by extracting the singleton read IDs from the 454ReadStatus.txt file, and using the sfffile and sffinfo commands to generate a fasta file of them. From the manual:
  
  grep Singleton 454ReadStatus.txt > singles.txt sfffile -o singles.sff -i singles.txt sff/* sffinfo -s singles.sff > singles.fna
  
  Here, sff/* stands for the folder containing all the sff files used for the assembly.
  - Brian said
    
    June 29, 2010 at 14:06
    fixlex, thanks once again, this is a straightforward workaround. I couldn’t find this one myself, no manual!
    
    I have to add that I think this is something of a “scientific bug”. In metagenomics you really want all reads, not just contigs. Perhaps the creators of newbler think that a singleton is somehow unvalidated, but it’s not the job of the assembler to make that determination, there are many good ways of determining whether some singleton can be trusted (e.g. match to a database entry, count of matches to the sequence or species that matches the singleton).
  - flxlex said
    
    June 29, 2010 at 14:16
    Brian, I agree. Other assemblers do output singletons. Where do we send feature requests for newbler? ;-)
Josh said

October 20, 2010 at 17:42
I was reading that Newbler requires 454 software for output, I was wondering if you could verify that. Also, is there a known upper limit to the amount of reads Newbler can successfully run on a given machine?
- flxlex said
  
  October 21, 2010 at 10:36
  What do you mean ‘Newbler requires 454 software for output’? Newbler is a piece of 454 software. It uses 454 reads as input, but can also take in fasta files of other read types. If you mean to visually inspect the newbler otput, you might be able to use the GUI version (gsAssembler/gsMapper instead of runAssembly/runMapping).
  
  On your second question: the only upper upper limit I am aware of is the amount of memory your machine has. Also, for large projects, more cpus (on a shared memory machine) help, as well as the large option.
Lee said

October 28, 2010 at 04:32
I would love it if you made an article on how each individual consensus base gets its quality score. How is each quality derived? Is it different in each assembler? Can you show an example?
- flxlex said
  
  October 28, 2010 at 19:21
  I would love to know this myself, but I’m afraid I don’t…
marble said

December 8, 2010 at 08:31
Flxlex, thank you for the great introduction, you are a great help!
There is a question, though. Since there is an obligatory field that sets the number of ‘minimum read lenght’ to a number, the default is 20, and a minimum overlap of 40, how is it possible to generate contigs of length e.g. 2 or 3 nucleotides?
- flxlex said
  
  December 17, 2010 at 14:44
  I am considering writing a blog post about the ultra-short contigs, so just be patient a little more…
  - Delphine said
    
    January 27, 2011 at 17:54
    Be patient again? I wait this post!! Come on Flxlex ;-)
  - flxlex said
    
    April 5, 2011 at 16:14
    Thanks for your patience, here is the post:
    
    Newbler output IV: on ultra-short and single-read contigs
Praveen said

March 2, 2011 at 20:21
I see a many cotigs with only 1 base and 2 bases. I cannot understand why? I am doing cDNA assembly. I just used the default parameters and I did not provide fasta qual values but only the fasta sequences.
- flxlex said
  
  March 15, 2011 at 14:19
  A blog entry on these ultra-short contigs is forthcoming…
Steven Sullivan said

March 25, 2011 at 20:01
I have what is probably a basic question: in the contig graph graphic https://contig.files.wordpress.com/2010/02/contig-graph.jpg, why are the first three reads at the top of contig 3 — which appear to be the 3′ ends of the bottom three reads in contig 1 – not assembled into contig 1 instead? (At a more basic level I guess I am asking, what makes an assembler split a read into different contigs?)
- flxlex said
  
  March 28, 2011 at 15:16
  You are pointing to a fundamental difference between assemblers: some do what you say and never split a read (i.e. never place one read into more than one contig), while others, such as newbler, do split reads, usually at the borders of repeats. Newbler in this way chooses the alignment over read ‘integrety’. The advantage is that it is easier to see which contigs are neighbors (e.g. flanking a repeat) from where the reads are aligned.
Nyine said

August 14, 2011 at 00:27
Can I access 454 Newbler free online or I have to purchase the software?
- flxlex said
  
  August 16, 2011 at 09:38
  You can request a free download here: http://454.com/contact-us%5Csoftware-request.asp
  - jeri said
    
    May 22, 2012 at 19:22
    When you submit the form from the link above it throws an error. Try this link instead. It worked for me.
    http://454.com/contact-us/software-request.asp
  - flxlex said
    
    May 23, 2012 at 21:06
    Thanks for letting us know!
Praveen said

November 15, 2011 at 21:45
I used 6 million 454 reads as input. Since I had only 24 GB RAM newbler eventually failed to assemble but I was shocked to see that the newbler trashed 4 million of my reads keeping only 2 million in the trimmedreads.fna. Could you please explain this behavior? What may be the potential causes of this?
- flxlex said
  
  November 16, 2011 at 12:57
  Are your reads very short? Low quality? Do you as newbler to filter for contaminants (‘-vs’ option)?
Victoria said

December 14, 2011 at 23:21
I am doing a denovo transcriptome assembly and wanted to get the singletons out of the output, and was wondering if the sfffile and sffinfo commands still work when you use fasta files as input for assembly rather than the sff files?
- flxlex said
  
  December 15, 2011 at 09:21
  No, these commands will not work, they only wotk on sff files. You can still use the information in the 454ReadStatus.txt file to find the IDs of the reads that are the singletons, and use a script/bioperl/biopyton/… to filter them out of your input file(s).
Maggie said

January 19, 2012 at 16:22
I have done some denovo transcriptome assembly and the assembly has been working fine and I am using default parameters(with -cpu 6). Recently I am redoing one assembly while the input is about 4M reads, it is taking forever. especially in the step of computing alignments.(for about 4Millions read, it did not complete within two days). Could you explain and what might be the cause of it?
- flxlex said
  
  January 24, 2012 at 16:25
  What is different this time, another version of newbler? More reads? Also check out the reply to this comment (point 2).
Steven Sullivan said

January 27, 2012 at 17:53
Is Newbler properly considered an OCC (overlap-contig-consensus) assembler like Celera or CABOG , or a de Bruijn assembler like Velvet and ALLPATHS, or a hybrid (maybe OCC for the long reads, de Bruijn for short)?
- flxlex said
  
  January 30, 2012 at 13:12
  Newbler is definitely using the Overlap Layout Consensus (OLC) approach.
JJ said

March 19, 2012 at 15:48
Could you please tell me what is the difference between contig – isotig and contig – singleton because I am a little bit confused. Do you know why so many write in their articles only about contigs but say nothing about isotigs? (I have read about EST sequencing of cDNA) Is isotigs only produced by newbler? Does contig in this context usually mean individual transcript as isotig does?
- flxlex said
  
  March 20, 2012 at 16:18
  What do you mean by ‘contig – isotig and contig – singleton’? Isotigs is a newbler-specific term, if people didn’t use newbler they will not have isotigs. What ‘contig’ means (transcript, part of it or something else) for transcriptome assemblies done with other programs, you would have to look up in their documentation.
  - JJ said
    
    March 21, 2012 at 09:56
    Ok, Good to know that isotigs is a newbler-specific term!
    
    I thought that when people use different programs they use word contig when they mean transcript (newbler:isotig) and singleton is the same thing as contig in the 454Isotig.fna file produced by mewbler. But I am not sure. Typically these are not so well explaned in articles.
    
    I found that singleton is typically defined so that its coverage depth is 1. I was thinking that does this mean the same thing as in the file 454Isotigs.fna the sequences which are named as contig0…? Or does this mean the isotigs which are formed from one contig?
  - flxlex said
    
    March 23, 2012 at 14:25
    Please check out my post on the output of newbler transcriptome assemblies: some isogroups contain only contigs, which are not singletons! In newbler, singletons are always unassembled reads, and they are (unfortunately) not included in the output files.
michiemulan said

April 4, 2012 at 10:33
Hi Flxlex,

I’m learning how to use Newbler and I must say the introduction has helped me tremendously. Thank you so much!

I have a question. Should the assembly fail, how shall I continue? Is there any command-line information I could get hold of, and what should be done at the respective steps should the assembly fail at any of the steps, so that the assembly can be continued from the very process it halted?

Many thanks for the reply.
- flxlex said
  
  April 4, 2012 at 10:42
  Newbler never fails :-) Sometimes, there is not much you can do when newbler ends with an error and asks you to report it to customer service, it could be a plain bug. Other times, trying with less reads, or a slight change in parameters will help.
  - Imrose said
    
    October 22, 2015 at 22:45
    Hi Flxlex,
    
    I am getting an error (see below) towards the end of my assembly. I am working with multiple datasets and using the same exact pipeline to prepare my files, but I am seeing this error for half of my datasets. I checked to see if the sizes of the files vary, but newbler (v2.0.01.14) seemed to work fine with datasets of various size. Is this the bug you were referring to? Do you have any suggestions what I could do in this case?
    
    Error: An internal error has occurred in the computation.
    Assertion: chord->getLength() – chordEndChop + nextChord->getLength() – nextChordStartChop < 256
    Location: 'void ChordMatrix::concatChords(Chord*, int, Chord*, int)' [ChordMatrix.cpp:239]
    Please report this error to your customer support representative.
    
    Any help would be great, many thanks :-).
  - lexnederbragt said
    
    October 23, 2015 at 08:39
    I am afraid I cant help here, this is a problem with the source code of Newbler that gets triggered for some of your data, but not all. The only suggestion I have is to look more into the difference(s) between the data that cause this error, and the data that don’t…
Frederick said

April 16, 2012 at 20:47
This might be a silly request, but could you direct me to any literature about Newbler?
I a struggling to find published articles on Newbler.

Regards
- flxlex said
  
  April 18, 2012 at 11:02
  Assuming you mean papers describing newbler, not papers using it for assembly, the main reference is the original Margulies paper. A somewhat more recent paper on trying out GS FLX on a set of Salmon BACs described improvements to the program.
zxybl said

May 24, 2012 at 15:12
what is the unknown gap size? is it the default gap size(20bp)?
- flxlex said
  
  May 24, 2012 at 15:29
  If newbler can’t estimate the gap size, it puts it at 20 bp (I think this is what you mean).
Nori said

October 26, 2012 at 19:43
Hi Flxlex,

Great site! My question: is there a way to get the intermediate files from Newbler, such as the number of seeds or a frequency distribution of the seeds?
- flxlex said
  
  November 1, 2012 at 20:45
  If you run with the ‘-nrm’ flag, or set up your project ‘manually’ (using newAssembly, addRun, runProject) then many intermediate files will not be thrown away. What you want to do may be possible with these files, but I am not entirely certain. Let me know if you find out!
Risham Kaur said

February 12, 2013 at 10:05
Hi Flxlex!

Thanks a tonn for explaining the working protocol of Newbler. I am trying to assemble a low coverage 454 data of a plant using Newbler. I have two raw sff files from two different genotypes of my experimental plant. newbler completes the assembly step without a considerable error for the individual sffs. But when I try to assemble the sff files of both genotypes together(using incremental denovo assembly) it just adds up the total contigs and the singletons for that matter neglecting the possible common contigs between the two genotypes. To my understanding newbler is treating every read in both the sff files as unique which is very unlikely to happen. My basic aim is to find the SNPs and repeats in the genome and if newbler is assembling every read into a unique contig then this could be a matter of concern to me. Please provide the necessary explaination for this behaviour.
- flxlex said
  
  February 14, 2013 at 10:49
  If your coverage is very low, the sff file from the second genotype may contain – by chance – sequences that come from genomic regions that do not overlap with the ones from the first. If this cannot be the explanation, than, if you want newbler to assembly into one contig per region, and there are a lot of heterozygosities (SNPs, indels, or larger regions) between the genotypes, I suggest lowering the overlap stringency may help. Perhaps not doing incremental assembly, but giving all reads in one go may help too (although it shouldn’t make a difference). Finally, if you succeed in a ‘consensus’ assembly, you could try mapping back the individual sff files to the contigs, and ask newbler (gsMapper, or runMapping) to report the variants.
LeFeesh said

February 20, 2013 at 18:32
Hi,
I was looking for some information regarding the directionality of Newbler assemblies. Are the RNAseq libraries for FLX directional (5′-3′). If so, then I assume all Newbler contigs are 5′-3′ orientation. If not, then I assume the Newbler contigs are randomly plus or minus strand contigs. Unless Newbler somehow orients each contig into the proper direction. Can you shed some light on this? I have a Newbler transcriptome that was published by another lab that I would like to use as an index for my Illumina HiSeq data from directional libraries. My libraries should have produced nearly exclusively 5′-3′ sequences so if the transcriptome is also 5′-3′ oriented then I can ignore reads that align to the reverse complement as errors. However, if Newbler produces both plus and minus strand contigs then I must accept all alignments as valid. Thoughts?
- flxlex said
  
  February 22, 2013 at 09:48
  Whether RNASeq libraries for the 454 are directional I guess depends on the lab that made them. That said, I am not aware of any directional protocol for 454. When the reads are from both strands, the contigs newbler produces are indeed random with respect to orientation. However, since your reads are directional, shouldn’t they map to the same strand within a contig? After alignment, you should then determine for each contig what the prevalent strand for alignment is, and discard all reads mapped to the other strand, right?
  - LeFeesh said
    
    February 22, 2013 at 17:30
    Thanks Flxlex,
    You’ve confirmed my suspicions. Unfortunately that makes the analysis a tiny bit more complicated but not too bad. I’ll just have to write a script to collect only the data from the most prevalent strand.
    
    Yes, my reads do largely align to either the plus or minus strand of a contig and not both but there is some cross talk. Some libraries have more cross talk than others and I’m not sure the reason for that. Perhaps the directionality in some of the library constructs isn’t as robust? In the few samples I’ve looked at it seems like I get about 98% of belonging to either the plus or minus strand and a fairly consistant 2% to the opposite strand.
starstar said

March 28, 2013 at 00:39
Hi Flxlex,
I have about ~250 million Illumina paired end reads (500Mil total) and has been running the assembly with 256 gb RAM. I know that the computing alignment stage is the slowest portion and since I have too many reads, I always reach the time limit set for me. Is there a checkpoint or something that I can still assemble where I left off for example if the job killed while computing the alignments? I am using version 2.6 for Newbler.
- flxlex said
  
  April 2, 2013 at 14:11
  There is no such checkpoint that I am aware of, AFAIK checkpoints are used after the alignment phase is finished. If your genome is more than 40Mbp, are you using the ‘-large’ option? You could also try the latest newbler. A kmer frequency analysis may tell you whether there is a set of kmers at very high frequency, in which case digital normalisation (google ‘diginorm khmer’) may help.
Stray Bird said

June 25, 2013 at 17:03
Hi, Flxex, do you know how the consensus sequence are computed by Newbler? I bet it is a majority vote, but I can’t find any documents to support this. If one position is covered with “GGGAAA”, would the consensus be a G or a A, or by random? Thanks for your great blog on Newbler!

Stray Bird
- flxlex said
  
  June 26, 2013 at 09:13
  As far as I understand – I don’t know of any reference – it takes the flow signals into consideration when calculating the consensus. That is why newbler at the ends says ‘Reading flowgrams…’. So it appears to be a bit more sophisticated than just majority voting.
Shas said

July 8, 2013 at 19:01
Hi Flxex, can you tell me if Newbler makes any assumptions when assembling circular or linear genomes, particularly from where the genome assembly begins?
- flxlex said
  
  August 6, 2013 at 14:28
  Newbler doesn’t know, nor can be told, whether the genome is linear or circular. Assembly can basically start anywhere.
Sinna said

November 5, 2013 at 12:02
Hi, I’using for first time gsAssembler, I have a problem to import .sff file, i can select them but seems like don’t import the information. So I’m working from command line and thanks to your tips i have started my first assemble. I have three genomes of bacteria, do you have some suggestion about the parameters ?

(Helicobacter pylori, has a circular genome of 1,667,867 base pairs and 1,590 predicted coding sequences.)
- lexnederbragt said
  
  November 5, 2013 at 14:52
  No particular suggestions, you should be good to go with default settings.
Rocío Sasmay said

October 10, 2017 at 20:39
Please, I need the correct cite for Newbler v2.5 for a scientific paper.
- lexnederbragt said
  
  October 11, 2017 at 16:06
  Unfortunately, there was never an official publication describing the software. People usually reference this paper: https://www.ncbi.nlm.nih.gov/pubmed/16056220
Help Please? said

October 14, 2017 at 05:00
Hello, I could use some help please? I keep getting this error:

“Error: More than one argument found after options”

I am running the following command:

newbler runAssembler -mi -o outputfilename .collapsed.fastq .pair1.truncated.fastq .pair2.truncated.fastq

I do not understand what I am doing wrong and would appreciate some guidance.

Thank you!
- lexnederbragt said
  
  October 16, 2017 at 13:21
  You are missing a number after the -mi option, see https://contig.wordpress.com/2010/06/10/running-newbler-de-novo-assembly/