An assembly of reads, contigs and scaffolds

A blog on all things newbler and beyond

How newbler works

Posted by lexnederbragt on February 9, 2010

I thought to start by explaining briefly how newbler works. I’ll do this by following the output newbler generates during the assembly process. This information is displayed during assembly, and can also be found in the 454NewblerProgress.txt file. It is a good thing anyways to have a look at this file, as it sometimes displays certain warnings (see below).

This example assembly is based on a read dataset consisting of both shotgun reads, and paired end reads (for more on 454 paired end reads, have a look here).

The first thing you’ll see is a message stating that the assembly computation started, and which version of newbler you used.

Then, you’ll see messages for each input file saying Indexing XXXXXXX.sff…, and a counter. During indexing, newbler scans the input file, performs some checks and trims the reads (sometimes more than the base-calling software already did). One of the checks is for possible 3′ and 5′ primers: if a certain percentage of reads contains the same sequence on either the 3′ or 5′ end, this is mentioned. I’ve had some surprises here, such as finding out that reads I got from another group contained an adaptor sequence, which caused problems during the assembly. More on primer removal later…

If an input sff file contains paired end reads, this will be mentioned, as well as the number of reads that contained the paired end linker sequence, for example:

224024 reads, 58599257 bases, 112080 paired reads.

Next:

Setting up long overlap detection…
XXXXX reads to align
Building a tree for YYYYYY seeds…
Computing long overlap alignments…

The first phase of assembly is finding overlap between reads. Newbler splits this phase into one for long reads (this goes very fast) and shorter reads (can take quite some time). As aligning all reads against each other would take too long time, newbler (and many other programs) actually make seeds, 16-mers of each read, where each seed starts 12 bases upstream of the previous one. These seed length and step sizes can be changed if you want (I’ve never tried this, though). When two different reads have identical seeds the program tries to extend the overlap between the reads until the minimum overlap (default 40 bp) with the minimum alignment percentage default 90%) has been reached. These settings can also be changed and influence the alignment stringency, this I will come back to in a later post.

When reads overlap they can be used to generate consensus contigs. The illustration at the top of the post shows an ideal situation with reads that all are the same length and have no variation in sequence in the overlap. In real assemblies, reads will actually show differences, and the contig sequence is based on a consensus estimate.

After long overlap follows short overlap:

Setting up overlap detection…
XXXXXXX reads to align
Building a tree for YYYYYYYY seeds…
Computing alignments…

Then the curious message appears:

Checkpointing…

Basically, checkpointing means writing the intermediate results to disc, so that in the case of a crash, you could continue the assembly from the last ‘checkpoint’.

At this point, newbler, as many other assemblers, has created a contig graph. Aligned reads form the ‘nodes’, reads going from one contig to another form the ‘edges’. For example, a small part of the graph could like like this:

Why would reads go from contig 1 to contig 3, as well as from contig 2 to contig 3? If the sequence from contig 3 is repeated in the genome, the reads coming from these repeats are very similar and collapse into a single contig. At the beginning and end of this contig there will be reads extending into the respective neighboring, single copy genomic regions. Essentially, here is the problem with assembly: repeats cause a complicated contig graph structure.

The graph in the above picture can be simplified to this:

After aligning all the reads, the contig graph potentially has many nodes and edges. The size and complexity of the graph depend on the size of the genome and the repeat structure. The ‘real’ genome is a path through the graph visiting all nodes.

Newbler continues:

Detangling alignments…
-> Phase 1…
-> Phase 2…
-> Level 1, Phase 3, Round 1…

-> Level 4, Phase 9, Round 2…
Checkpointing…

At this stage, the graph is simplified, I’ll leave out the details.

Setting up short read alignment detection…
-> 39593 of 39593…
Mapping short reads to consensi…

Paired end read halves shorter than a certain size are not assembled during alignment, but mapped to the contigs afterwards. The longer halves are assembled with the shotgun reads. See below.

Adding mapped short reads to alignments…
Checkpointing…
Building contigs/scaffolds…

Using all uniquely mapped paired end halves, the contigs can be ordered and oriented. At least two paired reads must link two contigs for them to be scaffolded. In order to estimate gap sizes between the contigs, newbler uses the paired end reads where both halves map to the same contig for estimating the paired end library insert size (this is reported in the 454NewblerMetrics.txt file). This insert size is than used to estimate the gap size between contigs. So, a scaffold is a series of contig-gap-contig-gap-contig and so on.

Next, newbler reports on how many scaffolds and contigs it found

-> XX scaffolds, YYY large contigs

(‘Large’ refers to contigs of at least 500 bp) , and starts producing output:

Computing signals…
Reading flowgrams…
Checkpointing…
Generating output…
Reading flowgrams…

Newbler here reads in all information from the flowgrams (sff file, these include the light intensity during sequencing for each read) and calculates for each position in the contig the consensus signal. Also, a consensus quality score is calculated for each base. Several output files are written during this phase.

Finally, newbler tells you it finished successfully.

There is a lot more that can be said about the process of assembly, but this post is long enough already. Next time: what do the different output files actually contain?

Feel free to post questions below!

61 Responses to “How newbler works”

  1. Vladimir said

    Thanks for the nice explanation. But still one question. How long does it usually take Newbler to read flowgrams? I’ve come with my assembly pretty fast up to that point, but reading flowgrams has been taken already quite a while. Thanks.

    • contig said

      Yes, for a large assembly (many reads) this phase can be quit time consuming. One thing you could do is use the -qo flag. From the manual:

      “Flag to generate quick output for mapping and assembly. Disables signal distribution computation for calling consensus sequences and can decrease accuracy.”
      Using this flag saves time during the output phase. However, your output will be slightly different than the normal, full output. There will be base changes, and slightly different contig/scaffold metrics (number, N50 length).

      Using -qo is good for getting a first impression of an assembly, but always use the full output for the final analysis!

      Hope this helps!

  2. Brian said

    fixlex, that is a very good summary, thank you. I’m doing some metagenomics work and running newbler with -large. What I’m finding is that newbler is not including singletons in its 454AllContigs files but I really want to have them included. Should I reduce the value given to -a? Or is more required?

    • flxlex said

      You’re correct, newbler does not make a file with singletons. Unfortunately, there is no setting to make it do that. The only option you have is by extracting the singleton read IDs from the 454ReadStatus.txt file, and using the sfffile and sffinfo commands to generate a fasta file of them. From the manual:

      grep Singleton 454ReadStatus.txt > singles.txt
      sfffile -o singles.sff -i singles.txt sff/*
      sffinfo -s singles.sff > singles.fna

      Here, sff/* stands for the folder containing all the sff files used for the assembly.

      • Brian said

        fixlex, thanks once again, this is a straightforward workaround. I couldn’t find this one myself, no manual!

        I have to add that I think this is something of a “scientific bug”. In metagenomics you really want all reads, not just contigs. Perhaps the creators of newbler think that a singleton is somehow unvalidated, but it’s not the job of the assembler to make that determination, there are many good ways of determining whether some singleton can be trusted (e.g. match to a database entry, count of matches to the sequence or species that matches the singleton).

      • flxlex said

        Brian, I agree. Other assemblers do output singletons. Where do we send feature requests for newbler? ;-)

  3. Josh said

    I was reading that Newbler requires 454 software for output, I was wondering if you could verify that. Also, is there a known upper limit to the amount of reads Newbler can successfully run on a given machine?

    • flxlex said

      What do you mean ‘Newbler requires 454 software for output’? Newbler is a piece of 454 software. It uses 454 reads as input, but can also take in fasta files of other read types. If you mean to visually inspect the newbler otput, you might be able to use the GUI version (gsAssembler/gsMapper instead of runAssembly/runMapping).

      On your second question: the only upper upper limit I am aware of is the amount of memory your machine has. Also, for large projects, more cpus (on a shared memory machine) help, as well as the large option.

  4. Lee said

    I would love it if you made an article on how each individual consensus base gets its quality score. How is each quality derived? Is it different in each assembler? Can you show an example?

  5. marble said

    Flxlex, thank you for the great introduction, you are a great help!
    There is a question, though. Since there is an obligatory field that sets the number of ‘minimum read lenght’ to a number, the default is 20, and a minimum overlap of 40, how is it possible to generate contigs of length e.g. 2 or 3 nucleotides?

  6. Praveen said

    I see a many cotigs with only 1 base and 2 bases. I cannot understand why? I am doing cDNA assembly. I just used the default parameters and I did not provide fasta qual values but only the fasta sequences.

  7. Steven Sullivan said

    I have what is probably a basic question: in the contig graph graphic https://contig.files.wordpress.com/2010/02/contig-graph.jpg, why are the first three reads at the top of contig 3 — which appear to be the 3′ ends of the bottom three reads in contig 1 – not assembled into contig 1 instead? (At a more basic level I guess I am asking, what makes an assembler split a read into different contigs?)

    • flxlex said

      You are pointing to a fundamental difference between assemblers: some do what you say and never split a read (i.e. never place one read into more than one contig), while others, such as newbler, do split reads, usually at the borders of repeats. Newbler in this way chooses the alignment over read ‘integrety’. The advantage is that it is easier to see which contigs are neighbors (e.g. flanking a repeat) from where the reads are aligned.

  8. Nyine said

    Can I access 454 Newbler free online or I have to purchase the software?

  9. Praveen said

    I used 6 million 454 reads as input. Since I had only 24 GB RAM newbler eventually failed to assemble but I was shocked to see that the newbler trashed 4 million of my reads keeping only 2 million in the trimmedreads.fna. Could you please explain this behavior? What may be the potential causes of this?

  10. Victoria said

    I am doing a denovo transcriptome assembly and wanted to get the singletons out of the output, and was wondering if the sfffile and sffinfo commands still work when you use fasta files as input for assembly rather than the sff files?

    • flxlex said

      No, these commands will not work, they only wotk on sff files. You can still use the information in the 454ReadStatus.txt file to find the IDs of the reads that are the singletons, and use a script/bioperl/biopyton/… to filter them out of your input file(s).

  11. Maggie said

    I have done some denovo transcriptome assembly and the assembly has been working fine and I am using default parameters(with -cpu 6). Recently I am redoing one assembly while the input is about 4M reads, it is taking forever. especially in the step of computing alignments.(for about 4Millions read, it did not complete within two days). Could you explain and what might be the cause of it?

  12. Steven Sullivan said

    Is Newbler properly considered an OCC (overlap-contig-consensus) assembler like Celera or CABOG , or a de Bruijn assembler like Velvet and ALLPATHS, or a hybrid (maybe OCC for the long reads, de Bruijn for short)?

  13. JJ said

    Could you please tell me what is the difference between contig – isotig and contig – singleton because I am a little bit confused. Do you know why so many write in their articles only about contigs but say nothing about isotigs? (I have read about EST sequencing of cDNA) Is isotigs only produced by newbler? Does contig in this context usually mean individual transcript as isotig does?

    • flxlex said

      What do you mean by ‘contig – isotig and contig – singleton’? Isotigs is a newbler-specific term, if people didn’t use newbler they will not have isotigs. What ‘contig’ means (transcript, part of it or something else) for transcriptome assemblies done with other programs, you would have to look up in their documentation.

      • JJ said

        Ok, Good to know that isotigs is a newbler-specific term!

        I thought that when people use different programs they use word contig when they mean transcript (newbler:isotig) and singleton is the same thing as contig in the 454Isotig.fna file produced by mewbler. But I am not sure. Typically these are not so well explaned in articles.

        I found that singleton is typically defined so that its coverage depth is 1. I was thinking that does this mean the same thing as in the file 454Isotigs.fna the sequences which are named as contig0…? Or does this mean the isotigs which are formed from one contig?

      • flxlex said

        Please check out my post on the output of newbler transcriptome assemblies: some isogroups contain only contigs, which are not singletons! In newbler, singletons are always unassembled reads, and they are (unfortunately) not included in the output files.

  14. Hi Flxlex,

    I’m learning how to use Newbler and I must say the introduction has helped me tremendously. Thank you so much!

    I have a question. Should the assembly fail, how shall I continue? Is there any command-line information I could get hold of, and what should be done at the respective steps should the assembly fail at any of the steps, so that the assembly can be continued from the very process it halted?

    Many thanks for the reply.

    • flxlex said

      Newbler never fails :-) Sometimes, there is not much you can do when newbler ends with an error and asks you to report it to customer service, it could be a plain bug. Other times, trying with less reads, or a slight change in parameters will help.

      • Imrose said

        Hi Flxlex,

        I am getting an error (see below) towards the end of my assembly. I am working with multiple datasets and using the same exact pipeline to prepare my files, but I am seeing this error for half of my datasets. I checked to see if the sizes of the files vary, but newbler (v2.0.01.14) seemed to work fine with datasets of various size. Is this the bug you were referring to? Do you have any suggestions what I could do in this case?

        Error: An internal error has occurred in the computation.
        Assertion: chord->getLength() – chordEndChop + nextChord->getLength() – nextChordStartChop < 256
        Location: 'void ChordMatrix::concatChords(Chord*, int, Chord*, int)' [ChordMatrix.cpp:239]
        Please report this error to your customer support representative.

        Any help would be great, many thanks :-).

      • I am afraid I cant help here, this is a problem with the source code of Newbler that gets triggered for some of your data, but not all. The only suggestion I have is to look more into the difference(s) between the data that cause this error, and the data that don’t…

  15. Frederick said

    This might be a silly request, but could you direct me to any literature about Newbler?
    I a struggling to find published articles on Newbler.

    Regards

  16. zxybl said

    what is the unknown gap size? is it the default gap size(20bp)?

  17. Nori said

    Hi Flxlex,

    Great site! My question: is there a way to get the intermediate files from Newbler, such as the number of seeds or a frequency distribution of the seeds?

    • flxlex said

      If you run with the ‘-nrm’ flag, or set up your project ‘manually’ (using newAssembly, addRun, runProject) then many intermediate files will not be thrown away. What you want to do may be possible with these files, but I am not entirely certain. Let me know if you find out!

  18. Hi Flxlex!

    Thanks a tonn for explaining the working protocol of Newbler. I am trying to assemble a low coverage 454 data of a plant using Newbler. I have two raw sff files from two different genotypes of my experimental plant. newbler completes the assembly step without a considerable error for the individual sffs. But when I try to assemble the sff files of both genotypes together(using incremental denovo assembly) it just adds up the total contigs and the singletons for that matter neglecting the possible common contigs between the two genotypes. To my understanding newbler is treating every read in both the sff files as unique which is very unlikely to happen. My basic aim is to find the SNPs and repeats in the genome and if newbler is assembling every read into a unique contig then this could be a matter of concern to me. Please provide the necessary explaination for this behaviour.

    • flxlex said

      If your coverage is very low, the sff file from the second genotype may contain – by chance – sequences that come from genomic regions that do not overlap with the ones from the first. If this cannot be the explanation, than, if you want newbler to assembly into one contig per region, and there are a lot of heterozygosities (SNPs, indels, or larger regions) between the genotypes, I suggest lowering the overlap stringency may help. Perhaps not doing incremental assembly, but giving all reads in one go may help too (although it shouldn’t make a difference). Finally, if you succeed in a ‘consensus’ assembly, you could try mapping back the individual sff files to the contigs, and ask newbler (gsMapper, or runMapping) to report the variants.

  19. LeFeesh said

    Hi,
    I was looking for some information regarding the directionality of Newbler assemblies. Are the RNAseq libraries for FLX directional (5′-3′). If so, then I assume all Newbler contigs are 5′-3′ orientation. If not, then I assume the Newbler contigs are randomly plus or minus strand contigs. Unless Newbler somehow orients each contig into the proper direction. Can you shed some light on this? I have a Newbler transcriptome that was published by another lab that I would like to use as an index for my Illumina HiSeq data from directional libraries. My libraries should have produced nearly exclusively 5′-3′ sequences so if the transcriptome is also 5′-3′ oriented then I can ignore reads that align to the reverse complement as errors. However, if Newbler produces both plus and minus strand contigs then I must accept all alignments as valid. Thoughts?

    • flxlex said

      Whether RNASeq libraries for the 454 are directional I guess depends on the lab that made them. That said, I am not aware of any directional protocol for 454. When the reads are from both strands, the contigs newbler produces are indeed random with respect to orientation. However, since your reads are directional, shouldn’t they map to the same strand within a contig? After alignment, you should then determine for each contig what the prevalent strand for alignment is, and discard all reads mapped to the other strand, right?

      • LeFeesh said

        Thanks Flxlex,
        You’ve confirmed my suspicions. Unfortunately that makes the analysis a tiny bit more complicated but not too bad. I’ll just have to write a script to collect only the data from the most prevalent strand.

        Yes, my reads do largely align to either the plus or minus strand of a contig and not both but there is some cross talk. Some libraries have more cross talk than others and I’m not sure the reason for that. Perhaps the directionality in some of the library constructs isn’t as robust? In the few samples I’ve looked at it seems like I get about 98% of belonging to either the plus or minus strand and a fairly consistant 2% to the opposite strand.

  20. starstar said

    Hi Flxlex,
    I have about ~250 million Illumina paired end reads (500Mil total) and has been running the assembly with 256 gb RAM. I know that the computing alignment stage is the slowest portion and since I have too many reads, I always reach the time limit set for me. Is there a checkpoint or something that I can still assemble where I left off for example if the job killed while computing the alignments? I am using version 2.6 for Newbler.

    • flxlex said

      There is no such checkpoint that I am aware of, AFAIK checkpoints are used after the alignment phase is finished. If your genome is more than 40Mbp, are you using the ‘-large’ option? You could also try the latest newbler. A kmer frequency analysis may tell you whether there is a set of kmers at very high frequency, in which case digital normalisation (google ‘diginorm khmer’) may help.

  21. Stray Bird said

    Hi, Flxex, do you know how the consensus sequence are computed by Newbler? I bet it is a majority vote, but I can’t find any documents to support this. If one position is covered with “GGGAAA”, would the consensus be a G or a A, or by random? Thanks for your great blog on Newbler!

    Stray Bird

    • flxlex said

      As far as I understand – I don’t know of any reference – it takes the flow signals into consideration when calculating the consensus. That is why newbler at the ends says ‘Reading flowgrams…’. So it appears to be a bit more sophisticated than just majority voting.

  22. Shas said

    Hi Flxex, can you tell me if Newbler makes any assumptions when assembling circular or linear genomes, particularly from where the genome assembly begins?

    • flxlex said

      Newbler doesn’t know, nor can be told, whether the genome is linear or circular. Assembly can basically start anywhere.

  23. Sinna said

    Hi, I’using for first time gsAssembler, I have a problem to import .sff file, i can select them but seems like don’t import the information. So I’m working from command line and thanks to your tips i have started my first assemble. I have three genomes of bacteria, do you have some suggestion about the parameters ?

    (Helicobacter pylori, has a circular genome of 1,667,867 base pairs and 1,590 predicted coding sequences.)

  24. Rocío Sasmay said

    Please, I need the correct cite for Newbler v2.5 for a scientific paper.

  25. Help Please? said

    Hello, I could use some help please? I keep getting this error:

    “Error: More than one argument found after options”

    I am running the following command:

    newbler runAssembler -mi -o outputfilename .collapsed.fastq .pair1.truncated.fastq .pair2.truncated.fastq

    I do not understand what I am doing wrong and would appreciate some guidance.

    Thank you!

Sorry, the comment form is closed at this time.