An assembly of reads, contigs and scaffolds

A blog on all things newbler and beyond

What is new in newbler version 2.5.3

Posted by lexnederbragt on March 22, 2011

(source: Wikimedia commons)

Recently, newbler version 2.5.3 became available. With this post, I’ll describe the changes between this version, and the previous (2.3). As I have not yet described the gsMapper function of newbler, I here only dicuss changes relevant to assembly (gsAssembler, runAssembly).

Read order
Previously, the order in which reads where added using addRun could have an effect on the final outcome of the assembly. Now, newbler will use a ‘canonical’ (fixed) order, regardless of the order of addition. This means that assemblies using the same parameters and the same set of reads will always result in the same assembly output (contig number and sequences etc), regardless of the order of the addition of reads. Note, however, that this only holds when only a single cpu is used, repeated multi-cpu assemblies will have slightly different outcomes.

New options
-tr
This option to output the trimmed reads was present already in version 2.3, but hidden, see my previous post.

-sio
This parameter is a great one for assemblies based on very large read datasets (over 4 million reads); ‘sio’ stands for Serial I/O. The option solves the problem of extremely long processing times at the end of assemblies, the ‘Computing signals’ and ‘Generating output’ phases. Here, newbler has to go trough all the sff (raw read) input files, so that it can use the exact basecalls and signal strengths for consensus base and signal calculations. Newbler searches for this information in the order in which bases are located in contigs and scaffolds, accessing the read files many times. With -sio, newbler first builds temporary files with the required information in a more efficient order. this speeds up these phases significantly.
Note, however, that up to three times the amount of disc space that the original sff files occupy is required for this process (I have had a long assembly crash at the very end because of lack of disc space…). Also, more memory is needed (the number of reads in your project times 8 kilobytes, with a maximum of 8 Gigabytes). Temporary files will be deleted upon completion of the assembly.

-siom
This option can be used to allow more memory consumption of the -sio option. Use -siom followed by the number of Gigabytes allowed (from 1-1000).

-siod
This option modifies the way -sio operates, in that it minimizes disc space consumption, presumable not at a loss of speed.

-nosio
When rerunning a project (that was run using the -nrm option) for which any of the -sio options was set, by default, the previous -sio option is used again. Using -nosio cancels any other -sio options.

-force
Previous versions of newbler would overwrite an assembly project folder with the same name as specified with the -o option. Newbler 2.5.3 instead exits with an error message that the project folder already exists. Using the -force option allows overwriting the existing project folder.

-urt
This stands for ‘use read tips’ and can be helpful for low coverage assemblies, or low coverage regions in an otherwise high-coverage assembly. The unaligned parts of assembled reads at the ends of contigs can extend significantly beyond the actual contig (the region consisting of multiple aligned reads). With the -urt option, the contig is extended to the end of such reads. The description says ‘the assembler tries to extend the contig to the “tip” (end) of the read which extends unaligned’, but I don’t know when what determines whether a try is successful or not.
In addition, very low coverage overlaps can result in contigs, where they normally would not.
The primary use of the -urt option is for transcriptome assemblies, where using the option will help obtaining contigs for rare transcripts (because they have a low coverage). However, also genome assemblies might benefit if they have regions represented with few reads.

New output files for transcriptome assemblies
Assemblies using the -cdna option will have two new output files:

454Isotigs.faa
This file reports the protein sequences of any ORFs (open reading frames) detected in isotigs and contigs (of at least 10 bp). It is up to you to determine which ORF is correct (where the longest one is the most probable for long isotigs…). The ORFs are reported withfrom longest to shortest, and the header lines contains the following tab-separated information:

The amino acid sequences are sorted by length (for each isotig), and are preceded by a
description line that consists of following tab-delimited information:

  • isotig/contig name
  • nucleotide start position (starts at 1)
  • nucleotide end position, inclusive
  • ORF frame: -3, -2, -1, +1, +2, +3
  • nucleotide sequence length, including the stop codon, if present
  • protein sequence length
  • number of Methionine (M) codons

Example (ORFs only partially shown):

>isotig00001    208     2448    +1      2241    746     23
MKAQDDPSRSSSPEGEEDAIMPVKDSPDSEFHRRGSVDTSCIRHNAHVNHRSTREVSPHRGSTMVSTLNSRNTAIMQDDS
EAAKLVDNRPSFVLRSLTGDLDDIVNDVARRAGRHKARKDKPPSPTLNRQITPKDGLKPVRVSRVFRVKKEGYDGGPKSP
...
RASMTGSTPQTVTITEGEELDMDDAR*
>isotig00001    3954    5036    -1      1083    360     5
MKRNLRQGIVLSIMNLRQGIVLSIMSADGVETQVPADEATAADRLLARLQSDKTKAFVVLFADFETGHLRFRQRRRGAAP
...
SSELGVFFFEAFFLIGESSSEVSSSSVNHCRCLLDLCWII*
>isotig00001    772     1410    -3      639     212     9
MIPQLSDGNHCTQELIEALLRSDFLSMFISKNGFYHAESTICMASLMRPADHLWGTCAGLALWGRRFVRLETAMEDLLLV
...
PTHVGAGFVVCYHPMVEVSETGEGLGFQHFDEAIATVDRGCAFQDGLLLCRQ*

454IsotigOrfAlign.txt
This file contains the predicted ORF amino acids aligned below the nucleotide sequences for the ORFs reported in the 454Isotigs.faa file. (The example below might look better if text size in your browser is set small…)

isotig00001        1 CAaTCCACAACGAGCCACAACAAATCCTCTTCCAAGTTATCATACAAACTCCATTCTGAGCACCCTGCAAATAGCGGCGC   80
+1:1..126          1 Q..S..T..T..S..H..N..K..S..S..S..K..L..S..Y..K..L..H..S..E..H..P..A..N..S..G..A.   27
+2:2..211          1  N..P..Q..R..A..T..T..N..P..L..P..S..Y..H..T..N..S..I..L..S..T..L..Q..I..A..A..P   27
---------------
isotig00001       81 CATACCAGTACCACTTGGTTCTTGGCTCTGTTGGTTTTGGCGTTGAAATTTATGGAAGAACCCGGTGGCTAGCCCAGGCA  160
+1:1..126         28 .I..P..V..P..L..G..S..W..L..C..W..F..W..R..*..                                     42
+2:2..211         28 ..Y..Q..Y..H..L..V..L..G..S..V..G..F..G..V..E..I..Y..G..R..T..R..W..L..A..Q..A..   53
---------------
isotig00001      161 ATCAACAGGCTACATACCCCAGGAAATCGCACCATACTATCATCATCATGAAAGCTCAGGATGATCCATCTCGATCGTCG  240
+2:2..211         54 I..N..R..L..H..T..P..G..N..R..T..I..L..S..S..S..*..                                70
+1:208..2448*      1                                                M..K..A..Q..D..D..P..S..R..S..S..   11
-1:216..332       39                                                        ..*..S..S..G..D..R..D..D.   32
---------------
isotig00001      241 TCGCCGGAGGGAGAGGAAGATGCGATCATGCCCGTGAAAGACAGCCCCGATTCAGAGTTCCATCGGAGAGGTTCTGTTGA  320
+1:208..2448*     12 S..P..E..G..E..E..D..A..I..M..P..V..K..D..S..P..D..S..E..F..H..R..R..G..S..V..D.   38
-1:216..332       31 .D..G..S..P..S..S..S..A..I..M..G..T..F..S..L..G..S..E..S..N..W..R..L..P..E..T..S    5
---------------
isotig00001      321 CACATCTTGCATTCGCCACAACGCTCATGTGAACCATCGAAGTACTCGCGAAGTCAGTCCTCATCGTGGCAGCACCATGG  400
+1:208..2448*     39 .T..S..C..I..R..H..N..A..H..V..N..H..R..S..T..R..E..V..S..P..H..R..G..S..T..M..V   65
-1:216..332        4 ..V..D..Q..M                                                                        1
-1:381..518       46                                                             ..*..R..P..L..V..M..   41
---------------

Lines showing the nucleotide sequence list the isotig/contig name, the start base position of the part shown, the sequence and end base of the part shown.
The other lines show the frame, a colon (‘:’), start (nucleotide) base, ‘…’ and end base, and an optional asterisk (‘*’) indicating that this ORF is the longest one. This is followed by the start amino acid position of the part shown, the sequence and end amino acid of the part shown.

Advertisements

14 Responses to “What is new in newbler version 2.5.3”

  1. Nitin said

    If we run gsAssembler (2.5), we get the result N50ScaffoldSize= 927736,2; in 454NewblerMetrics.txt. Can you tell me the meaning of 927736,2? I know the maening of N50.

  2. Dimitra said

    Dear Flxex,
    Im currently trying to align metagenome sequences produced from environmental samples.
    I was thinking of using stringent conditions such as -mi 98 -ml 100 and also the -urt option.
    Do you think its a good idea or will the stringent conditions affect the urt options by preventing the ends from aligning with other potential reads?
    Thank you for your time, Dim.

    • flxlex said

      I think you strategy is good: more stringent alignment hopefully prevents somewhat similar regions from different genomes to collaps, while the urt flag will yield longer contigs. Good luck!

  3. cram said

    Hi Flxlex,

    Thanks for the post, this blog is a great Newbler resource.

    I’m curious if you have any comments on the overall quality of de-novo transcriptome assemblies with version 2.5.3. I’ve been pretty disappointed with previous versions of newbler for transcriptome work, but suddenly in 2.5.3 with the -urt option, the results appear far better than anything else, at least at first glance. In particular it seems to do a great job of condensing the most abundant transcripts into just a small number of isotigs, where other assemblers always seem to choke on the really abundant ones and produce tons of redundant unigenes. Have you experienced similar results? Or have you found any issues with -urt causing mis-assemblies?

  4. Vega said

    Hi Flxlex,

    Thanks for this blog site! It’s been very helpful to me as I am doing EST assembly with 454 Junior data.

    One question is about the input parameters that can be changed, such as seed step, seed length, and seed count. I’m not sure how I would want to change those or why. Can you explain how different values might change an assembly?

    Thanks in advance.

    • flxlex said

      The defaults are optimized, as we also found out by testing, so I never change them. I am not too familiar with how exactly changing the way seeds are made will affect the assembly. I might read the manual very carefully and write a post about it at a later time. Until then, just give it a try (and write about it on your blog…)!

      • Vega said

        Thanks, flxlex. I’ll give that a try at some point, but I don’t have a blog so won’t be able to post about it there :)

  5. Steven Sullivan said

    Up to now I’ve used sffinfo -s to generate a fasta file of the reads used in an assembly, from the signal-processed .sff files used as input into the assembly. My understanding is that by default, sffinfo outputs trimmed reads (unless -notrim is specified, see p 153 of the Software v. 2.5.3, December 2010 manual , Part C). Would the output of tr via Newbler be different (e.g., additional trimming)?

    Also, is there an easy way to output the left and right ends of a paired-end read as separate FASTA sequences, rather than as a single sequence?

    • Steven Sullivan said

      (I’m guessing the answer to question 1 is ‘yes’, since the trimmed reads output by sffinfo are the same size as what’s reported in the ‘Original Trimmed Length’ column of 454TrimStatus.txt, rather than what’s in “Used Trim Length” column)

      • flxlex said

        You are correct on the first point, newbler does some additional trimming on some of the reads.

        Regarding the second point, using ‘-tr’ actually does exactly that, however, with newbler’s trimpoints.

  6. Youvika Singh said

    hi there!
    i have seuence in form of contigs using rapid library preperation and assembling it in gsDenovo assembler version 2.5.3. how can i get scaffolds using these contigs.

    • flxlex said

      You will need paired end or mate pair information in order to be able to scaffold. This is usually obtained by sequencing libraries especially prepared to yield such reads. Without it, all you have is contigs. Perhaps you can order the contigs using an already sequenced genome?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: