An assembly of reads, contigs and scaffolds

A blog on all things newbler and beyond

What is new in newbler 2.6

Posted by lexnederbragt on July 12, 2011

The latest version of newbler, version 2.6, has some welcome additions for input and output. As I have so far only treated de novo assembly, I will skip the updates on the gsMapper (except for mentioning that it is now able to provide a bam file using the -bam option).

FASTQ files support
Newbler could already use sff files (including those from the IonTorrent, by the way!), and fasta/qual files (e.g Sanger reads). Now, fastq files (a much used format for Next Generation Sequencing, see http://en.wikipedia.org/wiki/Fastq) are also supported. In principle, one can now use any fastq file, also those downloaded from the NCBI Short Read Archive, and import it directly into newbler. Newbler should be recognize the quality scoring version, and which reads are paired up (read1 and read2) based on the header text. I did a quick test on one such a fastq file, and it seem to work.

Gap filling with repeat contigs
Some contigs, usually with high depth, represent collapsed repeats. These make for many of the gaps in scaffolds. With the new -scaffold flag, you can now ask newbler to place a copy of the repeat in the gap it forms, effectively closing the gap. This potentially leads to much more complete assemblies. Note, however, that a contig from collapsed repeats is the consensus sequence from all occurrences of the repeat. Newbler places an instance of this contig as it is, so if the actual repeat instances in the original genome have sequence variation, it introduces errors in the scaffolds. This you will have to take into account. Two new output files are gene when the -scaffold flag is set, 454ContigScaffolds.txt and 454ScaffoldContigs.fna/qual, which I will describe in the next blog post.

Edge information
New sections in the 454NewblerMetrics.txt for assemblies which produce scaffolds has been added, called ‘largeContigEndMetrics’, ‘scaffoldGapMetrics’ and ‘scaffoldEndMetrics’. An ‘edge’ represent reads that exit a contig or scaffold and enter another one. These metrics report for contigs and scaffolds, the number and percentage that have ‘NoEdges’, ‘OneEdge’, ‘TwoEdges’, or ‘ManyEdges’. For gaps in scaffolds, those that have ‘BothNoEdges’, ‘OneNoEdges’, ‘BothOneEdge’ and ‘MultiEdges’ are reported. Although there is little documentation about this, I understand that if there are many contigs (or scaffolds or gaps) with no edges, this could indicate not enough reads (too low coverage) to bridge reads.

Increased assembly of splice variants in cDNA assembly
The new -isplit option will look for depth spikes in the read alignment for transcriptome datasets. Such a spike could be the result of a specific splice variant. Setting this flag results in using these spikes for starting the generation of isotigs, potentially resulting in more isotigs.

Advertisements

21 Responses to “What is new in newbler 2.6”

  1. I’m looking in the new manual for the 2.6 software and I don’t see a description of -scaffold doing what you report (i.e., closing gaps with repeats) . How did you learn of this?

  2. Roxanne said

    I started using 2.6 recently, and have some problems. The first is that no ‘library’ section was created for a paired-end library. The second was that fields in the library section are named differently in 2.6 pairDistanceAvg -> computedPairDistanceAvg. (once found, the code looking for it was modified to use the new name, so no longer an issuer.

    Is there any way to force Newbler to create the library section for a paired library? We have a downstream program that needs it. Do you know?

    • flxlex said

      About your first problem: if the exact same file under the exact same setting was accepted by newbler as a paired-end reads file, then this is very strange. If you check the 454NewblerProgress.txt file, does it say for your file:

      -> XXXXX reads, YYYYYY bases, ZZZZZ paired reads.

      If not, then newbler did not recognize it as a paired end file. You can try to force it to do that using the -p flag.

      About your second problem, I noticed the same and had to adjust my newblermetrics script. Makes you wonder, why do they keep changing this kind of stuff?

  3. Now that you can feed fastq files into newbler and it detects read pairs, is it necessary to reverse complement the reads from long insert/jumping/MP libraries or do you know it newbler is not orientation aware?

  4. Markestine said

    Hi,

    I try to use Newbler 2.6 to reconstruct eucaryote transcriptome from 454 + Illumina SE data (100pb).
    I made few test but it’s very long : 1 week for 500000 reads 454 + 85millions reads Illumina.

    I run this command line :
    /runAssembly -cdna -cpu 8 -mi 95 -ml 40 -urt fastq_file sff_file.

    I have many problems :
    1) Newbler recognize my fastq_file as a pairend fastq, while it is not the case. Anybody have a explanation?
    2) Why the job is so long?
    3) The parameter -ml 40 is it rigth? or I should try a other values?

    Thank in adavance

    • flxlex said

      1) Newbler recognizes the /1 and /2 at the end of the read name in fastq files and pairs up reads that way. Are you sure newbler interprets your file as paired end? Does it give a number of paired reads after indexing? Does it gve an estimate of the paired end distance in the 454NewblerMetrics file?

      2) cDNA assembly using newbler is slow when there are certain genes present with many many reads, e.g. rRNA genes (16S or 18S, 23S etc). What 454 recommended me was to do assemblies with small subsets of the data, map all reads to the contigs, identify contigs with extreme depth and take away almost all reads for those contigs. The goal would be to get an even coverage for all transcripts. I have yet to try this out…

      3) This I don’t know and you will have to try different settings. However, usually, default paramaters should be OK. -ml 40 is in fact the default, while 90 is the default for -mi.

      Good luck!

      • Markestine said

        Thanks for you answer.

        1) Yes, i’m sure in the 454NewblerMetrics file, I have pairend flag.
        “pairedReadData
        {

        But in my fastq file, I have juste /1 in the end of my reads_name. So, it’s a mystery why Newbler recognize the fastq SE as a fastq PE….

        2) It’s a goo idea.
        I’ll try to remove reads in repeated regions to imporve the assembly time…

      • flxlex said

        Regarding 1): I have seen similar behavior: whenever newbler recognizes a file as paired end, even if it can’t find pairs, it will automatically start reporting all the relevant metrics. Because of the ‘/1’ in the read identifier, ther ‘paired read’ flag is on. if you want to remove it, you could do remove the ‘/1’ from your input file, e.g. using sed:
        cat in.fastq |sed s'/\/1//' >out.fastq

  5. Joanna said

    Question, because newbler assumes the orientation of paired-end reads from 454 data, is it necessary to reverse complement read2 of the Illumina data to use appropriately in newbler?

    • flxlex said

      If your reads are in native Illumina fastq (before the latest Casava 1.8 upgrade) or downloaded from the SRA, newbler should recognize them automatically and you don’t need to do any reverse complementing. See also my ‘quick fix for the new Illumina fastq header’ post here.
      Otherwise, newbler expects the pairs not coming from 454 (e.g. Sanger and Illumina) to point towards each other, so you need to do nothing for paired ends, but for mate pairs (longer insert distances created with a special library prep protocol), you need in fact to reverse complement both reads…

  6. sachin said

    I am currently working on one of the project where I have an hybrid assembly (454 + Illumina + SOLiD)
    of 157439 contigs (genome of about 850Mb).

    Now I want to create scaffolds using 20kb 454 paired end reads.

    Which tool can I use? Newbler? (I would not use MIRA). With Which options?

    Any suggestions?

    Look forward to hearing from you.

    • flxlex said

      If you want to use newbler, you can only use the 454 and illumina reads. Alternatively, you could preassembly some of the reads (e.g. SOLiD, or SOLiD + Illumina), and input the contigs together with the 454 reads into newbler (perhaps cutting up the contigs). For scaffolding existing contigs with mate pairs, there are tools such as SSPACE and Bambus (just google them).

  7. Sudeep said

    Hi Flxlex,
    Regarding the -e parameter mentioned in an earlier blog. You had mentioned that it should be tried for >50X coverage (so this means its for a single genome?).I am having metagenomic samples and have tried assembly with Newbler (random reads, not the whole data set) using default parameters. After few ways of testing we have reasons to believe that we have 2 perdominant species. So for metagenomic assembly is there a way to tell Newbler to assemble reads with ‘x’ coverage? Not sure if -e is the option to be used or with the new version there is/are other parameter(s).
    Thanks

    • flxlex said

      The -e is intended for a single genome. You can’t tell newbler to assemble only for a certain coverage – which would not work as some repeats naturally will have a higher coverage. But, you could perhaps optimize the assembly, and then split the contigs based on depth, or GC, or both?

  8. Herty said

    hi,

    Does anyone know how to solve this problem? i tried to convert the sff and 454PairAlign.txt to sam file, it gave me error message: ValueError: substring not found:

    See the details below:
    Command line:
    glu seq.Newbler2SAM -o 454.RL2.sam 454PairAlign.txt 454Reads.RL2.01.sff 454Reads.RL2.02.sff

    Well, this is embarrassing.

    Traceback: Traceback (most recent call last):
    File “glu_launcher.py”, line 221, in main
    progmain()
    File “/mnt/SeqCapPool/lianyh/glu-genetics-dir/lib/python2.7/site-packages/glu-1.0b3_dev-py2.7.egg/glu/modules/seq/Newbler2SAM.py”, line 456, in main
    out.writerows(alignment)
    File “/mnt/SeqCapPool/lianyh/glu-genetics-dir/lib/python2.7/site-packages/glu-1.0b3_dev-py2.7.egg/glu/modules/seq/Newbler2SAM.py”, line 280, in handle_unaligned
    for align in alignment:
    File “/mnt/SeqCapPool/lianyh/glu-genetics-dir/lib/python2.7/site-packages/glu-1.0b3_dev-py2.7.egg/glu/modules/seq/Newbler2SAM.py”, line 219, in handle_maligned
    aligns = list(aligns)
    File “/mnt/SeqCapPool/lianyh/glu-genetics-dir/lib/python2.7/site-packages/glu-1.0b3_dev-py2.7.egg/glu/modules/seq/Newbler2SAM.py”, line 156, in pair_align_records
    start = seq.index(qseq)
    ValueError: substring not found

    Many thanks for your help,

    Best,
    Herty

  9. Anne said

    Hello,
    Does anyone know how to solve this problem.
    Newbler did not recognize my pair-end reads, and did not build the scaffolds.
    This is the command I used:

    runAssembly -o newblerTrmkmer28 -mi 96 -ml 60 -large -cpu 20 -ace -p ./reads/TrmR1.fastq ./reads/TrmR2.fastq

    These are the ouput files that I got.

    454AllContigs.fna 454LargeContigs.fna 454ReadStatus.txt
    454AllContigs.qual 454LargeContigs.qual 454TrimStatus.txt
    454ContigGraph.txt 454NewblerMetrics.txt sff
    454Contigs.ace 454NewblerProgress.txt

    Also, after trimming, some of the reads lost the paired information and they are now separated in another .fastq file. How can I include those reads with the paired end reads?

    Thank you,
    Anne

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: