An assembly of reads, contigs and scaffolds

A blog on all things newbler and beyond

A script for converting the 454NewblerMetrics.txt file to a tab-separated file

Posted by lexnederbragt on May 9, 2011

(source: Wikimedia commons)

One of you asked in the comments: “Is there an existing way of converting the 454NewblerMetrics.txt file to a tab-delimited file?”

I have in fact written a script for that. We use it all the time in our group for newbler assemblies, and I am hereby sharing it with you. The perl script, called newblermetrics.pl, needs to be given a 454NewblerMetrics.txt file from a newbler assembly. It works both on shotgun assemblies, with or without paired end data, and on cDNA assemblies (for which it includes the isogroups and isotigs metrics in the output). It will not work on mapping projects (gsMapper/runmapping commands).

The script produces an output like this:

Input
Number of reads    975240
Number of bases    275262092
Number of reads trimmed    1195883    122.6%
Number of bases trimmed    256085747    93.0%

Consensus results
Number of reads assembled    1065078    89.1%
Number partial    14365    1.2%
Number singleton    105760    8.8%
Number repeat    7248    0.6%
Number outlier    3432    0.3%
Number too short    0    0.0%

Scaffold Metrics
Number of scaffolds    12
Number of bases    5799904
Average scaffold size    483325
N50 scaffold size    5479633
Largest scaffold size    5479633

Large Contig Metrics
Number of contigs    479
Number of bases    5694980
Average contig size    11889
N50 contig size    44505
Largest contig size    160534
Q40 plus bases    5686792    99.86%

All Contig Metrics
Number of contigs    1748
Number of bases    6114087
Average contig size    3498

Library    Pair distance average (bp)
lib_3kb.sff    2542.8
lib_8kb.sff    7601.6

The script is available for download here: http://sourceforge.net/projects/newblertools/files/newblermetrics. I’d appreciate any feedback!

UPDATE Dag Ahren and Björn Canbäck made a web version of the script, accessible here: http://mbio-serv2.mbioekol.lu.se/apps/newblerMetrics.html

11 Responses to “A script for converting the 454NewblerMetrics.txt file to a tab-separated file”

  1. Germain Chevignon said

    Hi
    Your blog is very nice and very interesting.
    I also used Newbler software including GSMapper, and I am looking for a script that give me the information on the reads composition for each contigs produced by the assembly. Because this information is not in the output files.
    Do you know if any script like this already exist ?

    Thanks very much for your help.

    Germain Chevignon

  2. zaki said

    This is a very nice and useful script. Thanks a lot. But I’m missing the information about paired end statistics in the output.

  3. Katie said

    Hi

    Your blog is so helpful, I’m totally new to NGS data. The script works for me.

    Here is the summary of one of my run, is that data looks good? What’s kind of interpretation I can get based on those numbers

    Input
    Number of reads 520353
    Number of bases 49529016
    Number of reads trimmed 403982 77.6%
    Number of bases trimmed 18028700 36.4%

    Consensus results
    Number of reads assembled 144624 35.8%
    Number partial 5546 1.4%
    Number singleton 820 0.2%
    Number repeat 20 0.0%
    Number outlier 188 0.0%
    Number too short 252784 62.6%

    Large Contig Metrics
    Number of contigs 10
    Number of bases 6189
    Average contig size 618
    N50 contig size 577
    Largest contig size 1053
    Q40 plus bases 6153 99.42%

    All Contig Metrics
    Number of contigs 79
    Number of bases 28083
    Average contig size 355

    Thanks

    Katie

    • flxlex said

      It is very difficult to ‘judge’ your assembly without knowing more about the project. Most importantly, what is the expected genome size?

      The metrics show massive trimming of the bases, down to 36% of the input, leading to a very high % of ‘Too Short’ reads. Any idea why this is?

      Of the remaining 18 Mb trimmed bases, about 36% (6.5 Mb) is assembled into 28 kb (AllContigs), leading to a 230x coverage. This is very high. You could running with the ‘-e 230’ setting, telling newbler you have such a high coverage, or reducing the amount of input.

  4. Dan said

    Can I get values in a row, then run on several files to get a table?

    kthxbi ;-)

  5. Dan said

    This was messing with my downstream parser:

    [CODE]
    — newblermetrics1.1.pl~ 2011-09-06 15:12:02.000000000 +0100
    +++ newblermetrics1.1.pl 2012-09-24 10:41:57.013336454 +0100
    @@ -209,7 +209,7 @@
    print “\n”;

    if (exists $metrics{‘scaffoldMetrics’}{‘numberOfScaffolds’}){
    – print “Library\tPair distance average (bp)\n”;
    + print “Library Pair distance average (bp)\n”;
    foreach my $lib_name (sort @lib_names){
    print “$lib_name\t”,$metrics{$lib_name}{‘airDistanceAvg’},”\n”;
    }
    [/CODE]

  6. Dan said

    OK, I wrote a small Perl script to post ‘tidy’ your output, then an R script to plot the data (assuming a range of -mi and -ml)… Now I have to pick the assembly to use … Waa!

Sorry, the comment form is closed at this time.