An assembly of reads, contigs and scaffolds

A blog on all things newbler and beyond

Newbler output I: the 454NewblerMetrics.txt file

Posted by lexnederbragt on March 11, 2010

With this post, I’ll start going through the output files newbler generates. Some of these will be described in detail as they contain a lot of important information.

For today’s post, we’ll start with the 454NewblerMetrics.txt file. This file contains a lot of details on the reads used during the assembly, as well as the resulting contigs and, in the case of paired end reads, scaffolds.

The file starts with some metadata, such as the date of the assembly, where is is located, and what version of newbler was used. For this post, I used a file of as assembly generated with version 2.3 and both shotgun and paired end read files. Note that the output will be slightly different for a mapping project (to be described in a later post) than for an assembly project.

Section 1: runData

file
{
path = "/your/path/yourfile1.sff";

numberOfReads = ######, ######;
numberOfBases = #########, #########;
}

For each input file, the numbers mentioned are reads and bases in the file, reads and bases after trimming.

Section 2: pairedReadData
For files with paired end reads, there is some additional information (and something strange):

{
path = "/your/path/yourfile2.sff";

numberOfReads = ######, ######;
numberOfBases = #########, #########;
numWithPairedRead = ######;
}

Not all reads have the paired end linker, or have long enough sequences on both ends of the linker, so the number of reads that have been treated as paired end reads is listed. Note that the number of trimmed reads is higher than the number of actual reads. The reason for this is that each half of the paired end reads is counted as a separate trimmed read.

Section 3: runMetrics

{
totalNumberOfReads = #########;
totalNumberOfBases = ###########;

numberSearches   = ########;
seedHitsFound    = ##########, ##.##;
overlapsFound    = ##########, ###.##, ###.##%;
overlapsReported = ##########, ###.##, ##.##%;
overlapsUsed     = #########, #.##, #.##%;
}

The numbers of searches, seeds and overlap have to do with the alignment phase of the assembly (see previous post).

Section 4: readAlignmentResults

{
path = "/your/path/yourfile1.sff";

numAlignedReads     = ######, ##.##%;
numAlignedBases     = #########, ##.##%;
inferredReadError  = #.##%, #######;
}

Here, for each input file, the number of reads and bases that were aligned to other reads are reported. In addition, ‘inferredReadError’ indicates the number of errors (mainly indels) between the contigs and the reads in this file is reported. If this number is high, the read file might contain very low quality reads. I must admit that I don’t know that ‘high’ would be (if you have a bunch of sff files from different runs, one could compare the inferred error rate between them).

Section 5: pairedReadResults

{
path = "/your/path/yourfile2.sff";

numAlignedReads     = ######, ##.##%;
numAlignedBases     = ########, ##.##%;
inferredReadError  = #.##%, #######;

numberWithBothMapped  = #####;
numWithOneUnmapped    = #####;
numWithMultiplyMapped = #####;
numWithBothUnmapped   = ###;
}

For the paired end read files, the same information is displayed as in section 4. In addition, mapping results for the pair halves are mentioned: the number of reads for which i) both pair halves mapped (i.e aligned) uniquely (either to the same contig, or to two different contigs), ii) only one halve mapped,  iii) one or both of the halves mapped to multiple places, neither halve aligned to any contig. Of those reads in group i), reads that map to the same contig are used to calculate the average paired end distance for the reads, while only those reads which do map to different contigs will be used for scaffolding.

This concludes the part on the input. Next are the actual assembly results

Section 6: consensusDistribution
This section deals with basecalling of the consensus contigs. After alignment and contig building (which are done in ‘base space’, i.e. using nucleotide sequences), the flow signals of the reads obtained during sequencing (i.e the light intensities during incorporation of nucleotides) is used to determine the consensus bases. In fact, the number of homopolymers (A, AA, AAA, AAAA, ….) is determined based on all the signals present at a certain position.
The section on ‘fullDistribution’ gives a histogram of all the normalized signal values (the value 0 means no base incorporated during that flow, this happens many times). You will see peaks around (but not exactly at) 1.0, 2.0 etc. These peaks are mentioned under “distributionPeaks”. Based on this information, newbler determines thresholds for the border between n A’s and n+1 A’s (or C’s, G’s T’s). These thresholds are mentioned under thresholdsUsed. For example, “threshold = 3, 4, 3.38” means the threshold between three and four bases lies at 3.38. Finally the “interpolationAmount” is used for signals over the highest peak mentioned, (but I do not know how this number is used…).

Section 7: consensusResults
Here is the really interesting stuff: the metrics of the assembly. But first, a summary of the reads, i.e. what was mentioned in section 4 and 5 is summarized under “readStatus” and “pairedReadStatus”. In addition, for all the reads it is mentioned whether they were
–    assembled fully or partially (in the latter case not all of the read was used in a contig)
–    singleton, i.e. reads without overlap to other reads
–    repeat, i.e. the read had many seeds which were present in many other reads, indicating that they were from repeat regions
–    outlier, i.e. reads which aligned, but in a way that was in inconsistent with the majority of reads
–    too short, the (trimmed) read was shorter than the minimum length used by newbler (which is 50 bases, unless when paired end reads are present, in which case it is 20 bases)
Also, for paired end reads, the per-library average distance between halves is mentioned, as well as the standard deviation of the average.

library
{
libraryName     = "3kb_PE";
pairDistanceAvg = 2675.5;
pairDistanceDev = 668.9;
}

This number is calculated, as mentioned above, by using the pairs where both halves map to the same contig. The average distance is then used to estimate the distance between scaffolded contigs. The library name is usually the filename. It is possible, however, to have several sff files, coming from the same paired end library, grouped under one library name, with a common paired end distance, by changing the way newbler is run (it involves the “addRun” command, more on this another time). Note, by the way, that the standard deviation is always 25% of the average distance (so, this number is ‘set’ by newbler, not calculated)…

“scaffoldMetrics”:

{
numberOfScaffolds   = #####;
numberOfBases       = #########;

avgScaffoldSize     = #####;
N50ScaffoldSize     = ######;
largestScaffoldSize = #######;
}

Here, the number of scaffolds and bases (including gaps) are mentioned, as well as average and largest scaffold size. The N50 scaffold size is defined such that “half of the assembled bases reside in scaffolds having a length of at least the N50 scaffold length” Or, when you sum up the lengths of the scaffolds from longest to shortest, the N50 size is the size of the scaffold where your sum is hits or crosses half the total length. N50 is a more effective number to compare assemblies, as it is reflecting the length distribution of the scaffolds, and is less influenced by the total number of scaffolds than the average.

Next is “largeContigMetrics”:

{
numberOfContigs   = ######;
numberOfBases     = #########;

avgContigSize     = ####;
N50ContigSize     = ####;
largestContigSize = #####;

Q40PlusBases      = #########, ##.##%;
Q40MinusBases     = #######, #.##%;
}

Large contigs are those of at least 500 bp, (this cutoff can been set differently). The same metrics as for the scaffolds are given (this time, the total length obviously does not include gaps). The number of Q40 plus bases indicates the bases with a phred-like consensus quality score of at least 40 (meaning a 1:10000 [1 in 104.0

] chance of the base being wrong). It used to be said that for a finished genome, 100% of the bases had a quality score of at least 40.

Finally, the number and total length of all the contigs, i.e. those with a minimum length of 100 bp (again, this number can be changed) is listed.

Note that not all contigs are used in scaffolds. I feel this metric should be included in this file, but I might get back on how to obtain it using some small-scale bioinformatics.

Advertisements

36 Responses to “Newbler output I: the 454NewblerMetrics.txt file”

  1. potato Jr. said

    Glad you explained well this “incomprehensible” output. I have this output and like to find out the number of single and paired reads.

    sffFile = “FP8LRLE01.sff”;
    numberOfReads = 87450, 133982;
    numberOfBases = 28604420, 23944283;
    numWithPairedRead = 51014;

    We cannot really get the number of single and paired reads from this, can we? Since we do not know how much reads are dropped during the process.

    thanks
    HI

    • contig said

      Well, the numbers after the comma will tell you:

      133982 reads total (shotgun + pairs, where each half counts as one read)
      51014 paired reads, which count as 102028 reads in the previous number
      So, 133982-102028=31954 shotgun reads.

      31954 shotgun reads + 51014 paired reads = 82968 reads in total (this is the trimmed number of reads; he number of untrimmed reads was 87450). By the way, the file had 61% paired reads, which is a good number…

      In conclusion, the numbers are there but ‘hidden’.

      Hope this helps!

  2. Phillip said

    You write:

    phred-like consensus quality score of at least 40 (meaning a 1:104.0 chance of the base being wrong

    Should that not read “1:10000”, rather than “1:104.0”?

    • flxlex said

      Thanks for catching that, it is corrected now. It is actually a 1:10000 chance, not 1:1000. Divide the phred value by ten, raise ten to that power and you got the ‘X’ of the 1:X chance…

  3. Steven Sullivan said

    Thanks for the excellent blog.
    How do you interpret the ‘Alignment depth’ part of NewblerMetrics.txt (Software Release: 2.3 (091027_1459)? I’m particularly curious about what comes at the end:

    peakDepth
    estimatedGenomeSize

    Any idea how Newbler calculates them?

    • flxlex said

      Above these lines is a sort of histogram of read depths (for the contig, I assume). Using the peak of this distribution, and the total amount of reads on can estimate the genome size. I guess this is what the lines stand for. The manual only mentions the ‘histogram’ and does not explain the lines you refer to. I don’t know how this is calculated exactly.

      In conclusion, I don’t know if you can trust the estimate…

      • babyfaruda said

        What is the peakDepth in the file? Is it the read depth or the average coverage of the genome? Thanks!

      • It is related to the coverage, but it is not the coverage. Coverage is raw bases divided by genome size (or assembly size). This is the peak of the distribution of the alignment depth of the reads forming the contigs, I think.

  4. Thomas said

    Why there are two N50 values in scaffoldMatrices section of 454NewblerMatrices.txt file?

    Eg:
    scaffoldMetrics
    {
    numberOfScaffolds = 10055;
    numberOfBases = 26566567;

    avgScaffoldSize = 2642;
    N50ScaffoldSize = 2549, 4163;
    largestScaffoldSize = 27581;

    }

    • flxlex said

      What version of newbler are you using? I am pretty the second number is the scaffold N50 number, i.e. the number of scaffolds of size N50 or larger

      • Sergio said

        the output with 2 N50 numbers is coming out of version 2.5. Couldn’t find what it means either.

    • Guy Leoanrd said

      Is there any update to this?

      I am using v2.5.3 and get the below metrics, I don’t think that the second number can be the N50 as analysis of the scaffolds from this run gives very good coverage compared to an EST library of the same taxa.

      numberOfScaffolds = 628;
      numberOfBases = 24855089;

      avgScaffoldSize = 45656;
      N50ScaffoldSize = 257482, 31;
      largestScaffoldSize = 1278887;

    • Guy Leoanrd said

      I figured out what the second statistic on the N50 line is.

      The first number is very definitely the N50, the second number is the L50 which is the number of sequences larger than or equal to the N50 sequence size!

  5. Gerald said

    Hi Flxlex,
    thanks for your useful blog…
    I have a question regarding the difference between align reads and assembled + Partially assembled…
    I thought that numAlignedReads = numberAssembled + numberPartial….but it’s not the case
    In the 454NewblerMetrics.txt of our run, we have :
    12753 + 7252 = 20005 assembled or partially assembled reads whereas we have 20057 aligned reads

    where can be the missing reads?

    many thanks for your help

    Gérald

    ...
    consensusResults
    {
    readStatus
    {
    numAlignedReads = 20057, 22.82%;
    numAlignedBases = 6126444, 19.90%;
    inferredReadError = 3.97%, 243502;

    numberAssembled = 12753;
    numberPartial = 7252;
    numberSingleton = 64297;
    numberRepeat = 55;
    numberOutlier = 3544;
    numberTooShort = 0;
    }

    ...
    }

    • flxlex said

      I had not yet done that math, interesting…

      Some of the ‘missing reads’ might be among those labeled ‘Repeat’, while some originally aligned reads perhaps were being kicked out during later stages of the assembly. Come to think of it, outlier reads are actually aligned, but usually in an inconsistent way relative to the contig graph, but I am not sure if these are counted under the numAlignedReads metric… So, the short answer is: it is a bit fishy…

      • Gerald said

        OK, thanks for your answer.
        I just sent an email to Roche about this.
        I will post their answer here.

        Gérald

      • Gerald said

        Answer from Roche support
        Dear Gérald,
        Just as a confirmation: numAlignedReads are the sum of the numAssembled and the numPartial.
        As this isn’t the case in your assembly it seems that you found a bug in the software. Therefore, our developers ask me if you could provide the sff files. They would need these files for investigation.

        Best regards,

      • flxlex said

        Well well… I cannot get these number to add up for a few assemblies I just checked either…

  6. Gerald said

    Answer from Roche today :
    Below please find some further information on your request regarding the numAlignedReads:

    The 454ReadStatus.txt file has two kinds of accnos.

    The ones that just have a line like this:

    GYAGOPT01AKGTV Repeat

    And the ones that have a line like this:
    GYAGOPT01AF64O Repeat contig00992 5 – contig00992 202 +

    The first kind are inferred (screened out prior to assembly due to high frequency kmers), and the second kind are not inferred.

    If you do “grep Repeat 454ReadStatus.txt | grep -c contig” this will give you the count of non-inferred repeats, and when you add those to numberPartial and numberAssembled, you reach the numAlignedReads total.

    We’ve requested that the bug be turned into an enhancement request that makes these categories more clear in 454ReadStatus.txt and breaks them out into distinct categories in 454NewblerMetrics.txt.

    Best regards,

  7. GM said

    Hi Flxlex,

    Thanks for this wonderful blog. It has been my bible in my 454 adventures!

    I had a quick question. Is there an existing way of converting the 454NewblerMetrics.txt file to a tab-delimited file? I could write a python script, but I dont want to reinvent something already present.

    Thanks

  8. Dim said

    Dear Flxlex,
    I was interested in finding out the number of reads that were aligned into contigs and the number of reads that remained as singletons. Based on the previous posts, I’m assuming i need to look at “numAlignedReads” to see the number of reads that were aligned into contigs and “numberSingleton” to see the number of reads that remained as singletons. Is that correct?

    My other question is, is the numAlignedReads + numberSingleton suppose to = totalNumberOfReads?
    In the example I’ve showed below it doesn’t add up.

    runMetrics
    {
    totalNumberOfReads = 2167524;
    totalNumberOfBases = 815827296;

    consensusResults
    {
    readStatus
    {
    numAlignedReads = 812802, 37.50%;
    numAlignedBases = 294544779, 36.10%;
    inferredReadError = 1.10%, 3251821;

    numberAssembled = 606164;
    numberPartial = 206595;
    numberSingleton = 1339740;
    numberRepeat = 343;
    numberOutlier = 6546;
    numberTooShort = 8136;
    }

    Since it doesnt add up, if i was to exclude the numAlignedReads and numberSingleton what else makes up totalNumberOfReads?
    Thank you for your time, Dim.

    • flxlex said

      If you scroll up a bit and look for the question asked by Gerald, you will see that
      – numAlignedReads is supposed to be the sum of numAssembled and numPartial
      – a bug in newbler makes that this is not so

  9. Jordi said

    Hi flxlex,
    Regard the number of reads you get after a transcriptome assembly, looking at the NewblerMetrics I realize that I have a 83% of aligned reads (i.e 296.470) but when I try to get the “numreads” from the 454AllContigs.fna, I get a far larger number (2020868). What’s going on??
    I mean:
    From NewblerMetrics:
    numberOfReads = 408505, 355140;
    numAlignedReads = 296470, 83.48%;
    From 454AllContigs.fna:
    >contig00001 length=151 numreads=142 gene=isogroup00001 status=ig_thresh
    >contig00002 length=258 numreads=247 gene=isogroup00001 status=ig_thresh
    numreads >>> 296470.

    I used that way to know the number of aligned reads 3 years ago from an assembly done with Newbler release 1.1 but I’ve done another trnascriptome assembly with the last one (2.6) and I don’t know exactly if there’s a difference between how both program versions deal the reads.

    • flxlex said

      As was pointed out on SeqAnswers, what is mentioned for ‘numreads’ is the number of reads that contributed to the contig. Some of these reads may be entirely aligned within the contig, other reads will have part of their sequences aligned in (an)other contig(s). There is a ‘-rip’ option to force newbler to place each read uniquely into one contig, but I would not recommended this for transcriptome assemblies.

      • Jordi said

        Thank so much flxlex.
        So according to your opinion, the best way to get the number of assembled reads would be looking at the ReadStatus file, wouldn’t it? And as I’ve read here (https://contig.wordpress.com/2010/03/11/newbler-output-i-the-454newblermetrics-txt-file/) the Assembled+Partial+non-inferred Repeats should be the total assembled reads in Newbler 2.6. But what about Newbler 1.1?? In NewblerMetrics I didn’t have non-inferred repeats and readStatus statement only has assembled, partial,singleton,repeat and outlier reads.
        Thanks again

      • flxlex said

        I would have to guess here, but most likely in newbler 1.1, anything not marked as singleton, repeat or outlier is assembled. There used to be a bug where some of the numbers didn’t add up, I am unsure about whether this also was the case for version 1.1, though…

      • Jordi said

        Thank you flxlex. I’ll ask to 454 Customer Service in order to get more information. I’ll post here if you think could be useful.
        Regards

  10. Jordi said

    That was the Roche answer:
    “We do not have a guideline to compare assemblies from different software versions, especially with very early versions.
    The numbers you are calculating are okay (although there is a much easier way to have a look at it now, please see below). I am not sure why your numbers differ, but it might be due to pair end reads that are internally split by the assembler (provided you have pair end data). If you are truly interested in this discrepancy, I can research it more intensively.
    In order to get a feeling of the quality of an assembly, we can now look at the 454NewblerMetrics.txt file (unfortunately this metric file was not available for very early software versions). There you can find a summary of several metrics calculated from the assembly. It’s a bunch of metrics, but the most important ones are at the bottom of the page, under the section #Consensus Results.”

    Hope this helps to any other kind of doubts.

  11. ARW said

    Hello Flxlex,
    great blog!!!
    I need some help on the 454NewblerMetrics.txt file. I was asked to have a second look on some BACs sequenced on 454. Unfortunately, I realized that the reads were not trimmed correctly as there are considerable vector sequences left in the original data. I now removed those sequences and would like to run a new de-novo assembly on a standalone Newbler installation on my server. Since I have the 454NewblerMetrics.txt file available from the original assembly (done by an external company), I was wondering whether it is possible to extract all the optional parameters used by the company from that file. My intention is to run a new assembly using the trimmed reads in a very comparable way as the original assembly. I tried it without any additional optional parameters, but the assembly produced much more contigs (ca. 150 conticgs) than the original one (ca. 25 contigs with vector contamination).

    Thank you all for any help.

    • flxlex said

      Unfortunately, the 454NewblerMetrics.txt file will not tell you the parameters that were used. If you can ask the company that performed the original assembly for the settings they used that would be best.

  12. Steven Sullivan said

    v2.8 Newbler now reports N50Scaffoldsize as a pair of numbers, e.g.

    N50ScaffoldSize = 6570, 798;

    there is, of course, no explanation for the second number in the manual…..

    Any guesses?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: