In the post on what is new in newbler version 2.6, I introduced the -scaffold option. Briefly, with this option instances (i.e. the consensus sequence) of repeats are placed in gaps. As I mentioned, setting -scaffold results in two extra files. With this post, I will explain these in detail.
Archive for the ‘Newbler output’ Category
Posted by lexnederbragt on July 12, 2011
Posted by lexnederbragt on May 9, 2011
One of you asked in the comments: “Is there an existing way of converting the 454NewblerMetrics.txt file to a tab-delimited file?”
I have in fact written a script for that. We use it all the time in our group for newbler assemblies, and I am hereby sharing it with you. The perl script, called newblermetrics.pl, needs to be given a 454NewblerMetrics.txt file from a newbler assembly. It works both on shotgun assemblies, with or without paired end data, and on cDNA assemblies (for which it includes the isogroups and isotigs metrics in the output). It will not work on mapping projects (gsMapper/runmapping commands).
The script produces an output like this:
Number of reads 975240
Number of bases 275262092
Number of reads trimmed 1195883 122.6%
Number of bases trimmed 256085747 93.0%
Number of reads assembled 1065078 89.1%
Number partial 14365 1.2%
Number singleton 105760 8.8%
Number repeat 7248 0.6%
Number outlier 3432 0.3%
Number too short 0 0.0%
Number of scaffolds 12
Number of bases 5799904
Average scaffold size 483325
N50 scaffold size 5479633
Largest scaffold size 5479633
Large Contig Metrics
Number of contigs 479
Number of bases 5694980
Average contig size 11889
N50 contig size 44505
Largest contig size 160534
Q40 plus bases 5686792 99.86%
All Contig Metrics
Number of contigs 1748
Number of bases 6114087
Average contig size 3498
Library Pair distance average (bp)
The script is available for download here: http://sourceforge.net/projects/newblertools/files/newblermetrics. I’d appreciate any feedback!
UPDATE Dag Ahren and Björn Canbäck made a web version of the script, accessible here: http://mbio-serv2.mbioekol.lu.se/apps/newblerMetrics.html
Posted by lexnederbragt on April 5, 2011
Sometimes you might observe very short contigs, some even having high read depth. You might see these for example when
– you choose ‘-a 1’ (or ‘-a 0’) as a setting during the assembly, forcing newbler to output all contigs of whatever length (normally the lower limit is 100 bp)
– you run an assembly using the cDNA option, here the lower limit is set to 1
– you use the 454ContigGraph.txt file, in which all contigs of whatever length are listed
The -minlen option requires by default a minimum length of 50 (20 when paired reads are part of the dataset), and the default minimum overlap between reads is 40 bases, so how are contigs so short possible at all?
There appear to be several reasons for these contigs (the information below was kindly provided by the newbler developers; disclaimer: I might have misunderstood them… ):
– microsatellites are very short repeats that the alignment loops through, causing a very short (2bp, 3bp, 4bp) alignment with ultra-high depth.
– very deep alignments (with lots of reads) can cause shattering, caused by accumulation of enough variation to break the alignment into pieces, some of which may be very short
– at the end of contigs, variations in the (light) signal distributions of homopolymers can also cause small contigs ‘breaking off’
Another very strange type of contig is one that mentions in the fasta header ‘numreads=1’. How can one single read become a contig? It should be labelled a singleton, right? Well, these ‘contigs’ can be explained also…
A multiple read alignment grows when reads added to it. After such an addition, there are checks run on the alignment. Addition of new reads may actually result in an alignment being broken, in some cases a part is taken out and placed in its own alignment. During the detangling phase, reads may be removed from a set of aligned reads and. For these parts taken out of alignments this may mean that onlu a single read is left in the alignment. Newbler then keeps this read as a contig (perhaps they should remove these instead, but who am I to complain…).
A singleton read is a read that did not show any significant overlap (by default, a 40 bp window of at least 90% similarity) with any other reads. These ‘numreads=1’ contigs are not singletons as they (or part of them) actually had sufficient overlap for them to have been part of an alignment.
Many people ask about these strange contigs, both in the comments on this blog, and on sites such as seqanswers.com. I hope this post makes the situation around these contigs a bit less confusing…
Newbler output VI: the ‘status’ files (454TrimStatus.txt, 454ReadStatus.txt, 454PairStatus.txt) and the 454AlignmentInfo.tsv file
Posted by lexnederbragt on May 20, 2010
The files that are the topic of this post are all tables, i.e. tab separated text files. The ‘status’ files describe what happened with all the reads and the paired end halves, while the AlignmentInfo file summarizes the contig alignments.
Accno Trimpoints Used Used Trimmed Length Orig Trimpoints Orig Trimmed Length Raw Length
ERGMJHS01CYVHW 5-78 74 5-98 94 100
ERGMJHS01D6IHL 5-116 112 5-116 112 161
ERGMJHS01DYTX5 5-127 123 5-127 123 173
ERGMJHS01DYDH0 5-78 74 5-78 74 124
ERGMJHS01ECEGM 5-256 252 5-256 252 271
ERGMJHS01CRQ8D 5-272 268 5-272 268 273
ERGMJHS01ECMVT 5-260 256 5-260 256 270
ERGMJHS01EZ7VU 5-41 37 5-61 57 62
ERGMJHS01ERDXB 5-207 203 5-207 203 252
This file describes what (trimmed) part of the read was considered for alignment. The columns describe:
Posted by lexnederbragt on April 13, 2010
The single file I’ll discuss today has in fact almost the entire assembly in it, besides the actual sequences (although even some of these are also included, see below). As explained in my first post, newbler (as many other assembly programs) builds a contig graph. Contigs are the nodes, and reads spanning between them (starting in one contig and continuing or ending in another) indicate the edges. All the information on this graph, except the actual read alignments and consensus contigs, is in the 454ContigGraph.txt file.
The file is divided into several sections, for each one the lines start with a capital letter, except for the first section.
Section 1) Contig statistics
Posted by lexnederbragt on March 22, 2010
The files most people are after when they do an assembly must be these: the actual contig and scaffold sequences. The contigs are in the files 454AllContigs and 454LargeContigs. ‘All’ indicates by default contigs of at least 100 bp, while ‘Large’ contigs are at least 500 bp. These lower limits can be set during assembly.
The ‘fna’ files contain the sequences (bases) in fasta format (I actually do not why this extension was chosen over ‘fasta’ or ‘fa’ which are most often used). The ‘qual’ files contain phred-like quality scores (see previous post). The contigs are in the same order between fna and qual files, and the quality scores are in the same order as the bases:
Posted by lexnederbragt on March 11, 2010
With this post, I’ll start going through the output files newbler generates. Some of these will be described in detail as they contain a lot of important information.
For today’s post, we’ll start with the 454NewblerMetrics.txt file. This file contains a lot of details on the reads used during the assembly, as well as the resulting contigs and, in the case of paired end reads, scaffolds.
The file starts with some metadata, such as the date of the assembly, where is is located, and what version of newbler was used. For this post, I used a file of as assembly generated with version 2.3 and both shotgun and paired end read files. Note that the output will be slightly different for a mapping project (to be described in a later post) than for an assembly project.
Section 1: runData
path = "/your/path/yourfile1.sff";
numberOfReads = ######, ######;
numberOfBases = #########, #########;
For each input file, the numbers mentioned are reads and bases in the file, reads and bases after trimming.