An assembly of reads, contigs and scaffolds

A blog on all things newbler and beyond

Cross-posted: make Newbler open source

Posted by lexnederbragt on January 31, 2014

Cross posted from flxlexblog.wordpress.com.

The Newbler assembler and mapper (gsAssembler, gsMapper) was developed especially for working with the reads from the Roche/454 Life Science sequencing technology. It is one of the best programs to deal with this type of data, scoring well in the assemblathon 2 competition. Newbler has been used for many large and small genome assemblies (numerous bacteria, Atlantic cod, bonobo, tomato, to name a few). Recently, Newbler has added support for using multiple sequencing technologies, making it one of the few hybrid assembly programs available. At the Advances in Genome Biology and Technology (AGBT) in 2013, Roche announced having used the Newbler program with a hybrid 454 and Illumina dataset to improve upon the human genome.

However, the Newbler program is not open source. Luckily, researchers only need to fill out an online form to get a free copy of the software. Still, this has hampered the wide-spread adoption of this program. Newbler, for example, was not included in assembly evaluations like GAGE and GAGE-B. That Roche/454 does not want to make the source code for Newbler available is partly understandable from a commercial standpoint: at least one competitor technology (Life Tech/Ion Torrent) with a similar sequencing error-model could benefit from access to the code. In fact, in a blog post, I showed Newbler to be superior to an open-source program when assembling Ion Torrent mate-pair data.

More worringly is that the hundreds of projects that used Newbler as part of the analysis are fundamentally irreproducible without the source code for each of the different versions. This is especially the case for projects, such as the Atlantic cod genome project, that have been given access to development versions of the code, incorporating elements not available to the general community.

Last October, Roche announced it will shutdown its 454 sequencing business in mid-2016. Whatever one may feel about this decision, this further strengthens the argument for Roche/454 to make the Newbler source code open source. After the 454 shutdown, Newbler is otherwise likely to disappear too, meaning that large swathes of the literature cannot be recapitulated from the raw data. Also, long after the 454 shutdown, many researchers will have to process their 454 sequencing data, and many may still want to rely on Newbler for that purpose.

There are several other reasons why I feel the research community should be given access to the source code of Newbler. Newbler represents a very valuable contribution to the field of genome assembly and mapping. Software developers can learn from the algorithms and implementations of the Newbler code, opening up for reusing these in other programs. Also, there is the hope that developers will improve upon the program, for example by adding support for other sequencing technologies, or assembling with reads longer than the current maximum of 2 kbp.

So I hereby ask the readers of this blog for help: I have set up an online petition asking for Roche/454 to make the Newbler source code available at the latest at the time of the 454 shutdown. Please sign the petition here. Additionally, spread the word (e.g., on twitter or your own blog). Thanks in advance!

I intend to hand over the results of the petition to a Roche representative at the Advances in Genome Biology and Technology (AGBT) meeting (February 12-15, 2014).

Finally, head over to my other blog to tell me about your Newbler experiences!

(Thanks to Nick Loman for his constructive comments on an earlier version of this post)

Posted in Miscellaneous | Tagged: | Leave a Comment »

Newbler input III: a quick fix for the new Illumina fastq header

Posted by lexnederbragt on September 1, 2011

One unfortunate drawback of working with Illumina sequences is the many changes to the format of their fastq readfiles. The quality scoring has been changed several times since the first Solexa reads become available. It appears they have now settled on the Sanger style, see this wikipedia entry.

(Source: thepoolandspashoponline.com.au)

Regrettably, with their latest software upgrade (Casava 1.8), the headers (sequence identifiers) in the fastq files have changed. The change is described in the aforementioned wikipedia entry; basically, some elements have been added, some have changed order, and there are now two parts seperated by a space.

I wouldn’t have written this blogpost if this change had not been relevant for newbler: we were lucky enough to enjoy direct reading of Illumina fastq files (with newbler determining the quality scoring type) starting with newbler 2.6. newbler also matches mate-pairs (Illumina read 1 and read 2), so that these can be used as paired-ends by newbler (to build scaffolds). By the way, FASTQ files from the NCBI/EBI Sequence Read Archive are also correctly parsed for mate pairs, but here the filename is used for determining read 1 and read 2.

Read the rest of this entry »

Posted in Newbler input | Tagged: , , , | 5 Comments »

Newbler output V: the 454ContigScaffolds.txt and 454ScaffoldContigs.fna

Posted by lexnederbragt on July 12, 2011

Filling the gaps (picture from http://www.ifaonline.co.uk)

In the post on what is new in newbler version 2.6, I introduced the -scaffold option. Briefly, with this option instances (i.e. the consensus sequence) of repeats are placed in gaps. As I mentioned, setting -scaffold results in two extra files. With this post, I will explain these in detail.

Read the rest of this entry »

Posted in How it works, Newbler output | Tagged: , , , , , , | 1 Comment »

What is new in newbler 2.6

Posted by lexnederbragt on July 12, 2011

The latest version of newbler, version 2.6, has some welcome additions for input and output. As I have so far only treated de novo assembly, I will skip the updates on the gsMapper (except for mentioning that it is now able to provide a bam file using the -bam option).

Read the rest of this entry »

Posted in Using newbler | Tagged: , , , | 21 Comments »

A script for converting the 454NewblerMetrics.txt file to a tab-separated file

Posted by lexnederbragt on May 9, 2011

(source: Wikimedia commons)

One of you asked in the comments: “Is there an existing way of converting the 454NewblerMetrics.txt file to a tab-delimited file?”

I have in fact written a script for that. We use it all the time in our group for newbler assemblies, and I am hereby sharing it with you. The perl script, called newblermetrics.pl, needs to be given a 454NewblerMetrics.txt file from a newbler assembly. It works both on shotgun assemblies, with or without paired end data, and on cDNA assemblies (for which it includes the isogroups and isotigs metrics in the output). It will not work on mapping projects (gsMapper/runmapping commands).

The script produces an output like this:

Input
Number of reads    975240
Number of bases    275262092
Number of reads trimmed    1195883    122.6%
Number of bases trimmed    256085747    93.0%

Consensus results
Number of reads assembled    1065078    89.1%
Number partial    14365    1.2%
Number singleton    105760    8.8%
Number repeat    7248    0.6%
Number outlier    3432    0.3%
Number too short    0    0.0%

Scaffold Metrics
Number of scaffolds    12
Number of bases    5799904
Average scaffold size    483325
N50 scaffold size    5479633
Largest scaffold size    5479633

Large Contig Metrics
Number of contigs    479
Number of bases    5694980
Average contig size    11889
N50 contig size    44505
Largest contig size    160534
Q40 plus bases    5686792    99.86%

All Contig Metrics
Number of contigs    1748
Number of bases    6114087
Average contig size    3498

Library    Pair distance average (bp)
lib_3kb.sff    2542.8
lib_8kb.sff    7601.6

The script is available for download here: http://sourceforge.net/projects/newblertools/files/newblermetrics. I’d appreciate any feedback!

UPDATE Dag Ahren and Björn Canbäck made a web version of the script, accessible here: http://mbio-serv2.mbioekol.lu.se/apps/newblerMetrics.html

Posted in Newbler output, scripts | Tagged: , | 11 Comments »

Newbler output IV: on ultra-short and single-read contigs

Posted by lexnederbragt on April 5, 2011

Ultra-short contigs...

Sometimes you might observe very short contigs, some even having high read depth. You might see these for example when
– you choose ‘-a 1’ (or ‘-a 0’) as a setting during the assembly, forcing newbler to output all contigs of whatever length (normally the lower limit is 100 bp)
– you run an assembly using the cDNA option, here the lower limit is set to 1
– you use the 454ContigGraph.txt file, in which all contigs of whatever length are listed

The -minlen option requires by default a minimum length of 50 (20 when paired reads are part of the dataset), and the default minimum overlap between reads is 40 bases, so how are contigs so short possible at all?

There appear to be several reasons for these contigs (the information below was kindly provided by the newbler developers; disclaimer: I might have misunderstood them… ):

– microsatellites are very short repeats that the alignment loops through, causing a very short (2bp, 3bp, 4bp) alignment with ultra-high depth.
– very deep alignments (with lots of reads) can cause shattering, caused by accumulation of enough variation to break the alignment into pieces, some of which may be very short
– at the end of contigs, variations in the (light) signal distributions of homopolymers can also cause small contigs ‘breaking off’

Another very strange type of contig is one that mentions in the fasta header ‘numreads=1’. How can one single read become a contig? It should be labelled a singleton, right? Well, these ‘contigs’ can be explained also…
A multiple read alignment grows when reads added to it. After such an addition, there are checks run on the alignment. Addition of new reads may actually result in an alignment being broken, in some cases a part is taken out and placed in its own alignment. During the detangling phase, reads may be removed from a set of aligned reads and. For these parts taken out of alignments this may mean that onlu a single read is left in the alignment. Newbler then keeps this read as a contig (perhaps they should remove these instead, but who am I to complain…).

A singleton read is a read that did not show any significant overlap (by default, a 40 bp window of at least 90% similarity) with any other reads. These ‘numreads=1’ contigs are not singletons as they (or part of them) actually had sufficient overlap for them to have been part of an alignment.

Many people ask about these strange contigs, both in the comments on this blog, and on sites such as seqanswers.com. I hope this post makes the situation around these contigs a bit less confusing…

Posted in Newbler output | Tagged: , , , , | 4 Comments »

What is new in newbler version 2.5.3

Posted by lexnederbragt on March 22, 2011

(source: Wikimedia commons)

Recently, newbler version 2.5.3 became available. With this post, I’ll describe the changes between this version, and the previous (2.3). As I have not yet described the gsMapper function of newbler, I here only dicuss changes relevant to assembly (gsAssembler, runAssembly).

Read the rest of this entry »

Posted in Using newbler | Tagged: , , , , | 14 Comments »

Newbler input II: sequencing reads from other platforms

Posted by lexnederbragt on January 21, 2011

A sanger sequence read electropherogram (source: wikipedia)

Both the runMapping and runAssembly programs are able to take in reads from other platforms, at least Sanger reads and Illumina reads. As long as the reads are in fasta format, with an optional quality file, newbler accepts and uses these reads. When the fasta files contain paired end (mate pair) reads, newbler can actually be made to use the pair information.

In general, it is a good idea to clean your fasta sequences before adding them to newbler: remove vectors, linkers, low quality parts of reads, or entire low quality reads first.
Also note that, while for sff files a symbolic link is generated in the assembly or project folder (still present after the program is finished when the -nrm flag is set), fasta files are not included in this way.

Read the rest of this entry »

Posted in Newbler input | Tagged: , , , , , | 20 Comments »

Newbler input I: the sff file

Posted by lexnederbragt on October 28, 2010

Newbler can obviously take in the 454 reads, but also other read types: regular Sanger reads, any sequence in a fasta file (at most 200 bp), and perhaps also Illumina reads.

Sff files are the standard output of the 454 sequencing machine. ‘sff’ stands for ‘standard flowgram file’. The 454 sequencing method determines the sequence not base by base, but measures homopolymer length (the number of consecutive ‘A’s, ‘C’s, ‘G’s and ‘T’s on a sequence). Nucleotides are flown over the sequencing plate in a determined order (T-A-C-G) and a light signal is generated during nucleotide incorporation. The strength of the light signal is proportional to the number of bases built in (at least up to a certain number, around 7). As the flow order is always the same, for certain sequences no base can be built in, leading to a signal of strength (+/-) 0.

The sff file contains all the bases, quality values and signal strengths, in contrast to the fna and qual files. Note that sff files can, by definition, contain reads from only one type of chemistry, i.e. either GS 20, GS FLX or GS FLX Titanium reads.

Sff files are binary files, meaning that they can not be accessed by regular text-based tools. 454 has its own scripts to manipulate sffiles and extract information from them (sfffile, sffinfo), but other programs/scripts can also be used to extract information from them. Example programs are sff_extract, flower, sff2fasta, or use the biopython parser, nothing for bioperl yet (I have not tested any of these – use at your own discretion…). When one uses 454’s sffinfo command on an sff file without parameters, all information contained in the file is reported in text format. The remainder of this post will describe that output. Read the rest of this entry »

Posted in Newbler input | Tagged: , , | 23 Comments »

Running newbler: de novo transcriptome assembly II: the output files

Posted by lexnederbragt on September 21, 2010

This post describes the transcriptome specific output files, or differencers between the files for transcriptome assembly relative to a regular assembly. For (aspects of) files not treated in this post have a look at these previous posts.

Alternative rope splicing (source: Wikimedia commons)

1) 454NewblerMetrics.txt
The differences for this file, relative to the same file for a ‘normal’ assembly (described in this post), are metrics on the isogroups and isotigs:

isogroupMetrics
{
numberOfIsogroups     = #####;
avgContigCnt          = #.#;
largestContigCnt      = ####;
numberWithOneContig   = #####;

avgIsotigCnt          = #.#;
largestIsotigCnt      = ##;
numberWithOneIsotig   = #####;
}

Besides the number of isogroups, the average and maximum number of contigs per isogroup are listed, as well as number of isogroups with only one contig. Below that, the average and maximum number of isotigs per isogroup are listed, as well as number of isogroups with one isotig. Read the rest of this entry »

Posted in Newbler output | Tagged: , , , , , , , | 11 Comments »