An assembly of reads, contigs and scaffolds

A blog on all things newbler and beyond

Posts Tagged ‘newbler’

Cross-posted: make Newbler open source

Posted by lexnederbragt on January 31, 2014

Cross posted from flxlexblog.wordpress.com.

The Newbler assembler and mapper (gsAssembler, gsMapper) was developed especially for working with the reads from the Roche/454 Life Science sequencing technology. It is one of the best programs to deal with this type of data, scoring well in the assemblathon 2 competition. Newbler has been used for many large and small genome assemblies (numerous bacteria, Atlantic cod, bonobo, tomato, to name a few). Recently, Newbler has added support for using multiple sequencing technologies, making it one of the few hybrid assembly programs available. At the Advances in Genome Biology and Technology (AGBT) in 2013, Roche announced having used the Newbler program with a hybrid 454 and Illumina dataset to improve upon the human genome.

However, the Newbler program is not open source. Luckily, researchers only need to fill out an online form to get a free copy of the software. Still, this has hampered the wide-spread adoption of this program. Newbler, for example, was not included in assembly evaluations like GAGE and GAGE-B. That Roche/454 does not want to make the source code for Newbler available is partly understandable from a commercial standpoint: at least one competitor technology (Life Tech/Ion Torrent) with a similar sequencing error-model could benefit from access to the code. In fact, in a blog post, I showed Newbler to be superior to an open-source program when assembling Ion Torrent mate-pair data.

More worringly is that the hundreds of projects that used Newbler as part of the analysis are fundamentally irreproducible without the source code for each of the different versions. This is especially the case for projects, such as the Atlantic cod genome project, that have been given access to development versions of the code, incorporating elements not available to the general community.

Last October, Roche announced it will shutdown its 454 sequencing business in mid-2016. Whatever one may feel about this decision, this further strengthens the argument for Roche/454 to make the Newbler source code open source. After the 454 shutdown, Newbler is otherwise likely to disappear too, meaning that large swathes of the literature cannot be recapitulated from the raw data. Also, long after the 454 shutdown, many researchers will have to process their 454 sequencing data, and many may still want to rely on Newbler for that purpose.

There are several other reasons why I feel the research community should be given access to the source code of Newbler. Newbler represents a very valuable contribution to the field of genome assembly and mapping. Software developers can learn from the algorithms and implementations of the Newbler code, opening up for reusing these in other programs. Also, there is the hope that developers will improve upon the program, for example by adding support for other sequencing technologies, or assembling with reads longer than the current maximum of 2 kbp.

So I hereby ask the readers of this blog for help: I have set up an online petition asking for Roche/454 to make the Newbler source code available at the latest at the time of the 454 shutdown. Please sign the petition here. Additionally, spread the word (e.g., on twitter or your own blog). Thanks in advance!

I intend to hand over the results of the petition to a Roche representative at the Advances in Genome Biology and Technology (AGBT) meeting (February 12-15, 2014).

Finally, head over to my other blog to tell me about your Newbler experiences!

(Thanks to Nick Loman for his constructive comments on an earlier version of this post)

Posted in Miscellaneous | Tagged: | Leave a Comment »

Newbler input III: a quick fix for the new Illumina fastq header

Posted by lexnederbragt on September 1, 2011

One unfortunate drawback of working with Illumina sequences is the many changes to the format of their fastq readfiles. The quality scoring has been changed several times since the first Solexa reads become available. It appears they have now settled on the Sanger style, see this wikipedia entry.

(Source: thepoolandspashoponline.com.au)

Regrettably, with their latest software upgrade (Casava 1.8), the headers (sequence identifiers) in the fastq files have changed. The change is described in the aforementioned wikipedia entry; basically, some elements have been added, some have changed order, and there are now two parts seperated by a space.

I wouldn’t have written this blogpost if this change had not been relevant for newbler: we were lucky enough to enjoy direct reading of Illumina fastq files (with newbler determining the quality scoring type) starting with newbler 2.6. newbler also matches mate-pairs (Illumina read 1 and read 2), so that these can be used as paired-ends by newbler (to build scaffolds). By the way, FASTQ files from the NCBI/EBI Sequence Read Archive are also correctly parsed for mate pairs, but here the filename is used for determining read 1 and read 2.

Read the rest of this entry »

Posted in Newbler input | Tagged: , , , | 5 Comments »

Newbler output V: the 454ContigScaffolds.txt and 454ScaffoldContigs.fna

Posted by lexnederbragt on July 12, 2011

Filling the gaps (picture from http://www.ifaonline.co.uk)

In the post on what is new in newbler version 2.6, I introduced the -scaffold option. Briefly, with this option instances (i.e. the consensus sequence) of repeats are placed in gaps. As I mentioned, setting -scaffold results in two extra files. With this post, I will explain these in detail.

Read the rest of this entry »

Posted in How it works, Newbler output | Tagged: , , , , , , | 1 Comment »

Newbler output IV: on ultra-short and single-read contigs

Posted by lexnederbragt on April 5, 2011

Ultra-short contigs...

Sometimes you might observe very short contigs, some even having high read depth. You might see these for example when
– you choose ‘-a 1’ (or ‘-a 0’) as a setting during the assembly, forcing newbler to output all contigs of whatever length (normally the lower limit is 100 bp)
– you run an assembly using the cDNA option, here the lower limit is set to 1
– you use the 454ContigGraph.txt file, in which all contigs of whatever length are listed

The -minlen option requires by default a minimum length of 50 (20 when paired reads are part of the dataset), and the default minimum overlap between reads is 40 bases, so how are contigs so short possible at all?

There appear to be several reasons for these contigs (the information below was kindly provided by the newbler developers; disclaimer: I might have misunderstood them… ):

– microsatellites are very short repeats that the alignment loops through, causing a very short (2bp, 3bp, 4bp) alignment with ultra-high depth.
– very deep alignments (with lots of reads) can cause shattering, caused by accumulation of enough variation to break the alignment into pieces, some of which may be very short
– at the end of contigs, variations in the (light) signal distributions of homopolymers can also cause small contigs ‘breaking off’

Another very strange type of contig is one that mentions in the fasta header ‘numreads=1’. How can one single read become a contig? It should be labelled a singleton, right? Well, these ‘contigs’ can be explained also…
A multiple read alignment grows when reads added to it. After such an addition, there are checks run on the alignment. Addition of new reads may actually result in an alignment being broken, in some cases a part is taken out and placed in its own alignment. During the detangling phase, reads may be removed from a set of aligned reads and. For these parts taken out of alignments this may mean that onlu a single read is left in the alignment. Newbler then keeps this read as a contig (perhaps they should remove these instead, but who am I to complain…).

A singleton read is a read that did not show any significant overlap (by default, a 40 bp window of at least 90% similarity) with any other reads. These ‘numreads=1’ contigs are not singletons as they (or part of them) actually had sufficient overlap for them to have been part of an alignment.

Many people ask about these strange contigs, both in the comments on this blog, and on sites such as seqanswers.com. I hope this post makes the situation around these contigs a bit less confusing…

Posted in Newbler output | Tagged: , , , , | 4 Comments »

Newbler input II: sequencing reads from other platforms

Posted by lexnederbragt on January 21, 2011

A sanger sequence read electropherogram (source: wikipedia)

Both the runMapping and runAssembly programs are able to take in reads from other platforms, at least Sanger reads and Illumina reads. As long as the reads are in fasta format, with an optional quality file, newbler accepts and uses these reads. When the fasta files contain paired end (mate pair) reads, newbler can actually be made to use the pair information.

In general, it is a good idea to clean your fasta sequences before adding them to newbler: remove vectors, linkers, low quality parts of reads, or entire low quality reads first.
Also note that, while for sff files a symbolic link is generated in the assembly or project folder (still present after the program is finished when the -nrm flag is set), fasta files are not included in this way.

Read the rest of this entry »

Posted in Newbler input | Tagged: , , , , , | 20 Comments »

Newbler input I: the sff file

Posted by lexnederbragt on October 28, 2010

Newbler can obviously take in the 454 reads, but also other read types: regular Sanger reads, any sequence in a fasta file (at most 200 bp), and perhaps also Illumina reads.

Sff files are the standard output of the 454 sequencing machine. ‘sff’ stands for ‘standard flowgram file’. The 454 sequencing method determines the sequence not base by base, but measures homopolymer length (the number of consecutive ‘A’s, ‘C’s, ‘G’s and ‘T’s on a sequence). Nucleotides are flown over the sequencing plate in a determined order (T-A-C-G) and a light signal is generated during nucleotide incorporation. The strength of the light signal is proportional to the number of bases built in (at least up to a certain number, around 7). As the flow order is always the same, for certain sequences no base can be built in, leading to a signal of strength (+/-) 0.

The sff file contains all the bases, quality values and signal strengths, in contrast to the fna and qual files. Note that sff files can, by definition, contain reads from only one type of chemistry, i.e. either GS 20, GS FLX or GS FLX Titanium reads.

Sff files are binary files, meaning that they can not be accessed by regular text-based tools. 454 has its own scripts to manipulate sffiles and extract information from them (sfffile, sffinfo), but other programs/scripts can also be used to extract information from them. Example programs are sff_extract, flower, sff2fasta, or use the biopython parser, nothing for bioperl yet (I have not tested any of these – use at your own discretion…). When one uses 454’s sffinfo command on an sff file without parameters, all information contained in the file is reported in text format. The remainder of this post will describe that output. Read the rest of this entry »

Posted in Newbler input | Tagged: , , | 23 Comments »

Running newbler: de novo transcriptome assembly I

Posted by lexnederbragt on August 31, 2010

RNA (source: wikimedia.org)

Since version 2.3, newbler has a -cdna option for de novo transcriptome assembly. In this post, I’ll explain the principles and setting up the transcriptome assembly. The next post will discuss the output of a transcriptome assembly.

1) Principles of transcriptome assembly

As with other assembly projects, the first steps for transcriptome assembly are identical, and newbler builds a contig graph, see this post. Ideally, the reads coming from the transcript of a certain gene should result in a single contig. However, because of splice-variants (and other sequence particularities), there may be several contigs for each transcript, which themselves form a small contig graph. Splice-variants will result in reads that , relative to other reads have an insert (representing an additional exon in the transcript), thereby breaking the contig graph, see the figure.

Relationship between exons, contigs and isotigs

So, for transcriptome projects, there will be numerous subgraphs each potentially representing one gene. Each of these subgraphs are called an isogroup. Next, newbler will traverse the contigs in the subgraphs of each isogroup to generate transcript variants, which are called isotigs, again, see the figure. There are certain rules for this traversing step, for example, for starting the path and for ending it. Another rule, for complex graphs, is a cutoff such that no more than a maximum number of isotigs are generated per isogroup (by default set to 100 isotigs). If fully traversing the graph will result in more isotigs than this maximum, the contigs of this isogroup are reported in the output instead of the isotigs. Read the rest of this entry »

Posted in How it works, Using newbler | Tagged: , , , , , , | 4 Comments »

Running newbler: de novo assembly

Posted by lexnederbragt on June 10, 2010

This post is about how to start up newbler for de novo assembly projects. I will describe setting up newbler using the command line. Most of the options I will mention are also available through the GUI version, but I will not describe how to use them here.

For a description of the progress that newbler reports during assembly, please check this post. For a description of the different output files, these are described in a series of previous posts.

1) default newbler on one or more files

runAssembly /data/sff/EYV886410.sff

This is the most simple way of running newbler: just provide it with one sff file. It will generate a folder called along the lines of P_yyyy_mm_dd_hh_min_sec_runAssembly and put all output in there. If you want to have control over the name of this folder, use

runAssembly -o project1 /data/sff/EYV886410.sff

-o describes the name of the folder newbler will provide all output in, in this case ‘project1’

Read the rest of this entry »

Posted in Using newbler | Tagged: , , , , , | 65 Comments »

Newbler output VI: the ‘status’ files (454TrimStatus.txt, 454ReadStatus.txt, 454PairStatus.txt) and the 454AlignmentInfo.tsv file

Posted by lexnederbragt on May 20, 2010

The files that are the topic of this post are all tables, i.e. tab separated text files. The ‘status’ files describe what happened with all the reads and the paired end halves, while the AlignmentInfo file summarizes the contig alignments.

The fact that these files are tabular makes for easy parsing using by perl/python or, my favorite, awk.

1) 454TrimStatus.txt

Accno   Trimpoints Used Used Trimmed Length     Orig Trimpoints Orig Trimmed Length     Raw Length
ERGMJHS01CYVHW  5-78    74      5-98    94      100
ERGMJHS01D6IHL  5-116   112     5-116   112     161
ERGMJHS01DYTX5  5-127   123     5-127   123     173
ERGMJHS01DYDH0  5-78    74      5-78    74      124
ERGMJHS01ECEGM  5-256   252     5-256   252     271
ERGMJHS01CRQ8D  5-272   268     5-272   268     273
ERGMJHS01ECMVT  5-260   256     5-260   256     270
ERGMJHS01EZ7VU  5-41    37      5-61    57      62
ERGMJHS01ERDXB  5-207   203     5-207   203     252

This file describes what (trimmed) part of the read was considered for alignment. The columns describe:

Posted in Newbler output | Tagged: , , , , , , | 14 Comments »

Newbler output III: the 454ContigGraph.txt file

Posted by lexnederbragt on April 13, 2010

The single file I’ll discuss today has in fact almost the entire assembly in it, besides the actual sequences (although even some of these are also included, see below). As explained in my first post, newbler (as many other assembly programs) builds a contig graph. Contigs are the nodes, and reads spanning between them (starting in one contig and continuing or ending in another) indicate the edges. All the information on this graph, except the actual read alignments and consensus contigs, is in the 454ContigGraph.txt file.

The file is divided into several sections, for each one the lines start with a capital letter, except for the first section.

Putting together an assembly...

Section 1) Contig statistics

Read the rest of this entry »

Posted in Newbler output | Tagged: , , , , , , | 33 Comments »