An assembly of reads, contigs and scaffolds

A blog on all things newbler and beyond

Posts Tagged ‘contig’

Newbler output V: the 454ContigScaffolds.txt and 454ScaffoldContigs.fna

Posted by lexnederbragt on July 12, 2011

Filling the gaps (picture from http://www.ifaonline.co.uk)

In the post on what is new in newbler version 2.6, I introduced the -scaffold option. Briefly, with this option instances (i.e. the consensus sequence) of repeats are placed in gaps. As I mentioned, setting -scaffold results in two extra files. With this post, I will explain these in detail.

Read the rest of this entry »

Posted in How it works, Newbler output | Tagged: , , , , , , | 1 Comment »

Newbler output IV: on ultra-short and single-read contigs

Posted by lexnederbragt on April 5, 2011

Ultra-short contigs...

Sometimes you might observe very short contigs, some even having high read depth. You might see these for example when
– you choose ‘-a 1’ (or ‘-a 0’) as a setting during the assembly, forcing newbler to output all contigs of whatever length (normally the lower limit is 100 bp)
– you run an assembly using the cDNA option, here the lower limit is set to 1
– you use the 454ContigGraph.txt file, in which all contigs of whatever length are listed

The -minlen option requires by default a minimum length of 50 (20 when paired reads are part of the dataset), and the default minimum overlap between reads is 40 bases, so how are contigs so short possible at all?

There appear to be several reasons for these contigs (the information below was kindly provided by the newbler developers; disclaimer: I might have misunderstood them… ):

– microsatellites are very short repeats that the alignment loops through, causing a very short (2bp, 3bp, 4bp) alignment with ultra-high depth.
– very deep alignments (with lots of reads) can cause shattering, caused by accumulation of enough variation to break the alignment into pieces, some of which may be very short
– at the end of contigs, variations in the (light) signal distributions of homopolymers can also cause small contigs ‘breaking off’

Another very strange type of contig is one that mentions in the fasta header ‘numreads=1’. How can one single read become a contig? It should be labelled a singleton, right? Well, these ‘contigs’ can be explained also…
A multiple read alignment grows when reads added to it. After such an addition, there are checks run on the alignment. Addition of new reads may actually result in an alignment being broken, in some cases a part is taken out and placed in its own alignment. During the detangling phase, reads may be removed from a set of aligned reads and. For these parts taken out of alignments this may mean that onlu a single read is left in the alignment. Newbler then keeps this read as a contig (perhaps they should remove these instead, but who am I to complain…).

A singleton read is a read that did not show any significant overlap (by default, a 40 bp window of at least 90% similarity) with any other reads. These ‘numreads=1’ contigs are not singletons as they (or part of them) actually had sufficient overlap for them to have been part of an alignment.

Many people ask about these strange contigs, both in the comments on this blog, and on sites such as seqanswers.com. I hope this post makes the situation around these contigs a bit less confusing…

Posted in Newbler output | Tagged: , , , , | 4 Comments »

Running newbler: de novo transcriptome assembly I

Posted by lexnederbragt on August 31, 2010

RNA (source: wikimedia.org)

Since version 2.3, newbler has a -cdna option for de novo transcriptome assembly. In this post, I’ll explain the principles and setting up the transcriptome assembly. The next post will discuss the output of a transcriptome assembly.

1) Principles of transcriptome assembly

As with other assembly projects, the first steps for transcriptome assembly are identical, and newbler builds a contig graph, see this post. Ideally, the reads coming from the transcript of a certain gene should result in a single contig. However, because of splice-variants (and other sequence particularities), there may be several contigs for each transcript, which themselves form a small contig graph. Splice-variants will result in reads that , relative to other reads have an insert (representing an additional exon in the transcript), thereby breaking the contig graph, see the figure.

Relationship between exons, contigs and isotigs

So, for transcriptome projects, there will be numerous subgraphs each potentially representing one gene. Each of these subgraphs are called an isogroup. Next, newbler will traverse the contigs in the subgraphs of each isogroup to generate transcript variants, which are called isotigs, again, see the figure. There are certain rules for this traversing step, for example, for starting the path and for ending it. Another rule, for complex graphs, is a cutoff such that no more than a maximum number of isotigs are generated per isogroup (by default set to 100 isotigs). If fully traversing the graph will result in more isotigs than this maximum, the contigs of this isogroup are reported in the output instead of the isotigs. Read the rest of this entry »

Posted in How it works, Using newbler | Tagged: , , , , , , | 4 Comments »

Newbler output VI: the ‘status’ files (454TrimStatus.txt, 454ReadStatus.txt, 454PairStatus.txt) and the 454AlignmentInfo.tsv file

Posted by lexnederbragt on May 20, 2010

The files that are the topic of this post are all tables, i.e. tab separated text files. The ‘status’ files describe what happened with all the reads and the paired end halves, while the AlignmentInfo file summarizes the contig alignments.

The fact that these files are tabular makes for easy parsing using by perl/python or, my favorite, awk.

1) 454TrimStatus.txt

Accno   Trimpoints Used Used Trimmed Length     Orig Trimpoints Orig Trimmed Length     Raw Length
ERGMJHS01CYVHW  5-78    74      5-98    94      100
ERGMJHS01D6IHL  5-116   112     5-116   112     161
ERGMJHS01DYTX5  5-127   123     5-127   123     173
ERGMJHS01DYDH0  5-78    74      5-78    74      124
ERGMJHS01ECEGM  5-256   252     5-256   252     271
ERGMJHS01CRQ8D  5-272   268     5-272   268     273
ERGMJHS01ECMVT  5-260   256     5-260   256     270
ERGMJHS01EZ7VU  5-41    37      5-61    57      62
ERGMJHS01ERDXB  5-207   203     5-207   203     252

This file describes what (trimmed) part of the read was considered for alignment. The columns describe:

Posted in Newbler output | Tagged: , , , , , , | 14 Comments »

Newbler output III: the 454ContigGraph.txt file

Posted by lexnederbragt on April 13, 2010

The single file I’ll discuss today has in fact almost the entire assembly in it, besides the actual sequences (although even some of these are also included, see below). As explained in my first post, newbler (as many other assembly programs) builds a contig graph. Contigs are the nodes, and reads spanning between them (starting in one contig and continuing or ending in another) indicate the edges. All the information on this graph, except the actual read alignments and consensus contigs, is in the 454ContigGraph.txt file.

The file is divided into several sections, for each one the lines start with a capital letter, except for the first section.

Putting together an assembly...

Section 1) Contig statistics

Read the rest of this entry »

Posted in Newbler output | Tagged: , , , , , , | 33 Comments »

Newbler output II: contigs and scaffolds sequence files, and the 454Scaffolds.txt file

Posted by lexnederbragt on March 22, 2010

The files most people are after when they do an assembly must be these: the actual contig and scaffold sequences. The contigs are in the files 454AllContigs and 454LargeContigs. ‘All’ indicates by default contigs of at least 100 bp, while ‘Large’ contigs are at least 500 bp. These lower limits can be set during assembly.
The ‘fna’ files contain the sequences (bases) in fasta format (I actually do not why this extension was chosen over ‘fasta’ or ‘fa’ which are most often used). The ‘qual’ files contain phred-like quality scores (see previous post). The contigs are in the same order between fna and qual files, and the quality scores are in the same order as the bases:

Read the rest of this entry »

Posted in Newbler output | Tagged: , , , , , , , | 19 Comments »

Newbler output I: the 454NewblerMetrics.txt file

Posted by lexnederbragt on March 11, 2010

With this post, I’ll start going through the output files newbler generates. Some of these will be described in detail as they contain a lot of important information.

For today’s post, we’ll start with the 454NewblerMetrics.txt file. This file contains a lot of details on the reads used during the assembly, as well as the resulting contigs and, in the case of paired end reads, scaffolds.

The file starts with some metadata, such as the date of the assembly, where is is located, and what version of newbler was used. For this post, I used a file of as assembly generated with version 2.3 and both shotgun and paired end read files. Note that the output will be slightly different for a mapping project (to be described in a later post) than for an assembly project.

Section 1: runData

file
{
path = "/your/path/yourfile1.sff";

numberOfReads = ######, ######;
numberOfBases = #########, #########;
}

For each input file, the numbers mentioned are reads and bases in the file, reads and bases after trimming.

Read the rest of this entry »

Posted in Newbler output | Tagged: , , , , | 36 Comments »

How newbler works

Posted by lexnederbragt on February 9, 2010

I thought to start by explaining briefly how newbler works. I’ll do this by following the output newbler generates during the assembly process. This information is displayed during assembly, and can also be found in the 454NewblerProgress.txt file. It is a good thing anyways to have a look at this file, as it sometimes displays certain warnings (see below).

This example assembly is based on a read dataset consisting of both shotgun reads, and paired end reads (for more on 454 paired end reads, have a look here).

The first thing you’ll see is a message stating that the assembly computation started, and which version of newbler you used.

Then, you’ll see messages for each input file saying Indexing XXXXXXX.sff…, and a counter. During indexing, newbler scans the input file, performs some checks and trims the reads (sometimes more than the base-calling software already did). One of the checks is for possible 3′ and 5′ primers: if a certain percentage of reads contains the same sequence on either the 3′ or 5′ end, this is mentioned. I’ve had some surprises here, such as finding out that reads I got from another group contained an adaptor sequence, which caused problems during the assembly. More on primer removal later…

If an input sff file contains paired end reads, this will be mentioned, as well as the number of reads that contained the paired end linker sequence, for example:

224024 reads, 58599257 bases, 112080 paired reads.

Next:

Setting up long overlap detection…
XXXXX reads to align
Building a tree for YYYYYY seeds…
Computing long overlap alignments…

The first phase of assembly is finding overlap between reads. Newbler splits this phase into one for long reads (this goes very fast) and shorter reads (can take quite some time). As aligning all reads against each other would take too long time, newbler (and many other programs) actually make seeds, 16-mers of each read, where each seed starts 12 bases upstream of the previous one. These seed length and step sizes can be changed if you want (I’ve never tried this, though). When two different reads have identical seeds the program tries to extend the overlap between the reads until the minimum overlap (default 40 bp) with the minimum alignment percentage default 90%) has been reached. These settings can also be changed and influence the alignment stringency, this I will come back to in a later post. Read the rest of this entry »

Posted in How it works | Tagged: , , , , , , | 57 Comments »