An assembly of reads, contigs and scaffolds

A blog on all things newbler and beyond

Newbler output IV: on ultra-short and single-read contigs

Posted by lexnederbragt on April 5, 2011

Ultra-short contigs...

Sometimes you might observe very short contigs, some even having high read depth. You might see these for example when
– you choose ‘-a 1’ (or ‘-a 0’) as a setting during the assembly, forcing newbler to output all contigs of whatever length (normally the lower limit is 100 bp)
– you run an assembly using the cDNA option, here the lower limit is set to 1
– you use the 454ContigGraph.txt file, in which all contigs of whatever length are listed

The -minlen option requires by default a minimum length of 50 (20 when paired reads are part of the dataset), and the default minimum overlap between reads is 40 bases, so how are contigs so short possible at all?

There appear to be several reasons for these contigs (the information below was kindly provided by the newbler developers; disclaimer: I might have misunderstood them… ):

– microsatellites are very short repeats that the alignment loops through, causing a very short (2bp, 3bp, 4bp) alignment with ultra-high depth.
– very deep alignments (with lots of reads) can cause shattering, caused by accumulation of enough variation to break the alignment into pieces, some of which may be very short
– at the end of contigs, variations in the (light) signal distributions of homopolymers can also cause small contigs ‘breaking off’

Another very strange type of contig is one that mentions in the fasta header ‘numreads=1’. How can one single read become a contig? It should be labelled a singleton, right? Well, these ‘contigs’ can be explained also…
A multiple read alignment grows when reads added to it. After such an addition, there are checks run on the alignment. Addition of new reads may actually result in an alignment being broken, in some cases a part is taken out and placed in its own alignment. During the detangling phase, reads may be removed from a set of aligned reads and. For these parts taken out of alignments this may mean that onlu a single read is left in the alignment. Newbler then keeps this read as a contig (perhaps they should remove these instead, but who am I to complain…).

A singleton read is a read that did not show any significant overlap (by default, a 40 bp window of at least 90% similarity) with any other reads. These ‘numreads=1’ contigs are not singletons as they (or part of them) actually had sufficient overlap for them to have been part of an alignment.

Many people ask about these strange contigs, both in the comments on this blog, and on sites such as seqanswers.com. I hope this post makes the situation around these contigs a bit less confusing…

Advertisements

4 Responses to “Newbler output IV: on ultra-short and single-read contigs”

  1. Jordi said

    Hi again!
    I am wondering why sometimes I have had an assembly step with contigs like the following:
    >contig00004 length=161 numreads=0 gene=isogroup00001 status=isotig
    >contig00012 length=49 numreads=0 gene=isogroup00001 status=isotig
    >contig00013 length=55 numreads=0 gene=isogroup00001 status=isotig

    Whether numreads=1 was really strange, what about numreads=0?ques

    Here:
    http://seqanswers.com/forums/showthread.php?t=7802

    I have found related question, but there was no answer.
    Just to point out some strange assembly results….
    Regards

    • flxlex said

      I’m afraid this looks like a bug – surely one cannot build a 161 bp contig without reads :-) . Perhaps a later (or earlier) version of newbler doesn’t show this?

  2. Natascha said

    Thank you for this very useful post!

    I would like to enquire, how does Newbler deal with reads that contain repetitive elements (such as microsatellites)?

    You mentioned that microsatellites could be one reason for ultra-short contigs.
    So, are they all included into the assembly, or is there a step where Newbler might mask/filter such reads out?

    Thank you in advance :)

    • Newbler will not filter them out. However, in our experience, 454 reads that hit such a sequence often end in those repeats, never being able to yield sequence at the other end. This then breaks contigs at these repeats.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: