An assembly of reads, contigs and scaffolds

A blog on all things newbler and beyond

Newbler output III: the 454ContigGraph.txt file

Posted by lexnederbragt on April 13, 2010

The single file I’ll discuss today has in fact almost the entire assembly in it, besides the actual sequences (although even some of these are also included, see below). As explained in my first post, newbler (as many other assembly programs) builds a contig graph. Contigs are the nodes, and reads spanning between them (starting in one contig and continuing or ending in another) indicate the edges. All the information on this graph, except the actual read alignments and consensus contigs, is in the 454ContigGraph.txt file.

The file is divided into several sections, for each one the lines start with a capital letter, except for the first section.

Putting together an assembly...

Section 1) Contig statistics

1 contig00001 27007 29.7
2 contig00002 34455 29.1
3 contig00003 35840 28.0
4 contig00004 32644 30.0
5 contig00005 20873 27.8

This first part consists, for each contig, of

  • the number (identifier) of the contig
  • the contig name (‘contig’ followed by the contig number with zeros so that it has at least 5 digits)
  • the length of the contig
  • its read depth.

‘Read depth’ is defined as the total number of included bases from all the reads aligned to generate the consensus contig sequence, divided over the contig length. Note that reads at the ends of contigs can contribute as little as 1 base (yes, I’ve seen it) to the contig. In that case, only that single base is counted towards the total. Most reads, however, count over the entire length.
This section is the only one where the contigs are named with the actual word ‘contig’. In the remainder of the file, the contig number are used.

Section 2) Edges (lines starting with ‘C’)

C       26      3'      221     3'      24
C       27      3'      1754    5'      23
C       27      3'      62      5'      22
C       212     5'      1924    3'      27
C       21      5'      1034    5'      21
C       28      5'      1034    5'      25
C       29      3'      1895    3'      45
C       127     5'      31      3'      32

Edges are reads that align from the end (5’ or 3’) of one contig into the end (5’ or 3’) of another. The columns represent:

  • the letter ‘C’
  • the contig number on the left end of the edge
  • 5’ or 3’ to indicate which end of the contig the left edge refers to
  • the contig number at the right end of the edge
  • 5’ or 3’ to indicate which end of the contig the right edge refers to
  • the depth of the edge.

‘Edge depth’ represents the number of reads that align to form this edge.

From the example it already becomes clear that the graph can be complicated: contig 27 has two 3’ edges. From the first section of the file, it appears that its depth is also about twice the average, indicating that contig 27 most likely is a collapsed repeat (see the post on how newbler works). The same holds for contig 1034, which has about triple the average depth, and three 5’ edges (the third one is not shown in the small excerpt).

Section 3) Scaffolds (lines starting with ‘S’)

S       21      249301  1048:+;gap:1209;1049:+;gap:222;1050:+;gap:1329;1051:+;gap:721;1052:+;gap:729;1053:+;gap:542;1054:+;gap:807;1055:+;gap:305;1056:+;gap:644;1057:+

The information in this section is partly overlapping with that in the 454Scaffolds.txt file (described in the previous post). The first three columns are:

  • the letter ‘S’
  • the scaffold number (this would make this scaffold00021)
  • the scaffold length

The last column describes how the scaffold is built up out of contigs and gaps:

1048:+;gap:1209;1049:+;gap:222;1050:+;gap:1329;1051:+

means that scaffold00021

  • starts with contig 1048 (contig01048), in the + (forward) orientation (i.e. 5’ to 3’)
  • followed by a gap of 1209 bp
  • followed by contig 1049
  • a 222 bp gap
  • contig 1050
  • etc…

Section 4) Thru-flow information (lines starting with ‘I’)

I       34      TCTTATAAAGAAACGGTTTATTATATAAGTAGTATCTGGGAAAAGGCAGATTTTTTTTCCCAAAAGATTAAAGGGCATTGGG      15:1805-3'..207-3';14:1805-3'..1973-5'
I       35      AACTTTTCCTCCGTAAATACCGTTAATGTTTCTGGAAATTCAGTTACATTAGACACCAGTATTGGAAATGGAGCAATTGACTTTATTGGTTCAACCCTTGCTGGA       10:36-3'..36-5'
I       91      ACCACTTATTTCGA  85:93-5'..92-5'

For (very) short contigs that consist of short repeats, reads that have that repeat in them are often longer than the repeat/contig length. In this case, the reads would ‘flow through’ the short contig, starting in a contig outside it, and ending in yet another contig. This section has an entry for all contigs shorter than 256 bp, and if there are flow-through reads, some statistics on these are included. The columns are:

  • The letter ‘I’
  • The contig number
  • The contig sequence (note, this only included if the contig has less than 256 columns in the alignment)
  • The ‘Thru-flow’ information

In the example for contig 34 (or contig00034):

15:1805-3'..207-3';14:1805-3'..1973-5'

This means that

  • there are 15 reads that come from the 3’ end of contig 1805, flow through contig 34, and continue at the 3’ end of contig 207, AND
  • there are 14 reads that start come from the 3’ end of contig 1805, flow through contig 34, and continue at the 5’ end of contig 1973.

Section 5) Single-end Read flow information (lines starting with ‘F’)

F       3       1033/40/0.0     1851/57/36.1;1808/41/68.3
F       4       124/67/0.0      117/46/0.0;5/3/101.3
F       5       -       1008/31/0.0

This section contains information on where reads end (or start) that flow out of the contig in question (i.e. have their start or end in the contig in question, but do not align entirely in it). The columns are:

  • The letter ‘F’
  • The contig number
  • The flow information for reads flowing from the 5’ end of the contig
  • The flow information for reads flowing from the 3’ end of the contig

In the example for contig 4 (or contig00004):

124/67/0.0      117/46/0.0;5/3/101.3

  • For the 5’ flows: 67 reads flow from the 5’ end of contig 4 and terminate in contig 124; the average distance from the 5′ end of contig 4 to the end of contig 124 into which the reads flow is 0.0 bp. Zero bp? Yes, this means that the two contigs are right next to each other without a gap inbetween. Zero base gaps between contigs are logical if you understand the contig graph principle: collapsed repeat contigs branch off on either end into single-copy contigs. People who first start mining newbler assemblies are sometimes frustrated to find contigs that seem to belong next to each other without a gap…
  • For the 3’ flows: 46 reads flow from the 3’ end of contig 4 and terminate in contig 117; again, these contigs are next to each other.
  • An additional 3 reads flow from the 3’ end of contig 4 and terminate in contig 5; the average distance from the 3′ end of contig 4 to the end of contig 5 into which the reads flow is 101.3 bp. Contig 117 is 101 bp, indicating that the reads most likely flow through this contig!

Note that the ‘minus’ for the 5’ flows of contig 5 in the excerpt above indicates there are no such flows for this contig.

Section 6) Paired-end Read flow information (lines starting with ‘P’)

This section describes essentially the same information as the previous one, except that it deals with the paired end reads.

As you can see, this file contains a lot of information. What use is it? One immediate use is the read depth described in the first section. When you plot the distribution of read depths, you get a feel for the overall coverage (‘oversampling’) of the assembly. Also, contigs with unusually low depth could indicate contamination, those with unusually high depth collapsed repeats. In fact, read depth turns out to be correlated to the number of copies present in the genome, a fact that my colleagues and I exploited in a paper available here.
Also, the information about which contigs are gapless neighbors could come in handy.

In addition, I think it could be very useful if there would exist a browser for the contig graph. It would allow for looking visually at neighboring contigs, indicate which contigs are repeated and where these could be placed, explain gaps in scaffolds etc. I once used a very simple approach, treating contigs as ‘dots’, disregarding the 5’ and 3’ ends, contig length and depth, only indicating contig edges, and made a graph in VisANT. This program will accept a simple table describing ‘from to ‘ and direction (+1 or -1, which I set to +1 for all). A nice feature of VisANT is that it allows for determining all possible shortests paths between selected nodes. I used it to check if certain contigs could be (close) neighbors. This is a VisANT example for a bacterial genome assembly (click to enlarge):

I hope I was able to make some sense out of this file. As always, feel free to ask questions in the comments section!

(Images from Wikimedia commons, here and here)

Advertisements

33 Responses to “Newbler output III: the 454ContigGraph.txt file”

  1. Delphine said

    Thanks for this information.
    I use Newbler and there are again some mysteries for me.
    For example, how Newbler can write something as this in 454ContigGraph.txt in section1):
    20 contig00020 0 10.0

    Length of the contig 0 ??

    I don’t understand, is it a bug?

  2. Doug Senalik said

    Your blog is currently the only real Google hit to 454ContigGraph.txt, and I am learning that this file is full of important information. Thanks for taking the time top post this information!

    I used the graphviz program “neato” to do something similar to your VisANT illustration. With a sufficient pile of Perl code to do the heavy lifting, it is possible to visualize the 5′ and 3′ ends of contigs separately and draw the links, and even add a read number on the edges connecting contigs.

  3. Mike said

    Doug – Do you by chance have the source code that you are using to create those graphs? I can create the graphs manually but I am unable to parse the data automatically from the 454ContigGraph.txt file to get a script that graphvis can use. I also run into problems when I adjust node size to reflect contig size as it appears you have done. If I adjust the node size, the graph becomes disjointed and none of the edges line up properly. (Oddly enough, I don’t have this problem when using the “Dot” funciton of graphvis, only the when I try to do an undirected graph with “Neato”.)

    Thanks.

  4. Tony said

    Hi Flxlex,
    Thanks for blogging this very useful information. In the last column describing how the scaffold is built out of contigs and gaps (section 3), why do the contigs have consecutive numbers and always in forward direction? This seems to be conflicting with your VisANT and Doug’s Contig Network illustrations (that is, a scaffold can be built from contigs with random numbers coming from the assembler as long as they are connected). Sorry I mush have misunderstood it. Can you explain?

    Thanks

    • flxlex said

      Newbler numbers the contigs according to where they are in scaffolds, and puts them in the ‘+’ orientation by default. You will occasionally find contig numbers ‘missing’ in scaffolds, but these are usually found as neighbors in the contig graph (i.e. potentially have an, as yet unresolved, place in a gap)

      • Tony said

        I am confused. So, the contig numbers in “S” section are NOT the contig numbers in “C” section, right? Does the contig network graph actually depict the “C” section?

      • flxlex said

        Sorry for the confusion, but the contig numbers in the ‘C’ section are the same as in the ‘S’ section. Just not in the examples I used (I took a part of the ‘C’ section in the beginning, of the ‘S’ from the middle)…

  5. dimitra said

    Dear Flxex,
    I’m a little confused with the contig statistics. For some of my contigs, i have a “length of the contig” = 1. How is this possible?
    Also, i find that most contigs generated are around 1-2kb. Is Newbler biased towards this contig size?
    Thank you for your time.

    • dsenalik said

      When I have a contig of length 1, it is telling me that I have an indel of 1 b.p., either a variant between 2 alleles, or 2 copies of a gene. I have a picture on this page: http://www.vcru.wisc.edu/simonlab/sdata/software/index.html#contignet. Sometimes there will be two short contigs, both linked to the same contigs on either end, and if it’s 1 b.p. then you have a high-confidence SNP.

      I suspect the contigs at 1-2kb is really a reflection of the average distance between common repeat motifs in the genome plus allele variants (like the 1 b.p. contigs above), i.e. related to the average length of single-copy sequence. But I am totally just making this up, anybody know for sure?

      • flxlex said

        Thanks, dsenalik. I hand’t seen your reply before I wrote mine. yes, SNPs could be causing the 1 bp contigs, it looks like. If there is enough coverage, then it could be like you suggest, many repeats in this size class. But my bet is on low coverage…

    • flxlex said

      For your first question, have a look at this post: https://contig.wordpress.com/2011/04/05/newbler-output-iv-on-ultra-short-and-single-read-contigs/

      The fact that you find mostly 1-2 kb contigs should not be the results of newbler being biased. Without knowing more about your project, I would say you have low coverage, or transcriptome reads…

  6. Konstantin said

    Hello,
    thanks a lot for explanation contigs graph file content! It really helps me. I also use this graph after visualization . But l have some problems with understanding of order and directions of this contigs. Could you explain me how to fined sequences of this common reads between two contigs to avoid for example overlap between this contigs etc.? Can you advise any software for this purposes. I’ll glad for any help.

    • flxlex said

      I’m sorry, but I do not really understand your question, what kind of sequences are you trying to find? Can you try to rephrase?

      • Konstantin said

        Sorry. How to find the name of reads and get sequences of this reads which are common between two contigs? By the way is it possible to assemly the whole genome if we based on only contigs graph structure?

      • flxlex said

        The information where the reads are located is available from the ace file, or with the latest version from the bam file. There is limited information in the 454ReadStatus file, but there only the beginning and endpoint of the read is mentioned (a read could be in other contigs in the middle portion). For your second question: in a way, the contig graph is the best newbler can do when it comes to assembling the reads. So, unless there is other information you can add, one cannot get a better assembly by only using the graph.

  7. alessa said

    Dear Flxex,

    Thanks for this information!! but I am still confused about some statistics…mainly read depth….. I want to know what is the average of reads used per contig? Can I have a value of that with the average of the read depth from all contigs?, or is just the number of reads assembled devided by the number of contigs?
    Thanks in advanced!!

    • flxlex said

      For read depth, you shouldn’t use the number of reads, but the number of bases in those reads. So, average read depth could be calculated by total number of bases used for the assembly, divided by genome size, or by the total length of all the contigs (note that the total assembly size is usually shorter than the genome size).

  8. SB said

    Great blog :-)

    I am working on a couple of bacterial genomes assembled de novo with Newbler 2.5. Contigs number range from about 90-120, depending on the strain.

    My contigs graph file doesn’t have a scaffold “S” part. What would that mean? No scaffolds could be assembled at all?

    By reading the contig graph file I found a lot of logic (contigs with plasmid annotation seem to hang together) so it looks promising. Recently I read this paper “Nagarajan et al., Finishing genomes with limited resources: lessons from an ensemble of microbial genomes BMC Genomics 2010, 11:242” which suggests a script (found here: http://cbcb.umd.edu/finishing/) to assemble scaffolds and close gaps. Did anyone try this one?

    • flxlex said

      If you didn’t add paired end (better called mate pair) data to your assembly, newbler won’t generate scaffolds. I have not looked into the reference you mentioned – you could ask the community at seqanswers instead for people’s experiences.

  9. Konika Chawla said

    Hi, I am using Newbler for assembly of reads from a sff file. I use RunAssembly function and get get contigs, but do not get any 454ContigGraph.txt or 454Scaffold.txt. I also want to know how to find the read coverage for this assembly, how do I calculate it. Thanks.

    • I find it surprising you do not get the 454ContigGraph.txt file. With really old versions of Newbler, you had to add the -g flag to get it, if I remember correctly. You will only get scaffolds if Newbler sees paired end reads (more correctly called ‘mate pair reads’) in your input. Average coverage is amount of bases in the reads devidied by genome size (or assembly total size).

      • Konika Chawla said

        Thanks. Got the coverage. But I checked I have Newbler2.7 and I had used the command runAssembly -o test_konika GMFR293.273.RL12.sff , will try with a -g again.

      • Konika Chawla said

        Is it possible to identify the type of sequencing, example if its paired end read or not, from the .sff file?

      • No. The best way to find out about this is feeding the sff file to Newbler, and see what it says. If it is paired end (25% or more of the reads have the linker sequence) Newbler wil indicate that the file has paired end reads. An alternative is how many paired reads sff_extract gives you, see http://bioinf.comav.upv.es/sff_extract/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: