An assembly of reads, contigs and scaffolds

A blog on all things newbler and beyond

Running newbler: de novo transcriptome assembly II: the output files

Posted by lexnederbragt on September 21, 2010

This post describes the transcriptome specific output files, or differencers between the files for transcriptome assembly relative to a regular assembly. For (aspects of) files not treated in this post have a look at these previous posts.

Alternative rope splicing (source: Wikimedia commons)

1) 454NewblerMetrics.txt
The differences for this file, relative to the same file for a ‘normal’ assembly (described in this post), are metrics on the isogroups and isotigs:

isogroupMetrics
{
numberOfIsogroups     = #####;
avgContigCnt          = #.#;
largestContigCnt      = ####;
numberWithOneContig   = #####;

avgIsotigCnt          = #.#;
largestIsotigCnt      = ##;
numberWithOneIsotig   = #####;
}

Besides the number of isogroups, the average and maximum number of contigs per isogroup are listed, as well as number of isogroups with only one contig. Below that, the average and maximum number of isotigs per isogroup are listed, as well as number of isogroups with one isotig.

isotigMetrics
{
numberOfIsotigs       = #####;
avgContigCnt          = #.#;
largestContigCnt      = ##;
numberWithOneConitg   = #####;

numberOfBases     = ########;
avgIsotigSize     = ###;
N50IsotigSize     = ####;
largestIsotigSize = #####;
}

After the number of isotigs follows the average and maximum number of contigs per isotig, and the number of isotigs with one contig (note the typo…). Below that, the total length of all isotigs combined is listed, as well as the average, N50 and longest isotig length.

2) A note on isogroup numbering
The isogroups are ordered such that the lowest numbers (identifiers, e.g. isogroup00001, isogroup00002, …) are the isogroups that only consist of contigs, followed by the isogroups with two and more isotigs in the middle, and ending with isogroups with only one contig.

3) 454IsotigsLayout.txt
This file represents in a schematic (‘graphical’) way how the isotigs are build up out of contigs.
As an example, let’s take isogroup 156:

>isogroup00156  numIsotigs=10  numContigs=10
___Length : 425   655   217   341   238   679   1116  149   108   137   (bp)
___Contig : 5625  5638 19956  6778  5613  6320 19957  5627  5628  8964  Total:
isotig02293       >>>>>       >>>>> >>>>>             >>>>>              1385
isotig02294       >>>>>       >>>>> >>>>>                   >>>>>        1343
isotig02295 >>>>>             >>>>> >>>>>             >>>>>              1154
isotig02296 >>>>>             >>>>> >>>>>                   >>>>>        1112
isotig02297       >>>>>       >>>>>       >>>>> <<<<<                    2792
isotig02298 >>>>>             >>>>>       >>>>> <<<<<                    2561
isotig02299       >>>>>       >>>>>       >>>>>                   <<<<<  1813
isotig02300       >>>>>       >>>>>                               <<<<<  1134
isotig02301 >>>>>             >>>>>       >>>>>                   <<<<<  1582
isotig02302             <<<<<                                     <<<<<   354

(Note that I added ‘_’ symbols to the second and third lines in order to have the columns align as they should, in the real file thesea re spaced, but I can’t get these to show up here…).
The first row explains that Isogroup 156 (‘gene’) contains 10 contigs forming 10 isotigs.
The table lists the contigs as columns, and the isotigs as rows.
The ‘Length’ row lists the lengths of the individual contigs
The contig numbers are listed below (so, ‘5625’ is really contig05265 in other files).

So, isotig02293 is built up out of the following contigs:
5638, 6778, 5613, 5627, all in the ‘forward’ orientation, represented by the right-pointing ‘>’ symbol.
isotig02294 is built up out of the same contigs, with one difference: contig 5627 is replaced by 5628. Most likely, these two contigs are quite similar, but not similar enough to be collapsed into a single contig.
isotig02299 and isotig02300 are a good candidate for splice variants, as they are identical except for the third contig, which is missing in 2300. Note that the last contig of these isotigs is in the reverse orientation (‘<’ symbols).
Isotig02302 is a bit special; it consists of only two contigs, and it is the only one with contig 217…

The isogroups listed last are the ones consisting of only 1 isotig/contig, for example:

>isogroup25158  numIsotigs=1  numContigs=1
Length : 420   (bp)
Contig : 10111 Total:
isotig32465 >>>>>   420

At the end of the file, there is a bunch of summary statistics. First, there are four ‘histograms’:
Contigs per isogroups:

NumContigsInIsogroup    NumIsogroups
1            21133
2              390
3             2302



103                1
107                1
127                1
202                1
1096                1

Isotigs per isogroup:

NumIsotigsInIsogroup    NumIsogroups
1            21494
2             2488
3              444
4              285


73                1
75                1
95                1
97                1

Contigs per isotig:

NumContigsInIsotig    NumIsotigs
1            21264
2             5754
3             2324


14               65
15               32
17                1
18               16

Length of contigs (in 100 bp bins):

ContigLength    NumContigs
0-100     4300
100-200     5014
200-300     3549


5900-6000        1
9800-9900        1
11500-11600        1

Length of isotigs (also in 100 bp bins):

IsotigLength    NumIsotigs
0-100      478
100-200      272
200-300      791
300-400     1429


9200-9300        1
9800-9900        2
11500-11600        1

Finally, at the very end, there is a summary statistics on the ‘Filter status’:

Filter status:
#isotig #none   #cyclyc #ig_thresh      #it_thresh      #icl_thresh     #edge_thresh
32541   162     432     1096    1273    119     122

  • #isotig: the number of isotigs
  • #none: the number of contigs that did not make it into isotigs because the traversel of the graph stopped (which was caused by crossing one of the thresholds for traversal)
  • #cyclyc: contig graphs can be cyclic causing the traversal to hit the same contig more times, in which case the traversal is stopped, I guess this is another typo (cyclic?)
  • #ig_thresh: number of contigs that were not traversed because of crossing the Isogroup Threshold
  • #it_thresh: number of contigs that were not traversed because of crossing the Isotig Threshold
  • #icl_thresh: number of contigs that were not traversed because of crossing the Isotig Contig Length Threshold
  • #edge_thresh: I actually don’t know what this one represents…

For an explanation of these thresholds, see this previous post.

4) 454Isotigs.txt
This file is the equivalent of the 454Scaffolds.txt file from a regular assembly (see my post on this file /2010/03/22/newbler-output-ii-contigs-and-scaffolds-sequence-files-and-the-454scaffolds-txt-file/). It is follows the ‘AGP’ format.

Taking the same isogroup as example, listing the first two isotigs:

isotig02293     1       655     1       W       contig05638     1       655     +
isotig02293     656     997     2       W       contig06778     1       342     +
isotig02293     998     1236    3       W       contig05613     1       239     +
isotig02293     1237    1385    4       W       contig05627     1       149     +
isotig02294     1       655     1       W       contig05638     1       655     +
isotig02294     656     997     2       W       contig06778     1       342     +
isotig02294     998     1235    3       W       contig05613     1       238     +
isotig02294     1236    1343    4       W       contig05628     1       108     +

Columns are: Isotig name, start base, end base, incremental number, ‘W’ for fragment (as opposed to ‘N’ for gap – not relevant for isotigs), contig name, start base for the contig (always 1), end base of the contig (identical to the contig’s length), orientation (‘+’ of ‘-‘ for forward or reverse, respectively).

5) 454Isotigs.fna, 454Isotigs.qual
These files replace the 454LargeContigs files and contain the actual isotig sequences. The fasta headers for these files look like these:

>isotig02293  gene=isogroup00156  length=1385  numContigs=4
>isotig02294  gene=isogroup00156  length=1343  numContigs=4
>isotig02295  gene=isogroup00156  length=1154  numContigs=4
>isotig02296  gene=isogroup00156  length=1112  numContigs=4

These represent four of the isotigs of isogroup 156.

6) 454AllContigs.fna, 454AllContigs.qual

>contig05638  length=655  numreads=4  gene=isogroup00156  status=isotig
>contig06778  length=342  numreads=4  gene=isogroup00156  status=isotig
>contig00487  length=208  numreads=18  gene=isogroup00001  status=ig_thresh
>contig37963  length=333  numreads=8  gene=isogroup00013  status=edge_thresh
>contig00487  length=208  numreads=18  gene=isogroup00001  status=ig_thresh
>contig35882  length=1  numreads=3  gene=isogroup00014  status=icl_thresh
>contig00498  length=999  numreads=32  gene=isogroup00039  status=cyclic
>contig20279  length=3  numreads=80  gene=isogroup00013  status=none

These files have all the contigs. Note that ‘All’ here refers to contigs from 1bp and longer, usually this file contains by default contigs of 100 bp and larger. Also note that in the 454NewblerMetrics file (described above), the number of All contigs is again different, but when I checked, it does not look like it is representing the contigs with a lower limit of 100 bp…
The fasta headers of the contigs list the usual length and number of reads, but in addition, they list to which isogroup the contig belongs, and the ‘filter status’, as described above for the 454IsotigLayout.txt file

7) 454Isotigs.ace and/or consed files; 454AlignmentInfo.tsv
These are as per the normal assembly output, but contain the isotigs, and those contigs that were not included in any isotig.

8) 454RefLink.txt
This file basically lists the isogroups and corresponding isotigs.

#name    product    mrnaAcc    protAcc    geneName    prodName    locusLinkId    omimId
isogroup00001        contig00568
isogroup00001        contig00604
isogroup00001        contig00621
isogroup00001        contig37347



isogroup00156        isotig02293
isogroup00156        isotig02294
isogroup00156        isotig02295
isogroup00156        isotig02296
isogroup00156        isotig02297
isogroup00156        isotig02298
isogroup00156        isotig02299
isogroup00156        isotig02300
isogroup00156        isotig02301

The header line, though, is based on the UCSC’s reflink.txt file. The manual states “This can be used as an annotation file for further mapping projects of the cDNA / transcriptome assembly products (isotigs and contigs)”, but I have no idea how to actually use this file for that purpose…

That should cover the transcriptome output!

Advertisements

11 Responses to “Running newbler: de novo transcriptome assembly II: the output files”

  1. BJD said

    This has been a very helpful blog for me.

    Am I correct to assume that the isotigs in “454IsotigsLayout.txt” are just all combinations of *potential* contig configurations, but without long-read data, the true existence of each combination isn’t necessarily supported? Do you have any experience with long-read (e.g. sanger) data adjusting the isotig configurations to just those empirically supported combinations?

    Thanks,

    • flxlex said

      You are correct, the isotigs are constructed as potential transcripts based on the contig graph from the read alignment. I have a transcriptome assembly where both 454 and Sanger reads 8from different individuals and tissues) are included, but I have not done a systematic comparison of isotigs with and without Sanger reads. In theory, adding Sanger reads should help.

  2. Vespa said

    Hi Flxlex,

    Thanks for consolidating the 454 output transcriptome output here…very helpful.

    A question: someone asked if the number of reads in an isotig is available. From what I can see, one has to get the contigs in an isotig and add up the reads in each contig, unless I’m missing something. We can get the unique depth and get a feel for the number of reads going into an isotig, right, but that won’t give us number of reads per isotig. Am I missing anything here?

    Also, how important do you think this is (number of reads/isotig) as compared to the qual scores? Or is this an apples and oranges comparison?

    • flxlex said

      The number of reads in an isotig is not available. It would be wrong to add all reads in the contigs making up an isotig, as many reads are (by definition) present in more than one isotig in the contig. Looking at the read depth of the contigs within the isogroup (e.g. the average) is a good starting point, the depths are listed in the 454ContigGraph.txt file. Alternatively, you could map reads back to isotigs, but it would be best to select one isotig per isogroup to ensure unique mapping. There surely is a correlation between depth and quality values, and there should be less frame-shifts due to homopolymer errors in high-coverage isotigs, so it is not really apples and oranges.

  3. Vespa said

    Flxlex,

    Thanks much for the reply. Makes a great deal of sense to me.

  4. DEP said

    Hi Flxlex,

    Thanks so much for your posts, they are very helpful for me also.

    Though I was wondering if you would know whether the reported orientation of the contigs within an isotig (item 3) is determined by codon composition or is this simply the original direction of the majority of reads as produced by the FLX sequencer?

    I thought during library preparation the adaptors were added without maintaining transcript orientation, this would be very useful if it predicted the actual direction of the transcript.

  5. Fenix said

    Hi Flxlex,

    I found in many cases, the isotigs # reported in 454NewblerMetrics.txt and the sum of histogram reported in 454IsotigsLayout.txt are the same, BUT not in total isotigs reported in 454Isotigs.fna. Do you know why? I checked the difference occurred in each length range (bin), thus it seems not caused by the cutoff of short length isotigs.

    Thanks.

    • Fenix said

      I know what happen. The unique contigs as isotigs with only one contig reported were reported in 454Isotigs.fna file, but not counted in 454NewblerMetrcs.txt and 454IsotigsLayout.txt as “isotig”.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: