An assembly of reads, contigs and scaffolds

A blog on all things newbler and beyond

Newbler output VI: the ‘status’ files (454TrimStatus.txt, 454ReadStatus.txt, 454PairStatus.txt) and the 454AlignmentInfo.tsv file

Posted by lexnederbragt on May 20, 2010

The files that are the topic of this post are all tables, i.e. tab separated text files. The ‘status’ files describe what happened with all the reads and the paired end halves, while the AlignmentInfo file summarizes the contig alignments.

The fact that these files are tabular makes for easy parsing using by perl/python or, my favorite, awk.

1) 454TrimStatus.txt

Accno   Trimpoints Used Used Trimmed Length     Orig Trimpoints Orig Trimmed Length     Raw Length
ERGMJHS01CYVHW  5-78    74      5-98    94      100
ERGMJHS01D6IHL  5-116   112     5-116   112     161
ERGMJHS01DYTX5  5-127   123     5-127   123     173
ERGMJHS01DYDH0  5-78    74      5-78    74      124
ERGMJHS01ECEGM  5-256   252     5-256   252     271
ERGMJHS01CRQ8D  5-272   268     5-272   268     273
ERGMJHS01ECMVT  5-260   256     5-260   256     270
ERGMJHS01EZ7VU  5-41    37      5-61    57      62
ERGMJHS01ERDXB  5-207   203     5-207   203     252

This file describes what (trimmed) part of the read was considered for alignment. The columns describe:

  • Accno: the unique read ID, where the first 7 characters describe the unique run ID, followed by the lane number, followed by the encoded x and y coordinates of the read on the picotiterplate.
  • Trimpoints Used: the start and end position of the part of the read newbler used. Most of the times, the start will be position 5, as the first four bases of every read comprise the key sequence that identifies the read as a sample read (as opposed to control reads that have different key sequences). Also, in contrast to traditional Sanger reads, read quality is usually high from the very first bases read after the sequencing primer. When MIDs (454’s Multiplex IDentifiers) or other tags/barcodes have been used during library generation, and the reads were split according to the tag (which removes the tag from the read), the starting position will be higher accordingly.
  • Used Trimmed Length: the length of the part of the read newbler used
  • Orig Trimpoints: the start and end part of the trimmed read as it was given to newbler. These positions are the result of the image signal processing software trimming steps (thanks to Steven Sullivan for pointing out the original trimpoints are from the signal processing steps, not image processing…)
  • Orig Trimmed Length: the corresponding original trimmed length
  • Raw Length: the length of the read as it was before image processing

Comparing the Used Trimmed Length with the Orig Trimmed Length shows that for some reads, newbler trims even further than the image processing software. Also, the usable part of a read can get shorter when the ‘trimming database’ option (-vt) was used during assembly, for example to remove vector/adaptor/primer sequences.

Another section of the same file:

FQL5QBG02GX6EQ_left     5-171   167     5-296   292     299
FQL5QBG02GX6EQ_right    217-296 80      5-296   292     299
FQL5QBG02GUPVF  255-255 1       5-255   251     265
FQL5QBG02IFXSU_left     5-173   169     5-305   301     308
FQL5QBG02IFXSU_right    219-305 87      5-305   301     308
FQL5QBG02GXQUO  29-268  240     5-268   264     268
FQL5QBG02JS960  5-270   266     5-275   271     304
FQL5QBG02H0VJ7_left     5-145   141     5-238   234     259
FQL5QBG02H0VJ7_right    190-238 49      5-238   234     259
FQL5QBG02HASXU  62-304  243     5-304   300     313

Here, some of the reads have ‘_left’ or ‘_right’ added at the end of the read ID (Accno). This indicates that the read was a paired end read (the linker sequence was detected in the read), and for this file, these reads get split into their constituent right and left halves. Note that, for example, for read FQL5QBG02GX6EQ, the position of the linker sequence can be determined from the trimpoints: from position 172 (following the last position of the left part) to 216 (just before the starting position of the right part). Also note that some reads of the same run are not paired end reads. These reads either lack the linker altogether (an results of the paired end library generation procedure), or have too few bases (less than 20) on one side of the linker to give two mappable read halves. These reads are used as normal shotgun reads.

2) 454ReadStatus.txt

Accno   Read Status     5' Contig       5' Position     5' Strand       3' Contig       3' Position     3' Strand
ERGMJHS01CYVHW  Assembled       contig00011     610     +       contig00011     685     -
ERGMJHS01CJOXV  PartiallyAssembled      contig00115     8069    -       contig00115     7943    +
ERGMJHS01DYDH0  Singleton
ERGMJHS01EZ7VU  Repeat
ERGMJHS01A8MP3  Outlier
FQL5QBG02GDUSS_left     Assembled       contig00106     3130    +       contig00106     3242    -
FQL5QBG02GDUSS_right    Assembled       contig00106     5787    -       contig00106     5759    +

This file describes where reads ended up after assembly was complete. For paired end reads, the ‘fate’ of each hall is reported on a separate line. Columns are:

  • Accno: the unique read ID
  • Read Status: this can be
    – Assembled: the reads was placed in one or more contigs
    – PartiallyAssembled: only part of the read was used for making contigs
    – Singleton: there was no (significant) overlap between this read and all the others
    – Repeat: the read was most likely derived from a repeated part of the genome. More technically: more than 70% of a read’s seeds (see this post) hit to more than 70 other reads.
    – Outlier: a problematic read, e.g. a chimeric read
    – TooShort: the trimmed portion of the read was below the length threshold. This minimum can be set with the –minlen flag during assembly. When it is not set, and no paired end reads are included, it is 50 bases; for an assembly with paired ends, it is 20 bases (if I’m not mistaken).
  • 5′ Contig, 5′ Position, 5′ Strand: the contig and position in it where the 5’ end of the reads alignment begins, and the orientation of the read relative to the contig (‘+’ or ‘-‘ for forward and reverse strand, respectively)
  • 3′ Contig, 3′ Position, 3′ Strand: similar for the 3’ end of the reads alignment

Note that only the starting and end of the each read’s alignment are shown. Due to the way newbler builds contigs, the middle of a read could be aligned within one or even several other contigs. It follows then, that this file can not be used for determining all the reads that were used to build a contig, or all the contigs that a read is a part of.

3) 454PairStatus.txt

Template        Status  Distance        Left Contig     Left Pos        Left Dir        Right Contig    Right Pos       Right Dir       Left Distance   RightDistance
FQL5QBG02GDUSS  SameContig      2657    contig00106     3130    +       contig00106     5787    -
FQL5QBG02GRUHY  Link    1366    contig00208     267     -       contig00207     3298    +       267     1099
FQL5QBG02HRDSS  OneUnmapped     -       Unmapped                        contig00017     10630   -
FQL5QBG02FS0NM  BothUnmapped    -       Unmapped                        Unmapped
FQL5QBG02IIB8R  MultiplyMapped  -       Repeat                  contig00173     207     -
FQL5QBG02IJDOE  FalsePair       -       contig00015     72252   +       contig01166     7528    -

This file describes for each paired end read, how it ended up in the assembly.  Columns are:

  • Template: the read ID
  • Status: this can be:
    – SameContig: both halves of the paired end read mapped to (or, for long enough halves, were assembled into) the same contig with a consistent orientation (i.e. the halves ‘point towards each other’ as paired end halves should). These reads have been used to determine the library insert size.
    – Link: the reads mapped to different contigs, close enough to the ends of these contigs so that they could be used to link the contigs together into a scaffold.
    – OneUnmapped: only one of the halves was mapped, the other not
    – BothUnmapped: neither the right half, or the left halve was mapped
    – MultiplyMapped: one or both of the halves mapped to multiple contigs (repeated reads)
    – FalsePair: both halves were mapped, but either to the same contig with incorrect orientation or, the distance between the halves was outside of the accepted range for the library.
  • So, of all these status categories, only the ones marked as ‘Link’ were actually used for scaffolding…
  • Distance:
    – for reads that map to the same contig: the distance between the halves
    – for reads that Link contigs into scaffolds: the sum of the distances from the position of each half to the end of the contig. So, the total distance between the halves for these pairs would be the distance mentioned in the
  • 454PairStatus.txt file, plus the gap between distance the contigs. This distance then should be consistent with the paired end library insert size.
  • Left Contig, Left Pos, Left Dir: the contig ID, position (of the 5’ end) and orientation (‘+’ or ‘-‘ for forward and reverse strand, respectively) of the mapped left half. Left Contig can also be marked as ‘Unmapped‘ or ‘Repeat’
  • Right Contig, Right Pos, Right Dir: similar for the right half. Note that ‘position’ here refers to the position of the 3’ end of the right half.
  • Left Distance: for reads that ‘Link’ contigs only: the distance from the 5’ end of the left half, to the end of the contig
  • Right Distance: for reads that ‘Link’ contigs only: the distance from the 3’ end of the right half, to the end of the contig

From this, it follows logically that for reads marked as ‘Link’, the sum of the Left and Right Distance columns is the same as the number listed in the Distance column (column 2)

For pair halves marked as ‘Repeat’, the mapping information is not reported in this file. It is possible to obtain the mapping results by adding the –pair or –pairt flags during assembly. This will result in the 454TagPairAlign.txt file, which describes all alignments of pair halves shorter than 50 bases (these are not assembled, but mapped to contigs afterwards, see my first post). The file can either report all alignments (-pair), or a tabulated summary (-pairt)

4) 454AlignmentInfo.tsv

Position        Consensus       Quality Score   Unique Depth    Align Depth     Signal  StdDeviation
>contig00001    1
1       C       64      24      29      0.99    0.08
2       T       64      24      29      0.94    0.10
3       C       64      24      29      0.91    0.07
4       A       64      24      29      1.93    0.10
5       A       64      24      29      1.93    0.10
6       T       64      24      29      1.03    0.08
7       A       64      23      28      0.95    0.09
8       T       64      23      28      1.93    0.08
9       T       64      23      28      1.93    0.08
10      A       64      22      27      0.99    0.08

This file gives a consensus alignment overview for each position in each contig. Normally, this file is only present in the output when the project contains less then 4 million reads, and less then 40Mb total assembled contig length. For larger assemblies, adding –info to the command line will output this file.

The information for each contig starts with a line giving the contig ID, e.g >contig00001. The number which follows is always ‘1’ for assemblies (but can be different for mapping projects, perhaps subject of a future post…)
Columns are:

  • Position: position in the contig
  • Consensus: consensus contig nucleotide (base) at this position
  • Quality Score: consensus contig quality score at this position
  • Unique Depth: the number of reads that align to (cover) the position, restricted to unique reads only (a significant proportion of 454 reads are duplicates as a results of two beads present in the same microreactor during emusion PCR).
  • Align Depth: the number of all reads that align to the position (including duplicates)
  • Signal, StdDeviation: the average flow signal and the corresponding standard deviation for the flows at that position. Note that for stretches of identical bases, these numbers are identical (as 454 sequencing basically reads homopolymer lengths), e.g. see positions 4 and 5.

In closing, with these last four posts, I have described the most important output files, and the ones that usually are present by default. With a little programming skills one should be able to distill all information necessary from a newbler assembly using these files.

Advertisements

14 Responses to “Newbler output VI: the ‘status’ files (454TrimStatus.txt, 454ReadStatus.txt, 454PairStatus.txt) and the 454AlignmentInfo.tsv file”

  1. Steven Sullivan said

    Your Newbler Output series of articles goes from “III” directly to “VI” — a typo, I think, unless is there special content behind a paywall? ;>

  2. Steven Sullivan said

    re: 454ReadStatus.txt
    “Note that only the starting and end of the each read’s alignment are shown. Due to the way newbler builds contigs, the middle of a read could be aligned within one or even several other contigs. It follows then, that this file can not be used for determining all the reads that were used to build a contig, or all the contigs that a read is a part of.”

    What if any file *can* be used for those purposes, then?

    • flxlex said

      The 454Contigs.ace file contains the complete alignments and could be used for this. If it is not part of the output, you can get it with the -ace flag. Check the consed documentation for the file format.

      • Sam Hunter said

        Using the “-rip” switch is supposed to disable splitting reads over multiple contigs yet I still see a number of “PartiallyAssembled” reads in 454ReadStatus.txt. Does this mean that part of the read was thrown out? Shouldn’t 454ReadStatus.txt be usable for determining all of the reads that were used to build a contig with this switch enabled?

      • flxlex said

        PartiallyAssembled means, as you say, that part of the read was not aligned and discarded. As far as I know, the -rip option has no effect on this. I am a bit uncertain whether the 454ReadStatus file is adjusted with -rip enabled. In principle, as you write, you should see start and finish positions mentioned for each read in the same contig, but I haven’t checked that this is the case myself…

  3. seb said

    Need details when you say:
    “Trimpoints Used: the start and end position of the part of the read newbler used.”

    Newbler used this range for:
    – the assembly
    – clipping the read
    ???

    In my case I have strange results :

    GWFSY8M04EHA5K 226-226 1 1-226 226 226
    GWFSY8M04EWR11 181-181 1 1-181 181 181
    GWFSY8M04E0JPI 382-382 1 1-391 391 391
    GWFSY8M04EPWF3 344-344 1 1-344 344 344
    GWFSY8M04ERKHD 225-225 1 1-225 225 225
    GWFSY8M04EXOCA 250-250 1 1-250 250 250
    GWFSY8M04D67WW 190-190 1 1-190 190 190

    • flxlex said

      Newbler used the range for clipping the read, and aligning the remainder (trimmed range, second column) However, what you show is really strange. I checked one of my own assemblies, and I see some reads with trimlength==1 (third column), some of these have the last base as only base, but none have the pattern you show. When I checked the reads-of-length-1bp-after-trimming in the 454ReadStatus file, they do not appear there, so they are not included in the alignment. But, the TooShort metric in the 454Newblermetrics.txt file does not list any too short reads, either… A bug perhaps?

      By the way, 454 reads usually have position 5 as first trimposition (because the first four are the universal key sequence present on all reads), while yours start at 1 (original trimpoints, column 4). Any idea why?

      • These reads were trimmed by newbler for whatever reason. To ensure each entry in SFF file has at least one nucleotide after trimming, newbler/sfffile adjusts trimpoints NOT to overlap. So what you see above is that those reads were effectively discarded. But, they got discarded as TooShort reads because 1nt blah” but followed with an empty line instead of the sequence.

      • Thanks, Martin!

  4. you wrote:
    “Orig Trimpoints: the start and end part of the trimmed read as it was given to newbler. These positions are the result of the image processing software trimming steps”

    I think this is reporting the result of signal processing, not just image processing. (Image processing, where correction takes place, precedes signal processing, where quality filtering takes place; output of signal processing is the .sff used by newbler. The read lengths reported in these .sff files match the lengths reported in the Orig Trimpoints column.)

    Does anyone know what causes newbler to trim a read still further? Is it simply trimming off unaligned parts of assembled reads?

    • flxlex said

      You are of course right in that I should have written ‘signal processing’. Duly noted and corrected.

      Newbler is not trimming off the unaligned parts, as the 454TrimStatus file is already generated before the alignment phase begins. I think newbler likes to use higher quality parts of reads to aid in assembly, while the image processing is more relaxed in it’s trimming

  5. Risham said

    Hi Flxlex

    I am trying to annotate contigs obtained from two different cultivars of my experimental plant. I generated contigs by newbler for two cultivars separately and for a pooled data set in which I pooled the raw sff files of two individuals for assembly. So I have contigs for 3 sets: a) cultivar 1 b) cultivar 2 c) Cultivar_1_2_Pooled. This was done because two cultivars are diverse and this could generate a draft genome of the experimental plant. Now when I am annotating the 3 sets of contigs, the genes which are coming in a particular orientation in both the cultivars and reversing their polarity in pooled set. I am unable to understand this. For example gene A is getting annotated in plus strand in both cultivars individually but in the pooled set of contigs its showing its polarity as minus strand. Is this related to how newbler assembles two different sff files together?

    • flxlex said

      Contigs can be in any orientation, so what you are seeing is most likely a chance thing. It could just as well have been that the strands were similar for cultivar 1 and the pooled assembly, but opposite for cultivar 2…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: