An assembly of reads, contigs and scaffolds

A blog on all things newbler and beyond

Newbler output II: contigs and scaffolds sequence files, and the 454Scaffolds.txt file

Posted by lexnederbragt on March 22, 2010

The files most people are after when they do an assembly must be these: the actual contig and scaffold sequences. The contigs are in the files 454AllContigs and 454LargeContigs. ‘All’ indicates by default contigs of at least 100 bp, while ‘Large’ contigs are at least 500 bp. These lower limits can be set during assembly.
The ‘fna’ files contain the sequences (bases) in fasta format (I actually do not why this extension was chosen over ‘fasta’ or ‘fa’ which are most often used). The ‘qual’ files contain phred-like quality scores (see previous post). The contigs are in the same order between fna and qual files, and the quality scores are in the same order as the bases:

>contig00005 length=962   numreads=77
CgaCTAGTATTGACACCCACAGTGAACTAACTATTGGTAACTATTATTAGGAACATGTAACTTGCATCAGGTACAGGTAACTAAAGGTATGTCTATTTAC

>contig00005 length=962   numreads=77
64 25 38 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 6464 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64

Note the lower case bases in the beginning of the sequence; these correspond to quality values below 40.

The fasta header:

>contig00005 length=962   numreads=77

gives the unique contig number, its length in bp and the number of reads in the alignment used to build this contig. Note however, that this number represents all reads that were aligned, regardless of whether the read was aligned over its entire length, or just a part of it. For this to make sense, please remember what I explained in the first post on the contig graph: reads can appear in more than one contig. Although the number of reads mentioned in the fasta header gives some indication on the read depth, it is not a good proxy. For the actual read depth (total number of bases aligned to generate the consensus contig sequence, divided over the contig length), please refer to the 454ContigGraph.txt file, which will be the subject of the next post.

The 454Scaffolds.fna and .qual files contain the scaffolds, which are nothing more than selected contigs, interspersed with gaps indicated by strings of ‘N’. The number of ‘N’s represent the gap length estimate, with a lower gap length limit of 20 bases. The contigs (between the gaps) in the scaffolds can be found in the 454AllContigs.fna (or the 454LargeCorntigs.fna) file, except those that are below the ‘All contigs’ lower length limit, usually 100 bp.

Finally, the 454Scaffolds.txt file contains a description of the scaffolds, showing which contigs are making up each scaffold, and what the gap sizes are in between them. The files follows the NCBI scaffold layout ‘AGP’ format.

A portion of an example 454Scaffolds.txt file is shown here:

scaffold00001   1       27007   1       W       contig00001     1       27007   +
scaffold00001   27008   27315   2       N       308     fragment        yes
scaffold00001   27316   61770   3       W       contig00002     1       34455   +
scaffold00001   61771   63341   4       N       1571    fragment        yes
scaffold00001   63342   99181   5       W       contig00003     1       35840   +
scaffold00001   99182   99489   6       N       308     fragment        yes
scaffold00001   99490   132133  7       W       contig00004     1       32644   +

Note column 5, which indicates whether the line describes a contig: ‘W’ or gap: ‘N’. For each unique scaffold number/name (column 1), the starting and ending position are in columns 2 and 3, an incremental line number in column 4 (this starts at ‘1’ for each next scaffold).

For the contig lines (‘W’ in column 5), the contig name is given in column 6, followed by the start and end position of the part of the contig, which in the case of 454 newbler assemblies always are the first and last base of the contig. The last column indicates the orientation of the contig in the scaffold, ‘+’ means forward, ‘-‘ reverse. Newbler adjusts the orientation of contigs such that this column always will be ‘+’.

For the lines indicating gaps (‘N’ in column 5), the gap length is given in column 6, followed by the gap type, which in the case of newbler assemblies always is ‘fragment’ (i.e. a gap between sequences). The final column is always ‘yes’ for gaps lines, which indicates that “there is evidence of linkage between the adjacent lines”.

A note on scaffold definitions
I would define a scaffold as ‘two or more contigs connected by a minimum number of consistent paired end reads’ (where newbler’s minimum is 2 paired reads). In the output of newbler, besides these scaffolds, you will also find all unscaffolded contigs of at least 2000 bp, as these constitute significant parts of the genome sequenced. This means that in the 454Scaffolds.fna file, there will be scaffolds without any gap bases. These scaffolds will be represented in the 454Scaffolds.txt file by a single line. I tend to remove these scaffolds from an assembly and only report on genuine scaffolds, with at least one gap…

Advertisements

19 Responses to “Newbler output II: contigs and scaffolds sequence files, and the 454Scaffolds.txt file”

  1. Aaron said

    Nice series of posts. I’m also a Newbler newbie trying to make sense of it’s output. Thanks!

  2. Shawn said

    I’ve always wondered about one thing in the 454*contig files. Sometimes there are contigs that say “numreads=1”. How are these different from the singletons that don’t get placed in contigs?

    • flxlex said

      These contigs are quite small, right? I guess that such a contig consists of part of a read whose other end(s) are aligned to other reads in another contig. So in a way it is an artifact of the contig graph. As the other part of the read is aligned to other reads, the read itself is not a singleton. Singletons are reads that have no (significant) alignment to any other read.

  3. Ketil said

    Hi, Lex, and thanks for pointing me here. I now subscibed to your RSS with Google Reader, keep it coming!

    The .fna suffix is sometimes used to separate nucleotide sequences (Fasta nucleic acids?) from amino acids, which use .faa. In addition to .fasta and .fa, I’ve also seen .fst – and there’s probably more.

    -k

  4. Trine said

    I like !

  5. ganga jeena said

    >contig150144 length=354 numreads=1
    >contig290757 length=103 numreads=1
    >contig290787 length=103 numreads=1
    >contig290874 length=103 numreads=1
    >contig291165 length=103 numreads=1
    >contig291512 length=102 numreads=1
    >contig291822 length=102 numreads=1
    >contig292501 length=101 numreads=1
    >contig292575 length=101 numreads=1
    >contig292626 length=101 numreads=1
    >contig293593 length=100 numreads=1

    [root@x4640 assembly]# head -1 454PairStatus.txt

    Template Status Distance Left Contig Left Pos Left Dir Right Contig Right Pos Right Dir Left Distance RightDistance

    [root@x4640 assembly]# less 454PairStatus.txt

    [root@x4640 assembly]# grep “GSMQLLJ02HDJFR” 454PairStatus.txt

    [root@x4640 assembly]# grep “GSMQLLJ02HDJFR” 454ReadStatus.txt

    GSMQLLJ02HDJFR Assembled contig305853 1 + contig291822 102 –

    [root@x4640 assembly]# grep “contig281520” 454ReadStatus.txt

    GMCWRKQ02FWW3F Assembled contig281520 1 + contig256431 99 –

    [root@x4640 assembly]# grep “contig150144” 454ReadStatus.txt

    GKCF20101DFS67 PartiallyAssembled contig397503 1 + contig150144 354 –
    [root@x4640 assembly]#

    How is it that only one base is mapped to some contig and longer to some other and longer one reported with numreads=1 status.

    Can someone explain
    I am getting many such contigs
    Regards,
    Ganga Jeena

  6. Delphine said

    Hi!

    There are lowercase bases in file 454AllContigs.fna. Is it possible to avoid that? Newbler can output contigs only with upercase (so only if quality value isn’t below 40)?

    Thanks,

    • flxlex said

      Nope, not possible. You will have to remove unwanted bases or contigs yourself. But a Q39 base is still very worthwhile, IMHO…

  7. Soni said

    Great blog Flxlex,

    Would you know how gsMapper decides on a consensus especially over a region that has structural variations? So I am looking in 454HCStructVars.txt and 454HCStructRearrangements.txt to identify large structural variations (SV). When I look at the assembly over the region with the variation (e.g. deletion) the deletion is not reflected, as in there is sequence spanning the “deleted” region. The variation frequency for these SVs is also high (around 70%).

    Would you know what’s going on or what’s the best way to interpret the SV output from gsMapper?

    Cheers

    Soni

    • flxlex said

      I am sorry, but I haven’t enough experience with gsMapper to be able to answer your question. Perhaps you could ask it on seqanswers.com, or on biostar.stackexchange.com?

  8. Nathan Watson-Haigh said

    In my 454Scaffold.txt file I have consecutive lines of contigs with no interveaning gaps. When I look at these boundaries, there are clipped reads that span into the neighbouring contig.

    Can you explain how these come about in the Newbler assembly/scaffolding process? Is there any reason when such neighbouring contigs cannot be merged into a single longer contig?

    Cheers,
    Nathan

  9. Nino said

    I have to do some analysis on a bacterial resequenced genome and I just received the data after assembly with Newbler. I did not receive any scaffold file while I have the AllContigs and ContigsGraph files. Is there a way to obtain longer not overlapping sequences from the AllContigs files? Or I have to consider the AllContigs file as the assembled genome? thanks

  10. Westeros said

    Thanks for the great blog !
    Since reads can be used in more than one contig, what would be a good proxy for the transcriptional level ?
    Thanks !

    • I wouldn’t use 454/Roche software to do RNA-seq with 454 data, there is no module for that. There may be other software that uses output from gsMApper (the mapping application), but I don’t know of any…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: