An assembly of reads, contigs and scaffolds

A blog on all things newbler and beyond

Newbler input I: the sff file

Posted by lexnederbragt on October 28, 2010

Newbler can obviously take in the 454 reads, but also other read types: regular Sanger reads, any sequence in a fasta file (at most 200 bp), and perhaps also Illumina reads.

Sff files are the standard output of the 454 sequencing machine. ‘sff’ stands for ‘standard flowgram file’. The 454 sequencing method determines the sequence not base by base, but measures homopolymer length (the number of consecutive ‘A’s, ‘C’s, ‘G’s and ‘T’s on a sequence). Nucleotides are flown over the sequencing plate in a determined order (T-A-C-G) and a light signal is generated during nucleotide incorporation. The strength of the light signal is proportional to the number of bases built in (at least up to a certain number, around 7). As the flow order is always the same, for certain sequences no base can be built in, leading to a signal of strength (+/-) 0.

The sff file contains all the bases, quality values and signal strengths, in contrast to the fna and qual files. Note that sff files can, by definition, contain reads from only one type of chemistry, i.e. either GS 20, GS FLX or GS FLX Titanium reads.

Sff files are binary files, meaning that they can not be accessed by regular text-based tools. 454 has its own scripts to manipulate sffiles and extract information from them (sfffile, sffinfo), but other programs/scripts can also be used to extract information from them. Example programs are sff_extract, flower, sff2fasta, or use the biopython parser, nothing for bioperl yet (I have not tested any of these – use at your own discretion…). When one uses 454’s sffinfo command on an sff file without parameters, all information contained in the file is reported in text format. The remainder of this post will describe that output.

Each sff file starts with a ‘common header’:

Common Header:
Magic Number: 0x2E736666
Version: 0001
Index Offset: 110544
Index Length: 3173
# of Reads: 35
Header Length: 840
Key Length: 4
# of Flows: 800
Flowgram Code: 1
Flow Chars: TACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACG
Key Sequence: TCAG

  • Magic number: identical for all sff files (0x2E736666, the newbler manual explains that this is ‘the uint32_t encoding of the string „.sff“ ‘)
  • Version: also identical for all sff files (0001)
  • Index offset and length: has to do with the index of the binary sff file (points to the location of the index in the file)
  • # of reads: stored in the sff file
  • Header length: looks like it is 440 for GS FLX reads, 840 for GS FLX Titanium reads
  • Key length: the length (in bases) of the key sequence that each read starts with, so far always 4
  • # of Flows: each flow consists of a base that is flowed over the plate; for GS20, there were 168 flows (42 cycles of all four nucleotides), 400 for GS FLX (100 cycles) and 800 for Titanium (200 cycles)
  • Flowgram code: kind of the version of coding the flowgrams (signal strengths); so far, ‘1’ for all sff files
  • Flow Chars: a string consisting of’ # of flow’ characters (168, 400 or 800) of the bases in flow order (‘TACG’ up to now)
  • Key Sequence: the first four bases of reads are either added during library preparation (they are the last bases of the ‘A’ adaptor) or they are a part of the control beads. For example, Titanium sample beads have key sequence TACG (default library protocol) or GACT (rapid library protocol), control beads have CATG or ATGC. Control reads never make it into sff files…

Each read has the following structure

>F7K88GK01BMPI0
Run Prefix: R_2009_12_18_15_27_42_
Region #: 1
XY Location: 0551_2346

Run Name: R_2009_12_18_15_27_42_FLX########_Administrator_yourrunname
Analysis Name: D_2009_12_19_01_11_43_XX_fullProcessing
Full Path: /data/R_2009_12_18_15_27_42_FLX########_Administrator_yourrunname/D_2009_12_19_01_11_43_XX_fullProcessing/

Read Header Len: 32
Name Length: 14
# of Bases: 500
Clip Qual Left: 15
Clip Qual Right: 490
Clip Adap Left: 0
Clip Adap Right: 0

Flowgram: 1.03 0.00 1.01 0.02 0.00 0.96 0.00 1.00 0.00 1.04 0.00 0.00 0.97 0.00 0.96 0.02 0.00 1.04 0.01 1.04 0.00 0.97 0.96 0.02 0.00 1.00 0.95 1.04 0.00 0.00 2.04 0.02 0.03 1.05 0.99 0.01 2.84 0.03 0.05 0.97 0.12 0.00 1.01 0.05 0.97 0.01 2.89 0.04 0.09 1.05 0.15 0.00 2.84 0.06 1.00 0.01 0.13 1.01 0.09 0.98 0.01 0.05 1.01 0.06 0.00 1.04 3.72 0.03 0.00 0.96 1.97 0.04 0.01 1.97 0.12 0.98 0.02 0.08 0.95 0.12 ...
Flow Indexes: 1 3 6 8 10 13 15 18 20 22 23 26 27 28 31 31 34 35 37 37 37 40 43 45 47 47 47 50 53 53 53 55 58 60 63 66 67 67 67 67 70 71 71 74 74 76 79 82 83 86 86 88 88 91 93 96 97 99 102 105 ...
Bases: tcagatcagacacgCCACTTTGCTCCCATTTCAGCACCCCACCAAGCACAAGGCTGTCATCCCAATTGGACGGACAGATATGAGGTTAGCATTGGAAACCAATTCAGTCCCTAATTATTCACGACTGAACCCAGCGACAATTGGACATGGATTCATTTTTCAACTTGATTTGTTGTTGTAAAAGCACTGAAGAAGATGCCGCAACAAGAGCTTCCAAAGTTTCCCACCGGATCGACGGTACCCTTTCCCTATGAATCTCCTTATCCTCAGCAGACAGCTTTGATGGACACGCTGCTCGAGTGTTTGCAGCAAAAGGATCACGATGATTCAACATGGCGCCAAACCAATGACAGCCATAGCAAGAACAAGAAGAAACCCCGTGCGGCCGTGATGATGTTGGAGTCTCCTACCGGCACTGGCAAGTCTCTATCTTTGGCGTGTAGTGCCATGGCGTGGCTCAAGTACTGCGAACAACGAGATTTGACTGCAGaagaagaatc
Quality Scores: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 38 38 38 40 40 40 39 39 39 40 34 34 34 40 40 40 40 39 26 26 26 26 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 ...

  • >F7K88GK01BMPI0: this is the read name, or “universal accession number.” ‘F7K88G’ encodes the timestamp of the run, ‘K’ is a random character, ’01’ indicates the region (lane) number on the plate, ‘BMPI0’ encodes the x,y location of the read on the plate.
  • Run prefix: A run folder starts with ‘R’ and the time the run started: R_yyyy_mm_dd_hh_min_
  • Region #: the region (lane) on the plate the read originated from
  • XY location: the location of the read on the plate
  • Run name: R_yyyy_mm_dd_hh_min_sec_machineName_userName_yourrunname
  • Analysis name: after a run, a subfolder is made with the image/basecalling analysis results, the foldername starts with ‘D’ and the time the analysis started: D_yyyy_mm_dd_hh_min_sec_machineName_analysisType
  • Full path: of the analysis results that the sff file originated from (on the GS FLX instrument: /data/R_…/D_…)
  • Read header len: 32 for all files as far as I can tell
  • Name length: the length of the read name (14), see above
  • # of bases: the total number of bases called for the read (before clipping)
  • Clip qual left: the position of the first base to be included after clipping. This is usually 5 because of the first four bases that are the key sequence. In this example, the read had an 10 base MID sequence; the example sff file is the result of splitting the original sff file, during splitting the MID sequence is ‘removed’, i.e. the clipping point is set beyond the MID end.
  • Clip qual right: position of the last base before the (quality) clipping.
  • Clip adap left and right: I actually wouldn’t know what these represent, but perhaps under certain circumstances, adaptors can be ‘removed’ this way.
  • Flowgram: for each flow, the normalized signal strength, or actually, the homopolymer length estimate, as a floating point integer with two digits to the right of the point.
  • Flow Indexes: the flows actually used for basecalling (excluding flows considered to be ‘0’, i.e. no signal.
  • Bases: the determined DNA sequence. Lower case bases are before and after the clipping point
  • Quality scores: the phred quality scores cores corresponding to the bases

How to ‘translate’ the flowgram values into bases? The start of the example flowgram has these signals:

1.03 0.00 1.01 0.02 0.00 0.96 0.00 1.00 0.00 1.04 0.00 0.00 0.97 0.00 0.96 0.02 0.00 1.04 0.01 1.04 0.00 0.97 0.96 0.02 0.00 1.00 0.95 1.04 0.00 0.00 2.04 0.02 0.03 1.05 0.99 0.01 2.84 0.03

Rounding of the numbers:

1 0 1 0 0 1 0 1 0 1 0 0 1 0 1 0 0 1 0 1 0 1 1 0 0 1 1 1 0 0 2 0 0 1 1 0 3 0

With the flow order TACG, this translates into

1 T’s; 0 A’s; 1 C’s; 0 G’s; 0 T’s; 1 A’s etc, or TCAGATCAGACACGCCACTTT

The figure is a graphic representation of the flowgram, with another example of ‘reading’ the sequence from it. Note that for some signals, the intensity is such that it is hard to determine whether for example there are two or three bases at that position. This inherent property of pyrosequencing leads to the well-known homopolymer (over- and undercall) errors.

Advertisements

23 Responses to “Newbler input I: the sff file”

  1. […] Science software suite, will list the content of the binary sff file in text format (see the post on my other blog). Other, open source/access tools, such the ones my mention on my blog, might do this as well. Here […]

  2. […] template. For each read, one or more of these bases gets incorporated, or none at all (see also an entry on this at my other […]

  3. XICO2KX said

    “# of Flows: each flow consists of a base that is flowed over the plate; for GS20, there were 168 flows (42 cycles of all four nucleotides), 400 for GS FLX (100 cycles) and 800 for Titanium (200 cycles)”
    Do you know the number of flows values for the “GS Junior” and for the new “GS FLX+”?
    Thank you very much!

    • flxlex said

      Since the GS Junior is running GS FLX Titanium chemistry, the number of flows is 800. GS FLX+ is double that, 1600 flows (400 cycles).

      • XICO2KX said

        Thanks for the information!
        By the way, do you know what is the default paired-ends linker used in the new “GS FLX+” chemistry?
        Is it the same one as in “FLX” or the same one (actually 2) in “Titanium” or a completely new one?
        Thank you very much once again!

      • flxlex said

        As far as I know, the protocol for generating paired end reads for hasn’t changed, so the linker should still be the same. I also heard Roche suggesting it is of no use to sequence Paired-end libraries on GS FLX+ instead of Titanium, as the extra length will not significantly help in the mapping of the pair halves.

      • In FLX+ there is the same linker like in Titanium (you have to look for forward and reverse-complementary form of it). The runs have 1600 flows (unlike FLX+ shotgun which have 1779 flows).

        Under old FLX you probably meant the one making a hairpin, that was in GS20 and GSFLXstd times.

        I don’t agree it is wasteful do to paired-end sequencing on FLX+. When the linker is not about position 200 but in in say 400, without the extra length you don’t get the second half. But, 8kb and definitely 20kb libs are highly redundant, that is more of a concern. But that is about the bench work and molecular biology, nothing to do with sequencing “issues”.

      • Yes, maybe the extra length will give you a higher fraction of pairs (i.e. read through the whole linker), but it may not be a very significant increase. And yes, the duplication rate is a bigger issue for long libraries.

  4. Carlos said

    “When one uses 454′s sfffile command on an sff file without parameters, all information contained in the file is reported in text format.”

    Should be sffinfo.
    Very good post anyway!

  5. luciana maria de hollanda said

    How do I find THAT THE SEQUENCE IS FORWARD OR REVERSE IN THE SFF FILE?

    • flxlex said

      Although I am unsure what you mean, I guess the answer is you can not, until you map the sequence to a reference genome or contigs.

      • LUCIANA said

        How the program’s 454 FLX roche know that the sequence generated by the file. Sff is foward or reverse?

      • flxlex said

        It can’t know the orientation based on the information from the sff file. Sequencing is random in that respect. During assembly, overlap between reads are found by both comparing the forward, and the reverse orientation of each read. During mapping, again both orientations are checked.

  6. Björn said

    Thank you very much for that post.

    I was wondering if it is possible to correct the homopolymer errors in an sff file directly.
    That could be done with the according Biopython package but I’m not sure if there is additional information, that would be destroyed by that.

    • flxlex said

      How would you go about to correct the homopolymer errors? And what information may be destroyed?

      • Björn said

        In the moment I’m converting them to FASTQ and then using Nesoni with some additional Illumina-Reads, which works quite good.
        I was just wondering if there was an option to simpy change the sequences and according flows in the sff files.

      • flxlex said

        There is no such tool that I know of – you’d have to write it yourself. But in principle this is possible, but you would have to be careful to get all the changes right.

  7. Bell said

    After one uses sffinfo to obtain a text file. How does one convert this text file back into sff format? Can sffinfo be used to do that?

    • flxlex said

      No. One option that I happen to know off that can do it is biopython. Check http://biopython.org/DIST/docs/api/Bio.SeqIO.SffIO-module.html. Note that is says:

      You can also use the Bio.SeqIO.write() function with the “sff” format. Note
      that this requires all the flow information etc, and thus is probably only
      useful for SeqRecord objects originally from reading another SFF file (and
      not the trimmed SeqRecord objects from parsing an SFF file as “sff-trim”).

  8. Jordi said

    Hi all! That site appears to be a reference site to 454 data! Thank you so much!
    I would like to explain what is going on with a certain sff file. That file resulted from a “sfffile -o output.sff file1.sff file2.sff” command line, but when I try to use the GS Amplicon Variant Analizer software I found the following error:

    “ERROR: No region number found in file name ‘output.sff’
    in ‘/tmp/tmphGrKNp’ at line 183”

    Using Roche tools (sfffile and sffinfo) I am able to know that output.sff file actually results from a combination of file1.sff and file2.sff, I mean: (just as an example)
    MID1 file1.sff=32 reads
    MID1 file2.sff=8 reads
    MID1 output.sff= 40 reads

    I suspect that there are some problems regarding the region the read was originated from, but I don’t know how to fix it in order to visualize the data with GS Amplicon Variant Analizer.
    Any advice would be appreciate!
    Thanks in advance!

  9. Student said

    Hello! How I can identify a SNP in a heterozygote individual in this flowgram detected by 454? Thanks.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: