An assembly of reads, contigs and scaffolds

A blog on all things newbler and beyond

Posts Tagged ‘sff’

Newbler input I: the sff file

Posted by lexnederbragt on October 28, 2010

Newbler can obviously take in the 454 reads, but also other read types: regular Sanger reads, any sequence in a fasta file (at most 200 bp), and perhaps also Illumina reads.

Sff files are the standard output of the 454 sequencing machine. ‘sff’ stands for ‘standard flowgram file’. The 454 sequencing method determines the sequence not base by base, but measures homopolymer length (the number of consecutive ‘A’s, ‘C’s, ‘G’s and ‘T’s on a sequence). Nucleotides are flown over the sequencing plate in a determined order (T-A-C-G) and a light signal is generated during nucleotide incorporation. The strength of the light signal is proportional to the number of bases built in (at least up to a certain number, around 7). As the flow order is always the same, for certain sequences no base can be built in, leading to a signal of strength (+/-) 0.

The sff file contains all the bases, quality values and signal strengths, in contrast to the fna and qual files. Note that sff files can, by definition, contain reads from only one type of chemistry, i.e. either GS 20, GS FLX or GS FLX Titanium reads.

Sff files are binary files, meaning that they can not be accessed by regular text-based tools. 454 has its own scripts to manipulate sffiles and extract information from them (sfffile, sffinfo), but other programs/scripts can also be used to extract information from them. Example programs are sff_extract, flower, sff2fasta, or use the biopython parser, nothing for bioperl yet (I have not tested any of these – use at your own discretion…). When one uses 454’s sffinfo command on an sff file without parameters, all information contained in the file is reported in text format. The remainder of this post will describe that output. Read the rest of this entry »

Posted in Newbler input | Tagged: , , | 23 Comments »