An assembly of reads, contigs and scaffolds

A blog on all things newbler and beyond

Archive for the ‘Newbler input’ Category

Newbler input III: a quick fix for the new Illumina fastq header

Posted by lexnederbragt on September 1, 2011

One unfortunate drawback of working with Illumina sequences is the many changes to the format of their fastq readfiles. The quality scoring has been changed several times since the first Solexa reads become available. It appears they have now settled on the Sanger style, see this wikipedia entry.

(Source: thepoolandspashoponline.com.au)

Regrettably, with their latest software upgrade (Casava 1.8), the headers (sequence identifiers) in the fastq files have changed. The change is described in the aforementioned wikipedia entry; basically, some elements have been added, some have changed order, and there are now two parts seperated by a space.

I wouldn’t have written this blogpost if this change had not been relevant for newbler: we were lucky enough to enjoy direct reading of Illumina fastq files (with newbler determining the quality scoring type) starting with newbler 2.6. newbler also matches mate-pairs (Illumina read 1 and read 2), so that these can be used as paired-ends by newbler (to build scaffolds). By the way, FASTQ files from the NCBI/EBI Sequence Read Archive are also correctly parsed for mate pairs, but here the filename is used for determining read 1 and read 2.

Read the rest of this entry »

Advertisements

Posted in Newbler input | Tagged: , , , | 5 Comments »

Newbler input II: sequencing reads from other platforms

Posted by lexnederbragt on January 21, 2011

A sanger sequence read electropherogram (source: wikipedia)

Both the runMapping and runAssembly programs are able to take in reads from other platforms, at least Sanger reads and Illumina reads. As long as the reads are in fasta format, with an optional quality file, newbler accepts and uses these reads. When the fasta files contain paired end (mate pair) reads, newbler can actually be made to use the pair information.

In general, it is a good idea to clean your fasta sequences before adding them to newbler: remove vectors, linkers, low quality parts of reads, or entire low quality reads first.
Also note that, while for sff files a symbolic link is generated in the assembly or project folder (still present after the program is finished when the -nrm flag is set), fasta files are not included in this way.

Read the rest of this entry »

Posted in Newbler input | Tagged: , , , , , | 20 Comments »

Newbler input I: the sff file

Posted by lexnederbragt on October 28, 2010

Newbler can obviously take in the 454 reads, but also other read types: regular Sanger reads, any sequence in a fasta file (at most 200 bp), and perhaps also Illumina reads.

Sff files are the standard output of the 454 sequencing machine. ‘sff’ stands for ‘standard flowgram file’. The 454 sequencing method determines the sequence not base by base, but measures homopolymer length (the number of consecutive ‘A’s, ‘C’s, ‘G’s and ‘T’s on a sequence). Nucleotides are flown over the sequencing plate in a determined order (T-A-C-G) and a light signal is generated during nucleotide incorporation. The strength of the light signal is proportional to the number of bases built in (at least up to a certain number, around 7). As the flow order is always the same, for certain sequences no base can be built in, leading to a signal of strength (+/-) 0.

The sff file contains all the bases, quality values and signal strengths, in contrast to the fna and qual files. Note that sff files can, by definition, contain reads from only one type of chemistry, i.e. either GS 20, GS FLX or GS FLX Titanium reads.

Sff files are binary files, meaning that they can not be accessed by regular text-based tools. 454 has its own scripts to manipulate sffiles and extract information from them (sfffile, sffinfo), but other programs/scripts can also be used to extract information from them. Example programs are sff_extract, flower, sff2fasta, or use the biopython parser, nothing for bioperl yet (I have not tested any of these – use at your own discretion…). When one uses 454’s sffinfo command on an sff file without parameters, all information contained in the file is reported in text format. The remainder of this post will describe that output. Read the rest of this entry »

Posted in Newbler input | Tagged: , , | 23 Comments »