An assembly of reads, contigs and scaffolds

A blog on all things newbler and beyond

Newbler input III: a quick fix for the new Illumina fastq header

Posted by lexnederbragt on September 1, 2011

One unfortunate drawback of working with Illumina sequences is the many changes to the format of their fastq readfiles. The quality scoring has been changed several times since the first Solexa reads become available. It appears they have now settled on the Sanger style, see this wikipedia entry.

(Source: thepoolandspashoponline.com.au)

Regrettably, with their latest software upgrade (Casava 1.8), the headers (sequence identifiers) in the fastq files have changed. The change is described in the aforementioned wikipedia entry; basically, some elements have been added, some have changed order, and there are now two parts seperated by a space.

I wouldn’t have written this blogpost if this change had not been relevant for newbler: we were lucky enough to enjoy direct reading of Illumina fastq files (with newbler determining the quality scoring type) starting with newbler 2.6. newbler also matches mate-pairs (Illumina read 1 and read 2), so that these can be used as paired-ends by newbler (to build scaffolds). By the way, FASTQ files from the NCBI/EBI Sequence Read Archive are also correctly parsed for mate pairs, but here the filename is used for determining read 1 and read 2.

The new Illumina fastq header (from Casava 1.8 and beyond) still allows direct reading of the fastq files by newbler, but, with the change in format the header, the pairing information is no longer understood. These reads are therefore used as shotgun reads instead.

When I asked 454 Life Sciences abut this, they confirmed newbler 2.6’s behaviour on the new Illumina fastq headers, and came with a helpful tip on how to solve this, while we await a new newbler version that fixes this problem. The solution unfortunately requires you to make a copy of the fastq file, with the old-style header. For this, you can use your favorite bioinformatics command or language, but here I use an awk command. It adjusts the ‘@’ header line, but leaves the ‘+’ header line blank (potentially saving some disk space):

cat new-style_.fastq | awk '{if (NR % 4 == 1) {split($1, arr, ":"); printf "%s_%s:%s:%s:%s:%s#0/%s (%s)\n", arr[1], arr[3], arr[4], arr[5], arr[6], arr[7], substr($2, 1, 1), $0} else if (NR % 4 == 3){print "+"} else {print $0} }' > old-style.fastq

An example on a fake fastq file:
Original, casava 1.8+ fastq file (header and sequence taken from the wikipedia fastq page)

@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC
@EAS139:136:FC706VJ:2:2104:15343:197393 2:Y:18:ATCACG
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC

Old-style fastq file copy

@EAS139_FC706VJ:2:2104:15343:197393#0/1 (@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG)
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC
@EAS139_FC706VJ:2:2104:15343:197393#0/2 (@EAS139:136:FC706VJ:2:2104:15343:197393 2:Y:18:ATCACG)
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC

Note that the original header is added at the end in brackets. If you do not want/need that, simply remove the space and the ‘(%s)’ just before the ‘\n’ from the command. Also, the flowcell ID is added to the instrument name.

How to know whether newbler accepted your reads as pairs? During the inital ‘Indexing’ step (parsing of the read files), newbler will report for fastq files (on the screen and in the 454NewblerProgress.txt file):

Indexing reads.fastq (with quality scores)...
-> XXXXXX reads, YYYYYYYYYY bases, XXXXXX marked as matepairs.

This indicates successful parsing as paired end reads, note that all reads here were marked as pairs.

If the file was read, but without pairing information, newbler will report:

Indexing reads.fastq (with quality scores)...
-> XXXXXX reads, YYYYYYYYYY bases.

In addition, after assembly the estimated library insert size will be reported in the 454NewblerMetrics.txt file for fastq files with paired reads.

Note that I couldn’t do any extensive testing on the awk command due to lack of new Illumina fastq file to try. So, use the command at your own risk, and if you find problems please let me know through the comments!

Advertisements

5 Responses to “Newbler input III: a quick fix for the new Illumina fastq header”

  1. MagnusAR said

    Newbler 2.6 rather strangely reports “marked as matepairs” for fastq files from NCBI/SRA, even though “non paired-end” is chosen when setting up project.
    Ex. 454NewblerProgress.txt: “506149 reads, 261765010 bases, 500890 marked as matepairs”

    IDs (in TrimStatus.txt) are changed so that “/2” is added: SRR094479.104349/2
    Since all reads have changed to “/2” it probably does not matter at all.
    But one wonders why this happens …

    • flxlex said

      Newbler automatically recognises pairing from SRA files, so you cannot ‘turn it off’ when setting up the project. One thing you could try is to replace the ‘@’ header in the fastq files to something random, e.g. a running number. That should make newbler not accept the reads as pairs.

  2. Yongshan Lang said

    Can Newbler 2.6 recogise the new Illumina fastq quality encoding?

  3. Renato Oliveira said

    4 years later, Newbler 2.9 still can’t handle with Illumina paired-end data. But thanks to you, I could adjust the header from the .fastq files and now Newbler recognized them as mate-pairs. Thank you!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: