Newbler input II: sequencing reads from other platforms
Posted by lexnederbragt on January 21, 2011
Both the runMapping and runAssembly programs are able to take in reads from other platforms, at least Sanger reads and Illumina reads. As long as the reads are in fasta format, with an optional quality file, newbler accepts and uses these reads. When the fasta files contain paired end (mate pair) reads, newbler can actually be made to use the pair information.
In general, it is a good idea to clean your fasta sequences before adding them to newbler: remove vectors, linkers, low quality parts of reads, or entire low quality reads first.
Also note that, while for sff files a symbolic link is generated in the assembly or project folder (still present after the program is finished when the -nrm flag is set), fasta files are not included in this way.
1) Unpaired (single-end) Sanger reads
These can be simply added by telling newbler the location of a one or more fasta files:
runAssembly -o project1 /data/sanger/reads.fasta /data/sff/EYV886410.sff
Or, when you have more than one Sanger reads file:
runAssembly -o project1 /data/sanger/reads1.fasta /data/sanger/reads2.fasta /data/sff/EYV886410.sff
If you have a file with the corresponding quality files, make sure to use the same filename, but change the ending to ‘.qual’ (and put the file in the same folder). Newbler will always check whether there is such a file. So, in the above example, placing your quality file in /data/sanger and calling it reads.qual will do the trick.
2) Paired Sanger reads
The pairing information needs to be put in the fasta header for each sequence in order for newbler to understand which reads belong together. So, there is no need to join the two reads into one and add the 454 specific paired-end linker (as sometimes is suggested in forums), this actually will not work.
A read whose fasta header looks like this:
>plate12_G08_F template=plate12_G08 dir=F library=fosmid1
tells newbler that it is from a paired end library called ‘fosmid1’, in the forward orientation, and that the sequencing template (e.g. clone, or in this case, fosmid) was ‘plate12_G08’. Newbler then will look for a corresponding reverse read with this fasta header:
>plate12_G08_R template=plate12_G08 dir=R library=fosmid1
You can actually add multiple reads with the same header (if you have duplicate sequencing attempts from the same template), but newbler will only pick the ‘best’ one for assembly or mapping, where the alignment length and similarity is determining which read is best.
The ‘library’ name is used by newbler to group reads in the same way as for sff files (a per-library average insert distance is calculated, see this post). You will see the library name appear in the 454NewblerMetrics.txt file.
How you make your fasta headers for paired reads into this format is a bit up to you, as it is very much dependent on the format of the headers of your Sanger reads. It usually requires some scripting, or sed/awk commands. For example, I once had a set of BAC-ends wth fasta headers like this:
>bac-190o01.fb140_b1.SCF length=577 sp3=clipped
>bac-190o01.rb140_b1.SCF length=674 sp3=clipped
and use this command o adjust them for newbler
sed 's/-\(.*\).\([rf]\)b.*/-\1.\2 template=\1 dir=\2 library=BACends/' INFILE.fna>OUTFILE.fna
after which they looked like this:
>bac-190o01.f template=190o01 dir=f library=BACends
>bac-190o01.r template=190o01 dir=r library=BACends
Note: you have to ‘force’ newbler to take in the reads as paired end reads by including the -p flag:
runAssembly -o project1 -p /data/sanger/paired_reads.fasta /data/sff/EYV886410.sff
At the beginning of the assembly, newbler should then mention this:
1 read file successfully added as explicit paired-end file.
paired_reads.fasta (with quality scores)
3) Sanger reads for closing gaps
If you have a genome assembly for which you did some PCRs to close gaps, and sequenced the PCR products using the Sanger technology, you can actually try to use these reads to have newbler close the gaps for you. I must admit that I have not seen this being done with success yet, but in principle it should work. However, as Newbler needs more than one read in an alignment to build a contig, I would recommend adding several non-identical copies of each read to the assembly. The copies should be non-identical because newbler takes only one copy of identical reads. Making such copies can be done by shifting the start and end position of the copies. For example, for a 600 nt read you could create three copies as such:
– copy 1 from position 1 to 580
– copy 2 from position 10 to 590
– copy 2 from position 20 to 600
Make sure to give each copy a unique fasta header…
4) Illumina reads
In principle, Illumina reads can be added by converting the fastq files to fasta and quality files, with Sanger quality values, adjusting the fasta headers as described above, and feeding them to newbler. However, people who have tried this have so far reported newbler crashing when these read were being assembled (e.g. here). My only experience is trying to adda tiny amount of Illumina reads to a much larger 454 read dataset, and that worked well.
Illumina runs typically come in two files per lane, one for each read direction (forward and reverse). You can also add each of these in a separate file, and newbler will still be pairing the reads up with their mates. As an example for two files from lane 4:
runAssembly -o project1 -p /data/sanger/s_4_1_sequence.fasta -p /data/sanger/s_4_2_sequence.fasta /data/sff/EYV886410.sff
It seems that the commandline I gave above does not work. I have had better success by using the newAssembly/addRun/runProject approach, both for a single Illumina file as well as one file per run half:
addRun -lib libname -p /data/sanger/s_4_1_sequence.fasta
addRun -lib libname -p /data/sanger/s_4_2_sequence.fasta
Here, you specifically tell newbler that the files belong to the same library. Newbler seems to ignore the library name given after -lib and takes the one specified in the fasta header instead.