An assembly of reads, contigs and scaffolds

A blog on all things newbler and beyond

Newbler input II: sequencing reads from other platforms

Posted by lexnederbragt on January 21, 2011

A sanger sequence read electropherogram (source: wikipedia)

Both the runMapping and runAssembly programs are able to take in reads from other platforms, at least Sanger reads and Illumina reads. As long as the reads are in fasta format, with an optional quality file, newbler accepts and uses these reads. When the fasta files contain paired end (mate pair) reads, newbler can actually be made to use the pair information.

In general, it is a good idea to clean your fasta sequences before adding them to newbler: remove vectors, linkers, low quality parts of reads, or entire low quality reads first.
Also note that, while for sff files a symbolic link is generated in the assembly or project folder (still present after the program is finished when the -nrm flag is set), fasta files are not included in this way.

1) Unpaired (single-end) Sanger reads
These can be simply added by telling newbler the location of a one or more fasta files:

runAssembly -o project1 /data/sanger/reads.fasta /data/sff/EYV886410.sff

Or, when you have more than one Sanger reads file:

runAssembly -o project1 /data/sanger/reads1.fasta /data/sanger/reads2.fasta /data/sff/EYV886410.sff

If you have a file with the corresponding quality files, make sure to use the same filename, but change the ending to ‘.qual’ (and put the file in the same folder). Newbler will always check whether there is such a file. So, in the above example, placing your quality file in /data/sanger and calling it reads.qual will do the trick.

2) Paired Sanger reads
The pairing information needs to be put in the fasta header for each sequence in order for newbler to understand which reads belong together. So, there is no need to join the two reads into one and add the 454 specific paired-end linker (as sometimes is suggested in forums), this actually will not work.
A read whose fasta header looks like this:

>plate12_G08_F template=plate12_G08 dir=F library=fosmid1

tells newbler that it is from a paired end library called ‘fosmid1’, in the forward orientation, and that the sequencing template (e.g. clone, or in this case, fosmid) was ‘plate12_G08’. Newbler then will look for a corresponding reverse read with this fasta header:

>plate12_G08_R template=plate12_G08 dir=R library=fosmid1

You can actually add multiple reads with the same header (if you have duplicate sequencing attempts from the same template), but newbler will only pick the ‘best’ one for assembly or mapping, where the alignment length and similarity is determining which read is best.

The ‘library’ name is used by newbler to group reads in the same way as for sff files (a per-library average insert distance is calculated, see this post). You will see the library name appear in the 454NewblerMetrics.txt file.

How you make your fasta headers for paired reads into this format is a bit up to you, as it is very much dependent on the format of the headers of your Sanger reads. It usually requires some scripting, or sed/awk commands. For example, I once had a set of BAC-ends wth fasta headers like this:

>bac-190o01.fb140_b1.SCF length=577 sp3=clipped
>bac-190o01.rb140_b1.SCF length=674 sp3=clipped

and use this command o adjust them for newbler

sed 's/-\(.*\).\([rf]\)b.*/-\1.\2 template=\1 dir=\2 library=BACends/' INFILE.fna>OUTFILE.fna

after which they looked like this:

>bac-190o01.f template=190o01 dir=f library=BACends
>bac-190o01.r template=190o01 dir=r library=BACends

Note: you have to ‘force’ newbler to take in the reads as paired end reads by including the -p flag:
runAssembly -o project1 -p /data/sanger/paired_reads.fasta /data/sff/EYV886410.sff
At the beginning of the assembly, newbler should then mention this:

1 read file successfully added as explicit paired-end file.
paired_reads.fasta (with quality scores)

3) Sanger reads for closing gaps
If you have a genome assembly for which you did some PCRs to close gaps, and sequenced the PCR products using the Sanger technology, you can actually try to use these reads to have newbler close the gaps for you. I must admit that I have not seen this being done with success yet, but in principle it should work. However, as Newbler needs more than one read in an alignment to build a contig, I would recommend adding several non-identical copies of each read to the assembly. The copies should be non-identical because newbler takes only one copy of identical reads. Making such copies can be done by shifting the start and end position of the copies. For example, for a 600 nt read you could create three copies as such:
–    copy 1 from position 1 to 580
–    copy 2 from position 10 to 590
–    copy 2 from position 20 to 600

Make sure to give each copy a unique fasta header…

4) Illumina reads
In principle, Illumina reads can be added by converting the fastq files to fasta and quality files, with Sanger quality values, adjusting the fasta headers as described above, and feeding them to newbler. However, people who have tried this have so far reported newbler crashing when these read were being assembled (e.g. here). My only experience is trying to adda tiny amount of Illumina reads to a much larger 454 read dataset, and that worked well.

Again, converting Illumina fastq files to fasta and (Sanger-style) qual files can be done in several ways, see for example the comments to this post or the SEQanswers forums.

Illumina runs typically come in two files per lane, one for each read direction (forward and reverse). You can also add each of these in a separate file, and newbler will still be pairing the reads up with their mates. As an example for two files from lane 4:

runAssembly -o project1 -p /data/sanger/s_4_1_sequence.fasta -p /data/sanger/s_4_2_sequence.fasta /data/sff/EYV886410.sff

UPDATE

It seems that the commandline I gave above does not work. I have had better success by using the newAssembly/addRun/runProject approach, both for a single Illumina file as well as one file per run half:

newAssembly project1
cd project1
addRun /data/sff/EYV886410.sff
addRun -lib libname -p /data/sanger/s_4_1_sequence.fasta
addRun -lib libname -p /data/sanger/s_4_2_sequence.fasta
runProject

Here, you specifically tell newbler that the files belong to the same library. Newbler seems to ignore the library name given after -lib and takes the one specified in the fasta header instead.

About these ads

19 Responses to “Newbler input II: sequencing reads from other platforms”

  1. flxlex said

    I updated the last part after trying out newbler with some SRA Illumina reads

  2. HL said

    what’s the limit for input read length?
    I want to feed newbler some contigs which is already assembled from illumina data by other program.

    • flxlex said

      The limit is 2000 bp. See also the fire ant genome paper where they performed de novo assembly of the Illumina reads, and then chopped the contigs into 300 bp pieces (with 200 bp overlap) and added these pieces as reads to the 454 reads for assembly using newbler.

      • Doug Senalik said

        Just for the sake of absolute precision, the limit is < 2000, i.e. a maximum of 1999 bp. I just confirmed this in newbler 2.5.3. I'm going to try the fire ant strategy (chopping the contigs) and see how that works.

  3. Steven Sullivan said

    Thanks for the tips on including paired-end Sanger reads. But I notice some confusing things in output from my attempt to do so (which included 2 sets of 454 WGS reads, 1 set of 454 3kb paired end reads, and 1 set of sanger paired end read, as input). I made my Sanger read deflines 454-friendly, e.g.

    >FPCA001J_F template=FPCA001J dir=F library=barnwell
    >FPCA001J_R template=FPCA001J dir=R library=barnwell

    and made sure to tell Newbler that the Sanger file was paired-end reads (which I think worked, since it gave me the “file successfully added as explicit paired-end file” message). NB The sanger reads had no associated .qual file.

    For ease of discussion, I’ll call the two 454 WGS read input files ‘454_WGS1.sff’ and ‘454_WGS2.sff’ , the 454 paired end read input file ‘454_pe.sff’, and the Sanger paired end read input file ‘sanger_pe.fa’

    It’s hard for me to tell from NewblerMetrics whether the Sanger file was interpreted correctly because:
    – under ‘Input Information/runData’ it shows paths to 454_WGS1.sff, 454_WGS2.sff, and sanger_pe.fa, and under ‘/paired end’ it shows the path to 454_pe.sff only. This seems incorrect.

    – under ‘OperationMetrics/readAlignmentResults’ it shows paths to 454_WGS1.sff and 454_WGS2.sff only, and under ‘/pairedReadResults’ it shows paths to 454_pe.sff and sanger_pe.fa. This seems correct.

    – under Consensus Results/consensusResults/pairedReadStatus/library , there is only one library name, ‘454_pe.sff’. This seems incorrect; I would have expected ‘barnwell’ to be there too as a second library.

    Finally, in 454PairStatus , the sanger reads DO show up, but some have a status called ‘NoThreshold’ which is not listed in the 454 literature, or in your discussion of 454PairStatus.txt. E.g.,

    FPCA001J FalsePair
    FPCA014J NoThreshold
    FPCA015J BothUnmapped

    • flxlex said

      First, the fact that your Sanger reads end up in the 454PairStatus.txt file as forming pairs indicates that newbler successfully used them as such. However, I think you have too few pairs for them to make a real impact. That is most likely the reason why the library is not listed under ‘Consensus Results/consensusResults/pairedReadStatus/library’.
      pairedReadData part. the ‘NoThreshold’ listing has to d with that to, as I think it means the reads where found on the same contig in the correct orientation relative to each other, but as the library does not have an estimated pair insert size, a status (TruePair, SameContig or FalsePair) could not be determine and (but you are right, this is undocumented).

      The strange behavior with the sanger file missing under the ‘Input Information/runData’ section is I think a bug. It looks like I have a similar case with a Sanger input file not listed under that heading.

      Assuming you have a valid sat of pairs, one thing you could try is to make multiple copies of the sanger reads, but make sure to not have them be completely identical (newbler would labels them duplicates). If you take off at least one (perhaps a few) bases at the beginning that should work. Newbler needs at least two pairs to make links, and a minimum aligned within the same contig to determine a library insert size.

      Good luck!

  4. Steven Sullivan said

    (btw, my 121 Sanger paired end reads are mostly between 450-900 nt, though a few are shorter; I also noticed now that a few of the reads have no mate pair, only a F or R read alone; I wonder if this confuses Newbler?)

  5. Steven Sullivan said

    Very informative, thanks. Regarding ‘NoThreshold’, what’s actually happening there is the F and R reads are being assigned to different contigs, e.g.

    Template Status Distance Left_Contig Left_Pos Left_Dir Right_Contig Right_Pos Right_Dir
    FPCA014J NoThreshold 4666 contig01429 2506 – contig01670 775 +
    FPCA040J NoThreshold 27627 contig00145 2094 + contig00147 14191 –
    FPCA073J NoThreshold 26184 contig00156 4066 – contig00154 5098 +

    so ‘NoThreshold’ must mean something else.

    When Newbler found a Sanger mate pair mapping to the same contig, it called them ‘FalsePairs’ — there were just three of these in my data, and in two cases they mapped in the wrong orientation, but in one case (the third one below) the orientation is correct, so I’m not sure what the problem is.

    FPCA001J FalsePair 33551 contig02332 45406 – contig02332 11855 +
    FPCA017J FalsePair 33340 contig00290 72423 – contig00290 39083 +
    FPCA328J FalsePair 33919 contig02192 8218 + contig02192 42137 –

    My Sanger reads are all from a ~33kB insert size fosmid library (also evident in the ‘FalsePairs’ distances); is that too large an insert size for Newbler to handle? If not, is there a way to ‘tell’ Newbler that the expected insert size is 33kB?

    my Sanger reads come from a ~33kb insert size fosmid library; can Newbler be ‘told’ this? Is that insert size too large for Newbler?

    Template Status Distance Left_Contig Left_Pos Left_Dir Right_Contig Right_Pos Right_Dir
    FPCA001J FalsePair 33551 contig02332 45406 – contig02332 11855 +
    FPCA017J FalsePair 33340 contig00290 72423 – contig00290 39083 +
    FPCA328J FalsePair 33919 contig02192 8218 + contig02192 42137 -

    • flxlex said

      First on the ones that map to the same contig with 33 kb distances: all of them are in fact in the correct orientation! One – and one +, with the – one farthest away. The fact that you only have three of them explains why you don’t see a library distance, there is not enough information for that. These would have been labeled ‘SameContig’ when there was such a distance (perhaps these should also have been labeled ‘NoThreshold’). The ‘NoThreshold’ ones would have been used to link contigs into scaffolds.

      You could try a tool like bamus of sspace for scaffolding using your limited forsmid ends. Or get (many) more fosmid ends…

      • Steven Sullivan said

        Oops! I see now you’re right about those orientations!

        Anyway, while these fosmids ends are probably too few to make much difference to the assembly, it’s still educational for me to work through the process like this. Along those lines I’m going to try the tactic you suggested, of creating multiple staggered copies of each read of a mate pair, which might allow Newbler to assign a library distance and a better status to the reads. I presume then that Newbler can handle >30kb insert sizes, even though the maximum insert size of 454 paired end libraries is 20 kb?

      • flxlex said

        Newbler 2.5.3 can now handle BAC-size insert libraries, so you should be OK. Still, I am afraid you might not succeed using newbler with the limited amount of pairs you have…

  6. Steven Sullivan said

    sorry for the duplication at the end of that post — bad editing on my part.

  7. Lee said

    How can you tell Newbler how far apart your reads are? Is there a parameter for that?

    • flxlex said

      No, unfortunately you cannot. Newbler determines that itself from aligning the read pairs to the contigs (which inplies that there should be sufficiently long contigs to begin with). But, you can run other tools for scaffolding newbler contigs as well (sspace, bambus, …).

  8. Col said

    Thanks for the library newAssembly project update.

    Newbler 2.6 also now allows input of FastQ files. This is a nice enhancement, though it certainly does take a while with 300000 454 and 30mil *2 Illumina reads.

    • flxlex said

      Correct! But, is it worth the wait? :-)

    • Col said

      Just an update on that.

      Newbler actually crashed – i.e. could not proceed further – after trying to assemble about 770k Illumina PE reads. I was expecting it to manage a few more reads, if not perhaps all of them.

      This is for a bacterial genome assembly project (6MB). I’ll now try the de novo assembly step with Velvet and Abyss first, then giving Illumina contigs to Newbler.

      • flxlex said

        You could try giving newbler the reads in chunks (incremental assembly), but there is no guarantee it will go crash-free. It looks to me like the use of Illumina reads for newbler is basically unexplored territory, with many bugs/crashes to be expected…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Follow

Get every new post delivered to your Inbox.

Join 105 other followers

%d bloggers like this: