An assembly of reads, contigs and scaffolds

A blog on all things newbler and beyond

Running newbler: de novo assembly

Posted by lexnederbragt on June 10, 2010

This post is about how to start up newbler for de novo assembly projects. I will describe setting up newbler using the command line. Most of the options I will mention are also available through the GUI version, but I will not describe how to use them here.

For a description of the progress that newbler reports during assembly, please check this post. For a description of the different output files, these are described in a series of previous posts.

1) default newbler on one or more files

runAssembly /data/sff/EYV886410.sff

This is the most simple way of running newbler: just provide it with one sff file. It will generate a folder called along the lines of P_yyyy_mm_dd_hh_min_sec_runAssembly and put all output in there. If you want to have control over the name of this folder, use

runAssembly -o project1 /data/sff/EYV886410.sff

-o describes the name of the folder newbler will provide all output in, in this case ‘project1’

Using more files is most efficiently done by having them all in the same folder and writing

runAssembly -o project1 /data/sff/*.sff

It is also possible to provide several files in different folders:

runAssembly -o project1 /data/sff/EYV886410.sff /data/reads/FQL5QBG02.sff

2) Keeping track of different types (‘libraries’) of reads
When you have reads coming from different libraries, newbler can keep track of which library the sff files belong to. This is most useful in the situation where you have a mix of shotgun reads and paired-end reads, with more than one file per paired end library, and/or paired end libraries with different insert sizes.

During assembly, newbler will calculate the average pair distance for paired-end reads that map to the same contig, and use this average to estimate gap sizes between contigs, see the post on how newbler works.

If you provide all files at once, newbler will consider each input sff file containing paired-end reads separately (i.e. as coming from a separate library), and calculate the average distance for reads from each file. Using the setup described below will result in newbler using files from the same paired-end library together, and calculating the average distance for the reads on a per-library basis. The average distances are also reported in the 454NewblerMetrics.txt file (see this post).

For this to work, we need to set up the project with several commands

newAssembly project2
addRun -lib shotgun project2 /data/sff/EY*
addRun -lib 3kb_PE project2 /data/sff/EZ5TR6A01.sff
addRun -lib 3kb_PE project2 /data/sff/FQL5QBG02.sff
addRun -lib 8kb_PE project2 /data/sff/ GD4EG4306_8kb.sff
runProject project2

The newAssembly command sets up the project, creates a new folder called ‘project2’
addRun is used to add runs, with the -lib option to give a name to the library. Here, we added a bunch of shotgun containing sff files, two files from the same 3kb paired-end library (called ‘3kb_PE’), and one file from an 8kb library (called ‘8kb_PE’). Note the order of the arguments here: addRun – options – projectdirname – sff files.
Note that for shotgun reads, there is not really a need to indicate the library, newbler does not use this information. It might help you to keep track of where files come from, though…
runProject starts the actual assembly

In your terminal, this will look something like this:

bash-3.2$ newAssembly project2
Created assembly project directory project2
bash-3.2$ addRun -lib shotgun project2 /data/sff/EY*
10 read files successfully added.
EYV886409.sff
EYV886410.sff
EYV886411.sff
EYV886412.sff
EYV886413.sff
EYV886414.sff
EYV886415.sff
EYV886416.sff
EYZ1CS001.sff
EYZ1CS002.sff
bash-3.2$ addRun -lib 3kb_PE project2 /data/sff/EZ5TR6A01.sff
1 read file successfully added.
EZ5TR6A01.sff  (paired-end)
bash-3.2$ addRun -lib 3kb_PE project2 /data/sff/FQL5QBG02.sff
1 read file successfully added.
FQL5QBG02.sff  (paired-end)
bash-3.2$ addRun -lib 8kb_PE project2 /data/sff/GD4EG4306_8kb.sff
1 read file successfully added.
GD4EG4306_8kb.sff  (paired-end)
bash-3.2$ runProject project2
<lots of output>

Now, in the 454NewblerMetrics.txt file, you should get library insert size average for the two PE libraries.

NOTE: newbler will check each sff file for the presence of reads with the PE linker sequence. If at 25% of the first 500 reads have this linker, the reads will be treated as coming from a paired-end library. If your file has fewer paired end reads, you can force newbler to treat it as a paired-end file by adding the –p parameter:

addRun -p -lib 3kb_PE project2 /data/cyano_sff/FQL5QBG02.sff

At the end of assemblies set up using newAssembly, addRun and runProject, the folder structure is different than for the first example.

In the output folder (‘project2’) there is

  • an (empty) file called 454Project.xml
  • the folder called sff with the symlinks to the sff files used
  • a folder called assembly with the usual output, and in addition the 454AssemblyProject.xml (this xml file contains description of all input files and parameters). This folder also contains a number of hidden files (some quite large) that newbler used to store information regarding the assembly.

It is, in fact, possible to add reads to a finished assembly, by invoking addRun another time, and then running the project again (runProject), thereby only aligning the added reads, so-called incremental assembly. However, I would recommend always starting from the beginning. If you, on the other hand, want to change aspects of the output of a finished assembly, simply invoke runProject again with a parameter that, for example, generates an extra file (see below for an overview of such parameters). Newbler will then not recomputed the entire assembly, just generate the new output.

What follows are some parameters that are useful for influencing the assembly process.

3) Minimum overlap length (-ml) and minimum overlap identity (-mi)
Although the developers of newbler optimized parameters to get the best assembly in most cases, sometimes it is useful to adjust these two. They control the stringency of the alignment of reads at the beginning of the assembly. Increasing this stringency works well when:

  • the sample has low-frequency contamination DNA. In this case, only the reads that ‘belong together’ assemble together into contigs, without accidentally ‘pulling in’ contaminating reads. For example, for the bacterial assemblies based on contaminated read datasets described in this (my) paper, increasing mi and ml yielded better results for the genomes of interest.
  • the read dataset is low in coverage relative to the genome size. When the total number of bases in the read dataset is less than, say 10 times the genome size (less than 10x coverage), at default settings, newbler sometimes tries to find/force overlaps between reads that are not really there, slowing down the alignment phase. Increasing the stringency will speed up the assembly.

An example set up for adjusting -mi and -ml would be:

runAssembly -o project1 -mi 96 -ml 60 /data/sff/EYV886410.sff /data/reads/ FQL5QBG02.sff

Or, in the case of newAssembly, addRun and runProject, use:

runProject -mi 96 -ml 60 project2

I recommend testing -mi from 90 (the default) to 98 (very stringent) and -ml from 40 (the default) to 100 or more (very stringent).

4) Large datasets and parallelization
For large genomes (say, over 100 Mb) with a good coverage in terms of bases input, using -large will use some shortcuts to speed up (read: ‘make possible to complete’) the assembly. The cost is more contigs and more reads determined to be singletons. Also, using a multi-cpu machine and setting the number of cpus available to newbler with the -cpu parameter will result in a significant speed increase. During alignment, newbler will spread the task over all cpus. For those in the known, this parallelization is threaded, not MPI based, meaning that the cpus need to be on the same core/machine. The default number of cpus newbler will use is one. When more than one cpu is used, rerunning the exact same assembly will result in small differences in the resulting number of contigs.

Examples of use:

runAssembly -o project1 -large -cpu 8 /data/sff/EYV886410.sff

runProject -large -cpu 8 project2

5) Restricting the reads used during assembly
Sometimes, reads have adaptor sequences at the beginning and/or end, e.g. as a results of the sample prep procedure. Alternatively, the dataset could contain reads that should not be included. For example when a BAC is sequenced, there is usually some E. coli host DNA present in the sample, which will result in reads not coming from the BAC. Newbler allows for removal of such sequences by:

  • for adaptors: using -vt and a (multi) fasta vector trimming file with the sequences to be removed; newbler will check for a match at the ends of reads and exclude matching bases for assembly.
  • for contaminating reads: using -vs and a (multi) fasta screening file with the sequences to be checked against (e.g. the E. coli genome); newbler will check entire reads against the screening sequences and exclude matching reads for assembly.

Examples:

runAssembly -o project1 -vt /data/seq/adaptors.fasta /data/sff/EYV886410.sff /data/reads/ FQL5QBG02.sff

runProject -vs /data/seq/e_coli_genome.fasta project2

Alternatively, it is possible to provide either a list of reads that should be used for the assembly, excluding the other reads that might be present in the input, or, to provide a list of reads to exclude from the total set of reads provided. The ‘list’ should contain the read identifiers (14 characters). Use -fi to provide the include list, and -fe for the exclude list.

Finally, it is possible to exclude reads below a certain length by specifying the -minlen parameter, which can be set from 15 to 45 bases:

runAssembly -o project1 -minlen 45 /data/sff/EYV886410.sff

runProject -minlen 45 project2

6) Parameters influencing the output newbler will generate
All of the parameters mentioned here need to be added to the runAssembly or runProject commands. ‘#’ indicates that a number needs to be added after the parameter.

-a # and -l #
These determine the minimum length of contigs in the 454AllContigs.fna and 454LargeContigs.fna files, respectively. Defaults are 100 for -a, and 500 for -l.

-info
For large projects (more than 4 million reads, more than 40 MB genome), the output of the 454AlignmentInfo.tsv file is suppressed. Use -info to generate this file in this case.

-no
Setting this flag suppresses all output. This can be useful to quickly test if your assembly will finish, or in the case of incremental assemblies (see above), saving time and generating all output only at the very end.

-ace, -acedir
These flags control the output of one or more 454Contigs.ace file(s). The ‘ace’ file is a file format for describing assemblies, i.e. all alignments and contig consensus sequences, however, without scaffolding information. For newbler assemblies, it is the only output file containing the actual read alignments. The ace file originated as the output of the phrap assembly program. Other assembly programs sometimes can generate it as well (newbler is one of them) or there are converter scripts between the different formats. Somewhere in this document is a description of the ace file format. The ace file can be used for assembly visualization in programs such as consed (see below), or Hawkeye. While the -ace flag generates one ace file containing all contigs, using -acedir generates a subdirectory with a separate ace file for each contig.

-consed
Consed is an assembly viewing, editing and finishing tool. It takes the ace file as input and allows for visualization of the assembly. Using the -consed flag will generate, besides the ace file, a set of folders and files that (in theory) should allow for editing the newbler assembly in consed.

-rip
In contrast to other assembly programs, newbler splits reads that that are aligned at the end of a contig, and continue in another contig, thus allowing for reads to be placed in more than one contig. The -rip flag ensures that reads are placed in one contig only (I assume where the majority of it’s bases are aligned).

-nrm
As described above, the runAssembly program output has a different folder structure than the runProject-based assembly. The latter has the original folder structure, with additional hidden files. When using runAssembly, setting the -nrm flag will keep the folder structure the same as for assemblies started with runProject. Logically, this flag cannot be used with runProject.

Finally, there are many more parameters to choose from, but I think you will get very far using the ones described here. Maybe I’ll spend another post on the remaining parameters and options…

Advertisements

65 Responses to “Running newbler: de novo assembly”

  1. sabrina said

    Hi!
    Do you have any suggestion on how to use illumina paired end reads (.txt files) in gsassemblerm as a paired end read? It would be a great help.
    Thanks

    • flxlex said

      It is most likely possible, but:
      – I’m not sure how good newbler is in this kind of hybrid assembly, you might want to try other software (mira? celera? phusion? CLCBio? …?)
      – it requires some work, including tweaking of the fasta headers (‘names’) of your sequences.

      I have not done this myself, so here is my guestimate:
      1) split the fastq into fasta and qual files
      2) convert the qual values, if necessary, to phred based quality scores (no negative values) (check out seqanswers.com for tips)
      3) modify the fasta headers of the sequences so that they look like this:

      >originalreadname_1 template=originalreadname dir=F library=somename
      >originalreadname_2 template=originalreadname dir=R library=somename

      _1 and _2 refer to the first and second pass of sequencing. Template should be unique for each pair of reads, newbler will match up reads according to this field. If you have different libraries (perhaps with different insert sizes), use different names for the library field.

      • I came across the need to do this yesterday and decided to try it with a couple of sed commands.

        1) Convert fastq to fasta

        sed '/^@/!d;s//>/;N' asp5_leaf_read1.fq > asp5_leaf_read1.fna
        sed '/^@/!d;s//>/;N' asp5_leaf_read2.fq > asp5_leaf_read2.fna

        2) Reformat the headers of the fasta files

        sed -e 's/>\(.*\)#0\/\(.*\)/\1 template=\1 dir=\2 library=asp5/' -e 's/dir=1/dir=f/' asp5_leaf_read1.fna > asp5_leaf_read1_newbler.fna

        sed -e 's/>\(.*\)#0\/\(.*\)/\1 template=\1 dir=\2 library=asp5/' -e 's/dir=2/dir=r/' asp5_leaf_read2.fna > asp5_leaf_read1_newbler.fna

        3) Produce qual files with read names matching the fasta files


        sed '/^+/!d;s//>/;N' asp5_leaf_read1.fq | sed 's/>\\(.\*\\)#0\/\\(.\*\\)/>\1/' > asp5_leaf_read1_newbler.qual
        sed '/^+/!d;s//>/;N' asp5_leaf_read2.fq | sed 's/>\\(.\*\\)#0\/\\(.\*\\)/>\1/' > asp5_leaf_read2_newbler.qual

        Those quality scores then need rescaling.

        Steps 1 and 2 can be piped as well but I wanted to keep the unmodified fasta files for use somewhere else.

        Hopefully the formatting I’ve used inside the tags has worked but if you get an error I can find a way to send a plain text version. This was all much, much quicker than the steps described in the newbler manual for how to deal with Sanger reads.

      • Ok, so the formatting was completely wrong. Here’s another attempt

        sed ‘/^@/!d;s//>/;N’ asp5_leaf_read1.fq > asp5_leaf_read1.fna
        sed ‘/^@/!d;s//>/;N’ asp5_leaf_read2.fq > asp5_leaf_read2.fna

        sed -e ‘s/>\(.*\)#0\/\(.*\)/\1 template=\1 dir=\2 library=asp5/’ -e ‘s/dir=1/dir=f/’ asp5_leaf_read1.fna > asp5_leaf_read1_newbler.fna

        sed -e ‘s/>\(.*\)#0\/\(.*\)/\1 template=\1 dir=\2 library=asp5/’ -e ‘s/dir=2/dir=r/’ asp5_leaf_read2.fna > asp5_leaf_read1_newbler.fna

        Produce qual files with read names matching the fasta files
        sed ‘/^+/!d;s//>/;N’ asp5_leaf_read1.fq | sed ‘s/>\(.*\)#0\/\(.*\)/>\1/’ > asp5_leaf_read1_newbler.qual
        sed ‘/^+/!d;s//>/;N’ asp5_leaf_read2.fq | sed ‘s/>\(.*\)#0\/\(.*\)/>\1/’ > asp5_leaf_read2_newbler.qual

      • flxlex said

        Those quality scores then need rescaling.

        I remember once using biopython to split fastq into fasta and qual, and simultaneously converting the quality scores. Another time I could not do that, and as I was dealing with solexa quality values from 2007 or later, I converted to phred values based on this formula:
        $Q = 10 * log(1 + 10 ** ($sQ / 10.0)) / log(10);
        Source: http://maq.sourceforge.net/qual.shtml

        Perl code:

        while (){
        if (/>/){ # lines with the fasta header: just print
        print $_;
        next
        }
        # quality value line
        chomp;
        my @values=split;
        print join " ", map ( int(10 * log(1 + 10 ** ($_ / 10.0)) / log(10)), @values);
        print "\n";
        }

      • Will said

        Actually, looking for the @ character as the indicator of the first line, of the four lines of interest, could be problematic as the score of the first value could actually have the “@” character there.

  2. Soni said

    Hi Flxlex,

    Would you have any advice on how to choose the best parameters for my assembly? I have run assemblies by varying the -mi and -ml values but am having difficulty determining which of the parameter sets are resulting in the best assembly.

    Many thanks

    Soni

    • flxlex said

      One way to assess assembly quality is the N50 value, the higher this value the better the assembly in terms of contig/scaffold length. However, see this discussion on LinkedIn (if you have an account there): http://www.linkedin.com/groupItem?view=&gid=1902623&type=member&item=17186621
      If there is a closely related genome available you could try to map the contigs/scaffolds to that genome and see which setting gives the best mapping results.

      Is there a specific reason why you are trying different -mi and -ml settings? If the sample is clean (just the DNA of the species you want to sequence) than the default values usually work best.

      • sabrina said

        Dear Flxlex:

        Regarding the use of illumina paired end read in newbler, I have split the .txt file in fasta and qual file and also have changed the headers to feed newbler. Newbler starts the assembly process and can start to index those files. But unfortunately after a while during indexing step it says that the computation goes out of memory. I have tried this several times with different illumina paired end data sets. I am using ~144 G ram and 16 CPUs. Do you have any suggestion for it?

        Regarding the -ml and -mi parameter change you have mentioned that these parameters can affect the assembly if the sample is not clean (not only the desired genome)? Can you explain more?

        Another thing I wanted to ask that I have tried a transcriptome assembly with the -large option, though its not recommended. It gave the same results with lesser time except that in newbler metrics file it has shown a huge amount of repeat in the readstatus numberrepeat. Can you pls refer me something to understand the large algorithm properly?

        Thanks a lot for summarizing the newbler backgrounds so nicely.

        regards
        ~Sabrina

      • flxlex said

        Regarding the memory problem: it could very well be that the 144 Gb ram is not enough. Newbler stores a lot of information in memory… Have you tried taking a small subset of the Illumina reads?

        Regarding -mi and -ml: if there are reads in the sample from a different genome, and these reads are potentially similar to the sequence of the genome of interest, having higher alignment stringency can prevent these contaminating reads from forming overlaps (and contigs) with the reads from the genome of interest. Instead, they will form contigs with themselves, making for easy removal afterward (based on blast analysis, or read depth).

        Regarding -large: this setting tries to make alignment go faster. One of the ways it does that is by having a lower threshold for calling a read a repeat. A repeat read has x number of seeds present in at least y number of other reads; with -large, x and/or y are set lower (I don’t know the exact numbers here).
        Are you sure you get the same results? If the assembly finishes without -large, please use that one!

      • Ketil said

        However, see this discussion on LinkedIn (if you have an account there): http://www.linkedin.com/groupItem?view=&gid=1902623&type=member&item=17186621

        You not only need an account (which I have, just to forward the constant spam from them /dev/null), but you need to be a member of some group they won’t tell you the name of.

        Could you – or somebody else with access to it – summarize this discussion?

      • flxlex said

        The group is called NGS (Next Generation Sequencing) and I think this links to them: http://www.linkedin.com/groups?mostPopular=&gid=1902623

  3. sabrina said

    Thanks a lot for the reply, flxlex. it was really informative reply.

    Ya I have tried a small subset. say, I have total 8 pairs and i tried with subsets of two. newbler indexes the paired txts , then indexes the sff but after starting of assembly it stops and goes out of memory. Do you think omitting the -m option can be of use ?

    regarding the -large option the both assembly with and without large gave quite same outputs, not exact. But the significant differences are found in numberrepeat (very huge difference), in run metrics seedhitsFound and overlapsfound.I have compared that to get the idea of large. Thanks a lot again for your replies :)

    • flxlex said

      Aha, you used the -m option. Don’t! This option requires substantially more amounts of memory and is only useful for smaller projects to speed up the assembly.

      Good luck then and you’re welcome,

      flxlex

      • sabrina said

        Thanks!! its working i guess :) btw can u suggest a assembler for hybrid assembly with 454 and illumina data? thanks again :)

      • flxlex said

        Besides newbler? SOAPdenovo, Celera, Mira, Velvet, you name it. Check seqanswers.com for discussions and this page for software. Personally, I have limited experience with combining illumina and 454, but I will start looking into this soon!

  4. […] Illumina data.  There are a few blog posts discussing this, for instance  Sed to the rescue and Running Newbler de novo assembly. The challenge is to reformat the sequences from the Illumina FastQ format, to Sanger-style Fasta […]

  5. NewblerNube said

    Hey Newblers,

    I try the command:
    /home/projects/assembly/newbler> runAssembly -o ass1 readfile.fasta

    And I get the error:
    Error in Context:getProjectLock.
    The project lock file /home/projects/assembly/newbler/ass1/assembly/.RunLock is locked.
    Unable to get the project file lock.
    Error: Unable to create project directory: ass1
    Could not use a project lock file.
    Error in runAssembly: Error while creating the assembly project.
    Usage: runAssembly [-o projdir] [-nrm] [-p (sfffile | [regionlist:]analysisDir)]… (sfffile | [regionlist:]analysisDir)…

    I get this error while ass1 directory exists, and the program definitely has permission to create it…

    Does anybody have a clue to what the problem is?

    Cheers,
    T

    • NewblerNube said

      Correction: I get this error while ass1 directory DOES NOT exist, and the program definitely has permission to create it…

    • NewblerNube said

      Okay, found the problem. I am running Newbler on a cluster with 1700 cores, which requires a special filesystem to handle the many I/O requests. I had my files on an ordinary filesystem, which the cluster can only write to in some special cases. So The error was due to inability to write to the project directory.

      Cheers,
      T

  6. Steven Sullivan said

    As I understand the architecture of the PSSC Labs Titanium cluster that 454 sold us, it has (I think) 5 nodes (1 head node + 4 slaves) and (I think ) 20 cores (I don’t have a spec sheet handy). A couple of my genomes are in the range of 150-180 Mb, and are highly repetitive. I use -large on these otherwise they don’t complete. Should I also be trying the -cpu option, or does it parallelize by default?

  7. ganga jeena said

    hi,
    I assembled a 3kb library 454 data and 8kb lib 454data along with wgs 454 data using newbler2.3 (GUI).
    GUI doesnt take any info on insert size and linkers are same for both 3kb lib and 8 kb library.
    How does it calculate the insert size ??
    I see an erroneous calculation..
    Is it that we missed something while prparing 8 kb libray or the software is going awry.
    Please explain.

    pairedReadStatus

    {
    numberWithBothMapped = 1157826;
    numberWithOneUnmapped = 229595;
    numberMultiplyMapped = 715851;
    numberWithBothUnmapped = 11362;

    3Kb library library
    {
    libraryName = “GP5JY2V01.sff”;
    pairDistanceAvg = 2614.3;
    pairDistanceDev = 653.6;
    }

    8Kb library
    library
    {
    libraryName = “GSMQLLJ01.sff”;
    pairDistanceAvg = 2719.4;
    pairDistanceDev = 679.8;
    }

    library
    {
    libraryName = “GSMQLLJ02.sff”;
    pairDistanceAvg = 2702.5;
    pairDistanceDev = 675.6;
    }

    Looking at

    • flxlex said

      Weird….

      It is true that you cannot tell newbler /GUI or command line) what the expected distance is. The distance is calculated based on the pairs that where both laves map to the same contigs (the othr pairs are then used for scaffolding).

      What are the contig metrics? There are two options that make the most sense: Either you do in fact have two 3kb libraries, or there are hardly any contigs of at least 8kb, so that newbler does not manage to map the paired reads at 8kb and therefore maps them to the short contigs instead.

  8. Sophie said

    Hi Flxlex,
    I have a query regarding the -large option. I’m assembling metagenomic sequencing data (DNA) and am wondering whether it is a good idea to use the -large option for assembly? If it is possible to assemble the data without using this option would you recommend that instead? The reason I ask is that the manual recommends using -large for large/complex genomes, and the data I am assembling is a complex mix of many different species of microbes. I noticed that you recommended NOT to use -large for Sabrina and her transcriptome assembly but I wasn’t sure if this was a transcriptome specific piece of advice or not…

    I am using the latest version of Newbler (2.5.3)

    Thanks in advance!

    Sophie

    P.S. Your site is great! Thanks for being so helpful to everyone!

    • flxlex said

      Assembly without -large will usually be better. The -large option is needed when the assembly otherwise would take too long time, it takes some shortcuts. You can try first with -large, but be sure to try without it also!

      For your data set, you might consider increasing the alignment stringency settings (-mi up to 98% and -ml up to 100 bases, for example), so that pieces from different genomes that show some similarity do not get collapsed in a single contig.

      Good luck!

  9. Oksana said

    We have shotgun data from 1 haploid individual and paired libraries from several haploid individuals. We used -mi 98 -ml 100 for shotgun read assembly and -mi 98 -ml 200 for paired.
    Do you think we should use the same -ml value throughout all assembly steps or it is ok to increase stringency?
    Ideally we would like to use our paired data only for scaffolding. Do you have any suggestions?
    Thank you.

    • flxlex said

      For the haploid shotgun assembly, you don’t need to increase the stringency, as that will assemble nicely since it is a single haploid. For the paired read dataset, higher stringency might result in a more fragmented assembly, but this you will have to try.

      Using the paired end reads for scaffolding only can be dine with Bambus (part of AMOS, http://sourceforge.net/apps/mediawiki/amos/index.php?title=Bambus_2.0) or SSPACE (http://www.baseclear.com/sequencing/data-analysis/bioinformatics-tools/sspace/). I do not have experience with either one of these.

      You could try to assemble all your reads with newbler, and play around with stringency (I would recommend even trying settings less stringent than the default). One option to try is the -het flag, for heterozygous genomes. Especially if you were able to try reads just coming from two haploids, this would mimic a heterozygote.

      Good luck!

      • Oksana said

        Thank you for advices.

        We increase the stringency because our genome is estimated to be very repetitive (36% unique, 41% repeats interspearsed with unique sequences, 23% of very repetitive).

        Will increased stringency for paired end assembly affect the contigs assembled with shotgun data (with less stringent parameters)?

      • flxlex said

        With high stringency, you might see repeats appearing in multiple, very similar, contigs. If that is what you want go ahead. Alternatively, with the default stringency, repeats might collapse into a single contig.

        Your question I do not completely understand: are you referring to an assembly of shotgun and pared reads together? Incremental assembly?

      • Oksana said

        We do an incremental assembly:
        step 1: shotgun (-mi 98 -ml 100)
        step 2: 8kb paired (-mi 98 -ml 200)
        step 3: 20kb paired (-mi 98 -ml 200)

      • flxlex said

        Aha. I don’t actually know if the higher stringency in the second rounds will affect the alignment of the shotgun reads from the first round. It will influence the alignment of the newly added reads. So, perhaps this is a clever strategy!

  10. Reetu said

    Hi Flxlex,

    Thanks for providing so much of information. I am getting some problem while assembly of same data using newbler 2.0 and newbler 2.5. Using newbler 2.0 someone in my lab has done the assembly. Now, I have same data+some more new reads and I am trying to assemble all with newbler 2.5 (I dnt have access to older version). Surprisingly, I am getting poor assembly than the previous one (though I added some more reads also). Even, if I use only the older data then also assembly is poor than the previous one. Please have a look at the result statistics below. Please help !!!!

    Many thanks,
    Reetu

    using newbler 2.0:
    runMetrics
    {
    totalNumberOfReads = 69525;
    totalNumberOfBases = 13326516;

    numberSearches = 28924;
    seedHitsFound = 32185984, 1112.78;
    overlapsFound = 7060022, 244.09, 21.94%;
    overlapsReported = 6890790, 238.24, 21.41%;
    overlapsUsed = 72959, 2.52, 1.06%;
    }

    readAlignmentResults
    {
    }

    pairedReadResults
    {
    file
    {
    path = “—“;

    numAlignedReads = 46406, 66.75%;
    numAlignedBases = 9176978, 68.86%;
    inferredReadError = 1.65%, 151468;

    numberWithBothMapped = 10586;
    numWithOneUnmapped = 4645;
    numWithMultiplyMapped = 819;
    numWithBothUnmapped = 4628;
    }

    }
    scaffoldMetrics
    {
    numberOfScaffolds = 387;
    numberOfBases = 415181;

    avgScaffoldSize = 1072;
    N50ScaffoldSize = 1153;
    largestScaffoldSize = 6836;
    }

    largeContigMetrics
    {
    numberOfContigs = 430;
    numberOfBases = 404882;

    avgContigSize = 941;
    N50ContigSize = 978;
    largestContigSize = 4013;

    Q40PlusBases = 358229, 88.48%;
    Q39MinusBases = 46653, 11.52%;
    }

    using newbler 2.5:
    runMetrics
    {
    totalNumberOfReads = 68572;
    totalNumberOfBases = 13325544;

    numberSearches = 28784;
    seedHitsFound = 5300738, 184.16;
    overlapsFound = 1343330, 46.67, 25.34%;
    overlapsReported = 1250160, 43.43, 93.06%;
    overlapsUsed = 118047, 4.10, 9.44%;
    }

    readAlignmentResults
    {
    }

    pairedReadResults
    {
    file
    {
    path = “—–“;

    numAlignedReads = 42923, 62.60%;
    numAlignedBases = 8747891, 65.65%;
    inferredReadError = 1.79%, 156223;

    numberWithBothMapped = 9170;
    numWithOneUnmapped = 3231;
    numWithMultiplyMapped = 1997;
    numWithBothUnmapped = 5410;
    }

    }
    scaffoldMetrics
    {
    numberOfScaffolds = 13;
    numberOfBases = 35296;

    avgScaffoldSize = 2715;
    N50ScaffoldSize = 2916, 6;
    largestScaffoldSize = 3800;

    }

    largeContigMetrics
    {
    numberOfContigs = 343;
    numberOfBases = 298675;

    avgContigSize = 870;
    N50ContigSize = 890;
    largestContigSize = 3800;

    Q40PlusBases = 272351, 91.19%;
    Q39MinusBases = 26324, 8.81%;
    }

    • flxlex said

      Unless you have switched the data, the 2.5 assembly looks much better: 13 scaffolds versus 387!

      • Reetu said

        thanks a lot lex for your prompt reply. dataset is same. I have assembled reads of total 4 different genotypes. Samething I have done for other 3 genotypes i.e adding new reads and created new assemblies and that looks better in terms of no of scaffolds, length of scf and also number of genes covered by assemblies. I m getting problem with assembly of this genotype only. This was a poor assembly as compared to other assemblies of other genotypes bt its even worse when I am adding new reads to it. Its fine that number of scaffolds has been decreased from 387 to 13 bt have a look at the number of bases in scaffolds (earlier it was 415181 now its only 35296). Also the old assembly was covering 22 genes and after adding new reads its covering only 12 genes. Thats weird. Can you plz suggest somthing on this.

        Thanks,
        Reetu

      • flxlex said

        Yes, this is very weird indeed. I would look at the other read status metrics. Your paired-end read stats went down a bit between 2.0 and 2.5. Assuming the input data and settings are exactly the same, I must admit I have no clue what could be going on here…

  11. Reetu said

    Hi Lex,

    I got the problem. The problem is in calculating the pair-end jump library distance. Older verion is calculating this dist 750 which is not correct. Newer version is not calculating this distance coz of small size of contigs. Do you know the way to specify this distance?

    Thanks,
    Reetu

    • flxlex said

      Unfortunately, there is no way of telling newbler the library insert size, it can only get to that by using the distances of the pairs where both halves map to the same contig. You will have to find a way to increase contig size, then!

  12. Alexis said

    Hi Flexlex,

    Is there any way to specify the linker sequence for mate pair reads while running Newbler using sff files? I have some mate pair reads (2x25bp) that are too short for Newbler to use when submitting the fasta files, so I want to submit sff files that contains left-read + linker + right_read and hope newbler will consider those reads long enough and split them to mate pair reads when doing the calculation. But I can’t find an option to specify the linker sequence for newbler. The data I have uses linker sequences different from the 454 default linker sequence. Also when running the mate pair short reads using fasta files, I used “ml=20, minlen=20” but still got no assembly out of the reads. Any suggestions? Thanks!

    Alexis

    • flxlex said

      Unfortunately, what you want is not possible. Only the native 454 linker(s) are supported for sff files. What you could do is take the approach described in this post. Explicitly giving the reads as paired end (the ‘-p’ flag in the addRun command) sets the minimum readlength from 50 to 20 bases, allowing you short reads to be assembled. At the very least, your reads can then be used for linking contigs into scaffolds. Tweaking the ml parameter (as you wrote) could perhaps allow them to be aligned as well. All of this assuming you have longer reads available as well…

  13. eyla said

    hi Flxlex..

    do you know the standard adapter sequence used in 454 flx sequencing?

  14. Elizabeth Bent said

    Hi, I am planning to use a Nextera kit (generates fragments about 250-500 bp, random sizes) on amplicons that range in size from 300 bp to 900bp. I am wondering if I can use Newbler to assemble those amplicons (de novo assembly). Will the conserved regions in the amplicons be a problem? There is an alternate tool for amplicon assembly called EMIRGE-amplicon, and it may work better, but I wondered if it might be possible to assemble sequences (like plant rbcL sequences, where you need the whole amplicon to ID the plant) from fragments from a mixture of amplicons, without too many mis-assemblies. Can I modify the assembly parameters to make them very stringent, say like 70 bp overlapping with very few mismatches? Thanks for your help -Liz

    • flxlex said

      Yes, conserved regions between the amplicons will be a problem. Are you sure your strategy will give you full length correctly assembled amplicon sequences, and not a consensus sequence, or many small fragments? Just from the abstract, EMIRGE may be a better choice.

  15. Chris said

    Hi Flxlex,

    do you know whether there exists any option that prevents that partials occur? So that the Reads are aligned in their full length.
    It would be apparently very helpful for my project if such a function exist…

    Thanks!
    Chris

    • flxlex said

      The first thing that comes to mind is the -rip option. That at least ensures that a read is placed in one unique contig only. However, I am unsure whether that prevents partials. Come to think of it: perhaps this could work: the -ml option (minimum alignment length) allows for a percentage, i.e. an alignment is only used when it involves, say at minimum 80% of the read length (by default, it is 40 bp). You could set it to 100%:
      -ml 100%
      But, it may prevent many alignments from forming…

      • Chris said

        Thanks a lot for the prompt reply!
        I already tried -rip. And the result seemed a bit inconsistent. It looks like it had prevented the ripping just from a part of the reads. At some positions the reads where not assembled in full length.
        Do you have any idea how this can happen?
        I think the -ml option might not achieve what I would expect here… Because I would like to retain the reads to get overlaps which I expect/need.

      • flxlex said

        Read were perhaps trimmed? I have never used -rip, so I can’t be sure.

  16. AKrithika said

    Hi Flxlex,

    Thanks for the excellent blog! I have a question. Not sure if anyone has encountered this previously.
    Does newbler produce different results for the same input each time its run? I used Newbler2.6 with the default parameters .

    Thanks,
    Krithika

    • flxlex said

      The latest newbler versions – I think from 2.5 or 2.6 – shpuld give the exact same result when run multiple times on the same data if you use a single cpu. For runs with multiple cpus, the results may vary a bit in terms of exact contig sequences and N50 lengths etc.

  17. Genny said

    Dear Flxlex,
    Thank you for very detailed guide to various Newbler issues. I am trying to run de novo assembly and ran into an error. Hope you can give some advice.

    Data: Illumina reads in fastq files, 1 PE library (read lengths 101bp, insert size 550bp), 4 MP libraries (read lengths 36-151bp, insert size 2kb, 4kb, 7kb, 10kb)

    Newbler settings: runProject -large -e 50 -minlen 35 -notrim -cpu 8 -mi 40 -ml 90 -rip -scaffold -consed -a 101 -l 5000 -s 10000 ../newbler
    Wallclock time 11:00:56:03. Server output no error but i got this from 454NewblerProgress.txt

    ——–
    …. (Indexing all of my fastq files)
    Indexing a_r2.fastq (with quality scores)…
    -> 2315243 reads, 247708610 bases, 2001176 marked as matepairs.
    Indexing a_r1.fastq (with quality scores)…
    -> 2315243 reads, 262063751 bases, 2127972 marked as matepairs.
    Setting up overlap detection…
    -> 839346085 of 839346085, 839346085 reads to align
    Starting seed building…
    -> 839346085 of 839346085
    Building a tree for 6482759320 seeds…
    Computing alignments…
    -> 839287000 of 839346085
    107720131 sequences marked as repeats, 75581952 in alignments, removing from 1543849 chords…
    -> 839346085 of 839346085
    Checkpointing…
    Detangling alignments…
    -> Level 4, Phase 9, Round 2…
    Checkpointing…
    Building contigs/scaffolds…
    -> 7647 scaffolds, 127853 scaffold contigs
    Computing signals…
    -> 495000000 of 842804286…Reading flowgrams…
    Error: Unable to read Fastq entry.

    ———

    Thanks a lot.

    G.

    • Weird. Are you sure no files were moved during the assembly, or contact to a disk lost? I would just try again…

      • Genny said

        Thank you for your quick reply. My files are in server with Lustre file system and weren’t moved, as far as I know. I am trying again by PE libraries first because it took so long (11 days) to get to that error. And i got this error:
        …..
        Indexing r1.fastq (with quality scores)…
        -> 183136036 reads, 18273680214 bases, 182193867 marked as matepairs.
        Setting up overlap detection…
        -> 770120462 of 770120462, 770120462 reads to align
        Starting seed building…
        -> 770120462 of 770120462
        Building a tree for 5851843518 seeds…
        Computing alignments…

        Call Stack:
        7: fatal(char const*, …)
        6: AlignUpdate::asmFillSeqCHAR(int, int, int, int, std::vector<CHAR, std::allocator >&, bool, bool)
        5: AlignUpdate::updateChords(Sequence*, Sequence*, DiffAlign*, bool, bool, bool, DiffAlign*, Timer*)
        4: AlignPhase::asmJobUpdate(AlignAsmJob*, bool, Logger*)
        3: AlignAsmUpdateThread::run()
        2: Thread::_dispatch(void*)
        1: /lib64/libpthread.so.0() [0x38738079d1]

        Error: Initialization failure in ThreadMutex constructor. Please report this error to your customer support representative.

        So i am trying this run again. You have any idea? Thanks a bunch

        G.

      • No, sorry, this is really a crash…

  18. ZA said

    Thank you so much for the detailed information in regards to Newbler Assembler. I have started using Newbler Assembler through its GUI on our Bio Linux 8 Workstation on Intel Xeon E5-2640v2 128 GB of RAM.

    Unfortunately, I am running into an error with Newbler. I am working with interleaved Illumina Pair End reads (275 bp). I have enabled scaffolding and low depth overlap.

    As Newbler is computing the alignment, I end with the following error.

    Error: The _dispatch computation ran out of memory and is unable to continue.

    I was wondering what should I try to do to rectify this error. Many thanks in advance for any help and suggestions.

    • This is one of the clearest error messages :-) You need a server with more memory (available RAM) to get the assembly finished, it is as simple as that…

      • ZA said

        Thank you for the quick reply. That is weird, because I was tracking the RAM that was being used by the workstation. The used RAM did not exceed more than 20 to 25 GB, before this error was generated. Is there any methods to increase the amount of RAM to use for Newbler?

      • No. But you could track memory usage in more detail, maybe there was a sudden spike just before it crashed. I use this command for that: `top -b -d 300 -u username >top_output.txt`. Setting ‘d’ (time in seconds between running top) to a lower value will give more memory usage more frequently.

  19. merodev said

    Hi flxlex,
    Thanks for the wonderful description of Newbler. I have a question regarding quality parameters on Newbler. I tried Newbler 2.8 with default settings for assembly on 454 reads. The assembly metrics file shows that ~10% reads were not used. I was wondering what quality criteria does Newbler use for filtering or trimming. Any information regarding this would be very helpful.
    Thank you.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: