An assembly of reads, contigs and scaffolds

A blog on all things newbler and beyond

Running newbler: more de novo assembly parameters (and a hidden one)

Posted by lexnederbragt on July 16, 2010

Trimming (reads) by running newbler

There is a long list of options/flags/parameters for a newbler assembly, some of which have been treated in the previous post. In this post I will describe some more parameters. At the end, as a bonus, I will share a parameter that is not mentioned in the current documentation…

-ss -sl -sc -ais -ads
These parameters control read overlap detection (there are two more, -mi and -ml, which I described in the previous post). More on seeds and overlap detection is described in the post explaining how newbler works. I never change these parameters as I assume 454 has done a good job optimizing them. But I would love to hear from people that have tried the effect of adjusting these parameters…

-ss sets the seed step, i.e. how many bases further down the read does the next seed start (default: 12)
-sl sets the seed length (default 16)
-sc sets how many seeds are needed to overlap between two reads before they are deemed overlapping (I think) (default 1)
-ais and -ads set the alignment identity and difference score parameter, these are used to sort overlaps when there are multiple ones (defaults 2, and -3 respectively)

-e
If (parts of) the genome you are sequencing are covered by many, many reads, say more than 50x coverage, it is possible that small sequencing errors between the reads will force newbler to artificially make two contigs of a region, where there should only be one. Telling newbler in advance about the depth using the -e parameter will adjust for this. An example could be a BAC/cosmid/Fosmid, where, since these are relatively short, there is a good chance you will have many more reads than you actually would need. If you don’t know the depth of the read dataset, just run a normal assembly first and have a look at the 454ContigGraph.txt file, described here.

-m
This parameter forces newbler to keep all sequence data in memory instead of on disc. It will make assembly faster, but requires larger amounts of memory. I have never tried this, so I don’t know how much it speeds up, nor how much memory newbler needs in this case.

-qo
For large assemblies, the output generation phase will take a long time (newbler has to go through all the flowgrams twice, and so far, this stage is not yet parallelized). To get a quick idea of what the assembly looks like, you could suppress parts of the output generation with this flag. In particular, newbler will not go through all the flow signal intensities to calculate average values, which are needed to determine consensus base quality. As a result, the will be more errors in the contigs, but at least you will get a feeling for the number and lengths of contigs/scaffolds, N50 etc. If you used the -nrm flag, or started the assembly with runProject, you can actually restart the assembly to get the full output by writing runProject projectname, see also this post/

-nobig, -noace
With -nobig, the following (usually large) files will not be included in the output: ACE/consed files, 454PairAlign.txt, 454AlignmentInfo.tsv. With -noace, the ace file will not be generated.

-ar -at -ad
These parameters control how reads are entered in the ace file. -ar will results in the entire raw read (after basecalling) being added , with -at the trimmed portion of the read will be added, -ad resets to default, which is trimmed

-tr
And now for a hidden option that is not mentioned in the manual. I got special permission from my contacts at 454 to describe this parameter, but they wanted me to stress that it is not yet fully supported, but will be in the next software release (i.e. use at your own risk). -tr will result in two files, 454TrimmedReads.fna and 454TrimmedReads.qual. These files contain the reads after trimming (by newbler). Newbler describes the trimpoints in the 454TrimStatus.txt file, and uses these to generate these output files. Quite handy if you quickly need access to the reads as newbler used them! Another use of this file is to extract the singletons, by using the read IDs from the reads labeled “Singleton” in the 454ReadStatus file, and a script that pulls these out of the 454Trimmed.fna file.

-notrim
With this flag set, newbler will not do any additional trimming (based on quality, or primers/adaptors/vectors etc you might have added using -vt or -vs)., in combination with the -tr option, output untrimmed (instead of trimmed) reads in fasta+qual format

Advertisements

8 Responses to “Running newbler: more de novo assembly parameters (and a hidden one)”

  1. Gustavo said

    Hi Flxlex Very useful blog, congratulations!! I am
    wondering if you tried to use this parameter: 0 There is no info
    about it even in Newbler manual. I ask this because of problems
    with cluster memory, I tried to assemble >60 sff from a
    plant seq project, and crashed. I did some trials with -m,
    incremental assembly, etc. without success. Suggestions will be
    very welcome! thanks in advance.

    • Marko said

      Plants/RNAseq:
      1. Assemble 1-2 sff files first and extract contigs with high coverage into repeats.fasta.
      2. Rerun the assembly with more sff files, using repeats.fasta as “vector/contaminant” sequence + increase seed length:
      and dial up the -sl to 30-40 range,.
      -vt repeats.fasta -sl 3
      PS: Make sure you have enough memory, the newbler uses ONLY head node on the cluster. Titanium cluster will be no good for this, try system with 128GB or more RAM.
      PPS: Try MIRA or celera assemblers.

  2. Gustavo said

    Hello again, looks like did not appeared the parameter.
    This is the one:
    assemblerBatchSize>0</assemblerBatchSize maybe
    because of xml format.

    • flxlex said

      I have no clue what this parameter is about, I’m afraid. About your project, do you have enough memory?

      • Gustavo said

        Hi Flxlex
        the parameter is contained in the 454AssemblyProject.xml, between heterozygoteMode and numCPU. I try to make a trial with 5, 10 batches, but did not worked, that is why I ask you for help.
        Regarding memory, we do have 24G per node and because I have lots of sff to assemble, I am doing it in increments. First assembly was 30 sff, next 10, and after that 5 sff per increment. very slow approach, but it is working for me.

      • flxlex said

        We needed a lot less memory after we threw out the reads that were entirely consisting of short tandem repeats (STRs, 1-4mers), e.g. ACACACAC. We used TRF (http://tandem.bu.edu/trf/trf.html) to find STRs, allowing for 10 bases ‘normal’ sequence at the end. We also threw out paired reads where at least one half was STR only.

        Oh, and I assume you use the ‘large’ option?

        Good luck!

  3. Delphine said

    Hi Flxlex,

    I tried the option -e 26 but it seems this parameter didn’t work, because in 454ContigGraph.txt, I obtain something like:
    178 contig00178 242 15.1
    179 contig00179 227 10.8
    180 contig00180 212 4.8
    181 contig00181 185 4.0
    182 contig00182 178 4.5
    183 contig00183 148 1.0
    184 contig00184 145 2.4
    185 contig00185 138 2.4
    186 contig00186 123 1.9
    187 contig00187 125 1.0
    188 contig00188 125 1.0
    189 contig00189 114 9.6
    190 contig00190 112 2.8
    191 contig00191 106 1.0
    192 contig00192 99 2.9
    193 contig00193 82 1.0
    194 contig00194 76 1.0
    195 contig00195 64 1.0

    I don’t understand how I can obtain some contigs with a depth around 1 ?
    Have you an idea?

    Thanks again for your site,

    • flxlex said

      First, the -e option is used to indicate expected average depth. The depth obtained for the final contigs will be a (poisson-like) distribution around a certain maximum (usually). The -e option is normally only necessary if the expected depth is over 50x. You cannot ‘force’ newbler to produce only contigs of a certain depth
      Second, on the question about very low depth contigs, see this post. Hope this helps!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: