An assembly of reads, contigs and scaffolds

A blog on all things newbler and beyond

Archive for February, 2010

How newbler works

Posted by lexnederbragt on February 9, 2010

I thought to start by explaining briefly how newbler works. I’ll do this by following the output newbler generates during the assembly process. This information is displayed during assembly, and can also be found in the 454NewblerProgress.txt file. It is a good thing anyways to have a look at this file, as it sometimes displays certain warnings (see below).

This example assembly is based on a read dataset consisting of both shotgun reads, and paired end reads (for more on 454 paired end reads, have a look here).

The first thing you’ll see is a message stating that the assembly computation started, and which version of newbler you used.

Then, you’ll see messages for each input file saying Indexing XXXXXXX.sff…, and a counter. During indexing, newbler scans the input file, performs some checks and trims the reads (sometimes more than the base-calling software already did). One of the checks is for possible 3′ and 5′ primers: if a certain percentage of reads contains the same sequence on either the 3′ or 5′ end, this is mentioned. I’ve had some surprises here, such as finding out that reads I got from another group contained an adaptor sequence, which caused problems during the assembly. More on primer removal later…

If an input sff file contains paired end reads, this will be mentioned, as well as the number of reads that contained the paired end linker sequence, for example:

224024 reads, 58599257 bases, 112080 paired reads.

Next:

Setting up long overlap detection…
XXXXX reads to align
Building a tree for YYYYYY seeds…
Computing long overlap alignments…

The first phase of assembly is finding overlap between reads. Newbler splits this phase into one for long reads (this goes very fast) and shorter reads (can take quite some time). As aligning all reads against each other would take too long time, newbler (and many other programs) actually make seeds, 16-mers of each read, where each seed starts 12 bases upstream of the previous one. These seed length and step sizes can be changed if you want (I’ve never tried this, though). When two different reads have identical seeds the program tries to extend the overlap between the reads until the minimum overlap (default 40 bp) with the minimum alignment percentage default 90%) has been reached. These settings can also be changed and influence the alignment stringency, this I will come back to in a later post. Read the rest of this entry »

Advertisements

Posted in How it works | Tagged: , , , , , , | 61 Comments »

Introduction

Posted by lexnederbragt on February 9, 2010

With this blog I intend to share some of my experiences with the newbler assembly program from 454, also known as gsAssembler, or gsMapper. It is the software suite developed by 454 Life Sciences to be used with the sequence data coming from the GS FLX sequencing instrument.

Reads: those are the fragments, small bits and pieces that I write about in the blog.

Contigs: together, these fragments make a larger pieces of information on the subject.

Scaffolds: by building bridges between the subjects I hope to reach a complete overview of the newbler program. Granted, there will be gaps, as there are also gaps in my knowledge, but in the end, the information should be useful guide to newbler.

I learned a lot about newbler when working with data from bacterial genome assemblies, and the cod genome project, for which I am one of the bioinformaticists. I am also connected to the 454 node of the Norwegian High-Throughput Sequencing Centre (NSC), where many of our users are relying on newbler for their projects.

The first post will describe step-by-step how newbler generates contigs and scaffold from reads. So, let’s start the assembly!

Posted in Miscellaneous | Tagged: | Leave a Comment »