1 velvet: algorithms for de novo short assembly using de bruijn graphs march 12, 2008 daniel r....

1

Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Gra

phs

March 12, 2008

Daniel R. Zerbino and Ewan Birney

Presenter: Seunghak Lee

2

What is de Bruijn Graphs?

“De Bruijn graph” is a directed graph An edge represents overlap between sequences of sy

mbols V=(s1, s2, …, sm) E={(v1,v2,…, vn),(w1,w2,…,wn)):v2=w1,v3=w2, …, vn=wn-1}

3

Introduction

New sequencing techniques are commercially available (e.g. 454 Sequencing, Solexa)

454 Sequencing ~ 100 – 200bp

Solexa ~ 30bp

Algorithms whole genome shotgun (WGS) assembly are not suitable for short reads Overlap graph with a node per read is extremely large More ambiguous connections in assembly

4

Introduction (cont)

Euler assembler (Pevzner 2001) used k-mer for a node of de Bruijn graphs

Reads are mapped as a path through the de Brujin graph

High redundancy does not affect the number of nodes

“Velvet” effectively deals with experimental errors and repeats by using Brujin graphs with k-mers

5

De Bruijn Graphs - structure

Structure

6

De Bruijn Graphs – structure (cont)

Adjacent k-mers overlap by k-1 nucleotides

Each node is attached to twin node Reverse series of reverse complement k-mers Overlap between reads from opposite strand

Union of a node and its twin node is called a “block”

Last k-mer overlaps with the first of

its destination

7

De Bruijn Graphs – construction (cont)

Construction

Reads are hashed with predefined k-mer length

Small k-mer → increase connectivity → more ambiguous repeats

Large k-mer → increase specificity → decrease connectivity

Determine k considering “sensitivity” and “specificity”

8

De Bruijn Graphs – construction (cont)

For each k-mer, hash table records ID of the first read and its position

Each k-mer is recorded with reverse complement

Node is created if there is distinct

interruption points

Reads are traced through the graph

Create a directed arc if necessary

9

De Bruijn Graphs – simplification

Simplify the chains of blocks No information loss

If node A has only one outgoing arc to node B,

and if node B has only one ingoing arc → merge

A B

10

De Bruijn Graphs – error removal

Velvet focuses on “topological features” of the graph

First step: remove tips Tip: chain of nodes disconnected on one end

Use two criteria: (1) length and (2) minority count Length: remove a tip if < 2k bp

since two nearby errors can create a tip up to 2k bp error error

k k

11

De Bruijn Graphs – error removal (cont)

Minority count: multiplicity m < n

Starting from node B, going through the tip is an alternative to a more common path

m

n

B

tip

A

C

12


Second step: remove bubbles using Tour Bus

Redundant paths start and end at the same nodes

Bubbles are created by errors or biological variants such as SNP

Bubble

13


1. Detect redundant paths

2. Compare them using dynamic programming methods

3. If similar, merge them

Tour Bus

14


Third step: remove erroneous connections

Remove erroneous connections after Tour Bus algorithm

Remove erroneous connections with basic coverage

cutoff

Genuine short nodes which cannot be simplified in the graph should have high coverage

15

Breadcrumb: resolution of repeats

1. Using read pairs, pair up the long nodes

2. Flag paired reads using unambiguous long nodes

unambiguous long nodes

16


1. Using read pairs, pair up the long nodes

2. Flag paired reads using unambiguous long nodes

unambiguous long nodes

17


Extends the nodes as far as possible using flagged paired reads

All nodes between A and B are paired up to either A or B

18

Experimental Results

Test error removal pipeline on simulated data Simulate reads are from E. coli, S. cerevisiae,

C.elegans, and H. sapiens

Coverage density vs N50 for H. sapiens Limited by natural repetition of the reference genome

Ideal + Error (1%) + SNPN50

19

Experimental Results (cont)

Test error removal pipeline on experimental data

173,428 bp human BAC was sequenced using Solexa machines

Reads were 35bp long, and k=31

Tour Bus increased sensitivity by correcting errors and

preserved the integrity of the graph structure

20


21


22

Conclusions

Velvet is a de Bruijn graph based sequence assembly method for short reads

Errors are handled by removing tips and Tour Bus algorithm

A large number of repeats are resolved by Breadcrumb algorithm

Velvet was assessed using simulated and real datasets and it performed well

1 velvet: algorithms for de novo short assembly using de bruijn graphs march 12, 2008 daniel r....

Documents

necessaryde bruijn graphs

bruijn graphsmarch

bruijn graphsreads

brujin graphs

node b

nucleotideseach node

mer lengthsmall

mer overlaps