rapid global alignments

Post on 31-Dec-2015

49 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Rapid Global Alignments. How to align genomic sequences in (more or less) linear time. Methods to CHAIN Local Alignments. Sparse Dynamic Programming O(N log N). The Problem: Find a Chain of Local Alignments. (x,y)  (x’,y’) requires x < x’ y < y’. Each local alignment has a weight - PowerPoint PPT Presentation

TRANSCRIPT

Rapid Global Alignments

How to align genomic sequences in (more or less) linear time

Methods to CHAIN Local Alignments

Sparse Dynamic ProgrammingO(N log N)

The Problem: Find a Chain of Local Alignments

(x,y) (x’,y’)

requires

x < x’y < y’

Each local alignment has a weight

FIND the chain with highest total weight

Sparse DP for rectangle chaining

• 1,…, N: rectangles

• (hj, lj): y-coordinates of rectangle j

• w(j): weight of rectangle j

• V(j): optimal score of chain ending in j

• L: list of triplets (lj, V(j), j)

L is sorted by lj L is implemented as a balanced binary tree

y

h

l

Sparse DP for rectangle chaining

Main idea:

• Sweep through x-coordinates

• To the right of b, anything chainable to a is chainable to b

• Therefore, if V(b) > V(a), rectangle a is “useless” – remove it

• In L, keep rectangles j sorted with increasing lj-coordinates sorted with increasing V(j)

V(b)V(a)

Sparse DP for rectangle chaining

Go through rectangle x-coordinates, from left to right:

1. When on the leftmost end of rectangle i, compute V(i)

a. j: rectangle in L, with largest lj < hi

b. V(i) = w(i) + V(j)

2. When on the rightmost end of i, possibly store V(i) in L:

a. j: rectangle in L, with largest lj lib. If V(i) > V(j):

i. INSERT (li, V(i), i) in L

ii. REMOVE all (lk, V(k), k) with V(k) V(i) &

lk li

i

j

Example

x

y

1: 5

3: 3

2: 6

4: 45: 2

2

56

91011

1214

1516

Time Analysis

1. Sorting the x-coords takes O(N log N)

2. Going through x-coords: N steps

3. Each of N steps requires O(log N) time:

• Searching L takes log N• Inserting to L takes log N• All deletions are consecutive, so log N per deletion• Each element is deleted at most once: N log N for all deletions

• Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree

Putting it All Together:Fast Global Alignment Algorithms

1. FIND local alignments

2. CHAIN local alignments

FIND CHAIN

GLASS: k-mers hierarchical DP

MumMer: Suffix Tree sparse DP

Avid: Suffix Tree hierarchical DP

LAGAN CHAOS sparse DP

LAGAN: Pairwise Alignment

1. FIND local alignments

2. CHAIN local alignments

3. DP restricted around chain

LAGAN

1. Find local alignments

2. Chain -O(NlogN) L.I.S.

3. Restricted DP

LAGAN: recursive call

• What if a box is too large? Recursive application of LAGAN,

more sensitive word search

A trick to save on memory

“necks” have tiny tracebacks

…only store tracebacks

Multiple Sequence Multiple Sequence AlignmentsAlignments

Overview

• Definition

• Scoring Schemes

• Algorithms

Definition

• Given N sequences x1, x2,…, xN: Insert gaps (-) in each sequence xi, such that

• All sequences have the same length L

• Score of the global map is maximum

• A faint similarity between two sequences becomes significant if present in many

• Multiple alignments can help improve the pairwise alignments

Scoring Function

• Ideally: Find alignment that maximizes probability that sequences evolved

from common ancestor, according to some phylogenetic model

• More on phylogenetic models later

x

yz

w

v

?

Scoring Function

• A comprehensive model would have too many parameters, too inefficient to optimize

• Possible simplifications

Ignore phylogenetic tree

Statistically independent columns:

S(m) = G(m) + i S(mi)

m: alignment matrixG: function penalizing gaps

Scoring Function: Sum Of Pairs

Definition: Induced pairwise alignment

A pairwise alignment induced by the multiple alignment

Example:

x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG

Induces:

x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAGy: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

Sum Of Pairs (cont’d)

• The sum-of-pairs score of an alignment is the sum of the scores of all induced pairwise alignments

S(m) = k<l s(mk, ml)

s(mk, ml): score of induced alignment (k,l)

Sum Of Pairs (cont’d)

• Heuristic way to incorporate evolution tree:

Human

Mouse

Chicken• Weighted SOP:

S(m) = k<l wkl s(mk, ml)

wkl: weight decreasing with distance

Duck

Consensus

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGACCAG-CTATCAC--GACCGC----TCGATTTGCTCGAC

CAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

• Find optimal consensus string m* to maximize

S(m) = i s(m*, mi)

s(mk, ml): score of pairwise alignment (k,l)

Multiple Sequence Alignments

Algorithms

1. Multidimensional Dynamic Programming

Generalization of Needleman-Wunsh:

S(m) = i S(mi)

(sum of column scores)

F(i1,i2,…,iN) = max(all neighbors of cube)(F(nbr)+S(nbr))

• Example: in 3D (three sequences):

• 7 neighbors/cell

F(i,j,k) = max{ F(i-1,j-1,k-1)+S(xi, xj, xk),F(i-1,j-1,k )+S(xi, xj, - ),F(i-1,j ,k-1)+S(xi, -, xk),F(i-1,j ,k )+S(xi, -, - ),F(i ,j-1,k-1)+S( -, xj, xk),F(i ,j-1,k )+S( -, xj, xk),F(i ,j ,k-1)+S( -, -, xk) }

1. Multidimensional Dynamic Programming

Running Time:

1. Size of matrix: LN;

Where L = length of each sequence

N = number of sequences

2. Neighbors/cell: 2N – 1

Therefore………………………… O(2N LN)

1. Multidimensional Dynamic Programming

2. Progressive Alignment

• Multiple Alignment is NP-complete

• Most used heuristic: Progressive Alignment

Algorithm:1. Align two of the sequences xi, xj

2. Fix that alignment

3. Align a third sequence xk to the alignment xi,xj

4. Repeat until all sequences are aligned

Running Time: O( N L2 )

2. Progressive Alignment

• When evolutionary tree is known:

Align closest first, in the order of the tree

Example:Order of alignments: 1. (x,y)

2. (z,w)3. (xy, zw)

x

w

y

z

CLUSTALW: progressive alignment

CLUSTALW: most popular multiple protein alignment

Algorithm:

1. Find all dij: alignment dist (xi, xj)

2. Construct a tree

(Neighbor-joining hierarchical clustering)

3. Align nodes in order of decreasing similarity

+ a large number of heuristics

CLUSTALW & the CINEMA viewer

MLAGAN: progressive alignment of DNA

Given N sequences, phylogenetic tree

Align pairwise, in order of the tree (LAGAN)

Human

Baboon

Mouse

Rat

MLAGAN: main steps

Given a collection of sequences, and a phylogenetic tree

1. Find local alignments for every pair of sequences x, y

2. Find anchors between every pair of sequences, similar to LAGAN anchoring

3. Progressive alignment• Multi-Anchoring based on reconciling the pairwise anchors• LAGAN-style limited-area DP

4. Optional refinement steps

MLAGAN: multi-anchoring

XZ

YZ

X/Y

Z

To anchor the (X/Y), and (Z) alignments:

Heuristics to improve multiple alignments

• Iterative refinement schemes

• A*-based search

• Consistency

• Simulated Annealing

• …

Iterative Refinement

One problem of progressive alignment:• Initial alignments are “frozen” even when new evidence comes

Example:

x: GAAGTTy: GAC-TT

z: GAACTGw: GTACTG

Frozen!

Now clear correct y = GA-CTT

Iterative Refinement

Algorithm (Barton-Stenberg):

1. Align most similar xi, xj

2. Align xk most similar to (xixj)3. Repeat 2 until (x1…xN) are aligned

4. For j = 1 to N,Remove xj, and realign to x1…xj-1xj+1…xN

5. Repeat 4 until convergence

Note: Guaranteed to converge

Iterative Refinement

For each sequence y1. Remove y2. Realign y

(while rest fixed)x

y

z

x,z fixed projection

allow y to vary

Iterative Refinement

Example: align (x,y), (z,w), (xy, zw):

x: GAAGTTAy: GAC-TTAz: GAACTGAw: GTACTGA

After realigning y:

x: GAAGTTAy: G-ACTTA + 3 matchesz: GAACTGAw: GTACTGA

Iterative Refinement

Example not handled well:

x: GAAGTTAy1: GAC-TTAy2: GAC-TTAy3: GAC-TTA

z: GAACTGAw: GTACTGA

Realigning any single yi changes nothing

Restricted MDP

Here is another way to improve a multiple alignment:

1. Construct progressive multiple alignment m

2. Run MDP, restricted to radius R from m

Running Time: O(2N RN-1 L)

Restricted MDP

Run MDP, restricted to radius R from m

x

y

z

Running Time: O(2N RN-1 L)

Restricted MDP

x: GAAGTTAy1: GAC-TTAy2: GAC-TTAy3: GAC-TTA

z: GAACTGAw: GTACTGA

• Within radius 1 of the optimal

Restricted MDP will fix it.

Optional refinement steps in MLAGAN

• Limited-area iterative refinement

• Radius-r 3-sequence refinement on each node of the tree

top related