rapid global alignments
Post on 31-Dec-2015
49 Views
Preview:
DESCRIPTION
TRANSCRIPT
Rapid Global Alignments
How to align genomic sequences in (more or less) linear time
Methods to CHAIN Local Alignments
Sparse Dynamic ProgrammingO(N log N)
The Problem: Find a Chain of Local Alignments
(x,y) (x’,y’)
requires
x < x’y < y’
Each local alignment has a weight
FIND the chain with highest total weight
Sparse DP for rectangle chaining
• 1,…, N: rectangles
• (hj, lj): y-coordinates of rectangle j
• w(j): weight of rectangle j
• V(j): optimal score of chain ending in j
• L: list of triplets (lj, V(j), j)
L is sorted by lj L is implemented as a balanced binary tree
y
h
l
Sparse DP for rectangle chaining
Main idea:
• Sweep through x-coordinates
• To the right of b, anything chainable to a is chainable to b
• Therefore, if V(b) > V(a), rectangle a is “useless” – remove it
• In L, keep rectangles j sorted with increasing lj-coordinates sorted with increasing V(j)
V(b)V(a)
Sparse DP for rectangle chaining
Go through rectangle x-coordinates, from left to right:
1. When on the leftmost end of rectangle i, compute V(i)
a. j: rectangle in L, with largest lj < hi
b. V(i) = w(i) + V(j)
2. When on the rightmost end of i, possibly store V(i) in L:
a. j: rectangle in L, with largest lj lib. If V(i) > V(j):
i. INSERT (li, V(i), i) in L
ii. REMOVE all (lk, V(k), k) with V(k) V(i) &
lk li
i
j
Example
x
y
1: 5
3: 3
2: 6
4: 45: 2
2
56
91011
1214
1516
Time Analysis
1. Sorting the x-coords takes O(N log N)
2. Going through x-coords: N steps
3. Each of N steps requires O(log N) time:
• Searching L takes log N• Inserting to L takes log N• All deletions are consecutive, so log N per deletion• Each element is deleted at most once: N log N for all deletions
• Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree
Putting it All Together:Fast Global Alignment Algorithms
1. FIND local alignments
2. CHAIN local alignments
FIND CHAIN
GLASS: k-mers hierarchical DP
MumMer: Suffix Tree sparse DP
Avid: Suffix Tree hierarchical DP
LAGAN CHAOS sparse DP
LAGAN: Pairwise Alignment
1. FIND local alignments
2. CHAIN local alignments
3. DP restricted around chain
LAGAN
1. Find local alignments
2. Chain -O(NlogN) L.I.S.
3. Restricted DP
LAGAN: recursive call
• What if a box is too large? Recursive application of LAGAN,
more sensitive word search
A trick to save on memory
“necks” have tiny tracebacks
…only store tracebacks
Multiple Sequence Multiple Sequence AlignmentsAlignments
Overview
• Definition
• Scoring Schemes
• Algorithms
Definition
• Given N sequences x1, x2,…, xN: Insert gaps (-) in each sequence xi, such that
• All sequences have the same length L
• Score of the global map is maximum
• A faint similarity between two sequences becomes significant if present in many
• Multiple alignments can help improve the pairwise alignments
Scoring Function
• Ideally: Find alignment that maximizes probability that sequences evolved
from common ancestor, according to some phylogenetic model
• More on phylogenetic models later
x
yz
w
v
?
Scoring Function
• A comprehensive model would have too many parameters, too inefficient to optimize
• Possible simplifications
Ignore phylogenetic tree
Statistically independent columns:
S(m) = G(m) + i S(mi)
m: alignment matrixG: function penalizing gaps
Scoring Function: Sum Of Pairs
Definition: Induced pairwise alignment
A pairwise alignment induced by the multiple alignment
Example:
x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG
Induces:
x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAGy: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG
Sum Of Pairs (cont’d)
• The sum-of-pairs score of an alignment is the sum of the scores of all induced pairwise alignments
S(m) = k<l s(mk, ml)
s(mk, ml): score of induced alignment (k,l)
Sum Of Pairs (cont’d)
• Heuristic way to incorporate evolution tree:
Human
Mouse
Chicken• Weighted SOP:
S(m) = k<l wkl s(mk, ml)
wkl: weight decreasing with distance
Duck
Consensus
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGACCAG-CTATCAC--GACCGC----TCGATTTGCTCGAC
CAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
• Find optimal consensus string m* to maximize
S(m) = i s(m*, mi)
s(mk, ml): score of pairwise alignment (k,l)
Multiple Sequence Alignments
Algorithms
1. Multidimensional Dynamic Programming
Generalization of Needleman-Wunsh:
S(m) = i S(mi)
(sum of column scores)
F(i1,i2,…,iN) = max(all neighbors of cube)(F(nbr)+S(nbr))
• Example: in 3D (three sequences):
• 7 neighbors/cell
F(i,j,k) = max{ F(i-1,j-1,k-1)+S(xi, xj, xk),F(i-1,j-1,k )+S(xi, xj, - ),F(i-1,j ,k-1)+S(xi, -, xk),F(i-1,j ,k )+S(xi, -, - ),F(i ,j-1,k-1)+S( -, xj, xk),F(i ,j-1,k )+S( -, xj, xk),F(i ,j ,k-1)+S( -, -, xk) }
1. Multidimensional Dynamic Programming
Running Time:
1. Size of matrix: LN;
Where L = length of each sequence
N = number of sequences
2. Neighbors/cell: 2N – 1
Therefore………………………… O(2N LN)
1. Multidimensional Dynamic Programming
2. Progressive Alignment
• Multiple Alignment is NP-complete
• Most used heuristic: Progressive Alignment
Algorithm:1. Align two of the sequences xi, xj
2. Fix that alignment
3. Align a third sequence xk to the alignment xi,xj
4. Repeat until all sequences are aligned
Running Time: O( N L2 )
2. Progressive Alignment
• When evolutionary tree is known:
Align closest first, in the order of the tree
Example:Order of alignments: 1. (x,y)
2. (z,w)3. (xy, zw)
x
w
y
z
CLUSTALW: progressive alignment
CLUSTALW: most popular multiple protein alignment
Algorithm:
1. Find all dij: alignment dist (xi, xj)
2. Construct a tree
(Neighbor-joining hierarchical clustering)
3. Align nodes in order of decreasing similarity
+ a large number of heuristics
CLUSTALW & the CINEMA viewer
MLAGAN: progressive alignment of DNA
Given N sequences, phylogenetic tree
Align pairwise, in order of the tree (LAGAN)
Human
Baboon
Mouse
Rat
MLAGAN: main steps
Given a collection of sequences, and a phylogenetic tree
1. Find local alignments for every pair of sequences x, y
2. Find anchors between every pair of sequences, similar to LAGAN anchoring
3. Progressive alignment• Multi-Anchoring based on reconciling the pairwise anchors• LAGAN-style limited-area DP
4. Optional refinement steps
MLAGAN: multi-anchoring
XZ
YZ
X/Y
Z
To anchor the (X/Y), and (Z) alignments:
Heuristics to improve multiple alignments
• Iterative refinement schemes
• A*-based search
• Consistency
• Simulated Annealing
• …
Iterative Refinement
One problem of progressive alignment:• Initial alignments are “frozen” even when new evidence comes
Example:
x: GAAGTTy: GAC-TT
z: GAACTGw: GTACTG
Frozen!
Now clear correct y = GA-CTT
Iterative Refinement
Algorithm (Barton-Stenberg):
1. Align most similar xi, xj
2. Align xk most similar to (xixj)3. Repeat 2 until (x1…xN) are aligned
4. For j = 1 to N,Remove xj, and realign to x1…xj-1xj+1…xN
5. Repeat 4 until convergence
Note: Guaranteed to converge
Iterative Refinement
For each sequence y1. Remove y2. Realign y
(while rest fixed)x
y
z
x,z fixed projection
allow y to vary
Iterative Refinement
Example: align (x,y), (z,w), (xy, zw):
x: GAAGTTAy: GAC-TTAz: GAACTGAw: GTACTGA
After realigning y:
x: GAAGTTAy: G-ACTTA + 3 matchesz: GAACTGAw: GTACTGA
Iterative Refinement
Example not handled well:
x: GAAGTTAy1: GAC-TTAy2: GAC-TTAy3: GAC-TTA
z: GAACTGAw: GTACTGA
Realigning any single yi changes nothing
Restricted MDP
Here is another way to improve a multiple alignment:
1. Construct progressive multiple alignment m
2. Run MDP, restricted to radius R from m
Running Time: O(2N RN-1 L)
Restricted MDP
Run MDP, restricted to radius R from m
x
y
z
Running Time: O(2N RN-1 L)
Restricted MDP
x: GAAGTTAy1: GAC-TTAy2: GAC-TTAy3: GAC-TTA
z: GAACTGAw: GTACTGA
• Within radius 1 of the optimal
Restricted MDP will fix it.
Optional refinement steps in MLAGAN
• Limited-area iterative refinement
• Radius-r 3-sequence refinement on each node of the tree
top related