Download - CS 5263 Bioinformatics
CS 5263 Bioinformatics
Multiple Sequence Alignment
Multiple Sequence Alignment
• Motivation:– A faint similarity between two sequences becomes
very significant if present in many sequences
• Definition– Given N sequences x1, x2,…, xN: Insert gaps (-) in
each sequence xi, such that• All sequences have the same length L• Score of the alignment is maximum
• Two issues– How to score an alignment?– How to find a (nearly) optimal alignment?
Scoring function - first assumption
• Columns are independent– Similar in pair-wise alignment
• Therefore, the score of an alignment is the sum of all columns
• Need to decide how to score a single column
Scoring function (cont’d)
• Ideally:– An n-dimensional matrix, where n is the number of sequences– E.g. (A, C, C, G, -) for aligning 5 sequences– Total number of parameters: (k+1)n, where k is the alphabet size
• Direct estimation of such scores is difficult– Too many parameters to estimate– Even more difficult if need to
consider phylogenetic relationships
x
yz
w
v
?Phylogenetic tree
or evolution tree
Scoring Function (cont’d)
• Compromises:– Compute from pair-wise scores
• Option 1: Based on sum of all pair-wise scores• Option 2: Based on scores with a consensus
sequence
– Other options• Consider tree topology explicitly• Information-theory based score• Difficult to optimize
Scoring Function: Sum Of Pairs
Definition: Induced pairwise alignment
A pairwise alignment induced by the multiple alignment
Example: x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG
Induces:x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAGy: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG
-
-
-
-
Sum Of Pairs (cont’d)
• The sum-of-pairs (SP) score of an alignment is the sum of the scores of all induced pairwise alignments
S(m) = k<l s(mk, ml)
s(mk, ml): score of induced alignment (k,l)
Example:
x: AC-GCGG-C
y: AC-GC-GAGz: GCCGC-GAG
A C G T -
A 1 -1 -1 -1 -1
C -1 1 -1 -1 -1
G -1 -1 1 -1 -1
T -1 -1 -1 1 -1
- -1 -1 -1 -1 0
(A,A) + (A,G) x 2 = -1
(G,G) x 3 = 3
(-,A) x 2 + (A,A) = -1
Total score = (-1) + 3 + (-2) + 3 + 3 + (-2) + 3 + (-1) + (-1) = 5
Sum Of Pairs (cont’d)• Drawback: no evolutionary characterization
– Every sequence derived from all others• Heuristic way to incorporate evolution tree
– Weighted Sum of Pairs:
Human
Mouse
Chicken
S(m) = k<l wkl s(mk, ml)
wkl: weight decreasing with distance
Duck
Consensus score
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGACCAG-CTATCAC--GACCGC----TCGATTTGCTCGAC
CAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
• Find optimal consensus string m* to maximize
S(m) = i s(m*, mi)
s(mk, ml): score of pairwise alignment (k,l)
Consensus sequence:
Multiple Sequence Alignments Algorithms
• Can also be global or local– We only talk about global for now
• A simple method– Do pairwise alignment between all pairs– Combine the pairwise alignments into a single
multiple alignment– Is this going to work?
Compatible pairwise alignments
AAAATTTT
TTTTGGGG AAAAGGGG
AAAATTTT--------TTTTGGGG
AAAATTTT----AAAA----GGGG
----TTTTGGGGAAAA----GGGG
AAAATTTT--------TTTTGGGGAAAA----GGGG
Incompatible pairwise alignments
AAAATTTT
TTTTGGGG GGGGAAAA
AAAATTTT--------TTTTGGGG
----AAAATTTTGGGGAAAA----
TTTTGGGG--------GGGGAAAA
?
Multidimensional Dynamic Programming (MDP)
Generalization of Needleman-Wunsh:• Find the longest path in a high-dimensional cube
– As opposed to a two-dimensional grid
• Uses a N-dimensional matrix – As apposed to a two-dimensional array
• Entry F(i1, …, ik) represents score of optimal alignment for s1[1..i1], … sk[1..ik]
F(i1,i2,…,iN) = max(all neighbors of a cell) (F(nbr)+S(current))
• Example: in 3D (three sequences):
• 23 – 1 = 7 neighbors/cell
F(i-1,j-1,k-1) + S(xi, xj, xk),
F(i-1,j-1,k ) + S(xi, xj, -),
F(i-1,j ,k-1) + S(xi, -, xk),
F(i,j,k) = max F(i ,j-1,k-1) + S(-, xj, xk),
F(i-1,j ,k ) + S(xi, -, -),
F(i ,j-1,k ) + S(-, xj, -),
F(i ,j ,k-1) + S(-, -, xk)
Multidimensional Dynamic Programming (MDP)
(i,j,k)
(i,j,k-1)
(i-1,j,k-1)(i-1,j-1,k-1)
(i-1,j-1,k)
(i,j-1,k)
(i-1,j,k)
(i,j-1,k-1)
Multidimensional Dynamic Programming (MDP)
Running Time:
1. Size of matrix: LN;
Where L = length of each sequence
N = number of sequences
2. Neighbors/cell: 2N – 1
Therefore………………………… O(2N LN)
Faster MDP
• Carrillo & Lipman, 1988– Branch and bound– Other heuristics
• Implemented in a tool called MSA
• Practical for about 6 sequences of length about 200-300.
Faster MDP
• Basic idea: bounds of the optimal score of a multiple alignment can be pre-computed– Upper-bound: sum of optimal pair-wise
alignment scores, i.e.
S(m) = k<l s(mk, ml) k<l s(k, l)
– lower-bounded: score computed by any approximate algorithm (such as the ones we’ll talk next)
– For any partial path, if Scurrent + Sperspective < lower-bound, can give up that path
– Guarantees optimality
Score of the alignment between k and l induced by m
Optimal msa
Score of optimal alignment between k and l
Progressive Alignment
• Multiple Alignment is NP-hard• Most used heuristic: Progressive Alignment
Algorithm:1. Align two of the sequences xi, xj
2. Fix that alignment
3. Align a third sequence xk to the alignment xi,xj
4. Repeat until all sequences are aligned
Running Time: O(NL2)Each alignment takes O(L2)
Repeat N times
Progressive Alignment
• When evolutionary tree is known:– Align closest first, in the order of the tree
Example:Order of alignments: 1. (x,y)
2. (z,w)3. (xy, zw)
x
w
y
z
Progressive Alignment: CLUSTALW
CLUSTALW: most popular multiple protein alignment
Algorithm:1. Find all dij: alignment dist (xi, xj)
• High alignment score => short distance
2. Construct a tree
(Neighbor-joining hierarchical clustering. Will discuss in future)
3. Align nodes in order of decreasing similarity
+ a large number of heuristics
CLUSTALW example
• S1 ALSK
• S2 TNSD
• S3 NASK
• S4 NTSD
CLUSTALW example
• S1 ALSK
• S2 TNSD
• S3 NASK
• S4 NTSDs1 s2 s3 s4
s1 0 9 4 7
s2 0 8 3
s3 0 7
s4 0 Distance matrix
CLUSTALW example
• S1 ALSK
• S2 TNSD
• S3 NASK
• S4 NTSDs1 s2 s3 s4
s1 0 9 4 7
s2 0 8 3
s3 0 7
s4 0
s1
s3
s2
s4
CLUSTALW example
• S1 ALSK
• S2 TNSD
• S3 NASK
• S4 NTSDs1 s2 s3 s4
s1 0 9 4 7
s2 0 8 3
s3 0 7
s4 0
s1
s3
s2
s4
-ALSKNA-SK
CLUSTALW example
• S1 ALSK
• S2 TNSD
• S3 NASK
• S4 NTSDs1 s2 s3 s4
s1 0 9 4 7
s2 0 8 3
s3 0 7
s4 0
s1
s3
s2
s4
-ALSKNA-SK
-TNSDNT-SD
CLUSTALW example
• S1 ALSK
• S2 TNSD
• S3 NASK
• S4 NTSDs1 s2 s3 s4
s1 0 9 4 7
s2 0 8 3
s3 0 7
s4 0
s1
s3
s2
s4
-ALSKNA-SK
-TNSDNT-SD
-ALSK-TNSDNA-SKNT-SD
Problems with progressive alignment:• Depend on pair-wise alignments• If sequences are very distantly related, much higher likelihood of
errors• Initial alignments are “frozen” even when new evidence comes
Example:
x: GAAGTTy: GAC-TT
z: GAACTGw: GTACTG
Iterative Refinement
Frozen!
Now clear: correct y should be GA-CTT
Iterative Refinement
Algorithm (Barton-Stenberg):
1. Align most similar xi, xj
2. Align xk most similar to (xixj)3. Repeat 2 until (x1…xN) are aligned4. For j = 1 to N,
Remove xj, and realign to x1…xj-1xj+1…xN
5. Repeat 4 until convergence
Progressive alignment
Iterative Refinement (cont’d)
For each sequence y1. Remove y2. Realign y
(while rest fixed)
x
y
z
x,z fixed projection
allow y to vary
Note: Guaranteed to converge (why?)Running time: O(kNL2), k: number of iterations
Iterative Refinement
Example: align (x,y), (z,w), (xy, zw):
x: GAAGTTAy: GAC-TTAz: GAACTGAw: GTACTGA
After realigning y:
x: GAAGTTAy: G-ACTTA + 3 matchesz: GAACTGAw: GTACTGA
Iterative Refinement
• Example not handled well:
x: GAAGTTAy1: GAC-TTAy2: GAC-TTAy3: GAC-TTA
z: GAACTGAw: GTACTGA
Realigning any single yi changes nothing
Restricted MDP
• Similar to bounded DP in pair-wise alignment1. Construct progressive multiple alignment m
2. Run MDP, restricted to radius R from m
Running Time: O(2N RN-1 L)
x
y
z
Restricted MDP
x: GAAGTTAy1: GAC-TTAy2: GAC-TTAy3: GAC-TTA
z: GAACTGAw: GTACTGA
• Within radius 1 of the optimal
Restricted MDP will fix it.
Other approaches
• Statistical learning methods– Profile Hidden Markov Models– Will discuss in future lectures
• Consistency-based methods– Still rely on pairwise alignment
• But consider a third seq when aligning two seqs• If block A in seq x aligns to block B in seq y, and both aligns
to block C in seq z, we have higher confidence to say that the alignment between A-B is reliable
• Essentially: change scoring system according to consistency• Than applied DP as in other approaches
– Pioneered by a tool called T-Coffee
Multiple alignment tools• Clustal W (Thompson, 1994)
– Most popular• T-Coffee (Notredame, 2000)
– Another popular tool– Consistency-based– Slower than clustalW, but generally more accurate for more distantly related sequences
• MUSCLE (Edgar, 2004)– Iterative refinement– More efficient than most others
• DIALIGN (Morgenstern, 1998, 1999, 2005)– “local”
• Align-m (Walle, 2004)– “local”
• PROBCONS (Do, 2004)– Probabilistic consistency-based– Best accuracy on benchmarks
• ProDA (Phuong, 2006)– Allow repeated and shuffled regions
In summary
• Multiple alignment scoring functions– Sum of pairs– Other funcs exist, but less used
• Multiple alignment algorithms:– MDP
• Optimal• too slow• Branch & Bound doesn’t solve the problem entirely
– Progressive alignment: clustalW– Iterative refinement– Restricted MDP– Consistency-based
Heuristic