multiple sequence alignmentmuse/assets/msa.pdf · • in pairwise sequence alignment – scores...
TRANSCRIPT
Multiple Sequence Alignment
With thanks to Eric Stone and Steffen Heber,North Carolina State University
3
Definition: Multiple sequence alignment
ATTTG-ATTTGCAT-TGC
ATTTGATTTGCATT-GC
ATTT-G-ATTT-GCAT-T-GC
alignment no alignmentno alignment
• Given a set of sequences, a multiple sequence alignment is an assignment of gap characters such that– the resulting sequences have the same length
– no column contains only gaps
6
Application: Characterize protein families
7
Application: Discover conserved pattern
• A faint similarity between two sequences may become detectable if present in many
8
Application: Recover phylogenetic tree
Pig ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDDog --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEHuman KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPMouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP
***. ::: .: .. . : . . * . *: *
pig
dog
human
mouse
9
Multiple sequence alignment (MSA)
gi|89271|pir||A39486 PAKFKMKYWGVASFLQKGNDDHWIIDTDYDTYAAQYSCRLQNLDGTCADSgi|132403|sp|P18902|RETB_BOVIN PAKFKMKYWGVASFLQKGNDDHWIIDTDYETFAVQYSCRLLNLDGTCADSgi|5803139|ref|NP_006735.1| PAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSgi|6174963|sp|Q00724|RETB_MOUS PAKFKMKYWGVASFLQRGNDDHWIIDTDYDTFALQYSCRLQNLDGTCADSgi|132407|sp|P04916|RETB_RAT PAKFKMKYWGVASFLQRGNDDHWIIDTDYDTFALQYSCRLQNLDGTCADS
• Generalizes pairwise sequence alignment (PSA)– Multiple simply means three or more sequences
• In PSA, paired residues assumed to be homologous– In MSA, columns of residues assumed to be homologous
• Are there fundamental differences between MSA and PSA?– What distinguishes MSA from PSA?
10
Biological distinction
• Homology (common ancestry) in PSA is symmetric
• Phylogeny renders homology in MSA asymmetric
…TTACG……TTGCG…
A G AG
=
…TTACG……TTACG……TTGCG……TTGCG… A G
≠A G A GG A
11
MSA makes a statement about homology
• Like pairwise sequence alignment– Multiple sequence alignment asserts the homology of its columns
• Unlike pairwise sequence alignment, interpretation requires a phylogeny
NYLS
NYLS NFLS
12
Statistical distinction
a
b
c
a b a b
M R( )( ) ba
ab
p
RbaMba
=|,Pr|,Pr
a
b
• PSA compares a "match model" M to a "random model" R
• In MSA, how do we define M without phylogeny?
• How were PSA match probabilities pab obtained?– Can something similar be done for MSA?
13
Computational distinction
• Even for PSA, exhaustive search is exponential– But optimal PSA can be found by DP in O(mn)
• Clearly MSA has higher complexity than PSA– But by how much?
• Is finding the optimal MSA still feasible?
F(i-1, j-1) F(i, j-1)
F(i-1,j) F(i, j)
-d
-d
s(xi ,yj)
14
The problem of global MSA
• Given N sequences: x1, x2,…, xk
• Insert gaps (-) in each sequence xi such that– All sequences have the same length L
– Score of the global map is maximum
• MSA is more sensitive than PSA– A faint similarity between two sequences may become detectable if
present in many
• More sequences may increase alignment quality– But the cost is added complexity
15
How can alignment columns be scored?
• In pairwise sequence alignment– Scores quantify the exchangeability of the residues/gap in the pair
• In multiple sequence alignment– A similar treatment is more complicated and requires use of phylogeny
• One solution:– Evaluate MSA through its constituent PSAs
• These PSAs are called induced pairwise alignments
16
Induced pairwise alignments
• Example MSA:x:AC-GCGG-Cy:AC-GC-GAGz:GCCGC-GAG
• Induces three PSAs:
x:ACGCGG-C x:AC-GCGG-C y:AC-GCGAG
y:ACGC-GAC z:GCCGC-GAG z:GCCGCGAG
• The MSA can be scored by summing over the induced PSAs– This is called the “Sum-of-pairs” approach
17
Example: Sum-of-pairs score
F Y G D
F 5 -2 -2 -1
Y 7 1 -5
G 4 -3
D 5
x: F-Gy: F-Gz: FYD
Gap penalty: -8
BLOSUM 60
x: FGy: FG x: F-G
z: FYD
y: F-Gz: FYD
5 + 4 = 95 - 8 - 3 = -6
5 - 8 - 3 = -6
• Sum-of-pairs score: 9 - 6 - 6 = -3– What is the computational complexity?
18
Problem: Finding the optimal MSA
• Given k sequences: x1, x2,…, xk
• Sum-of-pairs score for any MSA of k sequences is– Sum of scores of all k(k-1)/2 induced PSAs
• Seek MSA A which maximizes sum-of-pairs scoreS(A) = Σi<j S(Aij)
where S(Aij) is the score of the Aij, the PSA of sequences xi
and xj induced by the MSA A
• Clearly exhaustive search is not an option– Can we rely on dynamic programming?
19
Dynamic Programming
• Similar to pairwise alignments, multiple sequence alignments can be computed by dynamic programming
F
2D 3D
20
Generalized Needleman-Wunsch
• Given 3 sequences x, y, and z
• Main iteration loop:
F(i,j,k) = max{ F(i-1, j-1, k-1) + S(xi, yj, zk),F(i-1, j-1, k ) + S(xi, yj, - ),F(i-1, j , k-1) + S(xi, -, zk),F(i-1, j , k ) + S(xi, -, - ),F(i , j-1, k-1) + S( -, yj, zk),F(i , j-1, k ) + S( -, yj, zk),F(i , j , k-1) + S( -, -, zk) }
21
Analysis of algorithm
• Given k sequences of length n:– Space for matrix: O(nk)
– Neighbors/cell: 2k-1
– Time to compute SP score: O(k2)
– Overall runtime: O(k22knk)
• Implications– Can align about 7 relatively short (length 200 - 300) sequences in a
reasonable amount of time
• 27 2007 > 1,600,000,000,000,000,000– Exact optimality is generally not attainable
22
Heuristics for multiple sequence alignment
• Exact optimality is too slow, even by dynamic programming
• Alternative:– Seek good suboptimal solutions that are attainable in reasonable time
• Key questions:– What is a “good” suboptimal solution?
– What is “reasonable” time?
• Heuristics focus on an intelligent reduction of search space– Divide-and-conquer alignment
– Greedy alignment (progressive)
23
Divide-and-conquer alignment (DCA)
• Idea: Reduce search space for dynamic programming by cutting the sequences.
• Algorithm:1. Cut sequences into fragments
until fragments can be aligned by DP
2. Build multiple alignments by DP
3. Concatenate the resulting alignments
24
Sequence 3
Seque
nce
2
Seq
uenc
e 1
Cut points optimize: C = Sprefix + Ssuffix - Scomplete
Reduction of search space
25
Greater reduction of search space
26
Progressive alignment
• Idea: – Build multiple sequence alignment from a series of pairwise alignments
• Strategy:– Choose two sequences to align (optimally)
– Hold pairwise alignment fixed, treat as a new sequence, and iterate
• For n sequences:– Requires n -1 pairwise sequence alignments
• Does the order matter?– What criteria are used to choose the sequences?
27
Guide tree
• Binary tree– Leaves correspond to sequences
– Internal nodes represent alignments
– Root corresponds to final MSA
• The guide tree specifies theorder of alignment
• Usually constructed from matrix of pairwise distances between sequences
ATC ATG TCG
ATCATG
ATC-ATG-
-TCC
TCC
TCGTCC
-TCG
28
Simple approach to distance matrix D
• Example sequences:
A ACGCGTTGGGCGATGGCAAC
B ACGCGTTGGGCGACGGTAAT
C ACGCATTGAATGATGATAAT
D ACACATTGAGTGATAATAAT
• Simple approach– Count mismatches
• Pairwise distance matrix:
29
From pairwise distances to a tree
• Using this information, a tree can be drawn:
A ACGCGTTGGGCGATGGCAAC
B ACGCGTTGGGCGACGGTAAT
C ACGCATTGAATGATGATAAT
D ACACATTGAGTGATAATAAT
• Is it guaranteed that the distances exactly fit a tree?
C
D
A
B
41
2
2
1
30
Guide tree
1
23
4
5
31
Guide tree
1
23
4
5
6
55
6
8
32
Guide tree
1
23
4
5
6
5
7
5
6
7
33
Guide tree
1
23
4
5
6
5
7
5
68
8
7
34
Progressive alignment
• Follow branching order of guide tree to build MSA
• Problem: We may have to align– Two sequences– A sequence and an alignment– Two alignments
1
23
4
5
Guide Tree
35
How to align two alignments?
Idea: Dynamic programming; treat columns like single positions.
Example:
a = GTCGTA
b = GTTGTT
GTCGTAGTTGTT
GT-CGT-AGTT-GTT-
GTT-GTT-GT-TGT-T
Align a[3]and b[3]
Align a[3]and gap
Align b[3]and gap F
36
1 PEEKSAVTAL2 GEEKAAVLAL3 PADKTNVKAA4 AADKTNVKAA
5 EGEWGLVLHV6 AAEKTKIRSA
Score: [ ]I)2s(K,V)2s(K, + I)s(L, + V)s(L,+ I)s(T, + V)s(T,8
1+
Scoring an alignment of alignments
• Average over all possibilities, possibly weighted– Generalizes PSA of two sequences to two profiles
37
Complexity of progressive alignment
The time required to align k sequences of length n:
• For progressive alignment: O(k2n2)
• Compare with dynamic programming: O(k22knk)
• BUT– Is there any guarantee on the quality of the progressive MSA?
38
Progressive alignments with ClustalW
• Clustal is the most popular method of MSA
39
ClustalW: Overview
Pairwise Alignments
1 + 2
3 + 4
1 + 3
1 + 4
2 + 4
2 + 3
Guide Tree
1
2
3
4
progressive alignment
23
4
1
1 2 3 4 5
1
2
3
4
5
Distance Matrix
1. Compute pairwise alignments (DP)2. Convert similarities into distances3. Build guide tree from distances by
Neighbor Joining4. Align with respect to guide tree
40
ClustalW server at EBI
http://www2.ebi.ac.uk/clustalw/
41
Input: Sequences in FASTA format
• ClustalW example:– RBP protein sequences from five species:
– Human, mouse, rat, cow, pig
42
Start of Pairwise alignmentsAligning...Sequences (1:2) Aligned. Score: 84Sequences (1:3) Aligned. Score: 84Sequences (1:4) Aligned. Score: 91Sequences (1:5) Aligned. Score: 92Sequences (2:3) Aligned. Score: 99Sequences (2:4) Aligned. Score: 86Sequences (2:5) Aligned. Score: 85Sequences (3:4) Aligned. Score: 85Sequences (3:5) Aligned. Score: 84Sequences (4:5) Aligned. Score: 96
Input:Five RBP sequences
best score is alignment between rat and mouse
Generate all 10 PSAs
• Highest scoring alignment ⇒ closest sequences on guide tree
43
((gi|5803139|ref|NP_006735.1|:0.04284,(gi|6174963|sp|Q00724|RETB_MOUS:0.00075,gi|132407|sp|P04916|RETB_RAT:0.00423):0.10542):0.01900,gi|89271|pir||A39486:0.01924,gi|132403|sp|P18902|RETB_BOVIN:0.01902);
Use PSA scores to create guide tree
Rat RBP
Mouse RBP
Pig RBP
Cow RBP
Human RBP
Start of Pairwise alignmentsAligning...Sequences (1:2) Aligned. Score: 84Sequences (1:3) Aligned. Score: 84Sequences (1:4) Aligned. Score: 91Sequences (1:5) Aligned. Score: 92Sequences (2:3) Aligned. Score: 99Sequences (2:4) Aligned. Score: 86Sequences (2:5) Aligned. Score: 85Sequences (3:4) Aligned. Score: 85Sequences (3:5) Aligned. Score: 84Sequences (4:5) Aligned. Score: 96
44
Progressively align sequences
• Make a MSA based on the order in the guide tree– Start with the two most closely related sequences
– Then add the next closest sequence
– Continue until all sequences are added to the MSA
• Rule: “once a gap, always a gap.”
1
23
4
5
Progressive MSA
45
ClustalW: “once a gap, always a gap”
x:ACGCGGCy:ACGC-GC
x:ACGCGGCy:ACTT-TC
Closely related Distantly related
• There are many possible ways to make a MSA– Where gaps are added is a critical question
• In which case are gap locations most reliable?
• Gaps often added to the first two sequences– To maintain the initial gap choices is to trust that those gaps are most
believable
46
CLUSTAL W (1.82) multiple sequence alignment
gi|89271|pir||A39486 MEWVWALVLLAALGSAQAERDCRVSSFRVKENFDKARFSGTWYAMAKKDP 50gi|132403|sp|P18902|RETB_BOVIN ------------------ERDCRVSSFRVKENFDKARFAGTWYAMAKKDP 32gi|5803139|ref|NP_006735.1| MKWVWALLLLAAW--AAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDP 48gi|6174963|sp|Q00724|RETB_MOUS MEWVWALVLLAALGGGSAERDCRVSSFRVKENFDKARFSGLWYAIAKKDP 50gi|132407|sp|P04916|RETB_RAT MEWVWALVLLAALGGGSAERDCRVSSFRVKENFDKARFSGLWYAIAKKDP 50
********************:* ***:*****
gi|89271|pir||A39486 EGLFLQDNIVAEFSVDENGHMSATAKGRVRLLNNWDVCADMVGTFTDTED 100gi|132403|sp|P18902|RETB_BOVIN EGLFLQDNIVAEFSVDENGHMSATAKGRVRLLNNWDVCADMVGTFTDTED 82gi|5803139|ref|NP_006735.1| EGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTED 98gi|6174963|sp|Q00724|RETB_MOUS EGLFLQDNIIAEFSVDEKGHMSATAKGRVRLLSNWEVCADMVGTFTDTED 100gi|132407|sp|P04916|RETB_RAT EGLFLQDNIIAEFSVDEKGHMSATAKGRVRLLSNWEVCADMVGTFTDTED 100
*********:*******.*:************.**:**************
gi|89271|pir||A39486 PAKFKMKYWGVASFLQKGNDDHWIIDTDYDTYAAQYSCRLQNLDGTCADS 150gi|132403|sp|P18902|RETB_BOVIN PAKFKMKYWGVASFLQKGNDDHWIIDTDYETFAVQYSCRLLNLDGTCADS 132gi|5803139|ref|NP_006735.1| PAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADS 148gi|6174963|sp|Q00724|RETB_MOUS PAKFKMKYWGVASFLQRGNDDHWIIDTDYDTFALQYSCRLQNLDGTCADS 150gi|132407|sp|P04916|RETB_RAT PAKFKMKYWGVASFLQRGNDDHWIIDTDYDTFALQYSCRLQNLDGTCADS 150
****************:*******:****:*:* ****** *********
* asterisks indicate identity in a column
Output: MSA of 5 RBP sequences
47
ClustalW has sophisticated gap treatment
• Gap opening an extension penalty dependent on:– Weight matrix
– Sequence similarity, length, difference in sequence length
– Position of gaps and residues at gaps
• Motivation: – If positions known of all secondary structure elements (α-helices, β-
strands) in all or some of the sequences
– Could increase the gap penalties inside ss elements and decrease outside them
– Forcing gaps to occur most often in loop regions
50
Sequence weighting in ClustalW
• Choose root such that mean of branch lengths on either side are equal
• For each sequence compute distance from root
• Adjust branches that are used several times
• Use distances as weight factors in SP score
A
B
C
root
0.30.2
0.6
0.1
wA = 0.2 + 0.3/2 = 0.35
wB = 0.1 + 0.3/2 = 0.25
wC = 0.6
guide tree
51
Weighted sum-of-pairs score
• Sequence pairs may be assigned weights to reduce the influence of very similar sequences on the alignment score
• This leads to a weighted sum-of-pairs score (WSP):
( ) ( )∑<
=lk
lkkli mmswmWSP ,
weight factor
52
Additional features of ClustalW
• Individual weights are assigned to sequences– Closely related ⇒ less weight
• Scoring matrices are varied depending on the presence of conserved or divergent sequences, e.g.
PAM20 80-100% id
PAM60 60-80% id
PAM120 40-60% id
PAM250 0-40% id
53
Shortcomings of progressive approaches
• Progressive MSA strongly dependent upon initial alignments
• If sequences aligned at each step are similar– Progressive approach works well
• If MSA is built on dubious PSAs– Errors in alignment propagated and amplified
• Post-processing solution:– Iterative refinement
54
Dangers of progressive alignment
Frozen by initial PSA
Additional sequences makeclear that y: GA-CTT
• Initial alignments are “frozen” even when new evidence is introduced
• Example:
x: GAAGTTy: GAC-TT
z: GAACTGw: GTACTG
x
w
yz
55
Iterative refinement of progressive MSA
x
y
z
x,z fixed projection
allow y to vary
• For each j = 1 to N– Remove sequence xj and realign to remaining alignment of x1,…,xj-
1,xj+1,…,xN
• Repeat until alignment converges
56
Ex: Iterative refinement
• Progressive alignment (x,y), (z,w), (xy, zw):
x: GAAGTTAy: GAC-TTAz: GAACTGAw: GTACTGA
• After realigning y to the remainder:
x: GAAGTTAy: G-ACTTA + 3 matchesz: GAACTGAw: GTACTGA
70
Evaluating methods of MSA
• Since DP for MSA is prohibitively slow– Heuristic methods are required
• Heuristics vary in both speed and accuracy
• Speed is well quantified by computational complexity– How can accuracy be quantified?
• Idea: Construct benchmark sets of “correct” MSAs– Evaluate ability of methods to reconstruct “correct” alignment
• Metric: Q = proportion of correctly aligned residues
71
Performance of different alignment tools
BAliBASE(237)
PREFAB(1932)
SABmark(698)
Algorithm
Q
0.804
0.832
0.861
0.882
0.883
0.896
0.910
Q tt tQ
19:25
2:53
1:07
1:18
21:31
1:05
-
5:32
-
0.572
0.589
0.648
0.636
12:25:00
2:57:00
2:36:00
144:51:00
3:11:000.648
0.668 19:41:00
Align-m 0.352 56:44
DIALIGN 0.410 8:28
CLUSTALW 0.439 2:16
MAFFT 0.442 7:33
T-Coffee 0.456 59:10
MUSCLE 0.464 20:42
PROBCONS 0.505 17:20
72
• ClustalW: Most widely used– http://www.ebi.ac.uk/clustalw/
• T-Coffee: Better but slower– http://www.ch.embnet.org/software/TCoffee.html
• ProbCons: Most accurate– http://probcons.stanford.edu/
• MUSCLE: Most scalable– http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py
Popular heuristic methods for MSA
73
Summary
• Multiple alignments make a statement about homology
• Optimal solution can be found by DP, but prohibitively slow
• Faster heuristics necessary for most applications
• Most heuristic methods based on progressive method
• Variants include pre-processing and post-processing
• Choice of method dictated by tradeoffs of time and accuracy
74
Postscript: The chicken and the egg
DGMNAGLAQ-VIADGM-ASLAQGVI------SIPGVDK-phylogenetic tree
initial alignment
refine alignment usingweighted SP score
compute sequence weights
estimate phylogenetic tree
75
Alignment ↔ phylogeny ↔ alignment ↔ ...
pig ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDDog --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEHuman KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPMouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP
***. ::: .: .. . : . . * . *: *
pig
dog
human
mouse
• Didn’t we use the tree to build the alignment?– How can we use the alignment to build the tree?