multiple sequence alignmentmuse/assets/msa.pdf · • in pairwise sequence alignment – scores...

Multiple Sequence Alignment

With thanks to Eric Stone and Steffen Heber,North Carolina State University

3

Definition: Multiple sequence alignment

ATTTG-ATTTGCAT-TGC

ATTTGATTTGCATT-GC

ATTT-G-ATTT-GCAT-T-GC

alignment no alignmentno alignment

• Given a set of sequences, a multiple sequence alignment is an assignment of gap characters such that– the resulting sequences have the same length

– no column contains only gaps

6

Application: Characterize protein families

7

Application: Discover conserved pattern

• A faint similarity between two sequences may become detectable if present in many

8

Application: Recover phylogenetic tree

Pig ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDDog --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEHuman KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPMouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

***. ::: .: .. . : . . * . *: *

pig

dog

human

mouse

9

Multiple sequence alignment (MSA)

gi|89271|pir||A39486 PAKFKMKYWGVASFLQKGNDDHWIIDTDYDTYAAQYSCRLQNLDGTCADSgi|132403|sp|P18902|RETB_BOVIN PAKFKMKYWGVASFLQKGNDDHWIIDTDYETFAVQYSCRLLNLDGTCADSgi|5803139|ref|NP_006735.1| PAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSgi|6174963|sp|Q00724|RETB_MOUS PAKFKMKYWGVASFLQRGNDDHWIIDTDYDTFALQYSCRLQNLDGTCADSgi|132407|sp|P04916|RETB_RAT PAKFKMKYWGVASFLQRGNDDHWIIDTDYDTFALQYSCRLQNLDGTCADS

• Generalizes pairwise sequence alignment (PSA)– Multiple simply means three or more sequences

• In PSA, paired residues assumed to be homologous– In MSA, columns of residues assumed to be homologous

• Are there fundamental differences between MSA and PSA?– What distinguishes MSA from PSA?

10

Biological distinction

• Homology (common ancestry) in PSA is symmetric

• Phylogeny renders homology in MSA asymmetric

…TTACG……TTGCG…

A G AG

=

…TTACG……TTACG……TTGCG……TTGCG… A G

≠A G A GG A

11

MSA makes a statement about homology

• Like pairwise sequence alignment– Multiple sequence alignment asserts the homology of its columns

• Unlike pairwise sequence alignment, interpretation requires a phylogeny

NYLS

NYLS NFLS

12

Statistical distinction

a

b

c

a b a b

M R( )( ) ba

ab

qq

p

RbaMba

=|,Pr|,Pr

a

b

• PSA compares a "match model" M to a "random model" R

• In MSA, how do we define M without phylogeny?

• How were PSA match probabilities pab obtained?– Can something similar be done for MSA?

13

Computational distinction

• Even for PSA, exhaustive search is exponential– But optimal PSA can be found by DP in O(mn)

• Clearly MSA has higher complexity than PSA– But by how much?

• Is finding the optimal MSA still feasible?

F(i-1, j-1) F(i, j-1)

F(i-1,j) F(i, j)

-d

-d

s(xi ,yj)

14

The problem of global MSA

• Given N sequences: x1, x2,…, xk

• Insert gaps (-) in each sequence xi such that– All sequences have the same length L

– Score of the global map is maximum

• MSA is more sensitive than PSA– A faint similarity between two sequences may become detectable if

present in many

• More sequences may increase alignment quality– But the cost is added complexity

15

How can alignment columns be scored?

• In pairwise sequence alignment– Scores quantify the exchangeability of the residues/gap in the pair

• In multiple sequence alignment– A similar treatment is more complicated and requires use of phylogeny

• One solution:– Evaluate MSA through its constituent PSAs

• These PSAs are called induced pairwise alignments

16

Induced pairwise alignments

• Example MSA:x:AC-GCGG-Cy:AC-GC-GAGz:GCCGC-GAG

• Induces three PSAs:

x:ACGCGG-C x:AC-GCGG-C y:AC-GCGAG

y:ACGC-GAC z:GCCGC-GAG z:GCCGCGAG

• The MSA can be scored by summing over the induced PSAs– This is called the “Sum-of-pairs” approach

17

Example: Sum-of-pairs score

F Y G D

F 5 -2 -2 -1

Y 7 1 -5

G 4 -3

D 5

x: F-Gy: F-Gz: FYD

Gap penalty: -8

BLOSUM 60

x: FGy: FG x: F-G

z: FYD

y: F-Gz: FYD

5 + 4 = 95 - 8 - 3 = -6

5 - 8 - 3 = -6

• Sum-of-pairs score: 9 - 6 - 6 = -3– What is the computational complexity?

18

Problem: Finding the optimal MSA

• Given k sequences: x1, x2,…, xk

• Sum-of-pairs score for any MSA of k sequences is– Sum of scores of all k(k-1)/2 induced PSAs

• Seek MSA A which maximizes sum-of-pairs scoreS(A) = Σi<j S(Aij)

where S(Aij) is the score of the Aij, the PSA of sequences xi

and xj induced by the MSA A

• Clearly exhaustive search is not an option– Can we rely on dynamic programming?

19

Dynamic Programming

• Similar to pairwise alignments, multiple sequence alignments can be computed by dynamic programming

F

2D 3D

20

Generalized Needleman-Wunsch

• Given 3 sequences x, y, and z

• Main iteration loop:

F(i,j,k) = max{ F(i-1, j-1, k-1) + S(xi, yj, zk),F(i-1, j-1, k ) + S(xi, yj, - ),F(i-1, j , k-1) + S(xi, -, zk),F(i-1, j , k ) + S(xi, -, - ),F(i , j-1, k-1) + S( -, yj, zk),F(i , j-1, k ) + S( -, yj, zk),F(i , j , k-1) + S( -, -, zk) }

21

Analysis of algorithm

• Given k sequences of length n:– Space for matrix: O(nk)

– Neighbors/cell: 2k-1

– Time to compute SP score: O(k2)

– Overall runtime: O(k22knk)

• Implications– Can align about 7 relatively short (length 200 - 300) sequences in a

reasonable amount of time

• 27 2007 > 1,600,000,000,000,000,000– Exact optimality is generally not attainable

22

Heuristics for multiple sequence alignment

• Exact optimality is too slow, even by dynamic programming

• Alternative:– Seek good suboptimal solutions that are attainable in reasonable time

• Key questions:– What is a “good” suboptimal solution?

– What is “reasonable” time?

• Heuristics focus on an intelligent reduction of search space– Divide-and-conquer alignment

– Greedy alignment (progressive)

23

Divide-and-conquer alignment (DCA)

• Idea: Reduce search space for dynamic programming by cutting the sequences.

• Algorithm:1. Cut sequences into fragments

until fragments can be aligned by DP

2. Build multiple alignments by DP

3. Concatenate the resulting alignments

24

Sequence 3

Seque

nce

2

Seq

uenc

e 1

Cut points optimize: C = Sprefix + Ssuffix - Scomplete

Reduction of search space

25

Greater reduction of search space

26

Progressive alignment

• Idea: – Build multiple sequence alignment from a series of pairwise alignments

• Strategy:– Choose two sequences to align (optimally)

– Hold pairwise alignment fixed, treat as a new sequence, and iterate

• For n sequences:– Requires n -1 pairwise sequence alignments

• Does the order matter?– What criteria are used to choose the sequences?

27

Guide tree

• Binary tree– Leaves correspond to sequences

– Internal nodes represent alignments

– Root corresponds to final MSA

• The guide tree specifies theorder of alignment

• Usually constructed from matrix of pairwise distances between sequences

ATC ATG TCG

ATCATG

ATC-ATG-

-TCC

TCC

TCGTCC

-TCG

28

Simple approach to distance matrix D

• Example sequences:

A ACGCGTTGGGCGATGGCAAC

B ACGCGTTGGGCGACGGTAAT

C ACGCATTGAATGATGATAAT

D ACACATTGAGTGATAATAAT

• Simple approach– Count mismatches

• Pairwise distance matrix:

29

From pairwise distances to a tree

• Using this information, a tree can be drawn:

A ACGCGTTGGGCGATGGCAAC

B ACGCGTTGGGCGACGGTAAT

C ACGCATTGAATGATGATAAT

D ACACATTGAGTGATAATAAT

• Is it guaranteed that the distances exactly fit a tree?

C

D

A

B

41

2

2

1

30

Guide tree

1

23

4

5

31

Guide tree

1

23

4

5

6

55

6

8

32

Guide tree

1

23

4

5

6

5

7

5

6

7

33

Guide tree

1

23

4

5

6

5

7

5

68

8

7

34

Progressive alignment

• Follow branching order of guide tree to build MSA

• Problem: We may have to align– Two sequences– A sequence and an alignment– Two alignments

1

23

4

5

Guide Tree

35

How to align two alignments?

Idea: Dynamic programming; treat columns like single positions.

Example:

a = GTCGTA

b = GTTGTT

GTCGTAGTTGTT

GT-CGT-AGTT-GTT-

GTT-GTT-GT-TGT-T

Align a[3]and b[3]

Align a[3]and gap

Align b[3]and gap F

36

1 PEEKSAVTAL2 GEEKAAVLAL3 PADKTNVKAA4 AADKTNVKAA

5 EGEWGLVLHV6 AAEKTKIRSA

Score: [ ]I)2s(K,V)2s(K, + I)s(L, + V)s(L,+ I)s(T, + V)s(T,8

1+

Scoring an alignment of alignments

• Average over all possibilities, possibly weighted– Generalizes PSA of two sequences to two profiles

37

Complexity of progressive alignment

The time required to align k sequences of length n:

• For progressive alignment: O(k2n2)

• Compare with dynamic programming: O(k22knk)

• BUT– Is there any guarantee on the quality of the progressive MSA?

38

Progressive alignments with ClustalW

• Clustal is the most popular method of MSA

39

ClustalW: Overview

Pairwise Alignments

1 + 2

3 + 4

1 + 3

1 + 4

2 + 4

2 + 3

Guide Tree

1

2

3

4

progressive alignment

23

4

1

1 2 3 4 5

1

2

3

4

5

Distance Matrix

1. Compute pairwise alignments (DP)2. Convert similarities into distances3. Build guide tree from distances by

Neighbor Joining4. Align with respect to guide tree

40

ClustalW server at EBI

http://www2.ebi.ac.uk/clustalw/

41

Input: Sequences in FASTA format

• ClustalW example:– RBP protein sequences from five species:

– Human, mouse, rat, cow, pig

42

Start of Pairwise alignmentsAligning...Sequences (1:2) Aligned. Score: 84Sequences (1:3) Aligned. Score: 84Sequences (1:4) Aligned. Score: 91Sequences (1:5) Aligned. Score: 92Sequences (2:3) Aligned. Score: 99Sequences (2:4) Aligned. Score: 86Sequences (2:5) Aligned. Score: 85Sequences (3:4) Aligned. Score: 85Sequences (3:5) Aligned. Score: 84Sequences (4:5) Aligned. Score: 96

Input:Five RBP sequences

best score is alignment between rat and mouse

Generate all 10 PSAs

• Highest scoring alignment ⇒ closest sequences on guide tree

43

((gi|5803139|ref|NP_006735.1|:0.04284,(gi|6174963|sp|Q00724|RETB_MOUS:0.00075,gi|132407|sp|P04916|RETB_RAT:0.00423):0.10542):0.01900,gi|89271|pir||A39486:0.01924,gi|132403|sp|P18902|RETB_BOVIN:0.01902);

Use PSA scores to create guide tree

Rat RBP

Mouse RBP

Pig RBP

Cow RBP

Human RBP

Start of Pairwise alignmentsAligning...Sequences (1:2) Aligned. Score: 84Sequences (1:3) Aligned. Score: 84Sequences (1:4) Aligned. Score: 91Sequences (1:5) Aligned. Score: 92Sequences (2:3) Aligned. Score: 99Sequences (2:4) Aligned. Score: 86Sequences (2:5) Aligned. Score: 85Sequences (3:4) Aligned. Score: 85Sequences (3:5) Aligned. Score: 84Sequences (4:5) Aligned. Score: 96

44

Progressively align sequences

• Make a MSA based on the order in the guide tree– Start with the two most closely related sequences

– Then add the next closest sequence

– Continue until all sequences are added to the MSA

• Rule: “once a gap, always a gap.”

1

23

4

5

Progressive MSA

45

ClustalW: “once a gap, always a gap”

x:ACGCGGCy:ACGC-GC

x:ACGCGGCy:ACTT-TC

Closely related Distantly related

• There are many possible ways to make a MSA– Where gaps are added is a critical question

• In which case are gap locations most reliable?

• Gaps often added to the first two sequences– To maintain the initial gap choices is to trust that those gaps are most

believable

46

CLUSTAL W (1.82) multiple sequence alignment

gi|89271|pir||A39486 MEWVWALVLLAALGSAQAERDCRVSSFRVKENFDKARFSGTWYAMAKKDP 50gi|132403|sp|P18902|RETB_BOVIN ------------------ERDCRVSSFRVKENFDKARFAGTWYAMAKKDP 32gi|5803139|ref|NP_006735.1| MKWVWALLLLAAW--AAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDP 48gi|6174963|sp|Q00724|RETB_MOUS MEWVWALVLLAALGGGSAERDCRVSSFRVKENFDKARFSGLWYAIAKKDP 50gi|132407|sp|P04916|RETB_RAT MEWVWALVLLAALGGGSAERDCRVSSFRVKENFDKARFSGLWYAIAKKDP 50

********************:* ***:*****

gi|89271|pir||A39486 EGLFLQDNIVAEFSVDENGHMSATAKGRVRLLNNWDVCADMVGTFTDTED 100gi|132403|sp|P18902|RETB_BOVIN EGLFLQDNIVAEFSVDENGHMSATAKGRVRLLNNWDVCADMVGTFTDTED 82gi|5803139|ref|NP_006735.1| EGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTED 98gi|6174963|sp|Q00724|RETB_MOUS EGLFLQDNIIAEFSVDEKGHMSATAKGRVRLLSNWEVCADMVGTFTDTED 100gi|132407|sp|P04916|RETB_RAT EGLFLQDNIIAEFSVDEKGHMSATAKGRVRLLSNWEVCADMVGTFTDTED 100

*********:*******.*:************.**:**************

gi|89271|pir||A39486 PAKFKMKYWGVASFLQKGNDDHWIIDTDYDTYAAQYSCRLQNLDGTCADS 150gi|132403|sp|P18902|RETB_BOVIN PAKFKMKYWGVASFLQKGNDDHWIIDTDYETFAVQYSCRLLNLDGTCADS 132gi|5803139|ref|NP_006735.1| PAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADS 148gi|6174963|sp|Q00724|RETB_MOUS PAKFKMKYWGVASFLQRGNDDHWIIDTDYDTFALQYSCRLQNLDGTCADS 150gi|132407|sp|P04916|RETB_RAT PAKFKMKYWGVASFLQRGNDDHWIIDTDYDTFALQYSCRLQNLDGTCADS 150

****************:*******:****:*:* ****** *********

* asterisks indicate identity in a column

Output: MSA of 5 RBP sequences

47

ClustalW has sophisticated gap treatment

• Gap opening an extension penalty dependent on:– Weight matrix

– Sequence similarity, length, difference in sequence length

– Position of gaps and residues at gaps

• Motivation: – If positions known of all secondary structure elements (α-helices, β-

strands) in all or some of the sequences

– Could increase the gap penalties inside ss elements and decrease outside them

– Forcing gaps to occur most often in loop regions

50

Sequence weighting in ClustalW

• Choose root such that mean of branch lengths on either side are equal

• For each sequence compute distance from root

• Adjust branches that are used several times

• Use distances as weight factors in SP score

A

B

C

root

0.30.2

0.6

0.1

wA = 0.2 + 0.3/2 = 0.35

wB = 0.1 + 0.3/2 = 0.25

wC = 0.6

guide tree

51

Weighted sum-of-pairs score

• Sequence pairs may be assigned weights to reduce the influence of very similar sequences on the alignment score

• This leads to a weighted sum-of-pairs score (WSP):

( ) ( )∑<

=lk

lkkli mmswmWSP ,

weight factor

52

Additional features of ClustalW

• Individual weights are assigned to sequences– Closely related ⇒ less weight

• Scoring matrices are varied depending on the presence of conserved or divergent sequences, e.g.

PAM20 80-100% id

PAM60 60-80% id

PAM120 40-60% id

PAM250 0-40% id

53

Shortcomings of progressive approaches

• Progressive MSA strongly dependent upon initial alignments

• If sequences aligned at each step are similar– Progressive approach works well

• If MSA is built on dubious PSAs– Errors in alignment propagated and amplified

• Post-processing solution:– Iterative refinement

54

Dangers of progressive alignment

Frozen by initial PSA

Additional sequences makeclear that y: GA-CTT

• Initial alignments are “frozen” even when new evidence is introduced

• Example:

x: GAAGTTy: GAC-TT

z: GAACTGw: GTACTG

x

w

yz

55

Iterative refinement of progressive MSA

x

y

z

x,z fixed projection

allow y to vary

• For each j = 1 to N– Remove sequence xj and realign to remaining alignment of x1,…,xj-

1,xj+1,…,xN

• Repeat until alignment converges

56

Ex: Iterative refinement

• Progressive alignment (x,y), (z,w), (xy, zw):

x: GAAGTTAy: GAC-TTAz: GAACTGAw: GTACTGA

• After realigning y to the remainder:

x: GAAGTTAy: G-ACTTA + 3 matchesz: GAACTGAw: GTACTGA

70

Evaluating methods of MSA

• Since DP for MSA is prohibitively slow– Heuristic methods are required

• Heuristics vary in both speed and accuracy

• Speed is well quantified by computational complexity– How can accuracy be quantified?

• Idea: Construct benchmark sets of “correct” MSAs– Evaluate ability of methods to reconstruct “correct” alignment

• Metric: Q = proportion of correctly aligned residues

71

Performance of different alignment tools

BAliBASE(237)

PREFAB(1932)

SABmark(698)

Algorithm

Q

0.804

0.832

0.861

0.882

0.883

0.896

0.910

Q tt tQ

19:25

2:53

1:07

1:18

21:31

1:05

-

5:32

-

0.572

0.589

0.648

0.636

12:25:00

2:57:00

2:36:00

144:51:00

3:11:000.648

0.668 19:41:00

Align-m 0.352 56:44

DIALIGN 0.410 8:28

CLUSTALW 0.439 2:16

MAFFT 0.442 7:33

T-Coffee 0.456 59:10

MUSCLE 0.464 20:42

PROBCONS 0.505 17:20

72

• ClustalW: Most widely used– http://www.ebi.ac.uk/clustalw/

• T-Coffee: Better but slower– http://www.ch.embnet.org/software/TCoffee.html

• ProbCons: Most accurate– http://probcons.stanford.edu/

• MUSCLE: Most scalable– http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py

Popular heuristic methods for MSA

http://www.ebi.ac.uk/clustalw/

http://www.ch.embnet.org/software/TCoffee.html

http://probcons.stanford.edu/

http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py

73

Summary

• Multiple alignments make a statement about homology

• Optimal solution can be found by DP, but prohibitively slow

• Faster heuristics necessary for most applications

• Most heuristic methods based on progressive method

• Variants include pre-processing and post-processing

• Choice of method dictated by tradeoffs of time and accuracy

74

Postscript: The chicken and the egg

DGMNAGLAQ-VIADGM-ASLAQGVI------SIPGVDK-phylogenetic tree

initial alignment

refine alignment usingweighted SP score

compute sequence weights

estimate phylogenetic tree

75

Alignment ↔ phylogeny ↔ alignment ↔ ...

pig ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDDog --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEHuman KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPMouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

***. ::: .: .. . : . . * . *: *

pig

dog

human

mouse

• Didn’t we use the tree to build the alignment?– How can we use the alignment to build the tree?

multiple sequence alignmentmuse/assets/msa.pdf · • in pairwise sequence alignment – scores...

Documents