multiple sequence alignmentmuse/assets/msa.pdf · • in pairwise sequence alignment – scores...

57
Multiple Sequence Alignment With thanks to Eric Stone and Steffen Heber, North Carolina State University

Upload: others

Post on 03-Dec-2019

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

Multiple Sequence Alignment

With thanks to Eric Stone and Steffen Heber,North Carolina State University

Page 2: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

3

Definition: Multiple sequence alignment

ATTTG-ATTTGCAT-TGC

ATTTGATTTGCATT-GC

ATTT-G-ATTT-GCAT-T-GC

alignment no alignmentno alignment

• Given a set of sequences, a multiple sequence alignment is an assignment of gap characters such that– the resulting sequences have the same length

– no column contains only gaps

Page 3: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

6

Application: Characterize protein families

Page 4: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

7

Application: Discover conserved pattern

• A faint similarity between two sequences may become detectable if present in many

Page 5: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

8

Application: Recover phylogenetic tree

Pig ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDDog --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEHuman KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPMouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

***. ::: .: .. . : . . * . *: *

pig

dog

human

mouse

Page 6: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

9

Multiple sequence alignment (MSA)

gi|89271|pir||A39486 PAKFKMKYWGVASFLQKGNDDHWIIDTDYDTYAAQYSCRLQNLDGTCADSgi|132403|sp|P18902|RETB_BOVIN PAKFKMKYWGVASFLQKGNDDHWIIDTDYETFAVQYSCRLLNLDGTCADSgi|5803139|ref|NP_006735.1| PAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSgi|6174963|sp|Q00724|RETB_MOUS PAKFKMKYWGVASFLQRGNDDHWIIDTDYDTFALQYSCRLQNLDGTCADSgi|132407|sp|P04916|RETB_RAT PAKFKMKYWGVASFLQRGNDDHWIIDTDYDTFALQYSCRLQNLDGTCADS

• Generalizes pairwise sequence alignment (PSA)– Multiple simply means three or more sequences

• In PSA, paired residues assumed to be homologous– In MSA, columns of residues assumed to be homologous

• Are there fundamental differences between MSA and PSA?– What distinguishes MSA from PSA?

Page 7: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

10

Biological distinction

• Homology (common ancestry) in PSA is symmetric

• Phylogeny renders homology in MSA asymmetric

…TTACG……TTGCG…

A G AG

=

…TTACG……TTACG……TTGCG……TTGCG… A G

≠A G A GG A

Page 8: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

11

MSA makes a statement about homology

• Like pairwise sequence alignment– Multiple sequence alignment asserts the homology of its columns

• Unlike pairwise sequence alignment, interpretation requires a phylogeny

NYLS

NYLS NFLS

Page 9: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

12

Statistical distinction

a

b

c

a b a b

M R( )( ) ba

ab

qq

p

RbaMba

=|,Pr|,Pr

a

b

• PSA compares a "match model" M to a "random model" R

• In MSA, how do we define M without phylogeny?

• How were PSA match probabilities pab obtained?– Can something similar be done for MSA?

Page 10: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

13

Computational distinction

• Even for PSA, exhaustive search is exponential– But optimal PSA can be found by DP in O(mn)

• Clearly MSA has higher complexity than PSA– But by how much?

• Is finding the optimal MSA still feasible?

F(i-1, j-1) F(i, j-1)

F(i-1,j) F(i, j)

-d

-d

s(xi ,yj)

Page 11: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

14

The problem of global MSA

• Given N sequences: x1, x2,…, xk

• Insert gaps (-) in each sequence xi such that– All sequences have the same length L

– Score of the global map is maximum

• MSA is more sensitive than PSA– A faint similarity between two sequences may become detectable if

present in many

• More sequences may increase alignment quality– But the cost is added complexity

Page 12: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

15

How can alignment columns be scored?

• In pairwise sequence alignment– Scores quantify the exchangeability of the residues/gap in the pair

• In multiple sequence alignment– A similar treatment is more complicated and requires use of phylogeny

• One solution:– Evaluate MSA through its constituent PSAs

• These PSAs are called induced pairwise alignments

Page 13: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

16

Induced pairwise alignments

• Example MSA:x:AC-GCGG-Cy:AC-GC-GAGz:GCCGC-GAG

• Induces three PSAs:

x:ACGCGG-C x:AC-GCGG-C y:AC-GCGAG

y:ACGC-GAC z:GCCGC-GAG z:GCCGCGAG

• The MSA can be scored by summing over the induced PSAs– This is called the “Sum-of-pairs” approach

Page 14: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

17

Example: Sum-of-pairs score

F Y G D

F 5 -2 -2 -1

Y 7 1 -5

G 4 -3

D 5

x: F-Gy: F-Gz: FYD

Gap penalty: -8

BLOSUM 60

x: FGy: FG x: F-G

z: FYD

y: F-Gz: FYD

5 + 4 = 95 - 8 - 3 = -6

5 - 8 - 3 = -6

• Sum-of-pairs score: 9 - 6 - 6 = -3– What is the computational complexity?

Page 15: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

18

Problem: Finding the optimal MSA

• Given k sequences: x1, x2,…, xk

• Sum-of-pairs score for any MSA of k sequences is– Sum of scores of all k(k-1)/2 induced PSAs

• Seek MSA A which maximizes sum-of-pairs scoreS(A) = Σi<j S(Aij)

where S(Aij) is the score of the Aij, the PSA of sequences xi

and xj induced by the MSA A

• Clearly exhaustive search is not an option– Can we rely on dynamic programming?

Page 16: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

19

Dynamic Programming

• Similar to pairwise alignments, multiple sequence alignments can be computed by dynamic programming

F

2D 3D

Page 17: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

20

Generalized Needleman-Wunsch

• Given 3 sequences x, y, and z

• Main iteration loop:

F(i,j,k) = max{ F(i-1, j-1, k-1) + S(xi, yj, zk),F(i-1, j-1, k ) + S(xi, yj, - ),F(i-1, j , k-1) + S(xi, -, zk),F(i-1, j , k ) + S(xi, -, - ),F(i , j-1, k-1) + S( -, yj, zk),F(i , j-1, k ) + S( -, yj, zk),F(i , j , k-1) + S( -, -, zk) }

Page 18: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

21

Analysis of algorithm

• Given k sequences of length n:– Space for matrix: O(nk)

– Neighbors/cell: 2k-1

– Time to compute SP score: O(k2)

– Overall runtime: O(k22knk)

• Implications– Can align about 7 relatively short (length 200 - 300) sequences in a

reasonable amount of time

• 27 2007 > 1,600,000,000,000,000,000– Exact optimality is generally not attainable

Page 19: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

22

Heuristics for multiple sequence alignment

• Exact optimality is too slow, even by dynamic programming

• Alternative:– Seek good suboptimal solutions that are attainable in reasonable time

• Key questions:– What is a “good” suboptimal solution?

– What is “reasonable” time?

• Heuristics focus on an intelligent reduction of search space– Divide-and-conquer alignment

– Greedy alignment (progressive)

Page 20: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

23

Divide-and-conquer alignment (DCA)

• Idea: Reduce search space for dynamic programming by cutting the sequences.

• Algorithm:1. Cut sequences into fragments

until fragments can be aligned by DP

2. Build multiple alignments by DP

3. Concatenate the resulting alignments

Page 21: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

24

Sequence 3

Seque

nce

2

Seq

uenc

e 1

Cut points optimize: C = Sprefix + Ssuffix - Scomplete

Reduction of search space

Page 22: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

25

Greater reduction of search space

Page 23: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

26

Progressive alignment

• Idea: – Build multiple sequence alignment from a series of pairwise alignments

• Strategy:– Choose two sequences to align (optimally)

– Hold pairwise alignment fixed, treat as a new sequence, and iterate

• For n sequences:– Requires n -1 pairwise sequence alignments

• Does the order matter?– What criteria are used to choose the sequences?

Page 24: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

27

Guide tree

• Binary tree– Leaves correspond to sequences

– Internal nodes represent alignments

– Root corresponds to final MSA

• The guide tree specifies theorder of alignment

• Usually constructed from matrix of pairwise distances between sequences

ATC ATG TCG

ATCATG

ATC-ATG-

-TCC

TCC

TCGTCC

-TCG

Page 25: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

28

Simple approach to distance matrix D

• Example sequences:

A ACGCGTTGGGCGATGGCAAC

B ACGCGTTGGGCGACGGTAAT

C ACGCATTGAATGATGATAAT

D ACACATTGAGTGATAATAAT

• Simple approach– Count mismatches

• Pairwise distance matrix:

Page 26: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

29

From pairwise distances to a tree

• Using this information, a tree can be drawn:

A ACGCGTTGGGCGATGGCAAC

B ACGCGTTGGGCGACGGTAAT

C ACGCATTGAATGATGATAAT

D ACACATTGAGTGATAATAAT

• Is it guaranteed that the distances exactly fit a tree?

C

D

A

B

41

2

2

1

Page 27: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

30

Guide tree

1

23

4

5

Page 28: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

31

Guide tree

1

23

4

5

6

55

6

8

Page 29: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

32

Guide tree

1

23

4

5

6

5

7

5

6

7

Page 30: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

33

Guide tree

1

23

4

5

6

5

7

5

68

8

7

Page 31: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

34

Progressive alignment

• Follow branching order of guide tree to build MSA

• Problem: We may have to align– Two sequences– A sequence and an alignment– Two alignments

1

23

4

5

Guide Tree

Page 32: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

35

How to align two alignments?

Idea: Dynamic programming; treat columns like single positions.

Example:

a = GTCGTA

b = GTTGTT

GTCGTAGTTGTT

GT-CGT-AGTT-GTT-

GTT-GTT-GT-TGT-T

Align a[3]and b[3]

Align a[3]and gap

Align b[3]and gap F

Page 33: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

36

1 PEEKSAVTAL2 GEEKAAVLAL3 PADKTNVKAA4 AADKTNVKAA

5 EGEWGLVLHV6 AAEKTKIRSA

Score: [ ]I)2s(K,V)2s(K, + I)s(L, + V)s(L,+ I)s(T, + V)s(T,8

1+

Scoring an alignment of alignments

• Average over all possibilities, possibly weighted– Generalizes PSA of two sequences to two profiles

Page 34: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

37

Complexity of progressive alignment

The time required to align k sequences of length n:

• For progressive alignment: O(k2n2)

• Compare with dynamic programming: O(k22knk)

• BUT– Is there any guarantee on the quality of the progressive MSA?

Page 35: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

38

Progressive alignments with ClustalW

• Clustal is the most popular method of MSA

Page 36: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

39

ClustalW: Overview

Pairwise Alignments

1 + 2

3 + 4

1 + 3

1 + 4

2 + 4

2 + 3

Guide Tree

1

2

3

4

progressive alignment

23

4

1

1 2 3 4 5

1

2

3

4

5

Distance Matrix

1. Compute pairwise alignments (DP)2. Convert similarities into distances3. Build guide tree from distances by

Neighbor Joining4. Align with respect to guide tree

Page 37: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

40

ClustalW server at EBI

http://www2.ebi.ac.uk/clustalw/

Page 38: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

41

Input: Sequences in FASTA format

• ClustalW example:– RBP protein sequences from five species:

– Human, mouse, rat, cow, pig

Page 39: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

42

Start of Pairwise alignmentsAligning...Sequences (1:2) Aligned. Score: 84Sequences (1:3) Aligned. Score: 84Sequences (1:4) Aligned. Score: 91Sequences (1:5) Aligned. Score: 92Sequences (2:3) Aligned. Score: 99Sequences (2:4) Aligned. Score: 86Sequences (2:5) Aligned. Score: 85Sequences (3:4) Aligned. Score: 85Sequences (3:5) Aligned. Score: 84Sequences (4:5) Aligned. Score: 96

Input:Five RBP sequences

best score is alignment between rat and mouse

Generate all 10 PSAs

• Highest scoring alignment ⇒ closest sequences on guide tree

Page 40: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

43

((gi|5803139|ref|NP_006735.1|:0.04284,(gi|6174963|sp|Q00724|RETB_MOUS:0.00075,gi|132407|sp|P04916|RETB_RAT:0.00423):0.10542):0.01900,gi|89271|pir||A39486:0.01924,gi|132403|sp|P18902|RETB_BOVIN:0.01902);

Use PSA scores to create guide tree

Rat RBP

Mouse RBP

Pig RBP

Cow RBP

Human RBP

Start of Pairwise alignmentsAligning...Sequences (1:2) Aligned. Score: 84Sequences (1:3) Aligned. Score: 84Sequences (1:4) Aligned. Score: 91Sequences (1:5) Aligned. Score: 92Sequences (2:3) Aligned. Score: 99Sequences (2:4) Aligned. Score: 86Sequences (2:5) Aligned. Score: 85Sequences (3:4) Aligned. Score: 85Sequences (3:5) Aligned. Score: 84Sequences (4:5) Aligned. Score: 96

Page 41: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

44

Progressively align sequences

• Make a MSA based on the order in the guide tree– Start with the two most closely related sequences

– Then add the next closest sequence

– Continue until all sequences are added to the MSA

• Rule: “once a gap, always a gap.”

1

23

4

5

Progressive MSA

Page 42: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

45

ClustalW: “once a gap, always a gap”

x:ACGCGGCy:ACGC-GC

x:ACGCGGCy:ACTT-TC

Closely related Distantly related

• There are many possible ways to make a MSA– Where gaps are added is a critical question

• In which case are gap locations most reliable?

• Gaps often added to the first two sequences– To maintain the initial gap choices is to trust that those gaps are most

believable

Page 43: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

46

CLUSTAL W (1.82) multiple sequence alignment

gi|89271|pir||A39486 MEWVWALVLLAALGSAQAERDCRVSSFRVKENFDKARFSGTWYAMAKKDP 50gi|132403|sp|P18902|RETB_BOVIN ------------------ERDCRVSSFRVKENFDKARFAGTWYAMAKKDP 32gi|5803139|ref|NP_006735.1| MKWVWALLLLAAW--AAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDP 48gi|6174963|sp|Q00724|RETB_MOUS MEWVWALVLLAALGGGSAERDCRVSSFRVKENFDKARFSGLWYAIAKKDP 50gi|132407|sp|P04916|RETB_RAT MEWVWALVLLAALGGGSAERDCRVSSFRVKENFDKARFSGLWYAIAKKDP 50

********************:* ***:*****

gi|89271|pir||A39486 EGLFLQDNIVAEFSVDENGHMSATAKGRVRLLNNWDVCADMVGTFTDTED 100gi|132403|sp|P18902|RETB_BOVIN EGLFLQDNIVAEFSVDENGHMSATAKGRVRLLNNWDVCADMVGTFTDTED 82gi|5803139|ref|NP_006735.1| EGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTED 98gi|6174963|sp|Q00724|RETB_MOUS EGLFLQDNIIAEFSVDEKGHMSATAKGRVRLLSNWEVCADMVGTFTDTED 100gi|132407|sp|P04916|RETB_RAT EGLFLQDNIIAEFSVDEKGHMSATAKGRVRLLSNWEVCADMVGTFTDTED 100

*********:*******.*:************.**:**************

gi|89271|pir||A39486 PAKFKMKYWGVASFLQKGNDDHWIIDTDYDTYAAQYSCRLQNLDGTCADS 150gi|132403|sp|P18902|RETB_BOVIN PAKFKMKYWGVASFLQKGNDDHWIIDTDYETFAVQYSCRLLNLDGTCADS 132gi|5803139|ref|NP_006735.1| PAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADS 148gi|6174963|sp|Q00724|RETB_MOUS PAKFKMKYWGVASFLQRGNDDHWIIDTDYDTFALQYSCRLQNLDGTCADS 150gi|132407|sp|P04916|RETB_RAT PAKFKMKYWGVASFLQRGNDDHWIIDTDYDTFALQYSCRLQNLDGTCADS 150

****************:*******:****:*:* ****** *********

* asterisks indicate identity in a column

Output: MSA of 5 RBP sequences

Page 44: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

47

ClustalW has sophisticated gap treatment

• Gap opening an extension penalty dependent on:– Weight matrix

– Sequence similarity, length, difference in sequence length

– Position of gaps and residues at gaps

• Motivation: – If positions known of all secondary structure elements (α-helices, β-

strands) in all or some of the sequences

– Could increase the gap penalties inside ss elements and decrease outside them

– Forcing gaps to occur most often in loop regions

Page 45: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

50

Sequence weighting in ClustalW

• Choose root such that mean of branch lengths on either side are equal

• For each sequence compute distance from root

• Adjust branches that are used several times

• Use distances as weight factors in SP score

A

B

C

root

0.30.2

0.6

0.1

wA = 0.2 + 0.3/2 = 0.35

wB = 0.1 + 0.3/2 = 0.25

wC = 0.6

guide tree

Page 46: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

51

Weighted sum-of-pairs score

• Sequence pairs may be assigned weights to reduce the influence of very similar sequences on the alignment score

• This leads to a weighted sum-of-pairs score (WSP):

( ) ( )∑<

=lk

lkkli mmswmWSP ,

weight factor

Page 47: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

52

Additional features of ClustalW

• Individual weights are assigned to sequences– Closely related ⇒ less weight

• Scoring matrices are varied depending on the presence of conserved or divergent sequences, e.g.

PAM20 80-100% id

PAM60 60-80% id

PAM120 40-60% id

PAM250 0-40% id

Page 48: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

53

Shortcomings of progressive approaches

• Progressive MSA strongly dependent upon initial alignments

• If sequences aligned at each step are similar– Progressive approach works well

• If MSA is built on dubious PSAs– Errors in alignment propagated and amplified

• Post-processing solution:– Iterative refinement

Page 49: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

54

Dangers of progressive alignment

Frozen by initial PSA

Additional sequences makeclear that y: GA-CTT

• Initial alignments are “frozen” even when new evidence is introduced

• Example:

x: GAAGTTy: GAC-TT

z: GAACTGw: GTACTG

x

w

yz

Page 50: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

55

Iterative refinement of progressive MSA

x

y

z

x,z fixed projection

allow y to vary

• For each j = 1 to N– Remove sequence xj and realign to remaining alignment of x1,…,xj-

1,xj+1,…,xN

• Repeat until alignment converges

Page 51: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

56

Ex: Iterative refinement

• Progressive alignment (x,y), (z,w), (xy, zw):

x: GAAGTTAy: GAC-TTAz: GAACTGAw: GTACTGA

• After realigning y to the remainder:

x: GAAGTTAy: G-ACTTA + 3 matchesz: GAACTGAw: GTACTGA

Page 52: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

70

Evaluating methods of MSA

• Since DP for MSA is prohibitively slow– Heuristic methods are required

• Heuristics vary in both speed and accuracy

• Speed is well quantified by computational complexity– How can accuracy be quantified?

• Idea: Construct benchmark sets of “correct” MSAs– Evaluate ability of methods to reconstruct “correct” alignment

• Metric: Q = proportion of correctly aligned residues

Page 53: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

71

Performance of different alignment tools

BAliBASE(237)

PREFAB(1932)

SABmark(698)

Algorithm

Q

0.804

0.832

0.861

0.882

0.883

0.896

0.910

Q tt tQ

19:25

2:53

1:07

1:18

21:31

1:05

-

5:32

-

0.572

0.589

0.648

0.636

12:25:00

2:57:00

2:36:00

144:51:00

3:11:000.648

0.668 19:41:00

Align-m 0.352 56:44

DIALIGN 0.410 8:28

CLUSTALW 0.439 2:16

MAFFT 0.442 7:33

T-Coffee 0.456 59:10

MUSCLE 0.464 20:42

PROBCONS 0.505 17:20

Page 54: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

72

• ClustalW: Most widely used– http://www.ebi.ac.uk/clustalw/

• T-Coffee: Better but slower– http://www.ch.embnet.org/software/TCoffee.html

• ProbCons: Most accurate– http://probcons.stanford.edu/

• MUSCLE: Most scalable– http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py

Popular heuristic methods for MSA

Page 55: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

73

Summary

• Multiple alignments make a statement about homology

• Optimal solution can be found by DP, but prohibitively slow

• Faster heuristics necessary for most applications

• Most heuristic methods based on progressive method

• Variants include pre-processing and post-processing

• Choice of method dictated by tradeoffs of time and accuracy

Page 56: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

74

Postscript: The chicken and the egg

DGMNAGLAQ-VIADGM-ASLAQGVI------SIPGVDK-phylogenetic tree

initial alignment

refine alignment usingweighted SP score

compute sequence weights

estimate phylogenetic tree

Page 57: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence

75

Alignment ↔ phylogeny ↔ alignment ↔ ...

pig ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDDog --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEHuman KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPMouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

***. ::: .: .. . : . . * . *: *

pig

dog

human

mouse

• Didn’t we use the tree to build the alignment?– How can we use the alignment to build the tree?