chapter 6. multiple sequence alignment methods

Chapter 6.Chapter 6.

Multiple sequence alignment methodsMultiple sequence alignment methods

(C) 2000, 2001 SNU CSE Biointelligence Lab

2

OutlineOutline

What a multiple alignment means Scoring a multiple alignment Multidimensional Dynamic Programming Progressive alignment methods Multiple alignment by profile HMM training


3

Multiple alignmentMultiple alignment Biologists produce high quality multiple

alignments by hand using expert knowledge of protein sequence evolution. Highly conserved regions Buried hydrophobic residues Influence of protein structure Expected patterns of insertions and deletions


4

Multiple alignmentMultiple alignment Manual multiple sequence alignment is tedius.

Automatic MSA methods are needed. In general, an automatic method must have a way to ass

ign a score so that better MSA get better scores. Scoring a multiple alignment and searching over possib

le alignments should be distinguished. In probabilistic modelling, scoring function is primary concern. One of goals in probabilistic modeling is to incorporate as man

y of an expert’s evaluation criteria as possible into scoring procedure.


5

What a multiple alignment What a multiple alignment meansmeans In a multiple sequence alignment, homologous

residues among a set of sequences are aligned together in columns. ‘Homologous’ is meant in both the structural and

evolutionary sense. Ideally, a column of aligned residues occupy similar

three-dimensional structural positions and all diverge from a common ancestral residue.


6

What a multiple alignment What a multiple alignment meansmeans Manually aligned example-10 imunoglobulin supe

rfamily A crystal structure of 1tlk(telokin) is known The telokin structure and alignments to other related se

qyences reveal conserved characteristics of the I-set immunoglobulin superfamily fold, including eight conserved β-strands and certain key residues in the sequences, such as two completely conserved cysteines in the b and f strands which form a disulfide bond in the core of the folded structure.


7

What a multiple alignment What a multiple alignment meansmeans


8

What a multiple alignment What a multiple alignment meansmeans Except for trivial cases, it is not possible to create

a single ‘correct’ multiple alignment. Given pair of divergent but clearly homologus protein s

equences, usually only 50% of the individual residues were superposable.

The Globin family, often used as a ‘typical’ problem in computational work, is in fact exceptional:almost the entire structure is convserved among divergent sequences.

Even the definition of ‘structurally superposable’ is subjective and can be expected to vary among experts.


9

What a multiple alignment What a multiple alignment meansmeans Our ability to define a single ‘correct’ alignment will vary

with the relatedness of the sequences being aligned. An alignment of very similar sequences will generally be unambig

uous, but there alignments are not of great interest to us. For cases of interest, there is no objective way to define an unambi

guously correct alignment. Usually a small subset of key residues will be identifiable which can

be aligned unambiguously for all the sequences in a family almost regardless of sequence divergence.

Core structal elements will also tend to be conserved and meaningfully alignable.


10

Scoring a multiple alignmentScoring a multiple alignment Two important features of multiple alignments

Some positions are more conserved than others. The sequences are not independent, but instead are relat

ed by a phylogenetic tree.


11

Scoring a multiple alignmentScoring a multiple alignment An idealised way

Specifty a complete probabilistic model of molecular sequence evolution.

The probability of a multiple alignment can be calculated using evolutionary model.

We don’t have enough data to build such a model Workable approximation:partly or entirely ignore the p

hylogenetic tree while doing some sort of position-specific scoring.


12

Scoring a multiple alignmentScoring a multiple alignment Simplifying assumption

Individual columns of an alignment are statistically independent. Then scoring function can be written as

Mi: column i of the multiple alignment m S(mi):the score for column i G:an function for scoring the gaps that occur in the alignments.

Unspecified function-affine scoring function can be used


13

Scoring a multiple alignment-Scoring a multiple alignment-Minimum EntropyMinimum Entropy Minimum Entropy

More variability in an alignment will be described by a higher entropy. Exactly matching sequences will have 0 entropy (completely organized)

To find the best alignment we want to have the minimum entropy.


14

Scoring a multiple alignment-Scoring a multiple alignment-Minimum EntropyMinimum Entropy Minimum entropy

Counting the residues in each column

Probability of residue a in column I (ML estimate)

Probability of a column(independence assumed)

Entropy is the negative log of the probability of the column.


15

Scoring a multiple alignment-Scoring a multiple alignment-Minimum EntropyMinimum Entropy

Treating columns as statistically independent-Leaving out knowledge of phylogeny.

Actually very similar to HMM without gap information

The assumption that the sequences are independent can be reasonable if representative sequence of a sequence family s carefully chosen.

A variety of tree-based wdighting schemes have been proposed to deal with this problem to partially compensate for the defects of the sequence independence assumption.


16

Scoring a multiple alignment-Scoring a multiple alignment-Sum of PairsSum of Pairs Sum of pairs

Standard method of scoring multiple alignment Similarity to HMM formulation

Do not use phylogenetic tree Assumes statistical indepedence for the columns.

Not HMM formulation though


17

Scoring a multiple alignment-Scoring a multiple alignment-Sum of pairsSum of pairs Sum of pairs

Columns are scored by SP function using a substitution scoring matrix such as a PAM or BLOSUM matrix.

Use linear gap function or score affine gaps separately. Sum N(N-1)/2 pairwise scores


18

Scoring a multiple alignment-Scoring a multiple alignment-Sum of pairsSum of pairs Problem of Sum of pairs

Sum of scores are not probabilistic correct extension to log-odds score.

Correct log-odds score extension:

SP score:

Evolutionary events are over-counted, a problem which increases as the number of sequemces increases.


19

Scoring a multiple alignment-Scoring a multiple alignment-Sum of pairsSum of pairs Example

an alignment of N sequences which all have leucine(L) at a certain position.

BLOSUM50 s(L,L)=5 The SP score of the column is 5N(N-1)/2

If instead there were one glycine(G) and N-1 Ls BLOSUM50 s(G,L)=-4 The SP score of the column is worse than the score for a colu

mn of all Ls by a fraction of 9(N-1) / 5N(N-1)/2 =18/5N


20

Scoring a multiple alignment-Scoring a multiple alignment-Sum of pairsSum of pairs Difference is 18/5N

Relative difference between score between the correct and incorrect allignment decreases with the no. of sequences

Yet, if we have MORE evidence that L is conserved then an outlier out to DECREASE the score more.


21

Multidimensional Dynamic Multidimensional Dynamic ProgrammingProgramming It is possible to generalise pairwise DP alignment to the ali

gnment of N sequences.


22

Multidimensional Dynamic Multidimensional Dynamic ProgrammingProgramming Assumptions

The columns of an alignment are statistically independent The gaps are scored with a linear gap cost

Then the overall score S(m) for an alignment can be calculated as a sum of the scores for each column.


23

Multidimensional Dynamic Multidimensional Dynamic ProgrammingProgramming


24

Multidimensional Dynamic Multidimensional Dynamic ProgrammingProgramming Simplifying the notation


25

Multidimensional Dynamic Multidimensional Dynamic ProgrammingProgramming Straightforward Multidimensional DP

Pros It can find optimal solution. Arbitary column scoring function can be used Only assumption is that column scores are independent.

Cons There are 2^N-1 gap combinations for each entry Huge computational complexity-O(2^N L^N)


26

Multidimensional Dynamic Multidimensional Dynamic Programming-MSAProgramming-MSA

MSA can reduce the volume of the multidimensional dynamic programing matrix that needs to be examined

Optimally align up to 5-7 protein sequences of reasonable length(200-300 residues)


27

Multidimensional Dynamic Multidimensional Dynamic Programming-MSAProgramming-MSA Assumptions

SP scoring system The score of a multiple alignment is the sum of the scores of all pairwise

alignment defined by the multiple alignment. Then the score of the complete alignment is given by

Let be the optimal pairwise alignment of k,l


28

Multidimensional Dynamic Multidimensional Dynamic Programming-MSAProgramming-MSA We can obtain a lower bound on the score of any pairwise alignment t

hat can occur in the optimal multiple alignment. Assume that we have a lower bound σ(a) on the score of the optimal

multiple alignment, then for optimal multiple alignment a

We only need to consider pairwise alignment of k and l that score better than

A good bound σ(a) can be obtained by any fast heurist algorithm Optimal pairwise alignment can be found using dynamic programmin

g


29

Multidimensional Dynamic Multidimensional Dynamic Programming-MSAProgramming-MSA Now find the complete set of coordinate pairs (ik,il) such that the

best alignment of xk to xl through (ik,il) scores more than The costly multidimensional dynamic programming algorithm can be r

estricted to evaluate only cells in the intersection of all theses sets: I,e, cels (i1,i2,…iN) for which (ik,il) is in for all k,l.


30

Progressive alignment Progressive alignment methodsmethods Most commonly used approach

Works by constructing a succession of pairwise alignmensts.

Initially, two sequences are chosen and aligned by standard pairwise alignment;this alignment is fixed.

Then, a third sequence is chosen and aligned to the first alignment

This process is iterated until all sequences have been aligned.


31

Progressive alignment Progressive alignment methodsmethods Basically heuristic

It does not separate the scoring and optimising. It does not directly optimise any global scoring functio

n. Fast and efficient, Generates reasonable result


32

Progressive alignment Progressive alignment methodsmethods Differences between PA algorithms

The way that they choose the order to do the alignment Whether the progression involves only alignment of seq

uences to a single growing alignment or whether subfamilies are built up on a tree structure and,at certain points, alignments are aligned to alignments.

Procedure used to align and score sequences or alignments against existing alignmetns.


33

Progressive alignment methods- Feng-DooProgressive alignment methods- Feng-Doolittle progressive multiple alignmentlittle progressive multiple alignment

Calculate a diagonal matrix of N(N-1)/2 distances between all pairs of N sequences by standard pairwise alignment. Compute a distance matrix D=-log(S)

Construct a Guide tree from the distance matrix using a clustering algorithm

Starting from the first node added to the tree, align the child nodes. Repeat for all other nodes in the order that they were added to the tree.


34

Progressive alignment methods-Feng-DooliProgressive alignment methods-Feng-Doolittle progressive multiple alignmentttle progressive multiple alignment

Converting alignment scores to distances Doesn’t need to be accurate-the goal is only to create an approximate

guide tree, not an evolutionary tree.

In phylogenetic tree construction, more care must be taken


35

Progressive alignment methods-Feng-Progressive alignment methods-Feng-Doolittle progressive multiple alignmeDoolittle progressive multiple alignmentnt Clustering

Done with The Fitch-Margooliash algorithm Sequence-Sequence alignments

Done with usual pairwise dynamic programming. A sequence is added to an existing group by aligning it pairwis

e to each sequence in the group in turn. The highest scoring pairwise alignment determines how the se

quence will be aligned to the group.


36

Progressive alignment methods-Feng-DooliProgressive alignment methods-Feng-Doolittle progressive multiple alignmentttle progressive multiple alignment

‘Once a gap,always a gap’ rule After an alignment is completed, gap symbols are replaced with a

neutral X character. This rule allows pairwise sequenc alignments to be used to guide t

he alignment of sequences to groups or groups to groups; otherwise, any given pairwise sequence alignment would not necessarily be consistent with the pre-existing alignment of a group.

Desirable side effect:encouraging gaps to occur in the same columns in subsequent pairwise alignments.

Not needed in profile-based progressive alignment algorithms


37

Progressive alignment Progressive alignment methodsmethods A problem with the Feng-Doolittle approach

all alignments are determined by pairwise sequence alignments.

It is advantageous to use position-specific information from the group’s multiple alignment to align a new sequence to it. (e.g. degree of sequence conservation)

Many progressive alignment methods use pairwise alignment of sequences to profiles or of profiles to profiles as a subroutine which is used many times in the process.


38

Progressive alignment Progressive alignment methodsmethods Linear gap scoring case

s(-,a)=s(a,-)=-g and s(-,-)=0 Two profiles: sequence 1..n and n+1… N Global alignment is

The first two sums are unaffected by the global alignment(s(-,-)=0) Therefore the optimal alignment of the two profiles can be obtained b

y only optimising the last sum with the cross terms, which can be done exactly like a standard pairwise alignment.


39

Progressive alignment methods-Progressive alignment methods-CLUSTAWCLUSTAW Profile-based progresive multiple alignment Works in much the same way as the Feng-Doolitle

method except for its carefully tuned use of profile alignment methods.

Uses various heuristics.


40

Progressive alignment methods-Progressive alignment methods-CLUSTAWCLUSTAW Construct a distance matrix of all N(N-1)/2 pairs b

y pairwise dynamic programming. Construct a guide tree by a neighbour-joining clust

ering algorithm. Progressively align at nodes in order of decreasing

similarity, using sequence-sequence, sequence-profile, and profile-profile alignment. Scoring is basically SP.


41

Progressive alignment methods-Progressive alignment methods-CLUSTAWCLUSTAW Heuristics used

Sequences are weighted to compensate for biased representation in large subfamilies.

The substitution matrix is chosen on the basis of the similarity expected of the alignment.

Position-specific gap-open penalties are used. Gap penalties are increased if there are no gaps in a

column but gaps occur nearby in the alignment.


42

Progressive alignment methods-Progressive alignment methods-Iterative refinement methodsIterative refinement methods Problem with progressive alignment

Subalignments are frozen. Once a group of sequemces has been aligned, their alignment t

o each other cannot be changed at a later stage as more data arrive.

Iterative refinement methods attempt to circumvent this problem.


43

Progressive alignment methods-Progressive alignment methods-Iterative refinement methodsIterative refinement methods Iterative refinement method

An initial alignment is generated Then one sequence (or a set of sequences) is taken out and realigned to a

profile of the remaining aligned sequences. If a meaningful score is being optimized, this either increases the overall

score or results in the same score. Another sequence is chosen and realigned, and so on, until alignment does

not change

Guaranteed to converged to a local maximum.


44

Progressive alignment methods-Progressive alignment methods-Iterative refinement methodsIterative refinement methods Barton-Sternberg multiple alignment

Find the two sequences with the highest pairwise similarity and align them using standard pairwise DP alignment.

Find the sequence that is most similar to a profile of the alignment of the first two, and align it to the first two by profile-sequence alignment. Repeat until all sequences have been included in the multiple aligment.

Remove sequence x1 and realign it to a profile of the other aligned sequences x2,… xN by profile-sequence alignment. Repeat for sequences x2…xN.

Repeat the previous realignment step a fixed number of times, or until the alignment score converges.


45

Multiple alignment by profile Multiple alignment by profile HMM trainingHMM training Sequence profiles could be recast in probabilistic f

orm as profile HMMs. Profile HMMs could simply be used in place of standar

d profiles in progressive or iterative alignment methods. Ad hoc SP scoring scheme can be replaced by more ex

plicit profile HMM assumption. Profile HMMs can also be trained from initially unalign

ed sequences using the Baum-Welch EM


46

Multiple alignment by profile HMM Multiple alignment by profile HMM training- Multiple alignment with a known training- Multiple alignment with a known profile HMMprofile HMM Before we estimate a model and a multiple

alignment simultaneously we consider the simpler problem of obtaining a multiple alignment from a known model. When we have a multiple alignment and a model of a

small representative set of sequences in a family, and we wish to use that model to align a large member of other family members altogether.


47

Multiple alignment by profile HMM Multiple alignment by profile HMM training- Multiple alignment with a training- Multiple alignment with a known profile HMMknown profile HMM We know how to align a sequence to a profile HM

M-Viterbi algorithm Construction a multiple alignment just requires cal

culating a Viterbi alignment for each individual sequence. Residues aligned to the same profile HMM match state

are aligned in columns.


48

Multiple alignment by profile HMM Multiple alignment by profile HMM training-Multiple alignment with a training-Multiple alignment with a known profile HMMknown profile HMM Given a preliminary alignment, HMM can align further

sequences.


49

Multiple alignment by profile HMM Multiple alignment by profile HMM training- Multiple alignment with a training- Multiple alignment with a known profile HMMknown profile HMM


50

Multiple alignment by profile HMM Multiple alignment by profile HMM training- Multiple alignment with a training- Multiple alignment with a known profile HMMknown profile HMM Importance difference with other MSA programs

Viterbi path through HMM identifies inserts Profile HMM does not align inserts Other multiple alighment algorithms align the whole se

quences.


51

Multiple alignment by profile HMM Multiple alignment by profile HMM training- Multiple alignment with a known training- Multiple alignment with a known profile HMMprofile HMM HMM doesn’t attempt to align residues assigned t

o insert states. The insert state residues usually represent part of the se

quences which are atypical, unconserved, and not meaningfully alignable.

This is a biologically realistic view of multiple alignment


52

Multiple alignment by profile HMM Multiple alignment by profile HMM training- Profile HMM training from training- Profile HMM training from unaligned sequencesunaligned sequences Harder problem-estimating both a model and a mu

ltiple alignment from initially unaligned sequences. Initialization:Choose the length of the profile HMM an

d initialize parameters. Training:Estimate the model using the Baum-Welch alg

orithm or the Viterbi alternative. Multiple Alignment:Align all sequences to the final mo

del using the Viterbi algorithm and build a multiple alignment as described in the previous section.


53

Multiple alignment by profile HMM Multiple alignment by profile HMM training- Profile HMM training from training- Profile HMM training from unaligned sequencesunaligned sequences Initial Model

The only decision that must be made in choosing an initial structure for Baum-Welch estimation is the length of the model M.

A commonly used rule is to set M be the average length of the training sequence.

We need some randomness in initial parameters to avoid local maxima.


54

Multiple alignment by profile Multiple alignment by profile HMM trainingHMM training Avoiding Local maxima

Baum-Welch algorithm is guaranteed to find a LOCAL maxima.

Models are usually quite long and there are many opportunities to get stuck in a wrong solution.

Multidimensional dynamic programming finds global optima, but is not practical.

Solution Start again many times from different initial models. Use some form of stochastic search algorithm, e.g. simulated

annealing.


55

Multiple alignment by profile HMM Multiple alignment by profile HMM training-Simulated annealingtraining-Simulated annealing Theoretical basis

Some compounds only crystallise if they are slowly annealed from high temperature to low temperature.

One can introduce an artificial temperature T, and by the laws of statistical physics the probabiliy of a configuration x is given by the Gibbs distribution.

In the limit of T->0, the system is ‘frozen’ in the limit of T->infinity, the system is ‘molten’

The minimum can be found by sampling this probability distribution at a high temperature first, and then at gradually decreasing temperatures.


56

Multiple alignment by profile HMM Multiple alignment by profile HMM training-Simulated annealingtraining-Simulated annealing For an HMM, a natural energy function is

Approximations Noise injection during Baum-Welch reestimation Simulated annealing Viterbi estimation of HMMs


57

Multiple alignment by profile HMM Multiple alignment by profile HMM training-Simulated annealingtraining-Simulated annealing Noise injection during Baum-Welch reestimation

Add noise to the counts estimated in the forward-backward procedure

Let the size of this noise decrease slowly.


58

Multiple alignment by profile HMM Multiple alignment by profile HMM training-Simulated annealingtraining-Simulated annealing Simulated annealing Viterbi estimation of HMMs

Model is trained by a simulated annealing variant of the Viterbi approximation to Baum-Welch estimation.

Viterbi estimation selects the highest probability path π of each seqeence x. Simulate annealing samples each path π according to the likelihood of the

path given the current model as modified by a temperature T.


59

Multiple alignment by profile HMM Multiple alignment by profile HMM training-Simulated annealingtraining-Simulated annealing Scheduling the temperature

A whole science (or art) itself There are theoretical result for simulated annealing saying that if the

temperature is lowered slowly enough, finding the optimum is guaranteed. In practice a simple exponentially or linearly decreasing schedule is often

used.


60

Multiple alignment by profile HMM -Multiple alignment by profile HMM -Comparison to Gibbs samplingComparison to Gibbs sampling The ‘Gibbs sampler’ algorithm described by Lawrence et a

l.[1993] has substantial similarities. The problem was to simultaneously find the motif positions and to

estimate the parameters for a consensus statistical model of them. The statistical model used is essentially a profile HMM with no ins

ert or delete states. In HMM framework, both SA algorithm and the Gibbs sa

mpler are stochastic variants of the Viterbi algorithm of EM. The Gibbs sampler is like running simulated annealing viterbi algo

rithm at a constant T=1, where alignments are sampled from a probability distribution unmodified by any effect of a temperature factor.


61

Multiple alignment by profile Multiple alignment by profile HMM training-Model surgeryHMM training-Model surgery After(or during) training a model, we can look at the align

ment it produces and decide that model needs some modification. Some of the match states are redundant Some insert states absorb too much sequence

Model sugery If a match state is used by less than ½ of training sequences, delete

its module (match-insert-delete states) If more than ½ of training sequences use a certain insert state, exp

and it into n new modules, where n is average length of insertions Ad hoc, but works well

chapter 6. multiple sequence alignment methods

Documents