pairwise sequence alignment - algorithms in bioinformatics

49
MPI for Developmental Biology, T¨ ubingen Pairwise sequence alignment Christoph Dieterich Department of Evolutionary Biology Max Planck Institute for Developmental Biology Christoph Dieterich Max Planck Institute for Developmental Biology Pairwise sequence alignment

Upload: others

Post on 03-Feb-2022

15 views

Category:

Documents


0 download

TRANSCRIPT

MPI for Developmental Biology, Tubingen

Pairwise sequence alignment

Christoph Dieterich

Department of Evolutionary BiologyMax Planck Institute for Developmental Biology

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

Example

Alignment between very similar human alpha- and beta globins:

GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLG+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KLGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

Example

Plausible alignment to leghaemoglobin from yellow lupin:

GSAQVKGHGKKVADALTNAVAHV---D--DMPNALSALSDLHAHKL++ ++++H+ KV + +A ++ +L+ L+++H+ KNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

Example

A spurious high-scoring alignment of human alpha globin to anematode glutathione S-transferase homologue:

GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKLGS+ + G + +D L ++ H+ D+ A +AL D ++AH+GSGYLVGDSLTFVDLLVAQHTADLL--AANAALLDEFPQFKAHQE

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

Assessing the quality of an alignment

The goal is to use similarity-based alignments to uncoverhomology, while avoiding homoplasyHomoplasy: random mutations that appear in parallel orconvergently in two different lineages.

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

The scoring model

Computation of an alignment critically depend on the choice ofparameters. Generally no existing scoring model can beapplied to all situations.

Evolutionary relationships between the sequences arereconstructed. Here scoring matrices based on mutationrates are usually applied.Protein domains are compared. Then the scoring matricesshould be based on composition of domains and theirsubstitution frequency.

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

The scoring model

Computation of an alignment critically depend on the choice ofparameters. Generally no existing scoring model can beapplied to all situations.

Evolutionary relationships between the sequences arereconstructed. Here scoring matrices based on mutationrates are usually applied.Protein domains are compared. Then the scoring matricesshould be based on composition of domains and theirsubstitution frequency.

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

Substitution matrices

To be able to score an alignment, we need to determine scoreterms for each aligned residue pair.

Definition

A substitution matrix S over an alphabet Σ = {a1, . . . , aκ} hasκ× κ entries, where each entry (i , j) assigns a score for asubstitution of the letter ai by the letter aj in an alignment.

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

Substitution matrices

Basic idea: Follow scheme of statistical hypothesis testing.

Score(ab

) =f (a, b)

f (a) · f (b)

Frequencies of the letters f (a) as well as substitutionfrequencies f (a, b) stem from a representative data set.

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

Null hypothesis / Random model

Given a pair of aligned sequences (without gaps), the nullhypothesis states that the two sequences are unrelated (nothomologous). The alignment is then random with a probabilitydescribed by the model R. The unrelated or random model Rassumes that in each aligned pairs of residues the two residuesoccur independently of each other. Then the probability of thetwo sequences is:

P(X , Y | R) = P(X | R)P(Y | R) =∏

i

pxi

∏i

pyi .

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

Match model

In the match model M, describing the alternative hypothesis,aligned pairs of residues occur with a joint probability pab,which is the probability that a and b have each evolved fromsome unknown original residue c as their common ancestor.Thus, the probability for the whole alignment is:

P(X , Y | M) =∏

i

pxi yi .

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

Odds ratio

The ratio of the two gives a measure of the relative likelihoodthat the sequences are related (model M) as opposed to beingunrelated (model R). This ratio is called odds ratio:

P(X , Y | M)

P(X , Y | R)=

∏i pxi yi∏

i pxi

∏i pyi

=∏

i

pxi yi

pxi pyi

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

Log-odds ratio

To obtain an additive scoring scheme, we take the logarithm(base 2 is usually chosen) to get the log-odds ratio:

S = log(P(X , Y | M)

P(X , Y | R)) = log(

∏i

pxi yi

pxi pyi

) =∑

i

s(xi , yi),

with

s(a, b) := log(

pab

papb

).

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

PAM matrices

Definition (PAM)

One point accepted mutation (1 PAM) is defined as anexpected number of substitutions per site of 0.01. A 1 PAMsubstitution matrix is thus derived from any evolutionary modelby setting the row sum of off-diagonal terms to 0.01 andadjusting the diagonal terms to keep the row sum equal to 1.

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

Definition (Jukes-Cantor Model)

The basic assumption is equality of substitution frequency forany nucleotide at any site. Thus, changing a nucleotide to eachof the three remaining nucleotides has probability α per timeunit. The rate of nucleotide substitution per site per time unit isthen r = 3α.

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

PAM matrices

Let’s build a PAM 1 matrix under a Jukes-Cantor model ofsequence evolution.

1− 3α α α αα 1− 3α α αα α 1− 3α αα α α 1− 3α

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

PAM matrices

We scale matrix entries such that the expected number ofsubstitutions per site is 0.01 = 3α and obtain a probabiltymatrix:

0.99 0.003 0.003 0.0030.003 0.99 0.003 0.0030.003 0.003 0.99 0.0030.003 0.003 0.003 0.99

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

PAM matrices

A scoring matrix is then obtained by computing the log-oddsratios:

s(a, b) := log(

pab

papb

).

with pA = pC = pG = pT = 0.25 and joint probabilities as givenby the PAM probability matrix.

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

PAM matrices

This leads to the following substitution score matrix:398 −438 −438 −438−438 398 −438 −438−438 −438 398 −438−438 −438 −438 398

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

BLOCKS and BLOSUM matrices

The BLOSUM matrices were derived from the databaseBLOCKS1 Blocks are multiply aligned ungapped segmentscorresponding to the most highly conserved regions of proteins.

1Henikoff, S and Henikoff, JG (1992) Amino acid substitution matricesfrom protein blocks. Proc Natl Acad Sci U S A. 89(22):10915-9. BLOCKSdatabase server: http://blocks.fhcrc.org/

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

BLOCKS and BLOSUM matrices

For the scoring matrices of the BLOSUM (=BLOcksSUbstitution Matrix) family all blocks of the database areevaluated columnwise. For each possible pair of amino acidsthe frequency f (ai , aj) of common pairs (ai , aj) in all columns isdetermined.

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

BLOCKS and BLOSUM matrices

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

BLOCKS and BLOSUM matrices

Altogether there are(n

2

)possible pairs that we can draw from

this alignment. We now assume that the observed frequenciesare equal to the frequencies in the population. Then

paa = observed/

(n2

)

The observed frequency of a single amino acid is generallycomputed as pa = paa +

∑b 6=a pab/2. For this example we then

get pA = 0.8 + 0.2/2 = 0.9 and pC = 0.1.

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

BLOCKS and BLOSUM matrices

Different levels of the BLOSUM matrix can be created bydifferentially weighting the degree of similarity betweensequences. For example, a BLOSUM62 matrix is calculatedfrom protein blocks such that if two sequences are more than62% identical, then the contribution of these sequences isweighted to sum to one.

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

BLOCKS and BLOSUM matrices

BLOSUM62 is scaled so that its values are in half-bits, ie. thelog-odds were multiplied by 2/ log2 2 and then rounded to thenearest integer value.

A R N D C Q E G H I L K M F P S T W Y VA 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3............................

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

Gap penalties

Gaps are undesirable and thus penalized. The standard costassociated with a gap of length g is given either by a linearscore

γ(g) = −gd

or an affine score

γ(g) = −d − (g − 1)e,

where d is the gap open penalty and e is the gap extensionpenalty.

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

Gap penalties

Usually, e < d , with the result that less isolated gaps areproduced, as shown in the following comparison:

Linear gap penalty: GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLGSAQVKGHGKK--------VA--D----A-SALSDLHAHKL

Affine gap penalty: GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLGSAQVKGHGKKVADA---------------SALSDLHAHKL

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

Alignment algorithms

Given a scoring scheme, we need to have an algorithm thatcomputes the highest-scoring alignment of two sequences.As for the edit distance-based alignments we will discussalignment algorithms based on dynamic programming. Theyare guaranteed to find the optimal scoring alignment.Note of caution: Optimal Pairwise alignment algorithms are ofcomplexity O(n ·m)

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

Global alignment: Needleman-Wunsch algorithm

Problem

Consider the problem of obtaining the best global alignment oftwo sequences. The Needleman-Wunsch algorithm is adynamic program that solves this problem.

Idea: Build up an optimal alignment using previous solutions foroptimal alignments of smaller substrings.

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

Global alignment: Needleman-Wunsch algorithm

F : {1, 2, . . . , n} × {1, 2, . . . , m} → R

in which F (i , j) equals the best score of the alignment of thetwo prefixes (x1, x2, . . . , xi) and (y1, y2, . . . , yj).

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

Global alignment: Needleman-Wunsch algorithm

0 x1 x2 x3 . . . . . . xi−1 xi . . . . . . xn

0 F (0, 0) |y1 |y2 |y3 |

|. . . |

yj−1 F (i − 1, j − 1) F (i , j − 1)↘ ↓

yj − − − − − F (i − 1, j) → F (i , j)

. . .

ym

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

We obtain F (i , j) as the largest score arising from these threeoptions:

F (i , j) := max

F (i − 1, j − 1) + s(xi , yj)F (i − 1, j)− dF (i , j − 1)− d .

This is applied repeatedly until the whole matrix F (i , j) is filledwith values.

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

Recursion

To complete the description of the recursion, we need to set thevalues of F (i , 0) and F (0, j) for i 6= 0 and j 6= 0:We set F (i , 0) = for i = 0, 1, . . . , n andwe set F (0, j) = for j = 0, 1, . . . , m.The final value F (n, m) contains the score of the best globalalignment between X and Y .

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

Example of a global alignment matrix

D 0 G A T T A G0 0 -2 -4 -6 -8 -10 -12A -2 -1 -1 -3 -5 -7 -9T -4 -3 -2 0 -2 -4 -6T -6 -5 -4 -1 1 -1 -3A -8 -7 -4 -3 0 2 0C -10 -9 -6 -5 -2 0 1

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

Pseudo code of Needleman-Wunsch

Input: two sequences X and YOutput: optimal alignment and score αInitialization:Set F (i, 0) := −i · d for all i = 0, 1, 2, . . . , nSet F (0, j) := −j · d for all j = 0, 1, 2, . . . , mFor i = 1, 2, . . . , n do:

For j = 1, 2, . . . , m do:

Set F (i, j) := max

8<:F (i − 1, j − 1) + s(xi , yj )F (i − 1, j) − dF (i, j − 1) − d

Set backtrace T (i, j) to the maximizing pair (i′, j′)The score is α := F (n, m)Set (i, j) := (n, m)

repeatif T (i, j) = (i − 1, j − 1) print

“xiyj

”else if T (i, j) = (i − 1, j) print

“xi−

”else print

“−yj

”Set (i, j) := T (i, j)

until (i, j) = (0, 0).

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

Complexity of Needleman-Wunsch

We need to store (n + 1)× (m + 1) numbers. Each numbertakes a constant number of calculations to compute: threesums and a max.

Hence, for filling the matrix, the algorithm requires O(nm) timeand memory. Given the filled matrix, the construction of thealignment is done in time O(n + m).

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

Local alignment: Smith-Waterman algorithm

Global alignment is applicable when we have two similarsequences that we want to align from end-to-end, e.g. twohomologous genes from related species.

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

Local alignment: Smith-Waterman algorithm

Problem

Global alignment is inapplicable to modular sequence.

Here we would like to find the best match between substringsof two sequence.

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

Local alignment: Smith-Waterman algorithm

TCCCAGTTATGTCAGGGGACACGAGCATGCAGAGAC

AATTGCCGCCGTCGTTTTCAGCAGTTATGTCAGATC

Here the score of an alignment between two substrings wouldbe larger than the score of an alignment between the fulllengths strings.

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

Local alignment: Smith-Waterman algorithm

Definition

Let X = x1 . . . xn and Y = y1 . . . ym be two sequences over analphabet Σ. Let δ be a score function for an alignment. A localalignment of X and Y is a global alignment of substringsX ′ = xi1 . . . xi2 and Y ′ = yj1 . . . yj2 . An alignment A = (X ′, Y ′) ofsubstrings X ′ and Y ′ is an optimal local alignment of X and Ywith respect to δ if

δ(A) = maxA′

{δ(X ′, Y ′)|X ′ is a substring of X , Y ′ is a substring of Y}

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

Example

Let X = AAAAACTCTCTCT and Y = GCGCGCGCAAAAA. Lets(a, a) = +1, s(a, b) = −1 and s(a,−) = s(−, a) = −2 be ascoring function. Then an optimal local alignment

AAAAA(CTCTCTCT)|||||

(GCGCGCGC)AAAAA

in this case has a score 5 whereas the optimal global alignment

AAAAACTCTCTCT| |

GCGCGCGCAAAAA

has score -11.

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

Local alignment: Smith-Waterman algorithm

The Smith-Waterman ( Smith, T. and Waterman, M. Identification ofcommon molecular subsequences. J. Mol. Biol. 147:195-197, 1981 )localalignment algorithm is a modification of the global alignmentalgorithm.

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

Modification in main recursion

In the main recursion, we set the value of F (i , j) to zero, if allattainable values at position (i , j) are negative:

F (i , j) = max

0,F (i − 1, j − 1) + s(xi , yj),F (i − 1, j)− d ,F (i , j − 1)− d .

The value F (i , j) = 0 indicates that we should start a newalignment at (i , j). This is because, if the best alignment up to(i , j) has a negative score, then it is better to start a new one,rather than to extend the old one.

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

Base conditions

For local alignments we need to set F (i , 0) = andF (0, j) = for all i = 0, 1, 2, . . . , n and j = 0, 1, 2, . . . , m.

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

Modification in traceback

Instead of starting the traceback at (n, m), we start it at the cellwith the highest score: argmax F (i , j). The traceback endsupon arrival at a cell with score 0, with corresponds to the startof the alignment.

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

Traceback via recursion

Input: Similarity matrix M of two strings s = s1 . . . sm and t = t1 . . . tnOutput: Optimal local alignment (s’,t’) of s and tProcedure Align(i,j):if M(i,j) 0 thens′ := εt′ := εelse

if (M(i, j) = M(i − 1, j) + g then(s, t) := Align(i − 1, j)s′ := concat(s, si )t′ := concat(t,′ −′)else if (M(i, j) = M(i, j − 1) + g then(s, t) := Align(i, j − 1)s′ := concat(s,′ −′)t′ := concat(t, tj )else(s, t) := Align(i − 1, j − 1)s′ := concat(s, si )t′ := concat(t, tj )

return(s’,t’)

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

Local alignment: Smith-Waterman algorithm

For this algorithm to work, we require that the expected scorefor a random match is negative, i.e. that∑

a,b∈Σ

pa · pb · s(a, b) < 0,

where pa and pb are the probabilities for seeing the symbol a orb respectively, at any given position. Otherwise, matrix entrieswill tend to be positive, producing long matches betweenrandom sequences.

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

Local vs. Global Alignment

The Global Alignment Problem tries to find the optimal pathbetween vertices (0, 0) and (n, m) in the matrix graph.The Local Alignment Problem tries to find the optimal pathamong paths between arbitrary vertices (i , j) and (i ′, j ′) in thematrix graph such that i < i ′ and j < j ′.

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment

MPI for Developmental Biology, Tubingen

Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm

Example

Smith-Waterman matrix of the sequences GATTAG and ATTACwith s(a, a) = 1, s(a, b) = −1 and s(a,−) = s(−, a) = −2:

F 0 G A T T A G0ATTAC

Score: ;Alignment =

Christoph Dieterich Max Planck Institute for Developmental Biology

Pairwise sequence alignment