pairwise sequence alignment h.c.huang @ 2005/9/22 2005 autumn / ym / bioinformatics

Pairwise Sequence Alignment

H.C.Huang @ 2005/9/22

2005 Autumn / YM / Bioinformatics

Outline

MotivationA simple alignment problemScoring alignments

Substition matrices Gap penalties

Dynamic programmingGlobal vs. local alignment

Reference

D.W. Mount / Bioinformatics / Ch.3

Slides fromProf. C.H.ChangUW / Genomic Informatics / W.S. Noble

http://www.bioinformaticsonline.org/graphics/cover.html

Example

Brief History Robinson (1938) described the 1st algorithm

for Longest Common Subsequence problem Levenshtein (1966) introduced “edit distance”

between strings Min. # of edit operations needed to transform from

one string to another No algorithm described

Gibbs & McIntyre (1970): dot matrix analysis Needleman & Wunsch (1970) “re-discovered”

the algorithm for sequence alignment using DP Smith & Waterman (1981) proposed a clever

modification of DP for local alignment

Brief History

Knuth (1973) suggested a filtering idea for pattern matching高德纳 (Donald Knuth)

Dumas & Ninio (1982) first applied the filtering idea for fast sequence comparison

Lipman & Pearson (1985): FASTAAltschul et al. (1990): BLAST

Pairwise Alignment Methods

Dotplot analysis methodDynamic programming (DP)

algorithmWord / k-tuple methods

How dotplot analysis works

How dotplot analysis works

Direct Repeat

Reversed Repeat

Aligned

sequence

To reduce random noise in dotplots

Specify a window size, wExamine w residues at a timeCount how many matches are

among the w pairs of residues

Dotlet

http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html

A simple alignment problem.

Problem: find the best pairwise alignment of GAATC and CATAC.

Scoring alignments

• We need a way to measure the quality of a candidate alignment.

• Alignment scores consist of two parts: a substition matrix, and a gap penalty.

GAATCCATAC

GAATC-CA-TAC

GAAT-CC-ATAC

GAAT-CCA-TAC

-GAAT-CC-A-TAC

GA-ATCCATA-C

Scoring aligned bases

Purine A G

Pyrimidine C T

Transition (high score)

Transversion(low score)


Purine A G

Pyrimidine C T

Transition

Transversion

A C G T

A 10 -5 0 -5

C -5 10 -5 0

G 0 -5 10 -5

T -5 0 -5 10

A hypothetical substitution matrix:

GAATCCATAC

-5 + 10 + -5 + -5 + 10 = 5


Purine A G

Pyrimidine C T

Transition (cheap)

Transversion(expensive)

A C G T

A 10 -5 0 -5

C -5 10 -5 0

G 0 -5 10 -5

T -5 0 -5 10

A hypothetical substitution matrix:

GAAT-CCA-TAC

-5 + 10 + ? + 10 + ? + 10 = ?

• Linear gap penalty: every gap receives a score of d.

• Affine gap penalty: opening a gap receives a score of d; extending a gap receives a score of e.

Scoring gaps

GAAT-C d=-4CA-TAC

-5 + 10 + -4 + 10 + -4 + 10 = 17

G--AATC d=-4CATA--C e=-1

-5 + -4 + -1 + 10 + -4 + -1 + 10 = 5

A simple alignment problem.

• Problem: find the best pairwise alignment of GAATC and CATAC.

• Use a linear gap penalty of -4.

• Use the following substitution matrix:

A C G T

A 10 -5 0 -5

C -5 10 -5 0

G 0 -5 10 -5

T -5 0 -5 10

How many possibilities?

• How many different alignments of two sequences of length N exist?

GAATCCATAC

GAATC-CA-TAC

GAAT-CC-ATAC

GAAT-CCA-TAC

-GAAT-CC-A-TAC

GA-ATCCATA-C

DP matrix

G A A T C

C

A

T

A

C

-8

The value in position (i,j) is the score of the best alignment of the first i

positions of the first sequence versus the first j positions of the

second sequence.

-G-CAT

DP matrix

G A A T C

C

A

T -8 -12A

C

Moving horizontally in the matrix introduces a

gap in the sequence along the left edge.

-G-ACAT-

DP matrix

G A A T C

C

A

T -8

A -12C

Moving vertically in the matrix introduces a gap in the sequence along

the top edge.

-G--CATA

Initialization

G A A T C

0

C

A

T

A

C

Introducing a gap

G A A T C

0 -4

C

A

T

A

C

G-

DP matrix

G A A T C

0 -4

C -4

A

T

A

C

-C

DP matrix

G A A T C

0 -4

C -4 -8

A

T

A

C

DP matrix

G A A T C

0 -4

C -4 -5

A

T

A

C

GC

DP matrix

G A A T C

0 -4 -8 -12 -16 -20

C -4 -5

A -8

T -12

A -16

C -20

-----CATAC

DP matrix

G A A T C

0 -4 -8 -12 -16 -20

C -4 -5

A -8 ?

T -12

A -16

C -20

DP matrix

G A A T C

0 -4 -8 -12 -16 -20

C -4 -5

A -8 -4

T -12

A -16

C -20

-4

0 -4

-GCA

G-CA

--GCA-

-4 -9 -12

DP matrix

G A A T C

0 -4 -8 -12 -16 -20

C -4 -5

A -8 -4

T -12 ?

A -16 ?

C -20 ?

DP matrix

G A A T C

0 -4 -8 -12 -16 -20

C -4 -5

A -8 -4

T -12 -8

A -16 -12

C -20 -16

DP matrix

G A A T C

0 -4 -8 -12 -16 -20

C -4 -5 ?

A -8 -4 ?

T -12 -8 ?

A -16 -12 ?

C -20 -16 ?

DP matrix

G A A T C

0 -4 -8 -12 -16 -20

C -4 -5 -9

A -8 -4 5

T -12 -8 1

A -16 -12 2

C -20 -16 -2

What is the alignment associated with this

entry?

DP matrix

G A A T C

0 -4 -8 -12 -16 -20

C -4 -5 -9

A -8 -4 5

T -12 -8 1

A -16 -12 2

C -20 -16 -2

-G-ACATA

DP matrix

G A A T C

0 -4 -8 -12 -16 -20

C -4 -5 -9

A -8 -4 5

T -12 -8 1

A -16 -12 2

C -20 -16 -2 ?

Find the optimal alignment, and

its score.

DP matrix

G A A T C

0 -4 -8 -12 -16 -20

C -4 -5 -9 -13 -12 -6

A -8 -4 5 1 -3 -7

T -12 -8 1 0 11 7

A -16 -12 2 11 7 6

C -20 -16 -2 7 11 17

DP matrix

G A A T C

0 -4 -8 -12 -16 -20

C -4 -5 -9 -13 -12 -6

A -8 -4 5 1 -3 -7

T -12 -8 1 0 11 7

A -16 -12 2 11 7 6

C -20 -16 -2 7 11 17

GA-ATCCATA-C

DP matrix

G A A T C

0 -4 -8 -12 -16 -20

C -4 -5 -9 -13 -12 -6

A -8 -4 5 1 -3 -7

T -12 -8 1 0 11 7

A -16 -12 2 11 7 6

C -20 -16 -2 7 11 17

GAAT-CCA-TAC

DP matrix

G A A T C

0 -4 -8 -12 -16 -20

C -4 -5 -9 -13 -12 -6

A -8 -4 5 1 -3 -7

T -12 -8 1 0 11 7

A -16 -12 2 11 7 6

C -20 -16 -2 7 11 17

GAAT-CC-ATAC

DP matrix

G A A T C

0 -4 -8 -12 -16 -20

C -4 -5 -9 -13 -12 -6

A -8 -4 5 1 -3 -7

T -12 -8 1 0 11 7

A -16 -12 2 11 7 6

C -20 -16 -2 7 11 17

GAAT-C-CATAC

Multiple solutions

When a program returns a sequence alignment, it may not be the only best alignment.

GA-ATCCATA-C

GAAT-CCA-TAC

GAAT-CC-ATAC

GAAT-C-CATAC

DP in equation form

• Align sequence x and y.

• F is the DP matrix; s is the substitution matrix; d is the linear gap penalty.

djiF

djiF

yxsjiF

jiF

F

ji

1,

,1

,1,1

max,

00,0

DP in equation form

1,1 jiF

jiF , jiF ,1

1, jiF

d

d ji yxs ,

Dynamic programming

Yes, it’s a weird name.DP is closely related to recursion

and to mathematical induction.We can prove that the resulting

score is optimal.

Dynamic Programming

A method for solving recursive problemBreak a problem into smaller

subproblemsSolve subproblems optimally,

recursivelyUse these optimal solutions to

construct an optimal solution for the original problem

Summary

Scoring a pairwise alignment requires a substition matrix and gap penalties.

Dynamic programming is an efficient algorithm for finding the optimal alignment.

Entry (i,j) in the DP matrix stores the score of the best-scoring alignment up to those positions.

DP iteratively fills in the matrix using a simple mathematical rule.

A simple exercise

A C G T

A 2 -7 -5 -7

C -7 2 -7 -5

G -5 -7 2 -7

T -7 -5 -7 2

A A G

A

G

C 1,1 jiF

jiF , jiF ,1

1, jiF

d

d ji yxs ,

Find the optimal alignment of AAG and AGC.Use a gap penalty of d=-5.

A simple exercise

A C G T

A 2 -7 -5 -7

C -7 2 -7 -5

G -5 -7 2 -7

T -7 -5 -7 2

A A G

0

A

G

C 1,1 jiF

jiF , jiF ,1

1, jiF

d

d ji yxs ,


A simple exercise

A C G T

A 2 -7 -5 -7

C -7 2 -7 -5

G -5 -7 2 -7

T -7 -5 -7 2

A A G

0 -5 -10 -15

A -5

G -10

C -15 1,1 jiF

jiF , jiF ,1

1, jiF

d

d ji yxs ,


A simple exercise

A C G T

A 2 -7 -5 -7

C -7 2 -7 -5

G -5 -7 2 -7

T -7 -5 -7 2

A A G

0 -5 -10 -15

A -5 2 -3 -8

G -10 -3 -3 -1

C -15 -8 -8 -6 1,1 jiF

jiF , jiF ,1

1, jiF

d

d ji yxs ,


Traceback

Start from the lower right corner and trace back to the upper left.

Each arrow introduces one character at the end of each aligned sequence.

A horizontal move puts a gap in the left sequence.

A vertical move puts a gap in the top sequence.

A diagonal move uses one character from each sequence.






A simple example

A A G

0 -5

A 2 -3

G -1

C -6







A simple example

A A G

0 -5

A 2 -3

G -1

C -6


AAG- AAG--AGC A-GC

Local alignment

A single-domain protein may be homologous to a region within a multi-domain protein.

Usually, an alignment that spans the complete length of both sequences is not required.

BLAST allows local alignments

Global alignment

Local alignment

Global alignment DP



djiF

djiF

yxsjiF

jiF

F

ji

1,

,1

,1,1

max,

00,0

Local alignment DP



0

1,,1

,1,1

max,

00,0

djiFdjiF

yxsjiF

jiF

F

ji

Local DP in equation form

1,1 jiF

jiF , jiF ,1

1, jiF

d

d ji yxs ,

0

Local alignment

Two differences with respect to global alignment:No score is negative.Traceback begins at the highest score in the

matrix and continues until you reach 0.Global alignment algorithm:

Needleman-Wunsch.Local alignment algorithm:

Smith-Waterman.

A simple example

A C G T

A 2 -7 -5 -7

C -7 2 -7 -5

G -5 -7 2 -7

T -7 -5 -7 2

A A G

A

G

C 1,1 jiF

jiF , jiF ,1

1, jiF

d

d ji yxs ,

Find the optimal local alignment of AAG and AGC.Use a gap penalty of d=-5.

0

A simple example

A C G T

A 2 -7 -5 -7

C -7 2 -7 -5

G -5 -7 2 -7

T -7 -5 -7 2

A A G

0 0 0 0

A 0

G 0

C 0 1,1 jiF

jiF , jiF ,1

1, jiF

d

d ji yxs ,


0

A simple example

A C G T

A 2 -7 -5 -7

C -7 2 -7 -5

G -5 -7 2 -7

T -7 -5 -7 2

A A G

0 0 0 0

A 0 2 2 0

G 0 0 0 4

C 0 0 0 0 1,1 jiF

jiF , jiF ,1

1, jiF

d

d ji yxs ,


0

A simple example

A C G T

A 2 -7 -5 -7

C -7 2 -7 -5

G -5 -7 2 -7

T -7 -5 -7 2

A A G

0 0 0 0

A 0 2 2 0

G 0 0 0 4

C 0 0 0 0 1,1 jiF

jiF , jiF ,1

1, jiF

d

d ji yxs ,


0

AGAG

Local alignment

A C G T

A 2 -7 -5 -7

C -7 2 -7 -5

G -5 -7 2 -7

T -7 -5 -7 2

A A G

0 0 0 0

G 0

A 0

A 0

G 0

G 0

C 0

1,1 jiF

jiF , jiF ,1

1, jiF

d

d ji yxs ,

Find the optimal local alignment of AAG and GAAGGC.Use a gap penalty of d=-5.

0

Local alignment

A C G T

A 2 -7 -5 -7

C -7 2 -7 -5

G -5 -7 2 -7

T -7 -5 -7 2

A A G

0 0 0 0

G 0 0 0 2

A 0 2 2 0

A 0 2 4 0

G 0 0 0 6

G 0 0 0 1

C 0 0 0 0

1,1 jiF

jiF , jiF ,1

1, jiF

d

d ji yxs ,


0

Substitution matrices

A C G T

A 2 -7 -5 -7

C -7 2 -7 -5

G -5 -7 2 -7

T -7 -5 -7 2

A A G

0 0 0 0

G 0 0 0 2

A 0 2 2 0

A 0 2 4 0

G 0 0 0 6

G 0 0 0 1

C 0 0 0 0

1,1 jiF

jiF , jiF ,1

1, jiF

d

d ji yxs ,


0

Where did this substitution matrix

come from?

Nucleic acid PAM matrices

• PAM = point accepted mutation• 1 PAM = 1% probability of mutation at each

sequence position.• A uniform PAM1 matrix:

A G T C

A 0.99 0.00333 0.00333 0.00333

G 0.00333 0.99 0.00333 0.00333

T 0.00333 0.00333 0.99 0.00333

C 0.00333 0.00333 0.00333 0.99

Transitions and transversions

• Transitions (A G or C T) are more likely than transversions (A T or G C)

• Assume that transitions are three times as likely:

A G T C

A 0.99 0.006 0.002 0.002

G 0.006 0.99 0.002 0.002

T 0.002 0.002 0.99 0.006

C 0.002 0.002 0.006 0.99

Distant relatives

• If the probability of a substitution is 2%, simply multiply the probabilities from 1% by themselves.

• A PAM N matrix is computed by raising PAM 1 to the Nth power.

A G T C

A 0.98014 0.011888 0.003984 0.003984

G 0.011888 0.98014 0.003984 0.003984

T 0.003984 0.003984 0.98014 0.011888

C 0.003984 0.003984 0.011888 0.98014

Better gap scoring>gi|729942|sp|P40601|LIP1_PHOLU Lipase 1 precursor (Triacylglycerol lipase) Length = 645

Score = 33.5 bits (75), Expect = 5.9 Identities = 32/180 (17%), Positives = 70/180 (38%), Gaps = 9/180 (5%)

Query: 2038 IYSLYGLYNVPYENLFVEAIASYSDNKIRSKSRRVIATTLETVGYQTANGKYKSESYTGQ 2097 +++ YGL+ Y+ ++ Y D K +R ++ + N + G+Sbjct: 441 VFTAYGLWRY-YDKGWISGDLHYLDMKYEDITRGIVLNDW----LRKENASTSGHQWGGR 495

Query: 2098 LMAGYTYMMPENINLTPLAGLRYSTIKDKGYKETGTTYQNLTVKGKNYNTFDGLLGAKVS 2157 + AG+ + + +P+ + KGY+E+G + + Y++ G LG ++Sbjct: 496 ITAGWDIPLTSAVTTSPIIQYAWDKSYVKGYRESGNNSTAMHFGEQRYDSQVGTLGWRLD 555

Query: 2158 SNINVNEIVLTPELYAMVDYAFKNKVSAIDARLQGMTAPLPTNSFKQSKTSFDVGVGVTA 2217 +N P ++ F +K I + + + S KQ + +G+ ASbjct: 556 TNFG----YFNPYAEVRFNHQFGDKRYQIRSAINSTQTSFVSESQKQDTHWREYTIGMNA 611

Real gaps are often more than one letter long.

Affine gap penaltyLETVGYW----L

-5 -1 -1 -1

Separate penalties for gap opening and gap extension.

This requires modifying the DP algorithm to store three values in each box.

Summary

Local alignment finds the best match between subsequences.

Smith-Waterman local alignment algorithm:No score is negative.Trace back from the largest score in the

matrix.Substitution matrices represent the

probability of mutations.Affine gap penalties include a large gap

opening penalty and small gap extension penalty.

pairwise sequence alignment h.c.huang @ 2005/9/22 2005 autumn / ym / bioinformatics

Documents