pairwise sequence alignment

49
Pairwise sequence Alignment

Upload: terri

Post on 18-Jan-2016

39 views

Category:

Documents


0 download

DESCRIPTION

Pairwise sequence Alignment. Types of Alignment • Global alignment: Aligning the whole sequences • Appropriate when aligning two very closely related sequencs • Local alignment: Aligning certain regions in the sequences • Appropriate for aligning multi-domain protein sequences - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Pairwise sequence  Alignment

Pairwise sequence Alignment

Page 2: Pairwise sequence  Alignment

Types of Alignment• • Global alignment: Aligning the whole sequences• • Appropriate when aligning two very closely related sequencs• • Local alignment: Aligning certain regions in the sequences• • Appropriate for aligning multi-domain protein sequences• • It is important to use the “appropriate” type

Distinction between global and local alignments of two sequences.

Page 3: Pairwise sequence  Alignment
Page 4: Pairwise sequence  Alignment

How do we compute the best alignment?AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

Too many possible alignments:

>> 2N

(exercise)

Page 5: Pairwise sequence  Alignment

Sequence Alignment

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

DefinitionGiven two strings x = x1x2...xM, y = y1y2…yN,

an alignment is an assignment of gaps to positions0,…, M in x, and 0,…, N in y, so as to line up each

letter in one sequence with either a letter, or a gap in the other sequence

AGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCGGTCGATTTGCCCGAC

Page 6: Pairwise sequence  Alignment

Alignment is additiveObservation:

The score of aligning x1……xM

y1……yN

is additive

Say that x1…xi xi+1…xM

aligns to y1…yj yj+1…yN

The two scores add up:

F(x[1:M], y[1:N]) = F(x[1:i], y[1:j]) + F(x[i+1:M], y[j+1:N])

Page 7: Pairwise sequence  Alignment

Calculation of an alignment score

Page 8: Pairwise sequence  Alignment

DP Algorithms for PairwiseAlignmentDP Algorithms for PairwiseAlignment

The number of all possible pairwise alignments (if gapsare allowed) is exponential in the length of the sequences• Therefore, the approach of “score every possiblealignment and choose the best” is infeasible in practice• Efficient algorithms for pairwise alignment have beendevised using dynamic programming (DP)

Page 9: Pairwise sequence  Alignment

We will first consider the global alignment algorithmof Needleman and Wunsch (1970).

We will then explore the local alignment algorithmof Smith and Waterman (1981).

Finally, we will consider BLAST, a heuristic versionof Smith-Waterman. We will cover BLAST in detailon Monday.

Two kinds of sequence alignment: global and local

Page 63

Page 10: Pairwise sequence  Alignment

• Two sequences can be compared in a matrix along x- and y-axes.

• If they are identical, a path along a diagonal can be drawn

• Find the optimal subpaths, and add them up to achieve the best score. This involves

--adding gaps when needed--allowing for conservative substitutions--choosing a scoring system (simple or complicated)

• N-W is guaranteed to find optimal alignment(s)

Global alignment with the algorithmof Needleman and Wunsch (1970)

Page 63

Page 11: Pairwise sequence  Alignment
Page 12: Pairwise sequence  Alignment

İnitial stage of filling in the DP

The sequences are written across the top and down the left side of a matrix, respectively,

An extra row and column labeled “gap” are added to allow the alignment to begin with a gap of any length in either sequence. The gap rows are filled with penalty scores for gaps of increasing lengths, as indicated. A zero is placed in the upper right box corresponding to no gaps in either sequence.

Sm and Sn

m+1 x n+1

9 x 10

columns

rows

Page 13: Pairwise sequence  Alignment

Gap=-8

Gap=-4

Page 14: Pairwise sequence  Alignment

[1] set up a matrix

[2] score the matrix

[3] identify the optimal alignment(s)

Three steps to global alignment with the Needleman-Wunsch algorithm

Page 63

Page 15: Pairwise sequence  Alignment

[1] identity (stay along a diagonal)[2] mismatch (stay along a diagonal)[3] gap in one sequence (move vertically!)[4] gap in the other sequence (move horizontally!)

Four possible outcomes in aligning two sequences

1

2

Page 64

Page 16: Pairwise sequence  Alignment

Page 64

Page 17: Pairwise sequence  Alignment
Page 18: Pairwise sequence  Alignment
Page 19: Pairwise sequence  Alignment

Necessary values in adjacent cells

Page 20: Pairwise sequence  Alignment

x (x1x2...xm) and y (y1y2...yn)

The matrix has (m+1) rows labeled 0➝m and (n+1) columns labeled 0➝n

The rows correspond to the residues of sequence x, and the columns correspond to the residues of sequence y

S0,0 + s(x1,y1) = 0+s(I,T)=0-1=-1S1,0 + g = -8-8=-16S0,1 + g = -8-8=-16

xx

yy

Page 21: Pairwise sequence  Alignment

s11 is the score for an a1-b1

match added to 0 in the upper left

position

Trial values for s12 are calculated and the maximum score is chosen. Trial 1 is to add

the score for the a1-b2 match to s11 and subtract a penalty for a gap of size 1. The other three trials shown by arrows include gap penalties and so likely cannot yield a

higher score than trial 1.

Page 22: Pairwise sequence  Alignment

Global alignment of two protein sequences by the Needleman-Wunsch algorithm with enhancements

by Smith and Waterman.

• sequence 1 = MNALSDRT and

• sequence 2 = MGSDRTTET.

• Notice the subsequence SDRT in the two sequences which one might expect to be aligned if the sequences are aligned properly.

JMB, 1970

Page 23: Pairwise sequence  Alignment

-12 is the penalty for opening the gap in the alignment, and -4 is the penalty for each additional sequence character in the gap. Use PAM250

S0,0 + s(x1,y1) = 0+s(M,M)=0+6=6S1,0 + g = -12-12=-24S0,1 + g = -12-12=-24

S1,1 =

MM

M -- M

- MM -

Page 24: Pairwise sequence  Alignment

sequence 1 M - N A L S D R Tsequence 2 M G S D R T T E Tscore 6 -12 1 0 -3 1 0 -1 3 = -5

sequence 1 M N A - L S D R Tsequence 2 M G S D R T T E Tscore 6 -12 1 0 -3 1 0 -1 3 = -5

Page 25: Pairwise sequence  Alignment

score(H,P) = -2, gap penalty=-8 (linear)

 - H E A G A W G H E E

 - 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80

P -8 -2

A -16

W -24

H -32

E -40

A -48

E -56

Example 2

Page 26: Pairwise sequence  Alignment

score(E,P) = 0, score(E,A) = -1, score(H,A) = -2

-  H E A G A W G H E E

 - 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80

P -8 -2 -8

A -16 -10 -3

W -24

H -32

E -40

A -48

E -56

Example contd.

Page 27: Pairwise sequence  Alignment

  H E A G A W G H E E

  0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80

P -8 -2 -8 -16 -24 -33 -42 -49 -57 -65 -73

A -16 -10 -3 -4 -12 -19 -28 -36 -44 -52 -60

W -24 -18 -11 -6 -7 -15 -4 -12 -21 -29 -37

H -32 -14 -18 -13 -8 -9 -12 -6 -2 -11 -19

E -40 -22 -8 -16 -16 -9 -12 -14 -6 4 -5

A -48 -30 -16 -3 -11 -11 -12 -12 -14 -4 2

E -56 -38 -24 -11 -6 -12 -14 -15 -12 -8 2

The value in the final cell is the best score for the alignment

H E A G A W G H E - E- P - - A W - H E A E

Optimal alignment:

Page 28: Pairwise sequence  Alignment

Alignments and Paths through Example 3

- t a c g c a a - a c g t g a a t t

Page 29: Pairwise sequence  Alignment

- t a c g c a a - a c g t g a a t t

t a c g - c a a - -t a c g - c a a - -- a c g t g a a t t- a c g t g a a t t

Page 30: Pairwise sequence  Alignment

- t a c g c a a - a c g t g a a t t

t - - a c g c a - - at - - a c g c a - - aa c g t g - - a a t ta c g t g - - a a t t

Page 31: Pairwise sequence  Alignment

Example 4

Alignment:Alignment:- T G C A T - A - - T G C A T - A - A T - C - T G A TA T - C - T G A T

Page 32: Pairwise sequence  Alignment

Tracing back a solution (I)

Page 33: Pairwise sequence  Alignment

Tracing back a solution (II)

The algorithm is called with PRINT-LCS(b,V,n,m)The algorithm is called with PRINT-LCS(b,V,n,m)

Page 34: Pairwise sequence  Alignment

Computing Distance

ddi,ji,j=min=min

ddi-1,ji-1,j + 1 + 1

ddi,j-1i,j-1 + 1 + 1

ddi-1,j-1i-1,j-1 , if v , if vii=w=wjj

Only deletions/insertions areOnly deletions/insertions areallowedallowed

Page 35: Pairwise sequence  Alignment

N-W is guaranteed to find optimal alignments,although the algorithm does not search all possiblealignments.

It is an example of a dynamic programming algorithm:an optimal path (alignment) is identified byincrementally extending optimal subpaths.Thus, a series of decisions is made at each step of thealignment to find the pair of residues with the best score.

Needleman-Wunsch: dynamic programming

Page 67

Page 36: Pairwise sequence  Alignment

Local sequence alignment

• Suppose, we have a long DNA sequence (e.g., 4000 bp) and we want to compare it with the complete yeast genome (12.5M bp).

• What if only a portion of our query, say 200 bp length, has strong similarity to a gene in yeast.– Can we find this 200 bp portion using (semi)

global alignment?Probably not. Because, we are trying to align the complete

4000 bp sequence, thus a random alignment may get a better

score than the one that aligns 200 bp portion to the similar

gene in yeast.

Page 37: Pairwise sequence  Alignment

Global alignment (Needleman-Wunsch) extendsfrom one end of each sequence to the other

Local alignment finds optimally matching regions within two sequences (“subsequences”)

Local alignment is almost always used for databasesearches such as BLAST. It is useful to find domains(or limited regions of homology) within sequences

Smith and Waterman (1981) solved the problem of performing optimal local sequence alignment. Othermethods (BLAST, FASTA) are faster but less thorough.

Global alignment versus local alignment

Page 69

Page 38: Pairwise sequence  Alignment

How the Smith-Waterman algorithm works

Set up a matrix between two proteins (size m+1, n+1)

No values in the scoring matrix can be negative! S > 0

The score in each cell is the maximum of four values:[1] s(i-1, j-1) + the new score at [i,j] (a match or mismatch)

[2] s(i,j-1) – gap penalty[3] s(i-1,j) – gap penalty[4] zero

Page 69

Page 39: Pairwise sequence  Alignment

Local alignemnt

The major difference between this scoring matrix and the Needleman-Wunsch matrix is that there are no negative scores in the Smith-Waterman scoring matrix. The effect of this change is that an alignment can begin anywhere without receiving a negative penalty from a previously low- scoring alignment.

sequence 1 S D R Tsequence 2 S D R Tscore 2 4 6 3 = 15

Page 40: Pairwise sequence  Alignment

Example

Q: E Q L L K A L E F K L P: K V L E F G Y

Linear gap modelGap = -1Match = 4Mismatch = -2

- E Q L L K A L E F K L-

K

V

L

E

F

G

Y

Page 41: Pairwise sequence  Alignment

Example

Q: E Q L L K A L E F K L P: K V L E F G Y

Linear gap modelGap = -1Match = 4Mismatch = -2

0 0 0 0 0 0 0 0 0 0 0 0

0

0

0

0

0

0

0

- E Q L L K A L E F K L-

K

V

L

E

F

G

Y

Page 42: Pairwise sequence  Alignment

Example

Q: E Q L L K A L E F K L P: K V L E F G Y

Linear gap modelGap = -1Match = 4Mismatch = -2

0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 4 3 2 1 0 4 3

0 0 0 0 0 3 2 1 0 0 3 2

0 0 0 4 4 3 2 6 5 4 3 7

0 4 3 3 3 2 1 5 10 9 8 7

0 3 2 2 2 1 0 4 9 14 13 12

0 2 1 1 1 0 0 3 8 13 12 11

0 1 0 0 0 0 0 2 7 12 11 10

- E Q L L K A L E F K L-

K

V

L

E

F

G

Y

Page 43: Pairwise sequence  Alignment

Example

Q: E Q L L K A L E F K L P: K V L E F G Y

Linear gap modelGap = -1Match = 4Mismatch = -2

0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 4 3 2 1 0 4 3

0 0 0 0 0 3 2 1 0 0 3 2

0 0 0 4 4 3 2 6 5 4 3 7

0 4 3 3 3 2 1 5 10 9 8 7

0 3 2 2 2 1 0 4 9 14 13 12

0 2 1 1 1 0 0 3 8 13 12 11

0 1 0 0 0 0 0 2 7 12 11 10

- E Q L L K A L E F K L-

K

V

L

E

F

G

Y

Page 44: Pairwise sequence  Alignment

Example

Q: E Q L L K A L E F K L P: K V L E F G Y

0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 4 3 2 1 0 4 3

0 0 0 0 0 3 2 1 0 0 3 2

0 0 0 4 4 3 2 6 5 4 3 7

0 4 3 3 3 2 1 5 10 9 8 7

0 3 2 2 2 1 0 4 9 14 13 12

0 2 1 1 1 0 0 3 8 13 12 11

0 1 0 0 0 0 0 2 7 12 11 10

- E Q L L K A L E F K L-

K

V

L

E

F

G

Y

Q: K A - L E F P: K - V L E F

Alignment

Page 45: Pairwise sequence  Alignment

Example

Q: E Q L L K A L E F K L P: K V L E F G Y

0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 4 3 2 1 0 4 3

0 0 0 0 0 3 2 1 0 0 3 2

0 0 0 4 4 3 2 6 5 4 3 7

0 4 3 3 3 2 1 5 10 9 8 7

0 3 2 2 2 1 0 4 9 14 13 12

0 2 1 1 1 0 0 3 8 13 12 11

0 1 0 0 0 0 0 2 7 12 11 10

- E Q L L K A L E F K L-

K

V

L

E

F

G

Y

Q: K - A L E F P: K V - L E F

Alignment

Page 46: Pairwise sequence  Alignment

Example

Q: E Q L L K A L E F K L P: K V L E F G Y

0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 4 3 2 1 0 4 3

0 0 0 0 0 3 2 1 0 0 3 2

0 0 0 4 4 3 2 6 5 4 3 7

0 4 3 3 3 2 1 5 10 9 8 7

0 3 2 2 2 1 0 4 9 14 13 12

0 2 1 1 1 0 0 3 8 13 12 11

0 1 0 0 0 0 0 2 7 12 11 10

- E Q L L K A L E F K L-

K

V

L

E

F

G

Y

Q: K A L E F P: K V L E F

Alignment

Page 47: Pairwise sequence  Alignment

-- G C T G G A A G G C A T

-- 0 0 0 0 0 0 0 0 0 0 0 0 0

G 0

C 0

A 0

G 0

A 0

G 0

C 0

A 0

C 0

G 0

Q: G C T G G A A G G C A TP: G C A G A G C A C G

Another Example

P

Q

Linear gap modelGap = -4Match = +5Mismatch = -4

Find the local alignment between:

Page 48: Pairwise sequence  Alignment

-- G C T G G A A G G C A T

-- 0 0 0 0 0 0 0 0 0 0 0 0

G 0 5 1 0 5 5 1 0 5 5 1 0 0

C 0 1 10 6 2 1 1 0 1 1 10 6 2

A 0 0 6 6 2 0 6 6 2 0 6 15 11

G 0 5 2 2 11 7 3 2 11 7 3 11 11

A 0 1 1 0 7 7 11 8 7 7 3 8 7

G 0 5 1 0 5 11 7 7 13 12 8 4 4

C 0 0 10 6 2 7 7 3 9 8 17 13 9

A 0 0 6 6 2 3 11 12 8 5 13 22 18

C 0 0 5 2 2 0 7 8 8 4 18 18 18

G 0 5 1 1 7 7 5 4 13 13 14 14 14

Q’s subsequence: G A A G – G C A P’s subsequence: G C A G A G C A

Another Example

P

Q

Page 49: Pairwise sequence  Alignment