dynamic programming for pairwise alignment 2

74
Dynamic Programming for Pairwise Alignment 2 Dr Alexei Drummond Department of Computer Science [email protected] Semester 2, 2006

Upload: avon

Post on 12-Jan-2016

37 views

Category:

Documents


0 download

DESCRIPTION

Dynamic Programming for Pairwise Alignment 2. Dr Alexei Drummond Department of Computer Science [email protected]. Semester 2, 2006. Review. Dynamic programming algorithm for global alignment (Needleman & Wunsch) Given sequences: F(i,j) = score of best alignment between and. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Dynamic Programming for  Pairwise Alignment 2

Dynamic Programmingfor

Pairwise Alignment 2

Dr Alexei Drummond

Department of Computer Science

[email protected]

Semester 2, 2006

Page 2: Dynamic Programming for  Pairwise Alignment 2

2

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Review

Dynamic programming algorithm for global alignment (Needleman & Wunsch)

Given sequences:

F(i,j) = score of best alignment

between

and €

Y = (y1,y2,...,yn )

X = (x1,x2,...,xm )

(x1,x2,...,x i)

(y1,y2,...,y j )

Page 3: Dynamic Programming for  Pairwise Alignment 2

3

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Principle of Optimality

Optimal alignment

x1, x2, x3, ..., x i

y1, y2, y3, ..., y j

F(i, j)

Page 4: Dynamic Programming for  Pairwise Alignment 2

4

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Principle of Optimality

Optimal alignment

x1, x2, x3, ..., x i

y1, y2, y3, ..., y j

Looks like ……

x1,x2,x3,...,x i−1

y1,y2,y3,...,y j−1

x i

y j

F(i, j)

F(i −1, j −1) + s(x i,y j )

Page 5: Dynamic Programming for  Pairwise Alignment 2

5

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Principle of Optimality

Optimal alignment

x1, x2, x3, ..., x i

y1, y2, y3, ..., y j

Looks like ……

x1,x2,x3,...,x i−1

y1,y2,y3,...,y j−1

x i

y j

F(i, j)

F(i −1, j −1) + s(x i,y j )

or ……………

x1,x2,x3,...,x i

y1,y2,y3,...,y j−1

y j

F(i, j −1) − d

Page 6: Dynamic Programming for  Pairwise Alignment 2

6

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Principle of Optimality

Optimal alignment

x1, x2, x3, ..., x i

y1, y2, y3, ..., y j

Looks like ……

x1,x2,x3,...,x i−1

y1,y2,y3,...,y j−1

x i

y j

F(i, j)

F(i −1, j −1) + s(x i,y j )

or ……………

x1,x2,x3,...,x i

y1,y2,y3,...,y j−1

y j

F(i, j −1) − d

or ……………

x1,x2,x3,...,x i−1

y1,y2,y3,...,y j

x i

F(i −1, j) − d

Page 7: Dynamic Programming for  Pairwise Alignment 2

7

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Principle of Optimality

Optimal alignment

x1, x2, x3, ..., x i

y1, y2, y3, ..., y j

Looks like ……

x1,x2,x3,...,x i−1

y1,y2,y3,...,y j−1

x i

y j

F(i, j)

F(i −1, j −1) + s(x i,y j )

or ……………

x1,x2,x3,...,x i

y1,y2,y3,...,y j−1

y j

F(i, j −1) − d

or ……………

x1,x2,x3,...,x i−1

y1,y2,y3,...,y j

x i

F(i −1, j) − d

so ……………

F(i −1, j −1) + s(x i,y j )

F(i, j) = max F(i, j −1) − d

F(i −1, j) − d

Page 8: Dynamic Programming for  Pairwise Alignment 2

8

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Basis

x1, x2, x3, ..., x i

− − − − ... −

y1, y2, y3, ..., y j

− − − − ... −

F(i,0) = F(i −1,0) + s(x i,−)

F(0, j) = F(0, j −1) + s(−,y j )

F(0,0) = 0

Page 9: Dynamic Programming for  Pairwise Alignment 2

9

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Filling up table

0

F matrix

0

1

2

m

0 1 2 n

X

Y

Optimalalignmentscore

Page 10: Dynamic Programming for  Pairwise Alignment 2

10

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Constructing alignment

0

F matrix

0

1

2

m

0 1 2 n

X

Y

Optimalalignmentscore

Page 11: Dynamic Programming for  Pairwise Alignment 2

11

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Example

0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80

-8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73

-16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60

-24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37

-32 -14 -18 -13 -8 -9 -13 -7 -3 -11 -19

-40 -22 -8 -16 -16 -9 -12 -15 -7 3 -5

-48 -30 -16 -3 -11 -11 -12 -12 -15 -5 2

-56 -38 -24 -11 -6 -12 -14 -15 -12 -9 1

F matrix

0

1

2

m

0 1 2 n

X

Optimalalignmentscore

P

A

W

H

E

A

E

Y

H E A G A W G H E E

AlignmentAlignmentX

Y H E A G A W G H E - E

- - P - A W - H E A E

Page 12: Dynamic Programming for  Pairwise Alignment 2

12

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Time and space

⇒ Θ(mn)

F matrix

0

1

2

m

0 1 2 n

(m +1) × (n +1) table entries space

Each entry computed in constant time

⇒ Θ(mn) time

Page 13: Dynamic Programming for  Pairwise Alignment 2

13

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Smith & Waterman algorithm

Computes local alignment.

i.e. look for best alignment of subsequences of X and Y, ignoring scoresof regions on either side

Y

X

Best subsequence alignment

Page 14: Dynamic Programming for  Pairwise Alignment 2

14

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Recurrences

0

F(i −1, j −1) + s(x i,y j )

F(i, j) = max F(i, j −1) − d

F(i −1, j) − d

F(i,0) = F(0, j) = 0Basis:

Page 15: Dynamic Programming for  Pairwise Alignment 2

15

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Example

F H E A G A W G H E E

0 0 0 0 0 0 0 0 0 0 0

P 0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 5 0 5 0 0 0 0 0

W 0 0 0 0 2 0 20 12 4 0 0

H 0 10 2 0 0 0 12 18 22 14 6

E 0 2 16 8 0 0 4 10 18 28 20

A 0 0 8 21 13 5 0 4 10 20 27

E 0 0 6 13 18 12 4 0 4 16 26

Page 16: Dynamic Programming for  Pairwise Alignment 2

16

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Example

F H E A G A W G H E E

0 0 0 0 0 0 0 0 0 0 0

P 0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 5 0 5 0 0 0 0 0

W 0 0 0 0 2 0 20 12 4 0 0

H 0 10 2 0 0 0 12 18 22 14 6

E 0 2 16 8 0 0 4 10 18 28 20

A 0 0 8 21 13 5 0 4 10 20 27

E 0 0 6 13 18 12 4 0 4 16 26

AlignmentX

Y A W G H E

A W - H E

Page 17: Dynamic Programming for  Pairwise Alignment 2

17

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Repeated (local) matches

Long sequences - interested in all local alignments with significant score,> threshold T.

e.g. copies of repeated domain or motif in a protein.

X = sequence containing motif

Y = target sequence

Method is asymmetric

Y

Matching parts of X

Page 18: Dynamic Programming for  Pairwise Alignment 2

18

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Principle of Optimality

Given sequences

Define F(i,j) (i ≥ 1) = best sum of match scores in

and €

Y = (y1,y2,...,yn )

X = (x1,x2,...,xm )

(x1,x2,...,x i)

(y1,y2,...,y j )

y j

x i

y j

assuming

and match ends in

is in a matched region

or

Page 19: Dynamic Programming for  Pairwise Alignment 2

19

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Ends of matches

F(0,0) = 0

F(0, j) = best sum of completed match scores to

(y1,y2,...,y j )

assuming that

y j is not in a matched region

F(0, j −1)

F(0, j) = max F(i, j −1) −T, i =1,...,n

Row 0 therefore marks unmatched regions and ends of matches in Y.

Page 20: Dynamic Programming for  Pairwise Alignment 2

20

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

General recurrence

F(0, j)

F(i −1, j −1) + s(x i,y j )

F(i, j) = max F(i, j −1) − d

F(i −1, j) − d

Start of new match

Extension of previous match

Page 21: Dynamic Programming for  Pairwise Alignment 2

21

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Filling up table

0

F matrix

0

1

2

m

0 1 2 n

X

Y

Page 22: Dynamic Programming for  Pairwise Alignment 2

22

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Filling up table

0

F matrix

0

1

2

m

0 1 2 n

X

Y

Page 23: Dynamic Programming for  Pairwise Alignment 2

23

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Filling up table

0

F matrix

0

1

2

m

0 1 2 n

X

Y

Page 24: Dynamic Programming for  Pairwise Alignment 2

24

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Filling up table

0

F matrix

0

1

2

m

0 1 2 n

X

Y

Page 25: Dynamic Programming for  Pairwise Alignment 2

25

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Filling up table

0

F matrix

0

1

2

m

0 1 2 n

X

Y

Page 26: Dynamic Programming for  Pairwise Alignment 2

26

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

0

Filling up table

F matrix

0

1

2

m

0 1 2 n

X

Y

Page 27: Dynamic Programming for  Pairwise Alignment 2

27

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

0

Filling up table

F matrix

0

1

2

m

0 1 2 n

X

Y

Page 28: Dynamic Programming for  Pairwise Alignment 2

28

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

0

Filling up table

F matrix

0

1

2

m

0 1 2 n

X

Y

Page 29: Dynamic Programming for  Pairwise Alignment 2

29

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Filling up table

0

F matrix

0

1

2

m

0 1 2 n

X

Y

Page 30: Dynamic Programming for  Pairwise Alignment 2

30

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

0

Filling up table

F matrix

0

1

2

m

0 1 2 n

X

Y

Page 31: Dynamic Programming for  Pairwise Alignment 2

31

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Filling up table

0

F matrix

0

1

2

m

0 1 2 n

X

Y

OptimalSum ofalignmentscores

Page 32: Dynamic Programming for  Pairwise Alignment 2

32

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

ExampleF H E A G A W G H E E

0 0 0 0 1 1 1 1 1 3 9

P 0 0 0 0 1 1 1 1 1 3 9

A 0 0 0 5 1 6 1 1 1 3 9

W 0 0 0 0 2 1 21 13 5 3 9

H 0 10 2 0 1 1 13 19 23 15 9

E 0 2 16 8 1 1 5 11 19 29 21

A 0 0 8 21 13 6 1 5 11 21 28

E 0 0 6 13 18 12 4 1 5 17 27

9

Extra cell for final total score

Page 33: Dynamic Programming for  Pairwise Alignment 2

33

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Example

AlignmentX

Y H E A G A W G H E E

H E A . A W - H E .

Extra cell for final total score

F H E A G A W G H E E

0 0 0 0 1 1 1 1 1 3 9

P 0 0 0 0 1 1 1 1 1 3 9

A 0 0 0 5 1 6 1 1 1 3 9

W 0 0 0 0 2 1 21 13 5 3 9

H 0 10 2 0 1 1 13 19 23 15 9

E 0 2 16 8 1 1 5 11 19 29 21

A 0 0 8 21 13 6 1 5 11 21 28

E 0 0 6 13 18 12 4 1 5 17 27

9

Page 34: Dynamic Programming for  Pairwise Alignment 2

34

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Overlap matchesY Y

X X

YY

X X

Don’t penalize overhanging ends i.e. set F(i,0) = F(0,j) = 0

F(i −1, j −1) + s(x i,y j )

F(i, j) = max F(i, j −1) − d

F(i −1, j) − d

Otherwise

Page 35: Dynamic Programming for  Pairwise Alignment 2

35

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

ExampleF H E A G A W G H E E

0 0 0 0 0 0 0 0 0 0 0

P 0 -2̀ -1 -1 -2 -1 -4 -2 -2 -1 -1

A 0 -2 -2 4 -1 3 -4 -4 -4 -3 -2

W 0 -3 -5 -4 1 -4 18 10 2 6 -6

H 0 10 2 6 -6 -1 10 16 20 12 4

E 0 2 16 8 0 7 2 8 16 26 18

A 0 -2 8 21 13 5 3 2 8 18 25

E 0 0 4 13 18 12 4 4 2 14 24

Page 36: Dynamic Programming for  Pairwise Alignment 2

36

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

ExampleF H E A G A W G H E E

0 0 0 0 0 0 0 0 0 0 0

P 0 -2̀ -1 -1 -2 -1 -4 -2 -2 -1 -1

A 0 -2 -2 4 -1 3 -4 -4 -4 -3 -2

W 0 -3 -5 -4 1 -4 18 10 2 6 -6

H 0 10 2 6 -6 -1 10 16 20 12 4

E 0 2 16 8 0 7 2 8 16 26 18

A 0 -2 8 21 13 5 3 2 8 18 25

E 0 0 4 13 18 12 4 4 2 14 24

AlignmentX

Y G A W G H E E

P A W - H E A

Page 37: Dynamic Programming for  Pairwise Alignment 2

37

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Affine gap penalities

• Affine score: (g) = -d - (g-1)e

gap-open penality gap-extension penalty

• Different penalties associated with extending alignment with gap symbol

Y = C C T W PX = C S T W -

Y = C C T W PX = C S T - -

different from

Page 38: Dynamic Programming for  Pairwise Alignment 2

38

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

General recurrence

F(i −1, j −1) + s(x i,y j )

F(i, j) = max F(k, j) + γ(i − k), k = 0,1,...,i −1

(i, j > 0) F(i,k) + γ ( j − k), k = 0,1,..., j −1

Extend by matching

x i and y j

Extend by matching suffix of Y to gap of length i-k

Extend by matching suffix of X to gap of length j-k

Θ(n3)Problem: Procedure runs in worst-case time

Page 39: Dynamic Programming for  Pairwise Alignment 2

39

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

version

Θ(n2)

Extra variables

M(i, j) = best score of alignment of (x1,x2,...,x i) and

(y1,y2,...,y j ) given that x i is aligned with y j Ix (i, j) = best score of alignment of (x1,x2,...,x i) and

(y1,y2,...,y j ) given that x i is aligned with a gap

Iy (i, j) = best score of alignment of (x1,x2,...,x i) and

(y1,y2,...,y j ) given that y j is aligned with a gap

Page 40: Dynamic Programming for  Pairwise Alignment 2

40

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Recurrences

M(i −1, j) − d

Ix (i, j) = max Ix (i −1, j) − e

(i, j > 0)

M(i, j −1) − d

Iy (i, j) = max Iy (i, j −1) − e

(i, j > 0)

M(i −1, j −1) + S(x i,y j )

M(i, j) = max Ix (i −1, j −1) + S(x i,y j )

Iy (i −1, j −1) + S(x i,y j )

(i, j > 0)

aligned to start of gap

x i

Θ(n2)Procedure runs in worst-case time

aligned to continuation of gap

x i

aligned to start of gap

y j

aligned to continuation of gap

y j

Page 41: Dynamic Programming for  Pairwise Alignment 2

41

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Linear space alignment

F matrix

0

1

2

m

0 1 2 n

X

Y

Page 42: Dynamic Programming for  Pairwise Alignment 2

42

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Linear space alignment

F matrix

0

1

2

m

0 1 2 n

X

Y

Page 43: Dynamic Programming for  Pairwise Alignment 2

43

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Linear space alignment

F matrix

0

1

2

m

0 1 2 n

X

Y

Page 44: Dynamic Programming for  Pairwise Alignment 2

44

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Linear space alignment

F matrix

0

1

2

m

0 1 2 n

X

Y

Page 45: Dynamic Programming for  Pairwise Alignment 2

45

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Linear space alignment

F matrix

0

1

2

m

0 1 2 n

X

Y

Page 46: Dynamic Programming for  Pairwise Alignment 2

46

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Linear space alignment

F matrix

0

1

2

m

0 1 2 n

X

Y

m2⎣ ⎦

Page 47: Dynamic Programming for  Pairwise Alignment 2

47

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Linear space alignment

F matrix

0

1

2

m

0 1 2 n

X

Y

m2⎣ ⎦

Page 48: Dynamic Programming for  Pairwise Alignment 2

48

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Linear space alignment

F matrix

0

1

2

m

0 1 2 n

X

Y

m2⎣ ⎦

Page 49: Dynamic Programming for  Pairwise Alignment 2

49

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Linear space alignment

F matrix

0

1

2

m

0 1 2 n

X

Y

m2⎣ ⎦

Page 50: Dynamic Programming for  Pairwise Alignment 2

50

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Linear space alignment

F matrix

0

1

2

m

0 1 2 n

X

Y

m2⎣ ⎦

Page 51: Dynamic Programming for  Pairwise Alignment 2

51

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Linear space alignment

F matrix

0

1

2

m

0 1 2 n

X

Y

m2⎣ ⎦

Page 52: Dynamic Programming for  Pairwise Alignment 2

52

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Linear space algorithm

From top

From bottom

+

=

k

Ftop ( j)

Fbottom ( j)

Ftop ( j) + Fbot ( j)

k ∈ {0,1,...,n} such that

Ftop (k) + Fbot (k) is maximized

k is on path of optimal alignment

Page 53: Dynamic Programming for  Pairwise Alignment 2

53

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Linear space alignmentHirschberg’s insight

F

m

n00

m2⎣ ⎦

k

Page 54: Dynamic Programming for  Pairwise Alignment 2

54

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Linear space alignmentHirschberg’s insight

F

m

n00

m2⎣ ⎦

k

Page 55: Dynamic Programming for  Pairwise Alignment 2

55

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Software for pairwise alignment

Pure D.P. runs in

Θ(mn) time

Example

100 million residues in database

Search sequence of length 10,000

# F matrix cells to be calculated:

1012

Computer speed: 10 million cells a second

Total time: 100,000 seconds = 28 hours (approx.)

Page 56: Dynamic Programming for  Pairwise Alignment 2

56

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Heuristic methods

FASTA (Pearson & Lipman, 1988)

Words in X and Y(length ktup)

. . .

…, ( i, j ), …cgtta

Position in X Position in Y

. . .

• sort matches on j - i • extend best matches (ungapped)• join neighbouring matches by inserting gaps• realign best matches by dynamic programming

Page 57: Dynamic Programming for  Pairwise Alignment 2

57

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Sensitivity

Tradeoff

High values of ktup: faster search, but may miss significant matches

Low values of ktup: catches more matches, but slower

ktup = 1 for sensitivity close to dynamic programming

Available from

http://www.fasta.bioch.virginia.edu/

Page 58: Dynamic Programming for  Pairwise Alignment 2

58

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Example>short1.seq Length: 100 August 1, 2003 11:09 Type: N Check: 5940atgaaattaacagcaatagctaaagcaacattagcattaggaatattaacaacaggtgtgatgacagcagaaagtcaaactgtaaacgcgaaagtaaagt

>short2.seq Length: 100 August 1, 2003 10:43 Type: N Check: 1744atgaagatgacagcaattgcgaaagccagtttagctctaagtattttagcgactggggttataacatcaacggctcaaactgtaaatgcgagcgaacatg

/seqprg/slib/bin/lalign -N 5000 -n -r "+5/-4" -f -12 -g -4 -w 75 -q @ @

resetting to DNA matrix resetting to DNA matrix LALIGN finds the best local alignments between two sequences version 2.1u03 April 2000Please cite: X. Huang and W. Miller (1991) Adv. Appl. Math. 12:373-381

resetting to DNA matrixalignments < E( 0.05):score: 51 (50 max) Comparison of:(A) @ short1.seq Length: 100 August 1, 2003 11:09 Type - 100 nt(B) @ short2.seq Length: 100 August 1, 2003 10:43 Typ - 100 nt using matrix file: DNA, gap penalties: -12/-4 E(limit) 0.05

71.4% identity in 91 nt overlap (1-91:1-91); score: 221 E(10000): 3.7e-12

10 20 30 40 50 60 70short1 ATGAAATTAACAGCAATAGCTAAAGCAACATTAGCATTAGGAATATTAACAACAGGTGTGATGACAGCAGAAAGT ::::: : :::::::: :: ::::: : ::::: :: : :: ::: : :: :: :: :: ::: :: :short2 ATGAAGATGACAGCAATTGCGAAAGCCAGTTTAGCTCTAAGTATTTTAGCGACTGGGGTTATAACATCAACGGCT 10 20 30 40 50 60 70

80 90short1 CAAACTGTAAACGCGA ::::::::::: ::::short2 CAAACTGTAAATGCGA 80 90

----------

Input sequences

Output matches

Page 59: Dynamic Programming for  Pairwise Alignment 2

59

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Example

More matches

64.1% identity in 39 nt overlap (17-54:32-69); score: 53 E(10000): 3.7e+02

20 30 40 50short1 TAGCTAAAGCAACATTAGC-ATTAGGAATATTAACAACA ::::: : : ::::: : : :: ::: :::: ::short2 TAGCTCTAAGTATTTTAGCGACTGGGGTTAT-AACATCA 40 50 60

----------

73.9% identity in 23 nt overlap (60-77:6-28); score: 53 E(10000): 3.7e+02

60 70short1 GATGACAGCA-----GAAAGTCA :::::::::: ::::: ::short2 GATGACAGCAATTGCGAAAGCCA 10 20

----------

Page 60: Dynamic Programming for  Pairwise Alignment 2

60

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

BLAST

Developed by Altschul & al (1990)

Preprocesses query sequence

Makes list of “neighbourhood words” with match > T

Tries to extend “seed” matches (ungapped) in database sequences

GAPPED-BLAST looks for gapped alignments

Page 61: Dynamic Programming for  Pairwise Alignment 2

61

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Genetics Computer Group package

GCG at University of Wisconsin

Commercial package (http://www.gcg.com/)

* assemble * backtranslate * bestfit * blast * breakup * chopup * circles * codonfrequency * codonpreference * coilscan * compare * composition * compresstext * comptable * consensus * correspond * corrupt * dataset * detab * distances * diverge * domes * dotplot * extractpeptide * fasta * fasta_parsable_output * fetch * figure * findpatterns * fingerprint * fitconsensus * foldrna * framealign * frames * framesearch * fromembl * fromfasta * fromgenbank * fromig * frompir * fromstaden * gap * gapshow * gcgtoblast * gelassemble * geldisassemble

* gelenter * gelintroduction * gelmerge * gelstart * gelview * getseq * growtree * helicalwheel * hthscan * isoelectric * lineup * listfile * lookup * lprint * map * mapplot * mapsort * mfold * moment * motifs * mountains * name * names * netblast * nooverlap * olddistances * onecase * overlap * paupdisplay * paupsearch * pepdata * pepplot * peptidemap * peptidesort * peptidestructure * pileup * plasmidmap * plotfold * plotsimilarity * plotstructure * plottest * pretty * prime * profileanalysis * profilegap * profilemake

* profilescan * profilesearch * profilesegments * publish * red * reformat * repeat * replace * reverse * sample * seg * segments * seqed * seqlab * setkeys * setplot * shiftover * shuffle * simplify * spew * spscan * squiggles * statplot * stemloop * stringsearch * symbol * terminator * testcode * tfasta * tofasta * toig * topir * tostaden * translate * whats_new_9.0 * whats_new_9.1 * window * wordsearch * xnu

Page 62: Dynamic Programming for  Pairwise Alignment 2

62

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

GAP

GAP (“Global Alignment Program” ?)

Needleman & Wunsch algorithm

Input in GCG format

Use GETSEQ

!!NA_SEQUENCE 1.0 GETSEQ from gcg, August 14, 19103 12:19.

Length: 389 August 14, 19103 12:19 Type: N Check: 9580 ..

1 AAATGATAAA CTATTTTACT TTATGTCTAA GGTCTTTCAT AATATGAAAT

51 AGAATGTAGA TATTGCAACA ATAGCATTTT TGGAGACAGC TACCTCCTTT

101 ACCAGGAATA ATCTTTGCAT GTCACATTTA GAGATAAAGC TCAAAATGCA

151 AATCCTTCCC CTGAGAGTGG GAAAGCATTA ACAAATGAGA GTGGGAAAAG

201 CATTAACAAA GCATTAACAC AGGTCTTTAC ATATTCAAAA TATTAAACTA

251 ATGCTAGGAT TATAGACTTG ATTTTAAGAC ATGGTAGTTA ATAGAAAAGT

301 TCTAGATTGA AAACAATTTT GCAAAAATAT ACATTTGGTA TATGTGTATA

351 TATGTATGTG GTATATATAT ATCNACTAGG GAAAATATA

Page 63: Dynamic Programming for  Pairwise Alignment 2

63

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Example<heman.lsb.sbs.auckland.ac.nz:/usr/users/gcg/359Stuff>gapGap uses the algorithm of Needleman and Wunsch to find the alignment oftwo complete sequences that maximizes the number of matches and minimizesthe number of gaps.

GAP of what sequence 1 ? Hs#S374655.gcg

Begin (* 1 *) ? End (* 389 *) ? Reverse (* No *) ?

to what sequence 2 (* Hs#S374655.gcg *) ? Hs#S1117589.gcg

Begin (* 1 *) ? End (* 323 *) ? Reverse (* No *) ?

What is the gap creation penalty (* 50 *) ?

What is the gap extension penalty (* 3 *) ?

What should I call the paired output display file (* Hs#S374655.pair *) ?

Aligning ................-. Aligning ................-..

Gaps: 0 Quality: 3080 Quality Ratio: 9.536 % Similarity: 95.356 Length: 389

Page 64: Dynamic Programming for  Pairwise Alignment 2

64

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Display fileGAP of: Hs#S374655.gcg check: 9580 from: 1 to: 389 GETSEQ from gcg, August 14, 19103 12:19.to: Hs#S1117589.gcg check: 8814 from: 1 to: 323 GETSEQ from gcg, August 14, 19103 12:20.

Symbol comparison table: /usr/users/gcg/gcgcore/data/rundata/nwsgapdna.cmp CompCheck: 8760 Gap Weight: 50 Average Match: 10.000 Length Weight: 3 Average Mismatch: 0.000 Quality: 3080 Length: 389 Ratio: 9.536 Gaps: 0 Percent Similarity: 95.356 Percent Identity: 95.356 Match display thresholds for the alignment(s): | = IDENTITY : = 5 . = 1 Hs#S374655.gcg x Hs#S1117589.gcg August 18, 19103 17:59 .. . . . . . 1 AAATGATAAACTATTTTACTTTATGTCTAAGGTCTTTCATAATATGAAAT 50 ||||||||||||||||||||||||||||||||||||||||||||||| 1 ...TGATAAACTATTTTACTTTATGTCTAAGGTCTTTCATAATATGAAAT 47 . . . . . 51 AGAATGTAGATATTGCAACAATAGCATTTTTGGAGACAGCTACCTCCTTT 100 |||||||||||||||||||||||||||||||||||||||||||||||||| 48 AGAATGTAGATATTGCAACAATAGCATTTTTGGAGACAGCTACCTCCTTT 97 . . . . . 101 ACCAGGAATAATCTTTGCATGTCACATTTAGAGATAAAGCTCAAAATGCA 150 |||||||||||||||||||||||||||||||||||||||||||| ||||| 98 ACCAGGAATAATCTTTGCATGTCACATTTAGAGATAAAGCTCAAGATGCA 147 . . . . . 151 AATCCTTCCCCTGAGAGTGGGAAAGCATTAACAAATGAGAGTGGGAAAAG 200 |||||||||||||||||||||||||||||||||||||||||||||||||| 148 AATCCTTCCCCTGAGAGTGGGAAAGCATTAACAAATGAGAGTGGGAAAAG 197 . . . . . 201 CATTAACAAAGCATTAACACAGGTCTTTACATATTCAAAATATTAAACTA 250 |||||||||||||||||||||||||||||||||||||||||||||||||| 198 CATTAACAAAGCATTAACACAGGTCTTTACATATTCAAAATATTAAACTA 247 . . . . . 251 ATGCTAGGATTATAGACTTGATTTTAAGACATGGTAGTTAATAGAAAAGT 300 ||||||||||||||||||||||||||| |||||||| ||||| 248 ATGCTAGGATTATAGACTTGATTTTAAACATGGGTAGTTATAGAAAAAGG 297 . . . . . 301 TCTAGATTGAAAACAATTTTGCAAAAATATACATTTGGTATATGTGTATA 350 |||||||||||||||| ||| ||| 298 TCTAGATTGAAAACAAATTTTGCAAA........................ 323 . .

Page 65: Dynamic Programming for  Pairwise Alignment 2

65

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Bestfit<heman.lsb.sbs.auckland.ac.nz:/usr/users/gcg/359Stuff>bestfit

BestFit makes an optimal alignment of the best segment of similaritybetween two sequences. Optimal alignments are found by inserting gaps tomaximize the number of matches using the local homology algorithm ofSmith and Waterman.

BESTFIT of what sequence 1 ? short1.gcg

Begin (* 1 *) ? End (* 100 *) ? Reverse (* No *) ?

to what sequence 2 (* short1.gcg *) ? short2.gcg

Begin (* 1 *) ? End (* 100 *) ? Reverse (* No *) ?

What is the gap creation penalty (* 50 *) ?

What is the gap extension penalty (* 3 *) ?

What should I call the paired output display file (* short1.pair *) ?

Aligning ....-. Aligning ....-.

Gaps: 0 Quality: 416 Quality Ratio: 4.571 % Similarity: 71.429 Length: 91

Smith & Waterman algorithm

Local alignment

Same interface as GAP

Page 66: Dynamic Programming for  Pairwise Alignment 2

66

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Bestfit display fileBESTFIT of: short1.gcg check: 2998 from: 1 to: 100

GETSEQ from gcg, August 18, 19103 15:25.

to: short2.gcg check: 6455 from: 1 to: 100

GETSEQ from gcg, August 18, 19103 15:26.

Symbol comparison table: /usr/users/gcg/gcgcore/data/rundata/swgapdna.cmp CompCheck: 2335

Gap Weight: 50 Average Match: 10.000 Length Weight: 3 Average Mismatch: -9.000

Quality: 416 Length: 91 Ratio: 4.571 Gaps: 0 Percent Similarity: 71.429 Percent Identity: 71.429

Match display thresholds for the alignment(s): | = IDENTITY : = 5 . = 1

short1.gcg x short2.gcg August 18, 19103 15:27 ..

. . . . . 1 atgaaattaacagcaatagctaaagcaacattagcattaggaatattaac 50 ||||| | |||||||| || ||||| | ||||| || | || ||| | 1 atgaagatgacagcaattgcgaaagccagtttagctctaagtattttagc 50 . . . . 51 aacaggtgtgatgacagcagaaagtcaaactgtaaacgcga 91 || || || || ||| || |||||||||||| |||| 51 gactggggttataacatcaacggctcaaactgtaaatgcga 91

Page 67: Dynamic Programming for  Pairwise Alignment 2

67

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Wordsearch

Algorithm similar to algorithm of Wilbur and Lipman (1983).

Compares one sequence (the query) to any group of sequences.

Comparisons can be viewed as set of dot-plots.

Search finds registers of comparison (diagonals) that have the largest number of short perfect matches (words).

Best segment of similarity along each diagonal viewed with program SEGMENTS.

Page 68: Dynamic Programming for  Pairwise Alignment 2

68

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Wordsearch example<heman.lsb.sbs.auckland.ac.nz:/usr/users/gcg/359Stuff>wordsearch

WordSearch identifies sequences in the database that share largenumbers of common words in the same register of comparison with yourquery sequence. The output of WordSearch can be displayed withSegments.

WORDSEARCH with what query sequence ? short1.gcg

Begin (* 1 *) ? End (* 100 *) ?

Search for query in what sequence(s) (* GenEMBL:* *) ? short2.gcg

What word size (* 6 *) ?

List how many best diagonals (* 50 *) ? 4

Integrate how many adjacent diagonals (* 3 *) ?

What should I call the output file (* short1.word *) ?

1 short2.gcg Len: 100

6-mers found: 168 Diagonals with words: 6 Total diagonals: 398 Sequences searched: 1 CPU time: 00.03

Output file: /usr/users/gcg/359Stuff/short1.word

Page 69: Dynamic Programming for  Pairwise Alignment 2

69

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Short1.word contents

!!SEQUENCE_LIST 1.0 (Nucleotide) WORDSEARCH of: /usr/users/gcg/359Stuff/short1.gcg check: 2998 from: 1 to:100

GETSEQ from gcg, August 18, 19103 15:25.

TO: short2.gcg Sequences: 1 Total-length: 100 August 18, 19103 15:47

Word-size: 6 Words: 168 Diagonals: 6 Total-diagonals: 398 Integral-width: 3 Alphabet: 4 List-size: 4 CPU minutes: 0.00

Sequence Strd Diag Score Width Documentation ..

/short2.gcg + 0 20 3 GETSEQ from gcg, August 18, 19103 15:26./short2.gcg + -54 10 3 GETSEQ from gcg, August 18, 19103 15:26./short2.gcg - -47 7 3 GETSEQ from gcg, August 18, 19103 15:26./short2.gcg + -69 7 3 GETSEQ from gcg, August 18, 19103 15:26.

Page 70: Dynamic Programming for  Pairwise Alignment 2

70

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Run SEGMENTS

<heman.lsb.sbs.auckland.ac.nz:/usr/users/gcg/359Stuff>segments

Segments aligns and displays the segments of similarity found byWordSearch.

(BestFit) SEGMENTS from what WORDSEARCH file ? short1.word

What should I call the output file (* short1.pairs *) ?

Aligning ....-. /usr/users/gcg/359Stuff/short2.gcg 100 bp Gaps: 0 Quality:500 / Length: 98 Aligning ..-. /usr/users/gcg/359Stuff/short2.gcg 100 bp Gaps: 0 Quality:100 / Length: 10 Aligning ..-. /usr/users/gcg/359Stuff/short2.gcg 100 bp Gaps: 0 Quality:112 / Length: 32 Aligning .-. /usr/users/gcg/359Stuff/short2.gcg 100 bp Gaps: 0 Quality: 96 / Length: 16

Page 71: Dynamic Programming for  Pairwise Alignment 2

71

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Short1.pairs contents(BestFit) SEGMENTS from: short1.word August 18, 19103 15:48

(Nucleotide) WORDSEARCH of:/usr/users/gcg/359Stuff/short1.gcg check: 2998from: 1 to: 100GETSEQ from gcg, August 18, 19103 15:25.TO: short2.gcg Sequences: 1 Total-length: 100 August 18, 19103 15:47Word-size: 6 Words: 168 Diagonals: 6 Total-diagonals: 398Integral-width: 3 Alphabet: 4 List-size: 4 CPU minutes: 0.00

AvMatch: 3.84 AvMisMatch: -6.00 GapWeight: 50 LengthWeight: 3 ..

Match display thresholds for the alignment(s): | = IDENTITY : = 3 . = 1

short1.gcg check: 2998 from: 1 to: 100/usr/users/gcg/359Stuff/short2.gcg check: 6455 from: 1 to: 100 GETSEQ from gcg, August 18, 19103 15:26. Gaps: 0 Quality: 500 Ratio:5.102 Score:20 Width:3 Limits: +/-4 . . . . . 1 atgaaattaacagcaatagctaaagcaacattagcattaggaatattaac 50 ||||| | |||||||| || ||||| | ||||| || | || ||| | 1 atgaagatgacagcaattgcgaaagccagtttagctctaagtattttagc 50 . . . . 51 aacaggtgtgatgacagcagaaagtcaaactgtaaacgcgaaagtaaa 98 || || || || ||| || |||||||||||| |||| | | | 51 gactggggttataacatcaacggctcaaactgtaaatgcgagcgaaca 98

Page 72: Dynamic Programming for  Pairwise Alignment 2

72

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Short1.pairs (continued)short1.gcg check: 2998 from: 54 to: 100/usr/users/gcg/359Stuff/short2.gcg check: 6455 from: 1 to: 100 GETSEQ from gcg, August 18, 19103 15:26. Gaps: 0 Quality: 100 Ratio:10.000 Score:10 Width:3 Limits: +/-4 . 60 gatgacagca 69 |||||||||| 6 gatgacagca 15

short1.gcg check: 2998 from: 47 to: 100 /Reverse/usr/users/gcg/359Stuff/short2.gcg check: 6455 from: 1 to: 100 GETSEQ from gcg, August 18, 19103 15:26. Gaps: 0 Quality: 112 Ratio:3.500 Score:7 Width:3 Limits: +/-4 . . . 40 ctaatgctaatgttgctttagctattgctgtt 9 | | ||| || | ||||||| | | || 14 caattgcgaaagccagtttagctctaagtatt 45

short1.gcg check: 2998 from: 69 to: 100/usr/users/gcg/359Stuff/short2.gcg check: 6455 from: 1 to: 100 GETSEQ from gcg, August 18, 19103 15:26. Gaps: 0 Quality: 96 Ratio:6.000 Score:7 Width:3 Limits: +/-4 . 79 actgtaaacgcgaaag 94 || | || ||||||| 10 acagcaattgcgaaag 25

Page 73: Dynamic Programming for  Pairwise Alignment 2

73

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

EMBOSSWhat is EMBOSS?The European Molecular Biology Open Software SuiteEMBOSS is a package of high-quality FREE Open Sourcesoftware for sequence analysis.

Applications in EMBOSSThe EMBOSS programs and their documentation.

User DocumentationTutorial, Command syntax, Sequences and Databases, Reference

Jemboss and other InterfacesMany groups are creating graphical interfaces to EMBOSSJemboss is our supported interface

Downloading the softwareYou can download, install and run the software on most UNIXcomputersIt is known to work on: Irix, AIX(4.3.3 and 5.1), Red Hat, SuSe, Debian,HPUX11/IA64, MacOSX, Mandrake, NetBSD, Slackware, Solaris,Tru64 Unix (Full support soon. Loan machine being arranged)It is reported to work on: FreeBSD, OSF, SuSE-PPC

LATEST NEWS: Release 2.7.1 available as of 3rd June 2003

Page 74: Dynamic Programming for  Pairwise Alignment 2

74

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Suite of programsgetorf HGMP Finds and extracts open reading frames (ORFs)helixturnhelix HGMP Finds nucleic acid binding domains.hmoment HGMP Hydrophobic moment calculationiep HGMP Calculates the isoelectric point of a proteininfoalign HGMP Information on a multiple sequence alignmentinfoseq HGMP Displays some simple information about sequencesisochore Sanger Plots isochores in large DNA sequencesjembossctl HGMP J emboss Authentication Controllindna Norway Draws linear maps of DNA constructslistor HGMP Writes a list file of the logical OR of

two sets of sequencesmarscan HGMP Finds MAR/SAR sites in nucleic sequencesmaskfeat HGMP Mask off features of a sequencemaskseq HGMP Mask off regions of a sequence.matcher Sanger Local alignment of two sequencesmegamerger HGMP Merge two large overlapping nucleic acid sequencesmerger HGMP Merge two overlapping sequencesmsbar HGMP Mutate sequence beyond all recognitionmwcontam HGMP Shows molwts that match across a set of filesmwfilter HGMP Filter noisy molwts from mass spec outputneedle HGMP Needleman-Wunsch global alignment.