2. comparing biological sequences : sequence alignment

64
An Introduction to Bioinformatics 2. Comparing biological sequences: sequence alignment

Upload: aminia

Post on 12-Jan-2016

55 views

Category:

Documents


2 download

DESCRIPTION

2. Comparing biological sequences : sequence alignment. DNA Sequence Comparison: First Success Story. Finding sequence similarities with genes of known function is a common approach to infer a newly sequenced gene’s function - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: 2. Comparing biological sequences :         sequence alignment

An Introduction to Bioinformatics

2. Comparing biological sequences: sequence alignment

Page 2: 2. Comparing biological sequences :         sequence alignment

An Introduction to Bioinformatics

DNA Sequence Comparison: First Success Story • Finding sequence similarities with genes of

known function is a common approach to infer a newly sequenced gene’s function

• In 1984 Russell Doolittle and colleagues found similarities between cancer-causing gene (v-sys in Simian Sarcoma Virus) and normal growth factor (PDGF) gene

Page 3: 2. Comparing biological sequences :         sequence alignment

3

An Introduction to Bioinformatics

So what?• Identifying the similarity between PDGF and

the viral oncogene helped lead to a modern hypothesis of cancer.

• Identifying the similarity between ATP binding ion channels and the Cystic Fibrosis gene led to a modern hypothesis of CF.

• Comparing sequences can yield biological insight.

3

Page 4: 2. Comparing biological sequences :         sequence alignment

An Introduction to Bioinformatics

Similarity• Similar genes may have similar functions

• Similar genes may have similar evolutionary origins

• Similarity between sequences is not a vague idea---it can be quantified

• But...how do we compute it?

Page 5: 2. Comparing biological sequences :         sequence alignment

5

An Introduction to Bioinformatics

Complete DNA Sequences

nearly 200 complete genomes have been

sequenced

Page 6: 2. Comparing biological sequences :         sequence alignment

6

An Introduction to Bioinformatics

Evolution

Page 7: 2. Comparing biological sequences :         sequence alignment

7

An Introduction to Bioinformatics

Evolution at the DNA level

…ACGGTGCAGTTACCA…

…AC----CAGTCCACCA…

Mutation

SEQUENCE EDITS

REARRANGEMENTS

Deletion

InversionTranslocationDuplication

Page 8: 2. Comparing biological sequences :         sequence alignment

8

An Introduction to Bioinformatics

Evolutionary Rates

OK

OK

OK

X

X

Still OK?

next generation

Page 9: 2. Comparing biological sequences :         sequence alignment

9

An Introduction to Bioinformatics

Sequence conservation implies function

Alignment is the key to• Finding important regions• Determining function• Uncovering the evolutionary forces

Page 10: 2. Comparing biological sequences :         sequence alignment

10

An Introduction to Bioinformatics

Sequence Alignment

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

DefinitionGiven two strings x = x1x2...xM, y = y1y2…yN,

an alignment is an assignment of gaps to positions0,…, N in x, and 0,…, N in y, so as to line up each

letter in one sequence with either a letter, or a gapin the other sequence

AGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCGGTCGATTTGCCCGAC

Page 11: 2. Comparing biological sequences :         sequence alignment

11

An Introduction to Bioinformatics

Apologia for a diversion...

• Computing similarity is detail-oriented, and we need to do some preliminary work first:• The Manhattan Tourist Problem introduces

grids, graphs and edit graphs

11

Page 12: 2. Comparing biological sequences :         sequence alignment

12

An Introduction to Bioinformatics

Manhattan Tourist Problem

Where to?

See the most stuff in the least time.

Page 13: 2. Comparing biological sequences :         sequence alignment

An Introduction to Bioinformatics

Manhattan Tourist Problem (MTP)Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid

Sink*

*

*

*

*

**

* *

*

*

Source

Page 14: 2. Comparing biological sequences :         sequence alignment

An Introduction to Bioinformatics

Manhattan Tourist Problem: Formulation

Goal: Find the longest path in a weighted grid.

Input: A weighted grid G with two distinct vertices, one labeled “source” and the other labeled “sink”

Output: A longest path in G from “source” to “sink”

Page 15: 2. Comparing biological sequences :         sequence alignment

An Introduction to Bioinformatics

MTP: An Example

3 2 4

0 7 3

3 3 0

1 3 2

4

4

5

6

4

6

5

5

8

2

2

5

0 1 2 3

0

1

2

3

j coordinatesi co

ord

inate

s

13

source

sink

4

3 2 4 0

1 0 2 4 3

3

1

1

2

2

2

419

95

15

23

0

20

3

4

Page 16: 2. Comparing biological sequences :         sequence alignment

An Introduction to Bioinformatics

MTP: An Example

3 2 4

0 7 3

3 3 0

1 3 2

4

4

5

6

4

6

5

5

8

2

2

5

0 1 2 3

0

1

2

3

j coordinatesi co

ord

inate

s

13

source

sink

4

3 2 4 0

1 0 2 4 3

3

1

1

2

2

2

419

95

15

34

0

20

3

4

1 4

10 17

22

30 32

Page 17: 2. Comparing biological sequences :         sequence alignment

An Introduction to Bioinformatics

MTP: Simple Recursive ProgramMT(n,m) x MT(n-1,m)+ length of the edge from (n- 1,m) to (n,m) y MT(n,m-1)+ length of the edge from (n,m-1) to (n,m) return max{x,y}

Page 18: 2. Comparing biological sequences :         sequence alignment

An Introduction to Bioinformatics

MTP: Simple Recursive ProgramMT(n,m) x MT(n-1,m)+ length of the edge from (n- 1,m) to (n,m) y MT(n,m-1)+ length of the edge from (n,m-1) to (n,m) return max{x,y}

Slow, for the same reason that RecursiveChange was slow

Page 19: 2. Comparing biological sequences :         sequence alignment

An Introduction to Bioinformatics

1

5

0 1

0

1

i

source

1

5S1,0 = 5

S0,1 = 1

• Instead of recursion, store the result in an array S

MTP: Dynamic Programmingj

Page 20: 2. Comparing biological sequences :         sequence alignment

An Introduction to Bioinformatics

MTP: Dynamic Programming

1 2

5

3

0 1 2

0

1

2

source

1 3

5

8

4

S2,0 = 8

i

S1,1 = 4

S0,2 = 33

-5

j

Page 21: 2. Comparing biological sequences :         sequence alignment

An Introduction to Bioinformatics

MTP: Dynamic Programming

1 2

5

3

0 1 2 3

0

1

2

3

i

source

1 3

5

8

8

4

0

5

8

103

5

-5

9

131-5

S3,0 = 8

S2,1 = 9

S1,2 = 13

S3,0 = 8

j

Page 22: 2. Comparing biological sequences :         sequence alignment

An Introduction to Bioinformatics

MTP: Dynamic Programming

1 2 5

-5 1 -5

-5 3

0

5

3

0

3

5

0

10

-3

-5

0 1 2 3

0

1

2

3

i

source

1 3 8

5

8

8

4

9

13 8

9

12

S3,1 = 9

S2,2 = 12

S1,3 = 8

j

Page 23: 2. Comparing biological sequences :         sequence alignment

An Introduction to Bioinformatics

MTP: Dynamic Programming (cont’d)

1 2 5

-5 1 -5

-5 3 3

0 0

5

3

0

3

5

0

10

-3

-5

-5

2

0 1 2 3

0

1

2

3

i

source

1 3 8

5

8

8

4

9

13 8

12

9

15

9

j

S3,2 = 9

S2,3 = 15

Page 24: 2. Comparing biological sequences :         sequence alignment

An Introduction to Bioinformatics

MTP: Dynamic Programming (cont’d)

1 2 5

-5 1 -5

-5 3 3

0 0

5

3

0

3

5

0

10

-3

-5

-5

2

0 1 2 3

0

1

2

3

i

source

1 3 8

5

8

8

4

9

13 8

12

9

15

9

j

0

1

16S3,3 = 16

(showing all back-traces)

Done!

Page 25: 2. Comparing biological sequences :         sequence alignment

An Introduction to Bioinformatics

MTP: Recurrence

Computing the score for a point (i,j) by the recurrence relation:

si, j = max si-1, j + weight of the edge between (i-1, j) and (i, j)

si, j-1 + weight of the edge between (i, j-1) and (i, j)

the running time is n x m for a n by m grid

(n = # of rows, m = # of columns)

Page 26: 2. Comparing biological sequences :         sequence alignment

26

An Introduction to Bioinformatics

Manhattan is not a perfect grid

Page 27: 2. Comparing biological sequences :         sequence alignment

An Introduction to Bioinformatics

Manhattan Is Not A Perfect Grid

What about diagonals?

• The score at point B is given by:

sB =max of

sA1 + weight of the edge (A1, B)

sA2 + weight of the edge (A2, B)

sA3 + weight of the edge (A3, B)

B

A3

A1

A2

Page 28: 2. Comparing biological sequences :         sequence alignment

An Introduction to Bioinformatics

Manhattan Is Not A Perfect Grid (cont’d)

Computing the score for point x is given by the recurrence relation:

sx = max

of

sy + weight of vertex (y, x) where y є Predecessors(x)

• Predecessors (x) – set of vertices that have edges leading to x

•The running time for a graph G(V, E) (V is the set of all vertices and E is the set of all edges) is O(E) since each edge is evaluated once

Page 29: 2. Comparing biological sequences :         sequence alignment

An Introduction to Bioinformatics

Traversing the Manhattan Grid •3 different strategies:•a) Column by column•b) Row by row•c) Along diagonals

a) b)

c)

Page 30: 2. Comparing biological sequences :         sequence alignment

An Introduction to Bioinformatics

Aligning DNA Sequences

V = ATCTGATG

W = TGCATAC

n = 8

m = 7

A T C T G A T G

T G C A T A C

V

W

match

deletioninsertion

mismatch

indels

4123

matches mismatches insertions deletions

Page 31: 2. Comparing biological sequences :         sequence alignment

An Introduction to Bioinformatics

Longest Common Subsequence (LCS) Problem

• Given two sequences

v = v1 v2…vm and w = w1 w2…wn

• The LCS of v and w is a sequence of positions in

v: 1 < i1 < i2 < … < it < m

and a sequence of positions in

w: 1 < j1 < j2 < … < jt < n

such that it -th letter of v equals to jt-letter of w and t is maximal

Page 32: 2. Comparing biological sequences :         sequence alignment

An Introduction to Bioinformatics

LCS: Example

A T -- C T G A T G

-- T G C T -- A -- C

elements of v

elements of w

--

A

1

2

0

1

2

2

3

3

4

3

5

4

5

5

6

6

6

7

7

8

j coords:

i coords:

Matches shown in red

positions in v:positions in w:

2 < 3 < 4 < 61 < 3 < 5 < 6

The LCS Problem can be expressed using the grid similar to Manhattan Tourist Problem grid…

0

0

(0,0)(1,0)(2,1)(2,2)(3,3)(3,4)(4,5)(5,5)(6,6)(7,6)(8,7)

Page 33: 2. Comparing biological sequences :         sequence alignment

An Introduction to Bioinformatics

LCS: Dynamic Programming• Find the LCS of two

stringsInput: A weighted graph G with two distinct vertices, one labeled “source” one labeled “sink”Output: A longest path in G from “source” to “sink”•Solve using an LCS edit

graph with diagonals replaced with +1 edges

Page 34: 2. Comparing biological sequences :         sequence alignment

An Introduction to Bioinformatics

LCS Problem as Manhattan Tourist Problem

T

G

C

A

T

A

C

1

2

3

4

5

6

7

0i

A T C T G A T C0 1 2 3 4 5 6 7 8

j

Page 35: 2. Comparing biological sequences :         sequence alignment

An Introduction to Bioinformatics

Edit Graph for LCS Problem

T

G

C

A

T

A

C

1

2

3

4

5

6

7

0i

A T C T G A T C0 1 2 3 4 5 6 7 8

j

Page 36: 2. Comparing biological sequences :         sequence alignment

An Introduction to Bioinformatics

Computing LCS

Let vi = prefix of v of length i: v1 … vi

and wj = prefix of w of length j: w1 … wj

The length of LCS(vi,wj) is computed by:

si, j =maxsi-1, j

si, j-1

si-1, j-1 + 1 if vi = wj

Page 37: 2. Comparing biological sequences :         sequence alignment

An Introduction to Bioinformatics

Computing LCS (cont’d)

si,j = MAX si-1,j + 0 si,j -1 + 0 si-1,j -1 + 1, if vi = wj

i,j

i-1,j

i,j -1

i-1,j -1

1 0

0

Page 38: 2. Comparing biological sequences :         sequence alignment

An Introduction to Bioinformatics

LCS Problem RevisitedV = ATCTGATG = V1, V2, …, Vn where n = 8W = TGCATAC = W1, W2, …, Wm where m = 7

A T C T G A T G

T G C A T A C

V

W

2 3

1 3

4

5

6

6

i

j

i 1 , i 2 , i 3 , i 4 2 3 4 6j 1 , j 2 , j 3 , j 4

1 3 5 6

< < <

< < <

Manhattan Tourist Problem is also the example of maximizing problem

i 1 ,j 1 i 2 ,j 2 i 3 ,j 3

i 1 ,j 1

i 2 ,j 2...i t ,j t

Page 39: 2. Comparing biological sequences :         sequence alignment

An Introduction to Bioinformatics

Alignment Grid Example

0 1 2 3 4

0

1

2

3

4

W A T C G

A

T

G

T

V

0 1 2 2 3 4

V = A T - G T

| | |

W= A T C G –

0 1 2 3 4 4

Page 40: 2. Comparing biological sequences :         sequence alignment

An Introduction to Bioinformatics

Aligning Sequences without Insertions and Deletions: Hamming DistanceGiven two DNA sequences V and W :

V :

• The Hamming distance: dH(V, W) = 8 is large but the sequences are very similar

ATATATATATATATATW :

Page 41: 2. Comparing biological sequences :         sequence alignment

An Introduction to Bioinformatics

Aligning Sequences with Insertions and Deletions

V : ATATATATATATATATW : ----

However, by shifting one sequence over one position:

• The edit distance: dH(v, w) = 2.

• Using Hamming distance neglects insertions and deletions in DNA

Page 42: 2. Comparing biological sequences :         sequence alignment

An Introduction to Bioinformatics

Edit DistanceLevenshtein (1966) introduced edit distance of two strings as the minimum number of elementary operations (insertions, deletions, and substitutions) to transform one string into the other

d(v,w) = MIN no. of elementary operations

to transform v w

Page 43: 2. Comparing biological sequences :         sequence alignment

An Introduction to Bioinformatics

Edit Distance (cont’d)

V = ATATATAT

W = TATATATA

Edit distance: d(v, w) = 2 (one insertion and one deletion)

ith letter of v compare with ith letter of w

W = TATATATA

Just one shift

Make it all line up

V = - ATATATAT

Page 44: 2. Comparing biological sequences :         sequence alignment

An Introduction to Bioinformatics

Edit Distance: Example

• 5 edit operations: TGCATAT ATCCGAT

• TGCATAT (delete last T)• TGCATA (delete last A)• TGCAT (insert A at front)• ATGCAT (substitute C for 3rd G)• ATCCAT (insert G before last A) • ATCCGAT (Done)

Page 45: 2. Comparing biological sequences :         sequence alignment

An Introduction to Bioinformatics

Edit Distance: Example (cont’d)

4 edit operations: TGCATAT ATCCGAT

TGCATAT (insert A at front)ATGCATAT (delete 6th T)ATGCATA (substitute G for 5th A)ATGCGTA (substitute C for 3rd G)ATCCGAT (Done)

Page 46: 2. Comparing biological sequences :         sequence alignment

An Introduction to Bioinformatics

Alignment: 2 row representation

Alignment : 2 * k matrix ( k > m, n )

A T -- G T A T --

A T C G -- A -- C

letters of v

letters of w

T

T

ATCTGATTGCATA

v :w :

m = 7 n = 6

4 matches 2 insertions

2 deletions

Given 2 DNA sequences v and w:

Page 47: 2. Comparing biological sequences :         sequence alignment

An Introduction to Bioinformatics

The Alignment Grid •2 sequences used for grid

•V = ATGTTAT•W = ATCGTAC

•Every alignment path is from source to sink

Page 48: 2. Comparing biological sequences :         sequence alignment

48

An Introduction to Bioinformatics

Alignment as a Path in the Edit Graph

0 1 2 2 3 4 5 6 7 70 1 2 2 3 4 5 6 7 7 A T _ G T T A T _A T _ G T T A T _ A T C G T _ A _ CA T C G T _ A _ C0 1 2 3 4 5 5 6 6 7 0 1 2 3 4 5 5 6 6 7

(0,0) , (1,1) , (2,2), (2,3), (0,0) , (1,1) , (2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (3,4), (4,5), (5,5), (6,6), (7,6), (7,7)(7,6), (7,7)

- Corresponding path -

Page 49: 2. Comparing biological sequences :         sequence alignment

49

An Introduction to Bioinformatics

Alignments in Edit Graph (cont’d)

and represent indels in v and w with score 0.

represent matches with score 1.• The score of the alignment path is 5.

Page 50: 2. Comparing biological sequences :         sequence alignment

50

An Introduction to Bioinformatics

Alignment as a Path in the Edit Graph

Every path in the edit graph corresponds to an alignment:

Page 51: 2. Comparing biological sequences :         sequence alignment

51

An Introduction to Bioinformatics

Alignment as a Path in the Edit Graph

Old AlignmentOld Alignment 01223012234545677677v= AT_Gv= AT_GTTTTAT_AT_w= ATCGw= ATCGT_T_A_CA_C 01234012345555667667

New AlignmentNew Alignment 01223012234545677677v= AT_Gv= AT_GTTTTAT_AT_w= ATCGw= ATCG_T_TA_CA_C 01234012344545667667

Page 52: 2. Comparing biological sequences :         sequence alignment

52

An Introduction to Bioinformatics

Alignment as a Path in the Edit Graph

01201222343455667777v= ATv= AT__GTGTTTAATT__w= ATw= ATCCGTGT__AA__CC 01201233454555666677

(0,0) , (1,1) , (2,2), (0,0) , (1,1) , (2,2), (2,3),(2,3), (3,4), (4,5), (3,4), (4,5), (5,5),(5,5), (6,6), (6,6), (7,6),(7,6), (7,7)(7,7)

Page 53: 2. Comparing biological sequences :         sequence alignment

53

An Introduction to Bioinformatics

Alignment: Dynamic Programming

si,j = si-1, j-1+1 if vi = wj

max si-1, j

si, j-1

Page 54: 2. Comparing biological sequences :         sequence alignment

54

An Introduction to Bioinformatics

Dynamic Programming Example

Initialize 1st row and 1st column to be all zeroes.

Or, to be more precise, initialize 0th row and 0th column to be all zeroes.

Page 55: 2. Comparing biological sequences :         sequence alignment

55

An Introduction to Bioinformatics

Dynamic Programming Example

Si,j = Si-1, j-1

max Si-1, j

Si, j-1

value from NW +1, if vi = wj

value from North (top) value from West (left)

Page 56: 2. Comparing biological sequences :         sequence alignment

56

An Introduction to Bioinformatics

Alignment: Backtracking

Arrows show where the score originated from.

if from the top

if from the left

if vi = wj

Page 57: 2. Comparing biological sequences :         sequence alignment

57

An Introduction to Bioinformatics

Backtracking Example

Find a match in row and column 2.

i=2, j=2,5 is a match (T).

j=2, i=4,5,7 is a match (T).

Since vi = wj, si,j = si-1,j-1 +1

s2,2 = [s1,1 = 1] + 1 s2,5 = [s1,4 = 1] + 1s4,2 = [s3,1 = 1] + 1s5,2 = [s4,1 = 1] + 1s7,2 = [s6,1 = 1] + 1

Page 58: 2. Comparing biological sequences :         sequence alignment

58

An Introduction to Bioinformatics

Backtracking Example

Continuing with the dynamic programming algorithm gives this result.

Page 59: 2. Comparing biological sequences :         sequence alignment

59

An Introduction to Bioinformatics

Alignment: Dynamic Programming

si,j = si-1, j-1+1 if vi = wj

max si-1, j

si, j-1

Page 60: 2. Comparing biological sequences :         sequence alignment

60

An Introduction to Bioinformatics

Alignment: Dynamic Programming

si,j = si-1, j-1+1 if vi = wj

max si-1, j+0

si, j-1+0

This recurrence corresponds to the Manhattan Tourist problem (three incoming edges into a vertex) with all horizontal and vertical edges weighted by zero.

Page 61: 2. Comparing biological sequences :         sequence alignment

61

An Introduction to Bioinformatics

LCS Algorithm1. LCS(v,w)

2. for i 1 to n

3. si,0 0

4. for j 1 to m

5. s0,j 0

6. for i 1 to n

7. for j 1 to m

8. si-1,j

9. si,j max si,j-1

10. si-1,j-1 + 1, if vi = wj

11. “ “ if si,j = si-1,j

• bi,j “ “ if si,j = si,j-1

• “ “ if si,j = si-1,j-1 + 1

• return (sn,m, b)

Page 62: 2. Comparing biological sequences :         sequence alignment

62

An Introduction to Bioinformatics

Now What?

• LCS(v,w) created the alignment grid

• Now we need a way to read the best alignment of v and w

• Follow the arrows backwards from sink

Page 63: 2. Comparing biological sequences :         sequence alignment

63

An Introduction to Bioinformatics

Printing LCS: Backtracking

1. PrintLCS(b,v,i,j)2. if i = 0 or j = 03. return

4. if bi,j = “ “5. PrintLCS(b,v,i-1,j-1)

6. print vi

7. else

8. if bi,j = “ “9. PrintLCS(b,v,i-1,j)10. else11. PrintLCS(b,v,i,j-1)

Page 64: 2. Comparing biological sequences :         sequence alignment

64

An Introduction to Bioinformatics

LCS Runtime

• It takes O(nm) time to fill in the nxm dynamic programming matrix.

• Why O(nm)? The pseudocode consists of a nested “for” loop inside of another “for” loop to set up a nxm matrix.