pairwise alignment ii - bgutabio172/wiki.files/... · 67 some results most pairwise sequence...

40
Pairwise alignment II

Upload: others

Post on 15-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Pairwise alignment II - BGUtabio172/wiki.files/... · 67 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n),

Pairwise alignment II

Page 2: Pairwise alignment II - BGUtabio172/wiki.files/... · 67 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n),

Agenda

- Previous Lesson: Minhala + Introduction

- Review Dynamic Programming

- Pariwise Alignment Biological Motivation

Today:

- Quick Review: Sequence Alignment (Global, Local, Variants).

- Heuristic Search

Page 3: Pairwise alignment II - BGUtabio172/wiki.files/... · 67 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n),

3

Ilan Smoly and Dan Gusfield, WABI 2015 in Atlanta, Georgia

Page 4: Pairwise alignment II - BGUtabio172/wiki.files/... · 67 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n),

4/64

Sequence Comparison (cont)

• We seek the following similarities between sequences :

• Find similar proteins – Allows to predict function & structure

• Locate similar subsequences in DNA– Allows to identify (e.g) regulatory elements

• Locate DNA sequences that might overlap– Helps in sequence assembly

g1

g2

Page 5: Pairwise alignment II - BGUtabio172/wiki.files/... · 67 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n),
Page 6: Pairwise alignment II - BGUtabio172/wiki.files/... · 67 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n),

Comparison methods

• Global alignment – Finds the best alignment across the whole two sequences.

• Local alignment – Finds regions of similarity in parts of the sequences.

Global Local

_____ _______ __ ____

__ ____ ____ __ ____

Page 7: Pairwise alignment II - BGUtabio172/wiki.files/... · 67 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n),

Global Alignment

• Algorithm of Needleman and Wunsch (1970)

• Finds the alignment of two complete sequences:

ADLGAVFALCDRYFQ

|||| |||| |

ADLGRTQN-CDRYYQ

• Some global alignment programs “trim ends”

Page 8: Pairwise alignment II - BGUtabio172/wiki.files/... · 67 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n),

Local Alignment

• Algorithm of Smith and Waterman (1981).

• Makes an optimal alignment of the best segment of

similarity between two sequences.

ADLG CDRYFQ

|||| |||| |

ADLG CDRYYQ

• Can return a number of highly aligned segments.

Page 9: Pairwise alignment II - BGUtabio172/wiki.files/... · 67 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n),

47

Global Alignment: Algorithm

1..j1..iT and S of alignment optimum of Cost),( jiC

T of jlength of Prefix

S of i length of Prefix

..1

..1

j

i

T

S

ba

babaw

if

if),(

Page 10: Pairwise alignment II - BGUtabio172/wiki.files/... · 67 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n),

48

)1j,i(C

)j,1i(C

)T,S(w)1j,1i(C

max)j,i(Cji

j)j,0(Ci)0,i(C

Initial conditions:

Recurrence relation: For 1 i n, 1 j m:

Theorem. C(i,j) satisfies the following relationships:

Page 11: Pairwise alignment II - BGUtabio172/wiki.files/... · 67 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n),

51

Computation Procedure

C(n,m)

C(0,0)

C(i,j)

)1j,i(C,)j,1i(C),T,S(w)1j,1i(Cmax)j,i(C ji

C(i-1,j)C(i-1,j-1)

C(i,j-1)

Page 12: Pairwise alignment II - BGUtabio172/wiki.files/... · 67 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n),

52

λ C T C G C A G C

A

C

T

T

C

A

C

+10 for match, -2 for mismatch, -5 for space

0 -5 -10 -15 -20 -25 -30 -35 -40

-5

-10

-15

-20

-25

-30

-35

10 5

λ

Page 13: Pairwise alignment II - BGUtabio172/wiki.files/... · 67 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n),

53

0 -5 -10 -15 -20 -25 -30 -35 -40

-5 10 5 0 -5 -10 -15 -20 -25

-10 5 8 3 -2 -7 0 -5 -10

-15 0 15 10 5 0 -5 -2 -7

-20 -5 10 13 8 3 -2 -7 -4

-25 -10 5 20 15 18 13 8 3

-30 -15 0 15 18 13 28 23 18

-35 -20 -5 10 13 28 23 26 33

λ C T C G C A G C

A

C

T

T

C

A

C

λ

Traceback can yield both optimum alignments

*

*

Page 14: Pairwise alignment II - BGUtabio172/wiki.files/... · 67 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n),

54

Local AlignmentSmith-Waterman

• Best score for aligning part of sequences– Often beats global alignment score

ATTGCAGTG-TCGAGCGTCAGGCT

ATTGCGTCGATCGCAC-GCACGCT

Global Alignment

Local Alignment

CATATTGCAGTGGTCCCGCGTCAGGCT

TAAATTGCGT-GGTCGCACTGCACGCT

Page 15: Pairwise alignment II - BGUtabio172/wiki.files/... · 67 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n),

55

Local Alignment: Motivation

• Ignoring stretches of non-coding DNA:– Non-coding regions are more likely to be subjected to

mutations than coding regions.

– Local alignment between two sequences is likely to be between two exons.

• Locating protein domains:– Proteins of different kind and of different species often

exhibit local similarities

– Local similarities may indicate ”functional subunits”.

Page 16: Pairwise alignment II - BGUtabio172/wiki.files/... · 67 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n),

56

>Human DNA

CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA

>Mouse DNA

CATGCGTCTGACgctttttgctagcgatatcggactATCGATATA

Global vs. Local alignment

Alignment of two Genomic sequences

Page 17: Pairwise alignment II - BGUtabio172/wiki.files/... · 67 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n),

57

Human:CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA

Mouse:CATGCGTCTGACgct---ttttgctagcgatatcggactATCGAT-ATA

****** ***** * *** * ****** ***

Global Alignment

Human:CATGCGACTGAC

Mouse:CATGCGTCTGAC

Human:ATCGATCATA

Mouse:ATCGAT-ATA

Local Alignment

Global vs. Local alignment

Alignment of two Genomic sequences

Page 18: Pairwise alignment II - BGUtabio172/wiki.files/... · 67 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n),

58

>Human DNA

CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA

>Human mRNA

CATGCGACTGACATCGATCATA

Global vs. Local alignment

Alignment of DNA and mRNA

Page 19: Pairwise alignment II - BGUtabio172/wiki.files/... · 67 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n),

59

DNA: CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA

mRNA:CATGCGACTGAC---------------------------ATCGATCATA

************ **********

Global Alignment

DNA: CATGCGACTGAC

mRNA:CATGCGACTGAC

DNA: ATCGATCATA

mRNA:ATCGATCATA

Local Alignment

Global vs. Local alignment

Alignment of DNA and mRNA

Page 20: Pairwise alignment II - BGUtabio172/wiki.files/... · 67 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n),

60

Global vs. Local alignment

DOROTHY

DOROTHY

HODGKIN

HODGKIN

Global alignment:DOROTHY--------HODGKIN

DOROTHYCROWFOOTHODGKIN

Local alignment:

DorothyHodkin

DorothyCrowfootHodkin

Page 21: Pairwise alignment II - BGUtabio172/wiki.files/... · 67 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n),

61/64

Global vs. Local Alignment

Source: Jones and Pevzner

Page 22: Pairwise alignment II - BGUtabio172/wiki.files/... · 67 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n),

62

Local Alignment: Algorithm

Initialize top row and leftmost column to zero.

0

1,

,1

,]1,1[

max ,

jiC

jiC

jtisscorejiC

jiC

C [i, j] = Score of optimally aligning a suffix of S1…i with a suffix of T1…j.

Page 23: Pairwise alignment II - BGUtabio172/wiki.files/... · 67 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n),

63

0 0 0 0 0 0 0 0 0

0 1 0 1 0 1 0 0 1

0 0 0 0 0 0 2 0 0

0 0 1 0 0 0 0 1 0

0 0 1 0 0 0 0 0 0

0 1 0 2 0 1 0 0 1

0 0 0 0 1 0 2 0 0

0 1 0 1 0 2 0 1 1

λ C T C G C A G C

A

C

T

T

C

A

C

λ

+1 for a match, -1 for a mismatch, -5 for a space

Page 24: Pairwise alignment II - BGUtabio172/wiki.files/... · 67 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n),

64

Reducing space requirements

• O(mn) tables are often the limiting factor in computing large alignments

• There is a linear space technique that only doubles the time required [Hirschberg, 1977]

Page 25: Pairwise alignment II - BGUtabio172/wiki.files/... · 67 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n),

65

0 10 5 10 5 10 5 0 10

λ C T C G C A G C

A

C

T

T

C

A

G

IDEA: We only need the previous row to calculate the next

0 0 0 0 0 0 0 0 0λ

0 5 8 5 8 5 20 15 10

Page 26: Pairwise alignment II - BGUtabio172/wiki.files/... · 67 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n),

66

Linear-space Alignments

mn + ½ mn + ¼ mn + 1/8 mn + 1/16 mn + … = 2 mn

Page 27: Pairwise alignment II - BGUtabio172/wiki.files/... · 67 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n),

67

Some ResultsMost pairwise sequence alignment problems can be solved in O(mn)

time.

Space requirement can be reduced to O(m+n), while keeping run-time

fixed Hirshberg, 1988].

Highly similar sequences can be aligned in O(dn) time, where d

measures the distance between the sequences [Landau, 1986]

Time complexity of the fastest known sequence alignment algorithms ?

O(n2/logn) [Crochemore, Landau, Ziv-Ukelson, 2003]

For Discrete Scoring Schemes: [Masek and Paterson, 1980]

Page 28: Pairwise alignment II - BGUtabio172/wiki.files/... · 67 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n),

Sub Quadratic Sequence Alignment

How many points of interest? O(n2/t)

n/ t rows with n vertices each

n/ t columns with n vertices each

LZ-78 Compression Table Lookup

[Crochemore, Landua

and Ziv-Ukelson, 2003]

[Masek and Paterson,

1981]

Page 29: Pairwise alignment II - BGUtabio172/wiki.files/... · 67 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n),

69/64

Variants of Sequence Alignment

We have seen two basic variants of sequence alignment:

• Global alignment (Needelman-Wunsch)

• Local alignment (Smith-Waterman)

We will pose and solve two problems :

• Finding the best overlap alignment

• Using an affine cost for gaps

Page 30: Pairwise alignment II - BGUtabio172/wiki.files/... · 67 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n),

70/64

Overlap AlignmentConsider the following question:

Can we find the most significant overlap between two sequences s,t ?

Possible overlap relations: a.

b.

The difference between this problem and local alignment is that here we require alignment between the endpoints of the two sequences.

Page 31: Pairwise alignment II - BGUtabio172/wiki.files/... · 67 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n),

71

End-gap free alignment

• Gaps at the start or end of alignment are not penalized

Best global Best end-gap free

Match: +2 Mismatch and space: -1

Score = 1 Score = 9

Page 32: Pairwise alignment II - BGUtabio172/wiki.files/... · 67 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n),

72

Motivation: Shotgun assembly

Page 33: Pairwise alignment II - BGUtabio172/wiki.files/... · 67 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n),

73

Motivation: Shotgun assembly

• Shotgun assembly produces a large set of partially overlapping subsequences from many copies of one unknown DNA sequence.

• Problem: Use the overlapping sections to ”paste”

the subsequences together.

• Overlapping pairs will have low global alignmentscore, but high end-space free score because of overlap.

• HOW CAN THIS BE SOLVED?

Page 34: Pairwise alignment II - BGUtabio172/wiki.files/... · 67 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n),

75

Algorithm

• Same as global alignment, except:

1. Initialize with zeros (free gaps at start)

– Locate max in the last row/column (free gaps atend)

Page 35: Pairwise alignment II - BGUtabio172/wiki.files/... · 67 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n),

76

10 5 10 5 10 5 0 10

λ C T C G C A G C

A

C

T

T

C

A

G

+10 for match, -2 for mismatch, -5 for gap

0 0 0 0 0 0 0 0 0

0

0

0

0

0

0

0

λ

5 8 5 8 5 20 15 10

0 15 10 5 6 15 18 13

-2 10 13 8 3 10 13 16

10 5 20 15 18 13 8 23

5 8 15 18 13 28 23 18

0 3 10 25 20 23 38 33

Page 36: Pairwise alignment II - BGUtabio172/wiki.files/... · 67 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n),

77/64

Overlap Alignment

• Initialization: V[i,0]=0, V[0,j]=0

• Recurrence: as in global alignment

• a match starts at the top or left border of the matrix and finishes on the right or bottom border.

• Score: maximum value at the bottom line and rightmost

line in the matrix

])[,(],[

)],[(],[

])[],[(],[

max],[

1jtj1iV

1is1jiV

1jt1isjiV

1j1iV

Page 37: Pairwise alignment II - BGUtabio172/wiki.files/... · 67 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n),

78/64

Overlap Alignment Example

H E A G A W G H E E

0 0 0 0 0 0 0 0 0 0 0

P 0

A 0

W 0

H 0

E 0

A 0

E 0

s = PAWHEAE

t = HEAGAWGHEE

Scoring system:

Match: +4

Mismatch: -1

Indel: -5

Page 38: Pairwise alignment II - BGUtabio172/wiki.files/... · 67 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n),

79/64

Overlap Alignment Example

H E A G A W G H E E

0 0 0 0 0 0 0 0 0 0 0

P 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

A 0 -1

W 0 -1

H 0 4

E 0 1

A 0 -1

E 0 -1

s = PAWHEAE

t = HEAGAWGHEE

Scoring system:

Match: +4

Mismatch: -1

Indel: -5

Page 39: Pairwise alignment II - BGUtabio172/wiki.files/... · 67 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n),

80/64

Overlap Alignment Example

H E A G A W G H E E

0 0 0 0 0 0 0 0 0 0 0

P 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

A 0 -1 -2 3 -2 3 -2 -2 -2 -2 -2

W 0 -1 -2 -2 2 -2 7 2 -3 -3 -1

H 0 4 -1 -3 -3 1 2 6 6 1 -2

E 0 -1 8 3 -2 -3 0 1 5 10 5

A 0 -1 3 12 7 2 -2 -1 0 5 9

E 0 -1 3 7 11 6 1 -3 -2 4 9

s = PAWHEAE

t = HEAGAWGHEE

Scoring system:

Match: +4

Mismatch: -1

Indel: -5

Page 40: Pairwise alignment II - BGUtabio172/wiki.files/... · 67 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n),

81/64

Overlap Alignment ExampleThe best overlap is:

PAWHEAE------

---HEAGAWGHEE

Remark:

A different scoring system could lead us to a different result, such as:

---PAW-HEAE

HEAGAWGHEE-