using traveling salesman problem algorithms to determine multiple sequence alignment orders

45
Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong

Upload: gent

Post on 12-Jan-2016

35 views

Category:

Documents


0 download

DESCRIPTION

Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders. Weiwei Zhong. Topics. Background Algorithm Design Test Results. Background. Definitions. What is a Sequence Alignment?. Given 2 or more sequences a scoring scheme. match score - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

Using Traveling Salesman Problem Algorithms to Determine

Multiple Sequence Alignment Orders

Weiwei Zhong

Page 2: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

Topics

• Background

• Algorithm Design

• Test Results

Page 3: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

Background

Definitions

Page 4: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

What is a Sequence Alignment?

Given• 2 or more sequences• a scoring scheme

Insert gaps in each sequence, so that

• all sequences have the same length• maximum pairing score

• match score • mismatch score• gap penalty

Page 5: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

Scoring Matrix

• match = 2• mismatch = -1• gap penalty = -2

Simplified Scoring

Scoring matrix

In Practice

Page 6: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

Global vs. Local Alignments

F G K – G K GF G K F G K G

- - - F G K G K GF G K F G K G - -

Global: entire lengths of sequences

Local: regions of sequences

Page 7: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

Pairwise Alignment vs. Multiple Sequence Alignment (MSA)

F G K G K GF G K F G K G

Pairwise: 2 sequences MSA: more than 2 sequences

F G K G K GF G K F G K G- G K Q G K G- - K F G K G

Page 8: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

Background

Basic Dynamic Programming

Page 9: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

Dynamic Programming Algorithm for Pairwise Alignments

Two sequences

• GAATTC• GGATC

1. Initialization

Scoring scheme • match = 2• mismatch = -1• gap penalty = -2

0 0 0 0 0 0 0

0

0

0

0

0

G A A T T C

G

G

A

T

C

Page 10: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

Scoring scheme • match = 2• mismatch = -1• gap g = -2

2. Table fill

Mi-1,j-1 Mi-1,j

Mi,j-1 Mij

Mi-1,j-1 + S(ci, cj) Mi,j-1 + gMi-1,j + g

Mij = max

ci

cj

0 0 0 0 0 0 0

0

0

0

0

0

G A A T T C

G

G

A

T

C

2

54310-1

13532-1

-3-11340

-2-2-2-112

-1-1-1-10

Page 11: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

3. Trace back

G A A T T C| | | |G G A – T C

0

0

0

0

0

0000000

G A A T T C

G

G

A

T

C

2

54310-1

13532-1

-3-11340

-2-2-2-112

-1-1-1-10

Page 12: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

Multidimensional Dynamic Programming for MSA

• n strings of length L each, running time is O(Ln).

• Impractical: 5-7 proteins of 200-300 residues each.

Page 13: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

Topics

• Background

• Algorithm Design

• Test Results

Page 14: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

An MSA Heuristic

Algorithm Design

Page 15: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

1. Align 2 of the sequences Si, Sj

2. Align a 3rd sequence Sk to the alignment Si, Sj

3. Repeat 2 until all sequences are aligned

Feng-Doolittle Progressive Alignment

Running TimeO( n L2 )

*

TA

S

cj

ci

S(ci, cj) = (S(T, S) + S(A, S)) / 2

Page 16: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

Features of Feng-Doolittle Algorithm

x: G A A G T Ty: G A C – T T

x: G A A G T T

y: G A – C T Tz: G A A C T G

• Once a gap, always a gap

• Early mistakes cannot be corrected

Alignment order is important

z: G A A C T G

Page 17: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

TspMsa: First Version

Algorithm Design

Page 18: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

Traveling Salesman Problem (TSP)

Given• n nodes• distances for each pair of nodes

Find a roundtrip, so that• visit each node exactly once• minimal total length

NP-completeWell studied

Page 19: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

calculate pairwise distances

TspMsa: Algorithm Design

Feng-Doolittle alignment

0 1 15 51 61

1 0 14 24 58

15 14 0 46 67

51 24 46 0 38

61 58 67 38 0

0 1 2 3 4

0

1

2

3

4

Alignmentorder

0

2

3

1

4

0

2

3

1

4

1

0

4

2

3

determine a TSP tour

Page 20: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

429

814

624

8

932

84

1

14

79

2849151049

914

378

251

970

632

542

375

508337

498

0.7030.703

0.67

0.702

0.7470.770

0.737

0.681

0.677

0.686

0.711

0.7

0.6980.688

0.6850.7460.765

0.692

0.772

0.733

0.743

0.736

0.749

0.68

50.

719

0.6960.706

0.739

0.71

20.669

0.64

0.6890.7360.7020.

7220.74

0.665

0.653

0.636

0.6360.6030.654

0.668

0.731

0.712

0.65

6

01

4

2

3

19

20

21

2213

910

8

7

6

5

12

18

17

14

15

1611

Starting Point and Direction of TSP Tour

data setkinase_ref3

Page 21: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

TspMsa: Modified Design

Algorithm Design

Page 22: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

TspMsa: Modified Algorithm Design

calculate pairwise distances

determine a TSP tour

align closest nodes

0

2

3

1

4

67

15

38

1

24

2

3

1, 0

4

67

15

38

24

3

1, 0

2, 4

67

38

24

3, 1, 0, 2, 4

3, 1, 0

2, 467

38one node left

?

end

yes

no

0

2

3

1

4

Page 23: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

Modified Algorithm is BetterAlignment order for Kinase_ref3

Original TspMsa : 0.603 (worst) - 0.772 (best)

Modified TspMsa : 0.836

5 6 7 8 10 9 0 1 4 2 3 18 17 14 15 16 11 12 13 22 21 20 19

Page 24: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

Topics

• Background

• Algorithm Design

• Test Results

Page 25: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

Test Results

What to Compare With?

Page 26: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

Existing MSA Programs

Progressive

multal

Iterative

multalignpileup

clustalw

poa

prrpsaga

hmmt

less computation time better quality

best quality

Fast

Page 27: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

CLUSTALW1. Calculate pairwise distances2. Derive a guide tree by the Neighbor Joining method

12 3

4

5

678

9

12 3

4

78

95

6

repeat until one node left at the center

1234 789 56

choose 2 closest nodes, derive an internal node

ri=(Σdik)/(n-2)dix=(dij + ri - rj) /2djx=dij – dix

dxm=(dim + djm - dij)/2

j

ix

j

i

Page 28: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

CLUSTALW

• 2 gap penalty values: opening, extension

• Dynamically changes the gap penalty and the scoring matrix

3. Progressively align all sequences following the guide tree

• Weighted sequences

1 p e e k s a v t a l2 g e e k a a v l a l

3 e g e w q l v l h v

Without weightsScore = [S(t,v) + S(l,v)] / 2

With weightsScore = [S(t,v)*w1*w3 + S(l,v)*w2*w3] / 2

Page 29: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

POA

E T - - P K M I V RE T T H – K M L V R

1. Convert sequences to partial order graphs

E T N K

E T N K

E TP

K

T H

MI

V

L

R

Page 30: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

POA2. Align 2 sequences3. Align one sequence to the current group

E T

P

KT H

E

T

N

K

4. Repeat 3 until all sequences are aligned

Page 31: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

Test Results

Quality Evaluation

Page 32: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

BAliBASE Benchmark

• Reference 1: equidistance sequences with various levels of similarity.

• < 25% sequence identity• 20-40% sequence identity• > 35% sequence identity

• Reference 2: closely related sequences with a highly divergent “orphan” sequence.

• Reference 3: subgroups with <25% identity between groups.

• Reference 4: sequences with N/C-terminal extensions.

• Reference 5: sequences with internal insertions.

Page 33: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

Reference 1 Sequences with < 25% Identity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Test cases

Ali

gn

me

nt

sc

ore

s

TspMsa CLUSTALW POA

short medium long

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

ref.1(<25%)

CLUSTALW

TspMsa

POA

Average ScoreAll Test Scores

Page 34: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

0

0.2

0.4

0.6

0.8

1

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

Test cases

Alig

nme

nt s

core

s

TspMsa CLUSTALW POA

Reference 1 Sequences with 20-40% Identity

short medium long

Average ScoreAll Test Scores

0

0.2

0.4

0.6

0.8

1

ref.1 (20-40%)

CLUSTALW

TspMsa

POA

Page 35: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

00.10.20.30.40.50.60.70.80.9

1

1 3 5 7 9 11 13 15 17 19 21 23 25 27

Test cases

Alig

nmen

t sc

ores

TspMsa CLUSTALW POA

Reference 1 Sequences with >35% Identity

short medium long

Average ScoreAll Test Scores

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ref.1(>35%)

CLUSTALW

TspMsa

POA

Page 36: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Test cases

Ali

gn

me

nts

sc

ore

s

TspMsa CULSTALW POA

Reference 2

short medium long

Average ScoreAll Test Scores

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ref.2

CLUSTALW

TspMsa

POA

Page 37: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10 11 12

Test Cases

Alig

nm

ent s

core

s

TspMsa CLUSTALW POA

Reference 3

short medium long

Average ScoreAll Test Scores

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ref.3

CLUSTALW

TspMsa

POA

Page 38: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 3 5 7 9 11 13 15 17 19 21

Test cases

Alig

nm

ent s

core

s

TspMsa CLUSTALW POA

Reference 4 and Reference 5

Reference 4 Reference 5

Average ScoreAll Test Scores

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ref.4 ref.5

CLUSTALW

TspMsa

POA

Page 39: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

Alignment Quality Comparison

Reference 1: <25% identity: Similar *20-40% identity: Similar * > 35% identity: Similar

Reference 2: Similar *

Reference 3: TspMsa better

Reference 4: CLUSTALW better

Reference 5: Similar

* CLUSTALW slightly better for short sequences.

TspMsa and POA: TspMsa better

TspMsa and CLUSTALW: comparable

Page 40: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

Test Results

Execution Time Evaluation

Page 41: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

Fast Mode TspMsa

• Slow mode:full dynamic programming

(accurate)

• Fast mode:a fast approximate method

(heuristic)

Most time consuming step:Pairwise distance calculations

Page 42: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

Quality Impact of the Fast Mode

00.10.20.30.40.50.60.70.80.9

1

ref.1(<25%)

ref.1(20-

40%)

ref.1(>35%)

ref.2 ref.3 ref.4 ref.5

CLUSTALW TspMsa POA

CLUSTALW-fast TspMsa-fast

Page 43: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

Execution Time Evaluation

0

20

40

60

80

100

120

140

160

180

200

100 200 500 1000

Number of sequences

Ex

ec

uti

on

tim

e (

min

)CLUSTALW TspMsa POA

CLUSTALW and TspMsa in fast mode

Page 44: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

Conclusions

Slow mode • close to CLUSTALW (slow mode) • better than POA

Fast mode (not as good as slow mode)• comparable to CLUSTALW (fast mode) • better than POA

Fast mode• faster than CLUSTALW (fast mode)• comparable to POA

QUALITY

SPEED

Page 45: Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

Acknowledgement

Dr. Robert Robinson

Dr. Russell Malmberg

Dr. Eileen Kraemer

Computer Science Department