using traveling salesman problem algorithms to determine multiple sequence alignment orders

Using Traveling Salesman Problem Algorithms to Determine

Multiple Sequence Alignment Orders

Weiwei Zhong

Topics

• Background

• Algorithm Design

• Test Results

Background

Definitions

What is a Sequence Alignment?

Given• 2 or more sequences• a scoring scheme

Insert gaps in each sequence, so that

• all sequences have the same length• maximum pairing score

• match score • mismatch score• gap penalty

Scoring Matrix

• match = 2• mismatch = -1• gap penalty = -2

Simplified Scoring

Scoring matrix

In Practice

Global vs. Local Alignments

F G K – G K GF G K F G K G

- - - F G K G K GF G K F G K G - -

Global: entire lengths of sequences

Local: regions of sequences

Pairwise Alignment vs. Multiple Sequence Alignment (MSA)

F G K G K GF G K F G K G

Pairwise: 2 sequences MSA: more than 2 sequences

F G K G K GF G K F G K G- G K Q G K G- - K F G K G

Background

Basic Dynamic Programming

Dynamic Programming Algorithm for Pairwise Alignments

Two sequences

• GAATTC• GGATC

1. Initialization

Scoring scheme • match = 2• mismatch = -1• gap penalty = -2

0 0 0 0 0 0 0

0

0

0

0

0

G A A T T C

G

G

A

T

C

Scoring scheme • match = 2• mismatch = -1• gap g = -2

2. Table fill

Mi-1,j-1 Mi-1,j

Mi,j-1 Mij

Mi-1,j-1 + S(ci, cj) Mi,j-1 + gMi-1,j + g

Mij = max

ci

cj

0 0 0 0 0 0 0

0

0

0

0

0

G A A T T C

G

G

A

T

C

2

54310-1

13532-1

-3-11340

-2-2-2-112

-1-1-1-10

3. Trace back

G A A T T C| | | |G G A – T C

0

0

0

0

0

0000000

G A A T T C

G

G

A

T

C

2

54310-1

13532-1

-3-11340

-2-2-2-112

-1-1-1-10

Multidimensional Dynamic Programming for MSA

• n strings of length L each, running time is O(Ln).

• Impractical: 5-7 proteins of 200-300 residues each.

Topics

• Background


• Test Results

An MSA Heuristic

Algorithm Design

1. Align 2 of the sequences Si, Sj

2. Align a 3rd sequence Sk to the alignment Si, Sj

3. Repeat 2 until all sequences are aligned

Feng-Doolittle Progressive Alignment

Running TimeO( n L2 )

*

TA

S

cj

ci

S(ci, cj) = (S(T, S) + S(A, S)) / 2

Features of Feng-Doolittle Algorithm

x: G A A G T Ty: G A C – T T

x: G A A G T T

y: G A – C T Tz: G A A C T G

• Once a gap, always a gap

• Early mistakes cannot be corrected

Alignment order is important

z: G A A C T G

TspMsa: First Version

Algorithm Design

Traveling Salesman Problem (TSP)

Given• n nodes• distances for each pair of nodes

Find a roundtrip, so that• visit each node exactly once• minimal total length

NP-completeWell studied

calculate pairwise distances

TspMsa: Algorithm Design

Feng-Doolittle alignment

0 1 15 51 61

1 0 14 24 58

15 14 0 46 67

51 24 46 0 38

61 58 67 38 0

0 1 2 3 4

0

1

2

3

4

Alignmentorder

0

2

3

1

4

0

2

3

1

4

1

0

4

2

3

determine a TSP tour

429

814

624

8

932

84

1

14

79

2849151049

914

378

251

970

632

542

375

508337

498

0.7030.703

0.67

0.702

0.7470.770

0.737

0.681

0.677

0.686

0.711

0.7

0.6980.688

0.6850.7460.765

0.692

0.772

0.733

0.743

0.736

0.749

0.68

50.

719

0.6960.706

0.739

0.71

20.669

0.64

0.6890.7360.7020.

7220.74

0.665

0.653

0.636

0.6360.6030.654

0.668

0.731

0.712

0.65

6

01

4

2

3

19

20

21

2213

910

8

7

6

5

12

18

17

14

15

1611

Starting Point and Direction of TSP Tour

data setkinase_ref3

TspMsa: Modified Design

Algorithm Design

TspMsa: Modified Algorithm Design

calculate pairwise distances

determine a TSP tour

align closest nodes

0

2

3

1

4

67

15

38

1

24

2

3

1, 0

4

67

15

38

24

3

1, 0

2, 4

67

38

24

3, 1, 0, 2, 4

3, 1, 0

2, 467

38one node left

?

end

yes

no

0

2

3

1

4

Modified Algorithm is BetterAlignment order for Kinase_ref3

Original TspMsa : 0.603 (worst) - 0.772 (best)

Modified TspMsa : 0.836

5 6 7 8 10 9 0 1 4 2 3 18 17 14 15 16 11 12 13 22 21 20 19

Topics

• Background


• Test Results

Test Results

What to Compare With?

Existing MSA Programs

Progressive

multal

Iterative

multalignpileup

clustalw

poa

prrpsaga

hmmt

less computation time better quality

best quality

Fast

CLUSTALW1. Calculate pairwise distances2. Derive a guide tree by the Neighbor Joining method

12 3

4

5

678

9

12 3

4

78

95

6

repeat until one node left at the center

1234 789 56

choose 2 closest nodes, derive an internal node

ri=(Σdik)/(n-2)dix=(dij + ri - rj) /2djx=dij – dix

dxm=(dim + djm - dij)/2

j

ix

j

i

CLUSTALW

• 2 gap penalty values: opening, extension

• Dynamically changes the gap penalty and the scoring matrix

3. Progressively align all sequences following the guide tree

• Weighted sequences

1 p e e k s a v t a l2 g e e k a a v l a l

3 e g e w q l v l h v

Without weightsScore = [S(t,v) + S(l,v)] / 2

With weightsScore = [S(t,v)*w1*w3 + S(l,v)*w2*w3] / 2

POA

E T - - P K M I V RE T T H – K M L V R

1. Convert sequences to partial order graphs

E T N K

E T N K

E TP

K

T H

MI

V

L

R

POA2. Align 2 sequences3. Align one sequence to the current group

E T

P

KT H

E

T

N

K

4. Repeat 3 until all sequences are aligned

Test Results

Quality Evaluation

BAliBASE Benchmark

• Reference 1: equidistance sequences with various levels of similarity.

• < 25% sequence identity• 20-40% sequence identity• > 35% sequence identity

• Reference 2: closely related sequences with a highly divergent “orphan” sequence.

• Reference 3: subgroups with <25% identity between groups.

• Reference 4: sequences with N/C-terminal extensions.

• Reference 5: sequences with internal insertions.

Reference 1 Sequences with < 25% Identity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Test cases

Ali

gn

me

nt

sc

ore

s

TspMsa CLUSTALW POA

short medium long

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

ref.1(<25%)

CLUSTALW

TspMsa

POA

Average ScoreAll Test Scores

0

0.2

0.4

0.6

0.8

1

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

Test cases

Alig

nme

nt s

core

s

TspMsa CLUSTALW POA

Reference 1 Sequences with 20-40% Identity

short medium long


0

0.2

0.4

0.6

0.8

1

ref.1 (20-40%)

CLUSTALW

TspMsa

POA

00.10.20.30.40.50.60.70.80.9

1

1 3 5 7 9 11 13 15 17 19 21 23 25 27

Test cases

Alig

nmen

t sc

ores

TspMsa CLUSTALW POA

Reference 1 Sequences with >35% Identity

short medium long


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ref.1(>35%)

CLUSTALW

TspMsa

POA

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Test cases

Ali

gn

me

nts

sc

ore

s

TspMsa CULSTALW POA

Reference 2

short medium long


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ref.2

CLUSTALW

TspMsa

POA

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10 11 12

Test Cases

Alig

nm

ent s

core

s

TspMsa CLUSTALW POA

Reference 3

short medium long


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ref.3

CLUSTALW

TspMsa

POA

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 3 5 7 9 11 13 15 17 19 21

Test cases

Alig

nm

ent s

core

s

TspMsa CLUSTALW POA

Reference 4 and Reference 5

Reference 4 Reference 5


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ref.4 ref.5

CLUSTALW

TspMsa

POA

Alignment Quality Comparison

Reference 1: <25% identity: Similar *20-40% identity: Similar * > 35% identity: Similar

Reference 2: Similar *

Reference 3: TspMsa better

Reference 4: CLUSTALW better

Reference 5: Similar

* CLUSTALW slightly better for short sequences.

TspMsa and POA: TspMsa better

TspMsa and CLUSTALW: comparable

Test Results

Execution Time Evaluation

Fast Mode TspMsa

• Slow mode:full dynamic programming

(accurate)

• Fast mode:a fast approximate method

(heuristic)

Most time consuming step:Pairwise distance calculations

Quality Impact of the Fast Mode

00.10.20.30.40.50.60.70.80.9

1

ref.1(<25%)

ref.1(20-

40%)

ref.1(>35%)

ref.2 ref.3 ref.4 ref.5

CLUSTALW TspMsa POA

CLUSTALW-fast TspMsa-fast

Execution Time Evaluation

0

20

40

60

80

100

120

140

160

180

200

100 200 500 1000

Number of sequences

Ex

ec

uti

on

tim

e (

min

)CLUSTALW TspMsa POA

CLUSTALW and TspMsa in fast mode

Conclusions

Slow mode • close to CLUSTALW (slow mode) • better than POA

Fast mode (not as good as slow mode)• comparable to CLUSTALW (fast mode) • better than POA

Fast mode• faster than CLUSTALW (fast mode)• comparable to POA

QUALITY

SPEED

Acknowledgement

Dr. Robert Robinson

Dr. Russell Malmberg

Dr. Eileen Kraemer

Computer Science Department

using traveling salesman problem algorithms to determine multiple sequence alignment orders

Documents

t t c g g

sequencesf g

g t ty

local alignmentsf g

t t cggatcmi

t t cggatc254310

t c000000000000g

c t tz