using traveling salesman problem algorithms to determine multiple sequence alignment orders

Using Traveling Salesman Problem Algorithms to Determine

Multiple Sequence Alignment Orders

Weiwei Zhong

Topics

• Background

• Algorithm Design

• Test Results

Background

Definitions

What is a Sequence Alignment?

Given• 2 or more sequences• a scoring scheme

Insert gaps in each sequence, so that

• all sequences have the same length• maximum pairing score

• match score • mismatch score• gap penalty

Scoring Matrix

• match = 2• mismatch = -1• gap penalty = -2

Simplified Scoring

Scoring matrix

In Practice

Global vs. Local Alignments

F G K – G K GF G K F G K G

- - - F G K G K GF G K F G K G - -

Global: entire lengths of sequences

Local: regions of sequences

Pairwise Alignment vs. Multiple Sequence Alignment (MSA)

F G K G K GF G K F G K G

Pairwise: 2 sequences MSA: more than 2 sequences

F G K G K GF G K F G K G- G K Q G K G- - K F G K G

Background

Basic Dynamic Programming

Dynamic Programming Algorithm for Pairwise Alignments

Two sequences

• GAATTC• GGATC

1. Initialization

Scoring scheme • match = 2• mismatch = -1• gap penalty = -2

0 0 0 0 0 0 0

G A A T T C

Scoring scheme • match = 2• mismatch = -1• gap g = -2

2. Table fill

Mi-1,j-1 Mi-1,j

Mi,j-1 Mij

Mi-1,j-1 + S(ci, cj) Mi,j-1 + gMi-1,j + g

Mij = max

0 0 0 0 0 0 0

G A A T T C

54310-1

13532-1

-3-11340

-2-2-2-112

-1-1-1-10

3. Trace back

G A A T T C| | | |G G A – T C

0000000

G A A T T C

54310-1

13532-1

-3-11340

-2-2-2-112

-1-1-1-10

Multidimensional Dynamic Programming for MSA

• n strings of length L each, running time is O(Ln).

• Impractical: 5-7 proteins of 200-300 residues each.

Topics

• Background

• Test Results

An MSA Heuristic

Algorithm Design

1. Align 2 of the sequences Si, Sj

2. Align a 3rd sequence Sk to the alignment Si, Sj

3. Repeat 2 until all sequences are aligned

Feng-Doolittle Progressive Alignment

Running TimeO( n L2 )

S(ci, cj) = (S(T, S) + S(A, S)) / 2

Features of Feng-Doolittle Algorithm

x: G A A G T Ty: G A C – T T

x: G A A G T T

y: G A – C T Tz: G A A C T G

• Once a gap, always a gap

• Early mistakes cannot be corrected

Alignment order is important

z: G A A C T G

TspMsa: First Version

Algorithm Design

Traveling Salesman Problem (TSP)

Given• n nodes• distances for each pair of nodes

Find a roundtrip, so that• visit each node exactly once• minimal total length

NP-completeWell studied

calculate pairwise distances

TspMsa: Algorithm Design

Feng-Doolittle alignment

0 1 15 51 61

1 0 14 24 58

15 14 0 46 67

51 24 46 0 38

61 58 67 38 0

0 1 2 3 4

Alignmentorder

determine a TSP tour

2849151049

508337

0.7030.703

0.7470.770

0.6980.688

0.6850.7460.765

0.6960.706

20.669

0.6890.7360.7020.

7220.74

0.6360.6030.654

Starting Point and Direction of TSP Tour

data setkinase_ref3

TspMsa: Modified Design

Algorithm Design

TspMsa: Modified Algorithm Design

calculate pairwise distances

determine a TSP tour

align closest nodes

3, 1, 0, 2, 4

3, 1, 0

2, 467

38one node left

Modified Algorithm is BetterAlignment order for Kinase_ref3

Original TspMsa : 0.603 (worst) - 0.772 (best)

Modified TspMsa : 0.836

5 6 7 8 10 9 0 1 4 2 3 18 17 14 15 16 11 12 13 22 21 20 19

Topics

• Background

• Test Results

Test Results

What to Compare With?

Existing MSA Programs

Progressive

multal

Iterative

multalignpileup

clustalw

prrpsaga

less computation time better quality

best quality

CLUSTALW1. Calculate pairwise distances2. Derive a guide tree by the Neighbor Joining method

repeat until one node left at the center

1234 789 56

choose 2 closest nodes, derive an internal node

ri=(Σdik)/(n-2)dix=(dij + ri - rj) /2djx=dij – dix

dxm=(dim + djm - dij)/2

CLUSTALW

• 2 gap penalty values: opening, extension

• Dynamically changes the gap penalty and the scoring matrix

3. Progressively align all sequences following the guide tree

• Weighted sequences

1 p e e k s a v t a l2 g e e k a a v l a l

3 e g e w q l v l h v

Without weightsScore = [S(t,v) + S(l,v)] / 2

With weightsScore = [S(t,v)*w1*w3 + S(l,v)*w2*w3] / 2

E T - - P K M I V RE T T H – K M L V R

1. Convert sequences to partial order graphs

E T N K

POA2. Align 2 sequences3. Align one sequence to the current group

4. Repeat 3 until all sequences are aligned

Test Results

Quality Evaluation

BAliBASE Benchmark

• Reference 1: equidistance sequences with various levels of similarity.

• < 25% sequence identity• 20-40% sequence identity• > 35% sequence identity

• Reference 2: closely related sequences with a highly divergent “orphan” sequence.

• Reference 3: subgroups with <25% identity between groups.

• Reference 4: sequences with N/C-terminal extensions.

• Reference 5: sequences with internal insertions.

Reference 1 Sequences with < 25% Identity

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Test cases

TspMsa CLUSTALW POA

short medium long

ref.1(<25%)

CLUSTALW

TspMsa

Average ScoreAll Test Scores

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

Test cases

TspMsa CLUSTALW POA

Reference 1 Sequences with 20-40% Identity

short medium long

ref.1 (20-40%)

CLUSTALW

TspMsa

00.10.20.30.40.50.60.70.80.9

1 3 5 7 9 11 13 15 17 19 21 23 25 27

Test cases

TspMsa CLUSTALW POA

Reference 1 Sequences with >35% Identity

short medium long

ref.1(>35%)

CLUSTALW

TspMsa

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Test cases

TspMsa CULSTALW POA

Reference 2

short medium long

CLUSTALW

TspMsa

1 2 3 4 5 6 7 8 9 10 11 12

Test Cases

TspMsa CLUSTALW POA

Reference 3

short medium long

CLUSTALW

TspMsa

1 3 5 7 9 11 13 15 17 19 21

Test cases

TspMsa CLUSTALW POA

Reference 4 and Reference 5

Reference 4 Reference 5

ref.4 ref.5

CLUSTALW

TspMsa

Alignment Quality Comparison

Reference 1: <25% identity: Similar *20-40% identity: Similar * > 35% identity: Similar

Reference 2: Similar *

Reference 3: TspMsa better

Reference 4: CLUSTALW better

Reference 5: Similar

* CLUSTALW slightly better for short sequences.

TspMsa and POA: TspMsa better

TspMsa and CLUSTALW: comparable

Test Results

Execution Time Evaluation

Fast Mode TspMsa

• Slow mode:full dynamic programming

(accurate)

• Fast mode:a fast approximate method

(heuristic)

Most time consuming step:Pairwise distance calculations

Quality Impact of the Fast Mode

00.10.20.30.40.50.60.70.80.9

ref.1(<25%)

ref.1(20-

ref.1(>35%)

ref.2 ref.3 ref.4 ref.5

CLUSTALW TspMsa POA

CLUSTALW-fast TspMsa-fast

Execution Time Evaluation

100 200 500 1000

Number of sequences

)CLUSTALW TspMsa POA

CLUSTALW and TspMsa in fast mode

Conclusions

Slow mode • close to CLUSTALW (slow mode) • better than POA

Fast mode (not as good as slow mode)• comparable to CLUSTALW (fast mode) • better than POA

Fast mode• faster than CLUSTALW (fast mode)• comparable to POA

QUALITY

Acknowledgement

Dr. Robert Robinson

Dr. Russell Malmberg

Dr. Eileen Kraemer

Computer Science Department

using traveling salesman problem algorithms to determine multiple sequence alignment orders

t t c g g

sequencesf g

g t ty

local alignmentsf g

t t cggatcmi

t t cggatc254310

t c000000000000g

c t tz

Documents

genetic algorithms for solving traveling salesman problem

traveling salesman problem zip-method deductive approach of...

5 traveling salesman problem - schoolit - · pdf file5...

traveling salesman problem.ppt

traveling salesman with deadlines

graph theory: traveling salesman problem (tsp)

ant colonies for the traveling salesman problem

e the traveling salesman problem - mit...

penyelesaian masalah traveling salesman problem …

the traveling salesman problem

traveling salesman heuristic article helsgaun

online traveling salesman problems with service...

the traveling salesman problem in theory & practice

the traveling salesman - cornell university · history of...

using traveling salesman problem algorithms to...

lecture 1: symmetric traveling salesman problem · outline...

graphs paths circuits euler. traveling salesman problems

traveling salesman problem comparisons between heuristics...

traveling salesman path problems - springer · traveling...

traveling salesman problem