using traveling salesman problem algorithms to determine multiple sequence alignment orders
DESCRIPTION
Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders. Weiwei Zhong. Topics. Background Algorithm Design Test Results. Background. Definitions. What is a Sequence Alignment?. Given 2 or more sequences a scoring scheme. match score - PowerPoint PPT PresentationTRANSCRIPT
Using Traveling Salesman Problem Algorithms to Determine
Multiple Sequence Alignment Orders
Weiwei Zhong
Topics
• Background
• Algorithm Design
• Test Results
Background
Definitions
What is a Sequence Alignment?
Given• 2 or more sequences• a scoring scheme
Insert gaps in each sequence, so that
• all sequences have the same length• maximum pairing score
• match score • mismatch score• gap penalty
Scoring Matrix
• match = 2• mismatch = -1• gap penalty = -2
Simplified Scoring
Scoring matrix
In Practice
Global vs. Local Alignments
F G K – G K GF G K F G K G
- - - F G K G K GF G K F G K G - -
Global: entire lengths of sequences
Local: regions of sequences
Pairwise Alignment vs. Multiple Sequence Alignment (MSA)
F G K G K GF G K F G K G
Pairwise: 2 sequences MSA: more than 2 sequences
F G K G K GF G K F G K G- G K Q G K G- - K F G K G
Background
Basic Dynamic Programming
Dynamic Programming Algorithm for Pairwise Alignments
Two sequences
• GAATTC• GGATC
1. Initialization
Scoring scheme • match = 2• mismatch = -1• gap penalty = -2
0 0 0 0 0 0 0
0
0
0
0
0
G A A T T C
G
G
A
T
C
Scoring scheme • match = 2• mismatch = -1• gap g = -2
2. Table fill
Mi-1,j-1 Mi-1,j
Mi,j-1 Mij
Mi-1,j-1 + S(ci, cj) Mi,j-1 + gMi-1,j + g
Mij = max
ci
cj
0 0 0 0 0 0 0
0
0
0
0
0
G A A T T C
G
G
A
T
C
2
54310-1
13532-1
-3-11340
-2-2-2-112
-1-1-1-10
3. Trace back
G A A T T C| | | |G G A – T C
0
0
0
0
0
0000000
G A A T T C
G
G
A
T
C
2
54310-1
13532-1
-3-11340
-2-2-2-112
-1-1-1-10
Multidimensional Dynamic Programming for MSA
• n strings of length L each, running time is O(Ln).
• Impractical: 5-7 proteins of 200-300 residues each.
Topics
• Background
• Algorithm Design
• Test Results
An MSA Heuristic
Algorithm Design
1. Align 2 of the sequences Si, Sj
2. Align a 3rd sequence Sk to the alignment Si, Sj
3. Repeat 2 until all sequences are aligned
Feng-Doolittle Progressive Alignment
Running TimeO( n L2 )
*
TA
S
cj
ci
S(ci, cj) = (S(T, S) + S(A, S)) / 2
Features of Feng-Doolittle Algorithm
x: G A A G T Ty: G A C – T T
x: G A A G T T
y: G A – C T Tz: G A A C T G
• Once a gap, always a gap
• Early mistakes cannot be corrected
Alignment order is important
z: G A A C T G
TspMsa: First Version
Algorithm Design
Traveling Salesman Problem (TSP)
Given• n nodes• distances for each pair of nodes
Find a roundtrip, so that• visit each node exactly once• minimal total length
NP-completeWell studied
calculate pairwise distances
TspMsa: Algorithm Design
Feng-Doolittle alignment
0 1 15 51 61
1 0 14 24 58
15 14 0 46 67
51 24 46 0 38
61 58 67 38 0
0 1 2 3 4
0
1
2
3
4
Alignmentorder
0
2
3
1
4
0
2
3
1
4
1
0
4
2
3
determine a TSP tour
429
814
624
8
932
84
1
14
79
2849151049
914
378
251
970
632
542
375
508337
498
0.7030.703
0.67
0.702
0.7470.770
0.737
0.681
0.677
0.686
0.711
0.7
0.6980.688
0.6850.7460.765
0.692
0.772
0.733
0.743
0.736
0.749
0.68
50.
719
0.6960.706
0.739
0.71
20.669
0.64
0.6890.7360.7020.
7220.74
0.665
0.653
0.636
0.6360.6030.654
0.668
0.731
0.712
0.65
6
01
4
2
3
19
20
21
2213
910
8
7
6
5
12
18
17
14
15
1611
Starting Point and Direction of TSP Tour
data setkinase_ref3
TspMsa: Modified Design
Algorithm Design
TspMsa: Modified Algorithm Design
calculate pairwise distances
determine a TSP tour
align closest nodes
0
2
3
1
4
67
15
38
1
24
2
3
1, 0
4
67
15
38
24
3
1, 0
2, 4
67
38
24
3, 1, 0, 2, 4
3, 1, 0
2, 467
38one node left
?
end
yes
no
0
2
3
1
4
Modified Algorithm is BetterAlignment order for Kinase_ref3
Original TspMsa : 0.603 (worst) - 0.772 (best)
Modified TspMsa : 0.836
5 6 7 8 10 9 0 1 4 2 3 18 17 14 15 16 11 12 13 22 21 20 19
Topics
• Background
• Algorithm Design
• Test Results
Test Results
What to Compare With?
Existing MSA Programs
Progressive
multal
Iterative
multalignpileup
clustalw
poa
prrpsaga
hmmt
less computation time better quality
best quality
Fast
CLUSTALW1. Calculate pairwise distances2. Derive a guide tree by the Neighbor Joining method
12 3
4
5
678
9
12 3
4
78
95
6
repeat until one node left at the center
1234 789 56
choose 2 closest nodes, derive an internal node
ri=(Σdik)/(n-2)dix=(dij + ri - rj) /2djx=dij – dix
dxm=(dim + djm - dij)/2
j
ix
j
i
CLUSTALW
• 2 gap penalty values: opening, extension
• Dynamically changes the gap penalty and the scoring matrix
3. Progressively align all sequences following the guide tree
• Weighted sequences
1 p e e k s a v t a l2 g e e k a a v l a l
3 e g e w q l v l h v
Without weightsScore = [S(t,v) + S(l,v)] / 2
With weightsScore = [S(t,v)*w1*w3 + S(l,v)*w2*w3] / 2
POA
E T - - P K M I V RE T T H – K M L V R
1. Convert sequences to partial order graphs
E T N K
E T N K
E TP
K
T H
MI
V
L
R
POA2. Align 2 sequences3. Align one sequence to the current group
E T
P
KT H
E
T
N
K
4. Repeat 3 until all sequences are aligned
Test Results
Quality Evaluation
BAliBASE Benchmark
• Reference 1: equidistance sequences with various levels of similarity.
• < 25% sequence identity• 20-40% sequence identity• > 35% sequence identity
• Reference 2: closely related sequences with a highly divergent “orphan” sequence.
• Reference 3: subgroups with <25% identity between groups.
• Reference 4: sequences with N/C-terminal extensions.
• Reference 5: sequences with internal insertions.
Reference 1 Sequences with < 25% Identity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Test cases
Ali
gn
me
nt
sc
ore
s
TspMsa CLUSTALW POA
short medium long
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
ref.1(<25%)
CLUSTALW
TspMsa
POA
Average ScoreAll Test Scores
0
0.2
0.4
0.6
0.8
1
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
Test cases
Alig
nme
nt s
core
s
TspMsa CLUSTALW POA
Reference 1 Sequences with 20-40% Identity
short medium long
Average ScoreAll Test Scores
0
0.2
0.4
0.6
0.8
1
ref.1 (20-40%)
CLUSTALW
TspMsa
POA
00.10.20.30.40.50.60.70.80.9
1
1 3 5 7 9 11 13 15 17 19 21 23 25 27
Test cases
Alig
nmen
t sc
ores
TspMsa CLUSTALW POA
Reference 1 Sequences with >35% Identity
short medium long
Average ScoreAll Test Scores
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
ref.1(>35%)
CLUSTALW
TspMsa
POA
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Test cases
Ali
gn
me
nts
sc
ore
s
TspMsa CULSTALW POA
Reference 2
short medium long
Average ScoreAll Test Scores
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
ref.2
CLUSTALW
TspMsa
POA
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10 11 12
Test Cases
Alig
nm
ent s
core
s
TspMsa CLUSTALW POA
Reference 3
short medium long
Average ScoreAll Test Scores
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
ref.3
CLUSTALW
TspMsa
POA
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 3 5 7 9 11 13 15 17 19 21
Test cases
Alig
nm
ent s
core
s
TspMsa CLUSTALW POA
Reference 4 and Reference 5
Reference 4 Reference 5
Average ScoreAll Test Scores
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
ref.4 ref.5
CLUSTALW
TspMsa
POA
Alignment Quality Comparison
Reference 1: <25% identity: Similar *20-40% identity: Similar * > 35% identity: Similar
Reference 2: Similar *
Reference 3: TspMsa better
Reference 4: CLUSTALW better
Reference 5: Similar
* CLUSTALW slightly better for short sequences.
TspMsa and POA: TspMsa better
TspMsa and CLUSTALW: comparable
Test Results
Execution Time Evaluation
Fast Mode TspMsa
• Slow mode:full dynamic programming
(accurate)
• Fast mode:a fast approximate method
(heuristic)
Most time consuming step:Pairwise distance calculations
Quality Impact of the Fast Mode
00.10.20.30.40.50.60.70.80.9
1
ref.1(<25%)
ref.1(20-
40%)
ref.1(>35%)
ref.2 ref.3 ref.4 ref.5
CLUSTALW TspMsa POA
CLUSTALW-fast TspMsa-fast
Execution Time Evaluation
0
20
40
60
80
100
120
140
160
180
200
100 200 500 1000
Number of sequences
Ex
ec
uti
on
tim
e (
min
)CLUSTALW TspMsa POA
CLUSTALW and TspMsa in fast mode
Conclusions
Slow mode • close to CLUSTALW (slow mode) • better than POA
Fast mode (not as good as slow mode)• comparable to CLUSTALW (fast mode) • better than POA
Fast mode• faster than CLUSTALW (fast mode)• comparable to POA
QUALITY
SPEED
Acknowledgement
Dr. Robert Robinson
Dr. Russell Malmberg
Dr. Eileen Kraemer
Computer Science Department