advisor: prof. r. c. t. lee speaker: l. c. chen
Post on 11-Jan-2016
64 Views
Preview:
DESCRIPTION
TRANSCRIPT
1
Finding approximate palindromes in strings
Pattern Recognition, vol.35, pp. 2581-2591, 2002 Alexandre H. L Porto and Valmir C. Barbosa
Advisor: Prof. R. C. T. Lee Speaker: L. C. Chen
2
Definition
• S: a string of n characters.• S[i]: the ith character in S.
S[i..j]: the substring of S whose first and last characters are S[i] and S[j].
SR: the reverse of S.
S: abcab
SR:bacba
3
Definition
• A even(odd) palindrome is a string which is of the form of SRS(SRaS). Thus abaccaba is a palindrome because abac is the reverse of caba.
S[c]: the center of palindrome S[i…j] in S, where
. 2/)1(1 ijic
1 2 3 4 5 6 7 8
c b a c c a b aS
S[2…7]=baccab is an even palindrome and S[c]=4
4
Edit distance• In edit distance, there are three types of differences between two strings X and Y:
• Insertion: a symbol of Y is missing in X at a corresponding position.
• Substitution: symbols at corresponding positions are distinct.
• Deletion: a symbol of X is missing in Y at a corresponding position.
X : A - T Y : A G T X : A C CY : T C C
X: G C AY: G - A
5
• denotes the edit distance between two strings A and B as the minimum number of substitutions, insertions and deletions of characters in B to transform to A.
),( BAED
A=abcab-aB=cb–abbc Insertion:1, Substitution:2 and Deletion:1.
4),( BAED
6
Approximate palindromes
• An approximate palindrome with error up to k : a string of the form of SRS(SRaS) such that ED(S,SR) ≦k.
• An approximate palindrome is maximal if no other approximate palindrome for the same c and k exists having strictly greater size or the same size but strictly fewer errors.
7
• To simplify our discussion, we only discuss even approximate palindromes here.
• S: aabaabcd and k=1.
1 2 3 4 5 6 7 8
a a b a a b c dS
At c=3,abaa and aabaa are even approximate palindromes,
and aabaa is a maximal approximate palindrome.
Delete bSubstitute b with a
8
Problem
• Given a string T of size n, we want to find all maximal approximate palindromes in T with up to k errors.
• For each c, we find the largest i’ and j’ in T[c+1…n] and TR[1…c] respectively such that ED(T[c+1…i’] ), TR[1…j’]) ≦k.
9
• Let S2=TR[1…c] and S1=T[c+1…n], where 1≦c≦n.
• In the dynamic programming approach, we construct a matrix Dn’+1,m’+1 when Di,j is the minimum edit distance between S1[1,i] and S2[1,j], where the length of S1
and S2 are n’ and m’ respectively.
10
• T: dbcaabac, and k=2.• At c=3,
S2=TR [1…3] =cbd and S1=T[4…7]=aabac.
i
j
a a b a c
0 1 2 3 4 5
c 1 1 2 3 3 4
b 2 2 2 2 3 4
d 3 3 3 3 3 4
We can find that the maximal approximate palindrome is bcaab.
↖: substitution or a matching ↑: deletion
←: insertion
11
• How can we compute the table faster?• In this paper, the method in [LV89]( L.Y. Huang)
was used.
12
• We shall heavily use the concept of diagonal.
• Diagonal d is defined as all of the Di,j’s where d = i – j.
• The diagonal property: Di,j-Di-1,j-1=0 or 1. It means
that on the diagonal, the values are monotonically increasing. [U85]
Diagonal 2
Diagonal 0
1
1222c
211b
3210
cba
i 1 2 3
j
1
2
13
• Consider diagonal d=0. Let us find the largest j, if it exists, such that (i,j) is on Diagonal d (i - j = d) and Di,j = 0.
• Let us now label all of these locations.
4c
3t
2t
01g
76543210
atctgggi 1 2 3 4 5 6 7
j
1
2
3
4
Diagonal 0
S1=gggtctaS2=gttc
14
• Having found the above locations (i, j) where Di,j = 0, we can further find the largest j, if it exists, such that (i, j) is on Diagonal d and Di,j = 1.
• To do this, we use the following observation: Each element in Diagonal d can only influence elements in Diagonals d-1, d and d+1.
15
• Let us consider any (i, j) location on Diagonal d.
– Di,j can only be influenced as shown below:
• Thus, we conclude that we only need to consider Diagonals d-1, d and d+1 for each Di,j.
Di-1, j-1Di, j-1
Di-1, jDi, j
d
d+1
d-1
delete
insert
substitution
16
• Observe the following two strings:
• If i and j are the largest i and j such that ED(T1[1…i],T2[1…j]) = k and T1[i+1]≠ T2[j+1],
then ED(A1+x, A2+y) = k+1.
A1
A2
x
y
T1
T2 1 j
1 i
17
• Consider T1=abcd and T2=cdde. ED(T1[1…i],T2[1…j])=2. The largest such i and j are 2 and 3 respectively, and T1[i+1]≠ T2[j+1]. Thus the ED(ab+c,cbd+e)=2+1=3.
T1 ab c
T2 cbd
dd
e1 j
1 i
18
• Based upon the above discussion, on a diagonal d, we can find the largest i and j such that Di,j =e.
• How can we find the largest row containing the value smaller or equal to k ?
• We need to let Ld,e denote the largest row j such that Di,j is on the Diagonal d (i- j = d) and Di,j =e≦k.
19
• Let Ld,e denote the largest row j such that Di,j is on the Diagonal d (i- j = d) and Di,j =e≦k.
• Based upon this definition, e is the edit distance between S
1[1…i] and S2[1…j] such that i and j are the such largest ones, and S2[ j+1] ≠S1[i+1].
• At d =0. L0,0 = 1, L0,1=2, L1,2 =3 and L1,3 =4.
g g g t c t a
0 1 2 3 4 5 6 7
g 1 0 1 2 3 4 5 6
t 2 1 1 2 2 3 4 5
t 3 2 2 2 2 3 3 4
c 4 3 3 3 3 2 3 4
i 1 2 3 4 5 6 7
j
1
2
3
4
S1=gggtctaS2=gttc
dd=0=0
20
• How can we compute the Ld,e’s value?
• We define
rowd,e = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)].
(substitution) (insertion) (deletion) Ld,e= rowd,e+t, where t= the length of the longest comm
on prefix of S1[d+rowd,e+1…n’] and S2[rowd,e+1…m’]. If t=0, it means that S1[d+rowd,e+1] ≠S2[rowd,e+1].
21
• Consider D3,2. L1,1=1. The largest j on d=1 for Di,j=1 is j=1. In this case, d=1, e=2. Ld,e-1=L1,1=1, Ld-1,e-1=L0,1=2 and Ld+1,e-1=L2,1=0. Thus rowd,e=row1,2=max(L1,1+1,L0,1,L2,1+1)=max(1+1,2,0+1)=max(2,2,1)=2.
g g g t c t a
0 1 2 3 4 5 6 7
g 1 0 1 2 3 4 5 6
t 2 1 1 2 2 3 4 5
t 3 2 2 2 2 3 3 4
c 4 3 3 3 3 2 3 4
i 1 2 3 4 5 6 7
j
1
2
3
4
dd=0=0 dd=1=1 dd=2=2
22
• How to compute L-1,1?• row-1,1 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] = max[(L-1,0+1),(L-2,0),(L0,0+1)] = max[0+1, 0, 1+1]= max[1, 0, 2] = 2 Since S1[d+rowd,e+1]= S1[-1+1+2]=g ≠S2[rowd,e+1]=S2[2+1]=t, L-1,1 = row-1,1+0 = 2.
d = -1
i 1 2 3 4 5 6 7
j
1
2
3
4 4c
3t
12t
01 g
76543210
atctgggS1=gggtctaS2=gttc
• e =1, d = -1
23
• How to compute L1,2?• row1,2 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] = max[(L1,1+1),(L0,1),(L2,1+1)] = max[1+1, 2, 0+1]= max[2, 2, 1] = 2. Since the length of the longest common prefix of S1[d+row
1,2+1…n’]=S1[4…7]=tcta and S2[row1,2+1…m’]= S2[3…4]=tc is 2, L1,2 = row1,2+2 =4.
d = 1
i 1 2 3 4 5 6 7
j
1
2
3
4 24c
22223t
2112t
101 g
76543210
atctggg
S1=gggtctaS2=gttc
24
• Ld,e=rowd,e+t, where t= the length of the longest common prefix of S1[d+rowd,e+1…n’] and S2[rowd,e+1…m’].
• How can we compute t ?
In this paper, LCA (lowest common ancestor ) is used.
25
• Consider two substrings T1 and T2 as shown below:
T1 A1 S1
T2 A2 S2
If ED(A1, A2) =k and S1=S2, then ED(A1+S1, A2+S2) =k.
x
y
26
When we find the ED(A1, A2) =k, we want to determine whether the longest common prefix S of B1 and B2 exists.
This paper will use LCA (lowest common ancestor) to
find S.
A1
A2
S
S
x
y
S1
S2
B1
B2
27
• To find such S, if it exists, we may concatenate S1 and S2 to a new string.
• Obviously, suffixes S1’ and S2’ have a common prefix S.
A1
A2
S
S
x
y
S1
S2
SA1 x ySA2
S2’
S1’
28
• Let us concatenate S1 and S2 to be a new string as follows:
Consider D3,2, the substring after ggg is tctagttc=S1’. The substring after gt is tc=S2’. Note that S2’ and S1’ have a common prefix with length 2. Thus we have that D3,2=D4,3=D5,4=2.
S1=gggtctaS2=gttc g g g t c t a
0 1 2 3 4 5 6 7
g 1 0 1 2 3 4 5 6
t 2 1 1 2 2 3 4 5
t 3 2 2 2 2 3 3 4
c 4 3 3 3 3 2 3 4
i 1 2 3 4 5 6 7
j
1
2
3
4
ggg gt tctcta
d = 1
29
agttc$
tc$
g
c
t
tagttc$ $
g t
gtctagttc$
tctagttc$
tc$
ctagttc$
tagttc$
agttc$
c
$
S1=gggtctaS2=gttcLet us concatenate S1 and S2 to be a new string as follows:gggtctagttaa. And then we construct the suffix tree of it. The substring after ggg is tctagttc=S1’. The substring after gt is tc=S2’. Note that S2’ and S1’ have a common ancestor tc of length 2.
ggg gt tctcta
30
AlgorithmInitialization for all d, 1≦d ≦k+1, d > e, Ld,e=-1 .
for all d, -(k+1) ≦d -1,≦ Ld,|d|-1= -1, Ld,|d|-2 =|d|-2 .
for all e, -1≦e≦k, Ln’+1,e = -1
Find L0,0= the length of longest common prefix of S1 and S2 For e = 1 to k do
For d = -e to e do
rowd,e = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)]
rowd,e = min(rowd,e,m’)
while rowd,e < m’ and row d,e+d <n’ do find t= the length of longest common prefix of
S1[d+rowd,e+1…n’] and S2[rowd,e+1…m’];
rowd,e = rowd,e + t;
Ld,e = rowd,e.
31
g g g t c t a
0 1 2 3 4 5 6 7
g 1
t 2
t 3
c 4
Example:
T = cttggggtcta and k=2.
At c=4, T[1…4]=cttg, S2=TR[1..4]=gttc and S1=T[5…11]=gggtcta.
i 1 2 3 4 5 6 7
j
1
2
3
4
S2
S1
32
• At d = 0, find the largest j such that S2[1…j] is equal to S1[1..i], then we set the value of L0,0 = j.
•S2[1] = S1[1], L0,0 =1
i 1 2 3 4 5 6 7
4c
3t
2t
01 g
76543210
atctggg
j
1
2
3
4
d=0
S2
S1
33
• row-1,1 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)]
= max[0,0,2]=2.
the length of longest common prefix of ggtctagttc and tc is 0.
• L-1,1 = 2
d = -1
i 1 2 3 4 5 6 7
j
1
2
3
4 4c
3t
12t
01 g
76543210
atctggg
• e =1, d = -1
S2
S1
34
The length of LCA of ggtctagttc and tc is 0.
agttc$
tc$
g
c
t
tagttc$ $
g t
gtctagttc$
tctagttc$
tc$
ctagttc$
tagttc$
aggttc$
c
$
35
row0,1 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] = max[2,0,1]=2.the length of common prefix of gtctagttc and tc is 0.L0,1 = 2
d = 0
i 1 2 3 4 5 6 7
j
1
2
3
4 4c
3t
112t
01 g
76543210
atctggg
• e =1, d = 0
S2
S1
36
The length of LCA of gtctagttc and tc is 0.
agttc$
tc$
g
c
t
tagttc$ $
g t
gtctagttc$
tctagttc$
tc$
ctagttc$
tagttc$
aggttc$
c
$
37
row1,1= max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] =1.the length of common prefix of gtctagttc and ttc is 0.
L1,1 = 1
d = 1
i 1 2 3 4 5 6 7
j
1
2
3
4 4c
3t
112t
101 g
76543210
atctggg
• e =1, d = 1
S2
S1
38
The length of LCA of gtctagttc and ttc is 0.
agttc$
tc$
g
c
t
tagttc$ $
g t
gtctagttc$
tctagttc$
tc$
ctagttc$
tagttc$
aggttc$
c
$
39
• row1,2 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] =2d = 1
i 1 2 3 4 5 6 7 •e =2, d = 1
j
1
2
3
4 4c
2223t
2112t
101 g
76543210
atctgggS2
S1
40
We find that the longest common prefix of tc and tctagttc is tc.d = 1
i 1 2 3 4 5 6 7
j
1
2
3
4 24c
22223t
2112t
101 g
76543210
atctggg
tctaggg g tctS1’
S2’
•e =2, d = 1
L1,2 = row+2=2+2=4
41
agttc$
tc$
g
c
t
tagttc$ $
g t
gtctagttc$
tctagttc$
tc$
ctagttc$
tagttc$
aggttc$
c
$
The length of LCA of tctagttc and ttc is 2.
42
row2,2 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] =1
• We find that the lenghth of common prefix of ttc and tctagttc is 1.
d = 2
i 1 2 3 4 5 6 7
tctaggg g ttcS1’
S2’
•e =2, d = 2
j
1
2
3
4 24c
22223t
22112t
2101 g
76543210
atctggg
L2,2 = row2,2+1=1+1=2
S1
S2
43
agttc$
tc$
g
c
t
tagttc$ $
g t
gtctagttc$
tctagttc$
tc$
ctagttc$
tagttc$
aggttc$
c
$
The length of LCA of ttc and tctagttc is 1.
44
T = cttggggtcta and k=2.
At c=4, T[1…4]=cttg, TR[1..4]=gttc and TR[5…11]=gggtcta.
cttggggtc is the maximal approximate palindromes.
i 1 2 3 4 5 6 7
j
1
2
3
4 24c
2223t
2112t
2101 g
76543210
atctggg
2
2
S2
S1
S1=gggtctaS2=gttc
45
References
• [U85] Finding approximate patterns in strings, Ukkonen, E., Journal of algorithms, Vol. 6, 1985, pp.132-137.
• [LV89] Fast parallel and serial approximate string matching, G. Landau and U. Vishkin, Journal of algorithms, Vol. 10, 1989, pp.157-169.
top related