advisor: prof. r. c. t. lee speaker: l. c. chen

Finding approximate palindromes in strings

Pattern Recognition, vol.35, pp. 2581-2591, 2002 Alexandre H. L Porto and Valmir C. Barbosa

Advisor: Prof. R. C. T. Lee Speaker: L. C. Chen

Definition

• S: a string of n characters.• S[i]: the ith character in S.

S[i..j]: the substring of S whose first and last characters are S[i] and S[j].

SR: the reverse of S.

S: abcab

SR:bacba

Definition

• A even(odd) palindrome is a string which is of the form of SRS(SRaS). Thus abaccaba is a palindrome because abac is the reverse of caba.

S[c]: the center of palindrome S[i…j] in S, where

. 2/)1(1 ijic

1 2 3 4 5 6 7 8

c b a c c a b aS

S[2…7]=baccab is an even palindrome and S[c]=4

Edit distance• In edit distance, there are three types of differences between two strings X and Y:

• Insertion: a symbol of Y is missing in X at a corresponding position.

• Substitution: symbols at corresponding positions are distinct.

• Deletion: a symbol of X is missing in Y at a corresponding position.

X : A － T Y : A G T X : A C CY : T C C

X: G C AY: G － A

• denotes the edit distance between two strings A and B as the minimum number of substitutions, insertions and deletions of characters in B to transform to A.

),( BAED

A=abcab-aB=cb–abbc Insertion:1, Substitution:2 and Deletion:1.

4),( BAED

Approximate palindromes

• An approximate palindrome with error up to k : a string of the form of SRS(SRaS) such that ED(S,SR) ≦k.

• An approximate palindrome is maximal if no other approximate palindrome for the same c and k exists having strictly greater size or the same size but strictly fewer errors.

• To simplify our discussion, we only discuss even approximate palindromes here.

• S: aabaabcd and k=1.

1 2 3 4 5 6 7 8

a a b a a b c dS

At c=3,abaa and aabaa are even approximate palindromes,

and aabaa is a maximal approximate palindrome.

Delete bSubstitute b with a

Problem

• Given a string T of size n, we want to find all maximal approximate palindromes in T with up to k errors.

• For each c, we find the largest i’ and j’ in T[c+1…n] and TR[1…c] respectively such that ED(T[c+1…i’] ), TR[1…j’]) ≦k.

• Let S2=TR[1…c] and S1=T[c+1…n], where 1≦c≦n.

• In the dynamic programming approach, we construct a matrix Dn’+1,m’+1 when Di,j is the minimum edit distance between S1[1,i] and S2[1,j], where the length of S1

and S2 are n’ and m’ respectively.

• T: dbcaabac, and k=2.• At c=3,

S2=TR [1…3] =cbd and S1=T[4…7]=aabac.

a a b a c

0 1 2 3 4 5

c 1 1 2 3 3 4

b 2 2 2 2 3 4

d 3 3 3 3 3 4

We can find that the maximal approximate palindrome is bcaab.

↖: substitution or a matching ↑: deletion

←: insertion

• How can we compute the table faster?• In this paper, the method in [LV89]( L.Y. Huang)

was used.

• We shall heavily use the concept of diagonal.

• Diagonal d is defined as all of the Di,j’s where d = i – j.

• The diagonal property: Di,j-Di-1,j-1=0 or 1. It means

that on the diagonal, the values are monotonically increasing. [U85]

Diagonal 2

Diagonal 0

i 1 2 3

• Consider diagonal d=0. Let us find the largest j, if it exists, such that (i,j) is on Diagonal d (i - j = d) and Di,j = 0.

• Let us now label all of these locations.

76543210

atctgggi 1 2 3 4 5 6 7

Diagonal 0

S1=gggtctaS2=gttc

• Having found the above locations (i, j) where Di,j = 0, we can further find the largest j, if it exists, such that (i, j) is on Diagonal d and Di,j = 1.

• To do this, we use the following observation: Each element in Diagonal d can only influence elements in Diagonals d-1, d and d+1.

• Let us consider any (i, j) location on Diagonal d.

– Di,j can only be influenced as shown below:

• Thus, we conclude that we only need to consider Diagonals d-1, d and d+1 for each Di,j.

Di-1, j-1Di, j-1

Di-1, jDi, j

delete

insert

substitution

• Observe the following two strings:

• If i and j are the largest i and j such that ED(T1[1…i],T2[1…j]) = k and T1[i+1]≠ T2[j+1],

then ED(A1+x, A2+y) = k+1.

T2 1 j

• Consider T1=abcd and T2=cdde. ED(T1[1…i],T2[1…j])=2. The largest such i and j are 2 and 3 respectively, and T1[i+1]≠ T2[j+1]. Thus the ED(ab+c,cbd+e)=2+1=3.

T1 ab c

T2 cbd

• Based upon the above discussion, on a diagonal d, we can find the largest i and j such that Di,j =e.

• How can we find the largest row containing the value smaller or equal to k ?

• We need to let Ld,e denote the largest row j such that Di,j is on the Diagonal d (i- j = d) and Di,j =e≦k.

• Let Ld,e denote the largest row j such that Di,j is on the Diagonal d (i- j = d) and Di,j =e≦k.

• Based upon this definition, e is the edit distance between S

1[1…i] and S2[1…j] such that i and j are the such largest ones, and S2[ j+1] ≠S1[i+1].

• At d =0. L0,0 = 1, L0,1=2, L1,2 =3 and L1,3 =4.

g g g t c t a

0 1 2 3 4 5 6 7

g 1 0 1 2 3 4 5 6

t 2 1 1 2 2 3 4 5

t 3 2 2 2 2 3 3 4

c 4 3 3 3 3 2 3 4

i 1 2 3 4 5 6 7

S1=gggtctaS2=gttc

dd=0=0

• How can we compute the Ld,e’s value?

• We define

rowd,e = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)].

(substitution) (insertion) (deletion) Ld,e= rowd,e+t, where t= the length of the longest comm

on prefix of S1[d+rowd,e+1…n’] and S2[rowd,e+1…m’]. If t=0, it means that S1[d+rowd,e+1] ≠S2[rowd,e+1].

• Consider D3,2. L1,1=1. The largest j on d=1 for Di,j=1 is j=1. In this case, d=1, e=2. Ld,e-1=L1,1=1, Ld-1,e-1=L0,1=2 and Ld+1,e-1=L2,1=0. Thus rowd,e=row1,2=max(L1,1+1,L0,1,L2,1+1)=max(1+1,2,0+1)=max(2,2,1)=2.

g g g t c t a

0 1 2 3 4 5 6 7

g 1 0 1 2 3 4 5 6

t 2 1 1 2 2 3 4 5

t 3 2 2 2 2 3 3 4

c 4 3 3 3 3 2 3 4

i 1 2 3 4 5 6 7

dd=0=0 dd=1=1 dd=2=2

• How to compute L-1,1?• row-1,1 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] = max[(L-1,0+1),(L-2,0),(L0,0+1)] = max[0+1, 0, 1+1]= max[1, 0, 2] = 2 Since S1[d+rowd,e+1]= S1[-1+1+2]=g ≠S2[rowd,e+1]=S2[2+1]=t, L-1,1 = row-1,1+0 = 2.

d = -1

i 1 2 3 4 5 6 7

76543210

atctgggS1=gggtctaS2=gttc

• e =1, d = -1

• How to compute L1,2?• row1,2 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] = max[(L1,1+1),(L0,1),(L2,1+1)] = max[1+1, 2, 0+1]= max[2, 2, 1] = 2. Since the length of the longest common prefix of S1[d+row

1,2+1…n’]=S1[4…7]=tcta and S2[row1,2+1…m’]= S2[3…4]=tc is 2, L1,2 = row1,2+2 =4.

i 1 2 3 4 5 6 7

22223t

76543210

atctggg

S1=gggtctaS2=gttc

• Ld,e=rowd,e+t, where t= the length of the longest common prefix of S1[d+rowd,e+1…n’] and S2[rowd,e+1…m’].

• How can we compute t ?

In this paper, LCA (lowest common ancestor ) is used.

• Consider two substrings T1 and T2 as shown below:

T1 A1 S1

T2 A2 S2

If ED(A1, A2) =k and S1=S2, then ED(A1+S1, A2+S2) =k.

When we find the ED(A1, A2) =k, we want to determine whether the longest common prefix S of B1 and B2 exists.

This paper will use LCA (lowest common ancestor) to

find S.

• To find such S, if it exists, we may concatenate S1 and S2 to a new string.

• Obviously, suffixes S1’ and S2’ have a common prefix S.

SA1 x ySA2

• Let us concatenate S1 and S2 to be a new string as follows:

Consider D3,2, the substring after ggg is tctagttc=S1’. The substring after gt is tc=S2’. Note that S2’ and S1’ have a common prefix with length 2. Thus we have that D3,2=D4,3=D5,4=2.

S1=gggtctaS2=gttc g g g t c t a

0 1 2 3 4 5 6 7

g 1 0 1 2 3 4 5 6

t 2 1 1 2 2 3 4 5

t 3 2 2 2 2 3 3 4

c 4 3 3 3 3 2 3 4

i 1 2 3 4 5 6 7

ggg gt tctcta

agttc$

tagttc$ $

gtctagttc$

tctagttc$

ctagttc$

tagttc$

agttc$

S1=gggtctaS2=gttcLet us concatenate S1 and S2 to be a new string as follows:gggtctagttaa. And then we construct the suffix tree of it. The substring after ggg is tctagttc=S1’. The substring after gt is tc=S2’. Note that S2’ and S1’ have a common ancestor tc of length 2.

ggg gt tctcta

AlgorithmInitialization for all d, 1≦d ≦k+1, d ＞ e, Ld,e=-1 .

for all d, -(k+1) ≦d -1,≦ Ld,|d|-1= -1, Ld,|d|-2 =|d|-2 .

for all e, -1≦e≦k, Ln’+1,e = -1

Find L0,0= the length of longest common prefix of S1 and S2 For e = 1 to k do

For d = -e to e do

rowd,e = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)]

rowd,e = min(rowd,e,m’)

while rowd,e < m’ and row d,e+d <n’ do find t= the length of longest common prefix of

S1[d+rowd,e+1…n’] and S2[rowd,e+1…m’];

rowd,e = rowd,e + t;

Ld,e = rowd,e.

g g g t c t a

0 1 2 3 4 5 6 7

Example:

T = cttggggtcta and k=2.

At c=4, T[1…4]=cttg, S2=TR[1..4]=gttc and S1=T[5…11]=gggtcta.

i 1 2 3 4 5 6 7

• At d = 0, find the largest j such that S2[1…j] is equal to S1[1..i], then we set the value of L0,0 = j.

•S2[1] = S1[1], L0,0 =1

i 1 2 3 4 5 6 7

76543210

atctggg

• row-1,1 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)]

= max[0,0,2]=2.

the length of longest common prefix of ggtctagttc and tc is 0.

• L-1,1 = 2

d = -1

i 1 2 3 4 5 6 7

76543210

atctggg

• e =1, d = -1

The length of LCA of ggtctagttc and tc is 0.

agttc$

tagttc$ $

gtctagttc$

tctagttc$

ctagttc$

tagttc$

aggttc$

row0,1 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] = max[2,0,1]=2.the length of common prefix of gtctagttc and tc is 0.L0,1 = 2

i 1 2 3 4 5 6 7

76543210

atctggg

• e =1, d = 0

The length of LCA of gtctagttc and tc is 0.

agttc$

tagttc$ $

gtctagttc$

tctagttc$

ctagttc$

tagttc$

aggttc$

row1,1= max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] =1.the length of common prefix of gtctagttc and ttc is 0.

L1,1 = 1

i 1 2 3 4 5 6 7

76543210

atctggg

• e =1, d = 1

The length of LCA of gtctagttc and ttc is 0.

agttc$

tagttc$ $

gtctagttc$

tctagttc$

ctagttc$

tagttc$

aggttc$

• row1,2 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] =2d = 1

i 1 2 3 4 5 6 7 •e =2, d = 1

76543210

atctgggS2

We find that the longest common prefix of tc and tctagttc is tc.d = 1

i 1 2 3 4 5 6 7

22223t

76543210

atctggg

tctaggg g tctS1’

•e =2, d = 1

L1,2 = row+2=2+2=4

agttc$

tagttc$ $

gtctagttc$

tctagttc$

ctagttc$

tagttc$

aggttc$

The length of LCA of tctagttc and ttc is 2.

row2,2 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] =1

• We find that the lenghth of common prefix of ttc and tctagttc is 1.

i 1 2 3 4 5 6 7

tctaggg g ttcS1’

•e =2, d = 2

22223t

22112t

2101 g

76543210

atctggg

L2,2 = row2,2+1=1+1=2

agttc$

tagttc$ $

gtctagttc$

tctagttc$

ctagttc$

tagttc$

aggttc$

The length of LCA of ttc and tctagttc is 1.

T = cttggggtcta and k=2.

At c=4, T[1…4]=cttg, TR[1..4]=gttc and TR[5…11]=gggtcta.

cttggggtc is the maximal approximate palindromes.

i 1 2 3 4 5 6 7

2101 g

76543210

atctggg

S1=gggtctaS2=gttc

References

• [U85] Finding approximate patterns in strings, Ukkonen, E., Journal of algorithms, Vol. 6, 1985, pp.132-137.

• [LV89] Fast parallel and serial approximate string matching, G. Landau and U. Vishkin, Journal of algorithms, Vol. 10, 1989, pp.157-169.

advisor: prof. r. c. t. lee speaker: l. c. chen

Documents

the secondary low and heavy rainfall associated with typhoon...

advisor : prof. yu-chee tseng student : yi-chen lu...

advisor ： dr. hsu graduate ： ching-lung chen author ...

advisor : professor frank y. s. lin presented by: tuan-chun...

advisor: prof. zaniolo hung-chih yang ling-jyh chen xml...

reporter: lin, an advisor: chen, chuh-yean date: 6/11

reporter ： chia-cheng chen advisor ： wen-ping chen 1...

advisor: yen-ting chen presenter: yi-shiang chen 2011.4.27...

garrett vanhoy, marypat beaufait, duyun chen advisor: dr....

speaker: l. c. chen advisor: r. c. t. lee

shiyun chen | advisor : dr. william bahnfleth

chen chen bloque c

advisor ： dr. hsu graduate ： ching-lung chen author ...

advisor: hsin-hsi chen speaker: yong-sheng lo date:...

faculty of science - nus...prof lee hian kee level 4 advisor...

1 cloud computing advisor ： cho-chin lin student ：...

mechanical engineering revised november 20, 2018 the ... ·...

rotationally invariant descriptors using intensity order...

presenter: asta y.z. lord advisor: ming-puu chen date: ...

1 average case analysis of an exact string matching...