1 speeding up on two string matching algorithms advisor: prof. r. c. t. lee speaker: kuei-hao chen,...
TRANSCRIPT
1
Speeding up on two string matching algorithms
Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen
, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK, S., LECROQ, T.,
PLANDOWSKI, W. and RYTTER, W.
Algorithmica, Vol.12, 1994, pp.247-267
2
Problem Definition
• Input : A text T and a pattern P.• Output : Find all occurrences of P in T
3
Rule 1: The Suffix to Prefix Rule • For a window to have any chance to match a pattern,
in some way, there must be a suffix of the window which is equal to a prefix of the pattern.
4
Basic Ideas• Open a window W with size |P| in the text.
T|P|
W
p
• Find the longest suffix of W is also the prefix of pattern.
T|P|
p
W
Match!
Case 1:
5
T|P|
W
p
Case 2:
T|P|
W
p
T|P|
W
p
Case 3:
|P|
If there is no such suffix, we move W with length |P|.
6
Preprocessing phase
• T=GCATCGGCGAGAGTATACAGTACG
• P=GCAGAGAG
• L(S): a set contains all prefixes of the pattern.
}G,GC,GCA,GCAG,GCAGA,GCAGAG,GCAGAGA, {GCAGAGAG,)( SL
08 7 6 5 4 3 2 1GA GAGG AC
C
C
C A
We construct the suffix automaton of P.
Suffix Automaton
7
Preprocessing: Construct a Suffix Tree The reversal string of P.
Example: GCAGAGAG GAGAGACGSuffixes of GAGAGACG AGAGACG GAGACG AGACG GACG
ACG CG G
:P:P
:P
:P
C G
G
6
121
CG
A
54
CG
GACG
A
2
3
11109
7
8
GA
GACG
01
2
8 6 4 7 5 3
Suffix tree for GAGAGACG:P
8
G C A T C G C A G C A G A G A GW P
We want to find the longest suffix of W which is equal to a prefix of P.
A C G C T A C G
G A G A G A C G
W
P
Suffix tree for P
We find that ACG (a prefix of , a suffix of W) is a suffix of (a prefix of P).
Thus ACG is the longest suffix of W which is equal to a prefix of P.
W P
C G
G
6
121
CG
A
54
CG
GACG
A
2
3
GA
11109
CG
7
8
GA
CG
GACG
01
2
8 6 4 7 5 3
•Example 1
9
G T A T A C A G
G C A G A G A G
W
P
G A C A T A T G
G A G A G A C G
W
P
Suffix tree for P
We find that GAC is the longest prefix of (thus the longest suffix of W) which is equal to a substring of .
But GAC is not a suffix of and GACA is not a suffix of either.
WP
•Example 2
PP
10
G A C A T A T G
G A G A G A C G
W
P
Luckily, a prefix of GACG, namely G, is also a suffix of .
G can be found by finding the lowest common ancestor of G and GACG.
P
C G
G
6
1
CG
AA
2
01
4
Thus G is the longest prefix of (suffix of W) which is equal to a suffix of (prefix of P).P
W
11
Let X be the longest prefix of (suffix of W) which is equal to a substring of , but not a suffix of .
Let Y be a prefix of X (a suffix of W) which is equal to a suffix of (prefix of P).
Then Y is the longest suffix of W equal to a prefix of P.
PW
P
P
G C A A
G C A A G C
C C A A C G A
A C G AA
W
P
XY
X Y
12
Z is a suffix of which can be found in the suffix tree of .
Y may not exist.
If it exists, it must be in the suffix tree of and must have been found before X is found because Y is a prefix of X.
P
P
X Y
W
P
X
Y
Z
P
13
• Preprocessing phase: the worst case of the time complexity is O(m).
• Searching phase: the worst case of the time complexity is O(mn).
• But it needs time O( ) in average case where r is the size of the alphabet as shown in this paper.
m
mn rlog
14
About the average case analysis of RF algorithm, assume that the text is a random sequence over a size r alphabet and is preserved such that m must be enough large.
This assumption is reasonable.
Let m=16, r=4.
83log
mmr
hold. 8
3log
6238
163
83
216loglog 4
mm
m
m
r
r
15
Theorem. The expected average time of the RF algorithm is O( ).
Proof.
m
mn rlog
Note that r>1, and .8
3log
mmr
For a pattern with length m, there are no more than m substrings. Thus, there are at most m substrings with length . mrlog2
P
...
mrlog2
m
16
Let Li be the length of the shift in the ith attempt of RF algorithm and Let Xi and Yi be the X and the Y in ith attempt respectively.
Let Si be the length of the longest prefix of which appear in in the ith attempt. That is, Si=|Xi|.
Let Ai=|Yi| such that because Yi is a prefix of Xi.
ii AS
P
W
17
In the first attempt of RF algorithm,
mm
m
mrr
mmS m
mrr
r
1
log2Pr
2
2log2log21
m
mS r
11log2Pr 1
18
.4
8
3logby
8
32
log2
by
1
1
1
11
11
11
mL
mm
mmL
mmL
SmL
ASmLS
mLA
r
r
ii
4
mLi
mS ri log2
Let us call the ith shift long if and only if and short otherwise. (It implies that Li is long if .)
19
When at least new symbols are being read at the current attempt, with probability there are at most characters of the suffix of the window can match a substring of P, which causes a long shift.
,1
1m
mrlog2
Tmrlog2
P... ...
mrlog2
20
We divide all attempts into phases. Each phase ends on the first long shift. In other words, there is exactly one long shift in each phase.
T
Short shift
Long shift
Phase 1 Phase 2 Phase 3 Phase 4 Phase 5
21
There are two main ideas in the paper:
(1)The number of all phases is .
(2)We calculate the expected number of comparison of each phase. An expected number of comparison of each phase is .
We shall discuss above two ideas in the next slides.
m
nO
mO log
The number of all phases is .
We know that the length of long shift is
Then
The number of all phases is
m
nO
.4
m
.4
shift longphasem
.
44
m
nO
m
n
nm
23
2
1
4Pr 2
mLi
Claim 1: Assume that Li and Li+1 are both short.
Proof. Suppose Li and Li+1 < , then the pattern is of the form where , w, .
4
m
szwvv k z3k
Then .
That is , Li+2 is the end of a phase.
Next, we calculate the expected number of comprison of each phase.
24
Note that Yi denotes a longest suffix of the window Wi which is equal to a prefix of the pattern, where Wi is a window of the text of length m in the ith attempt. Let Bi be the set of new symbols to read in the ith attempt.
ki wvvY Note that the pattern is of the form .Then , , .
szwvv k
szBi szLi
i i+1 i+2
Li+1
TP Lii attempt
i+1 attempt Yi Bi
Yi+1 Bi+1i+2 attempt
Wi
Wi+1
25
vwLi 1Let Bi+1 be because there exists an overlap between Yi and Yi+1, and
.11 szwvvvszvwBY kkii
i i+1 i+2
Li+1
TP Lii attempt
i+1 attempt Yi Bi
Yi+1 Bi+1i+2 attempt
Wi
Wi+1
26
Example: T=bbcabcabcabcabcadc P=cabcabcabcadd, w=ab,v=c, s=a,z=dd.
Then , addabcc 3P .abcc 3 ki wvvY
,
When P shifts Li+1, the overlap of Yi and Yi+1 is
,addabcccaddcab 3333 szwvvvszvwP
.1 vwLi
.cab 33 vvw
c a b c a b c a b c a
c a b c a b c a b c a Li+1
b b c a b c a b c a b c a b c
Li
c a b c a b c a b c a
a dT
P
Li+1
i attempt
i+1
d
d
d
Overlap
d
d
d
c
i+1 attempt
27
11vw kwvv
''vw 11'' vwvw
.
If there exists a word such that , then because is a minimal period of .
11vw kwvv
Without loss of generality we can assume thatis a minimal period of .
''11 vwvw
Hence,
111 iLwvvw
28
Example: P=abcabcabcabcabcabcabbc, w=cabc,v=ab, s=b,z=c.
szvwvP
szwvvP k
111
6
3
bccabab
bccabcabab
w1v1 is a minimal period of P.
29
We can also assume (eventually changing wv and k) that and sz do not have a common prefix. We may therefore obtain a new fragment s1z1 such that
11vw
.11 szLzs i
A suffix of the read part of the text is of the form , and we have at least C=min(Li+1, Li) new symbols to read in the (i+2)th attempt.
Let e be a random word of length C to be read part of the text such that .
111 svw k
1zCe
30
31
Note that
If |Bi|>|Bi+1|, then , otherwise, , .
11 zCBi
1zBi 1s
i+1 attempt
z1i+2 attempt
BiYi
Yi+1 Bi+1
z1s1
BiYi
Yi+1 Bi+1
z1
1 ii BB1 ii BB
. and 11 iiii LBLB
32
We give an example when |Bi|>|Bi+1|.
T=bbbaaaaaaaacda P=aaaaaaaabc, w=a,v=a, s=a, z=bc.
ezsvw
szwv
wv
P
bc , , a,
perfixcommon a havenot do and bca
of period minimalabca
abcaaa
1111
8
7
3
a a a a a a a a
b b b a a a a a a a a a a
i
T
P Li
i+1
c
i+2
Li+1
d a
b c
a a a a a a a a b c
a a a a a a a a b c
33
We give another example when
T=bbcabcabcabcabcadc P=cabcabcabcadd, w=ab,v=c, s=a, z=dd.
ezsvw
szwv
wv
P
dd , ,ca b,
perfixcommon a havenot do and ddbcaca
of period minimaladdabcc
addabcc
1111
3
3
3
.1 ii BB
c a b c a b c a b c a
c a b c a b c a b c a Li+1
b b c a b c a b c a b c a b c
Li
c a b c a b c a b c a
a dT
P
i+1
d
d
d
d
d
d
c
i+2i
34
It is easy to see that if w1v1s1e is a substring of , then y must be either equal to pref(z1) if , or otherwise.
11111 zsvwv k
1s 111 zvwpref
i i+1 i+2
Li+1
TP Lii attempt
i+1 attempt Yi Bi
Bi+1
e111 svw
1111 svwv li+2 attempt
35
In other words, by the above condition, if , w1v1s1e would only appear to the end of P.
Therefore, e=pref(z1).
1s
i+2T
v1w1v1w1v1w1v1w1v1w1v1w1s1i+2 attempt P z1
w1v1w1v1w1s1 e
otherwise, w1v1s1e may appear to any position of P. Therefore, .111 zvwprefy
Tv1w1v1w1v1w1v1w1v1w1v1w1P z1
w1v1w1v1w1 e
36
The probability that reading e new symbols leads to a long (longer than Li+Li+1 which is less than ) substring of the pattern
is no greater than .2
1
er
e4
2m
i i+1 i+2
Li+1
TP Lii attempt
i+1 attempt Yi Bi
Bi+1
e111 svw
1111 svwv li+2 attempt
Note that . and 11111 ii LzsLwv
37
Therefore,
.2
1
4 Pr 2
mLi
.44
222
mmSmL ii
38
By Claim 1, the assumptions say that when the (k-1)th and (k-2)th shifts are both short, the kth shift is long with probability .
It implies that the kth shift of the phase is short with probability for
2
1
2
1 .3k
39
Let F be the random variable which is the number of short shifts in the phase.
What can we say about the probability distribution of F?
40
By claim 1, we know when (k-2)th and (k-1)th are both short, .
2
1short isth Pr k
3.for ,1
2
1Pr
,1
2
13Pr
,1
2Pr
,1
1Pr
,1
10Pr
2
km
kF
mF
mF
mF
mF
k
41
Let G be the random variable which is the number of comparison of the phase and let L be the number of comparison of a long shift of the phase. Then
The problem is on how to find L.
mFLG
42
For the number of comparison of a long shift of the phase, we know and .
Note that Si is the length of the substring of the pattern that is matched in Wi.
m
mS ri
11log2Pr
m
mS ri
1log2Pr
1log2
11log2
11
m
mmm
mL
r
r
Hence, mFm
mFLG
r
1log2
43
mO
mm
km
m
mmO
mkmm
mmm
mm
kFmkm
FmmFmGE
r
k kkk
kk
rr
kkr
rr
kr
rr
log
2
1
22
log2log
2
11log2
11log2
111log2
Pr1log2
1Pr11log20Pr1log2
2 222
22
22
2
For the expected number of comparison of each phase, we have
According to above discussion, we know that there are phases in the algorithm and an expected number of comparison of each phase is .
Therefore, the expected time of the RF
algorithm is .
mnO
mO r log
m
mnO rlog
45
In this paper, they use X to analyze the average case of RF algorithm finally note that X is the longest suffix of W which is equal to a substring of P .
In fact, the main idea of RF algorithm is to find out Y, but not X. Therefore, we may re-analyze the expected length of Yi.
Note that the Li=shift is equal to Li=m - |Yi|=m - Ai. If Ai is small, Li is large. We expect Ai to be very small.
46
Given a window Wi of T in the ith attempt and a pattern P, the expected length of the longest suffix of Wi equal to a prefix of P is
…..(1)
…..(2)
mmi rm
rm
rrA
111
12
11
12
132
111
12
11
mmi rm
rm
rrA
r
47
(2) - (1)
mii rrrA
rA
11112
.1
1
11
1
11
11
1
11111
1
2
r
r
r
r
rr
rrrA
r
m
mi
48
We can deduce that
21
1
111
r
rA
rA
r
i
i
49
We randomly generate some texts and patterns using Knuth’s random generating function in the first experiment.
Data source
The length of string
Alphabet size r
The number of total comparison with matched
The number
of window
The expected number of
comparison per window r/(r-1 )2
The number of average
comparison per window
text pattern
Random
1000 30 4 17 33 0.4444 0.51515210000 30 4 151 338 0.4444 0.446746
100000 30 4 1316 3377 0.4444 0.3896951000000 30 4 13074 33769 0.4444 0.38716
1000 50 5 6 20 0.3125 0.310000 50 5 64 201 0.3125 0.318408
100000 50 5 645 2012 0.3125 0.3205771000000 50 5 6045 20120 0.3125 0.300447
50
Data source
The length of string
Alphabet size r
The number of total compari
son with
matched
The number
of window
The expected number of
comparison per window r/(r-1 )2
The number of average
comparison per window
textpatter
n
Random
1000 30 10 5 33 0.1234 0.15151510000 30 10 30 334 0.1234 0.08982
100000 30 10 346 3344 0.1234 0.1034691000000 30 10 3930 33463 0.1234 0.117443
1000 100 7 4 10 0.1944 0.410000 100 7 14 100 0.1944 0.14
100000 100 7 195 1001 0.1944 0.1948051000000 100 7 1865 10018 0.1944 0.186165
51
In the second experiment, we take news reports from CNN site as T and randomly obtain a word as P.
Data source
The length of string
Alphabet size r
The number of total compar
ison with
matched
The number
of window
The expected number of
comparison per window r(r-1 )2
The number of average
comparison per window
text pattern
CNN news
3715 7 35 32 535 0.0302 0.05982222 14 40 2 158 0.0262 0.0126
52
In the 3rd experiment, we take three fragments from human chromosome as T. The pattern is taken from the part of T.
Data source
The length of string
Alphabet size r
The number of
total compariso
n with matched
The number of window
The expected
number of compariso
n per window r/(r-1 )2
The number of
average comparison
per window
text pattern
Human Chromosome
21 NT_011512.10
1627105 70 4 8942 23372 0.4444 0.3826
Human Chromosome
22 NT_011515.11
3437231 70 4 24648 49455 0.4444 0.4984
Human Chromosome
X NT_033330.7
754004 70 4 5029 10843 0.4444 0.4638
53
Data source
The length of string
rThe distribution length of the longest suffix of the window
which is equal to a prefix of the pattern
T P0 1 2 3 4 5 6 7 8 9 10 30 70
Random
1000 30 4 23 4 5 1 0 0 0 0 0 0 0 0 010000 30 4 240 60 27 7 4 0 0 0 0 0 0 0 0
100000 30 4 2429 661 225 48 9 5 0 0 0 0 0 0 0
1000000 30 4 24650 6200 2147 587 128 41 11 4 1 0 0 0 0
1000 50 5 16 3 0 1 0 0 0 0 0 0 0 0 010000 50 5 150 40 9 2 0 0 0 0 0 0 0 0 0
100000 50 5 1508 395 88 15 1 5 0 0 0 0 0 0 0
1000000 50 5 15314 3799 812 163 27 5 0 0 0 0 0 0 0
1000 30 10 28 5 0 0 0 0 0 0 0 0 0 0 010000 30 10 305 28 1 0 0 0 0 0 0 0 0 0 0
100000 30 10 3027 289 27 1 0 0 0 0 0 0 0 0 0
1000000 30 10 29959 3119 353 27 6 0 0 0 0 0 0 0 0
1000 100 7 7 2 1 0 0 0 0 0 0 0 0 0 010000 100 7 86 14 0 0 0 0 0 0 0 0 0 0 0
100000 100 7 841 131 23 6 0 0 0 0 0 0 0 0 0
54
Data source
The length of string
rThe distribution length of the longest suffix of the window
which is equal to a prefix of the patternT P
0 1 2 3 4 5 6 7 8 9 10 30 70
CNN news3715 7 35 503 32 0 0 0 0 0 0 0 0 0 0 02222 14 40 156 2 0 0 0 0 0 0 0 0 0 0 0
Human Chromosome 21 NT_011512.10
1627105 70 4 16265 5823 967 213 68 16 13 3 2 1 0 0 1
Human Chromosome 22 NT_011515.11
3437231 70 4 32269 11843 3990 891 295 111 45 6 2 1 1 0 1
HumanChromosome X NT_033330.7
754004 70 4 7177 2722 701 164 54 17 7 0 0 0 0 0 1
55
We calculate the distribution length of the longest suffix of the window which is equal to a prefix of the pattern in above experiments. We find that almost all Ai are smaller than 5.
Therefore, we conclude that the probability of finding large Ai is very small.
56
Reference• [A90]Algorithms for finding patterns in strings, A. V. Aho, Handbook of
Theoretical Computer Science, Vol. A, Elsevier, Amsterdam, 1990, pp.255-300.
• [A85]The myriad virtues of suffix trees, Apostolico, A., Combinatorial Algorithms on words, NATO Advanced Science Institutes, Series F, Vol. 12, 1985, pp.85-96
• [AG86]The Boyer-Moore-Galil string searching strategies revisited, Apostolico, A. and Giancarlo, R., SIAM, Comput. 15, 1986, pp98-105.
• [BR92]Average running time of the Boyer-Moore-Horspool algorithm, Baeza-Yates, R. A. and Regnier, M. Theoret. Comput. Sci., 1992, pp.19-31.
• [BKR91]Analysis of algorithms and Data Structures, Banachowski, L., Kreczmar, A. and Rytter, W., Addison-Wesley. Reading, MA,1991.
• [BM77] A fast string searching algorithm. Boyer, R. S. and Moore, J. S., Communications of the ACM, Vol. 20, 1977, pp.762-772.
• [C99]Tight bounds on the complexity of the Boyer-Moore pattern string searching algorithm, Cole, R. Proceedings of the second annual ACM-SIAM symposium on Discrete algorithms, 1999, pp.224-233.
57
• [C86] Transducers and repetitions, Crochemore, M., Theoret. Comput. Sci., Vol. 45, 1986, pp.63-86.
• [G79] On improving the worst case running time of the Boyer-Moore string searching algorithm, Galil, Z., Comm. ACM, Vol.22, 1979, pp.505-508.
• [G80] A new proof of the linearity of the Boyer-Moore string searching algorithm, Guibas, L. J. and Odlyzko, A. M., SIAM J. Comput., Vol. 9, 1980, pp. 672-682.
• [H80] Practical fast searching in strings, Horspool, R. N., Software-Practice and Experience, Vol.10, 1980, pp. 501-506.
• [HS80] Fast string searching, Hume, A. and Sunday, D. M.,Software-Practice and Experience, 1980, pp. 1221-1248.
• [KMP77] Fast pattern matching in strings, D.E. Knuth, J.H. Morris and V.R. Pratt, SIAM Journal on Computing, Vol. 6, No.2, 1977, pp 323-350 .
• [L92] A variation on Boyer-Moore algorithm, Lecroq, T.,Theorer. Comput. Sci., Vol.92, 1992, pp.119-144.
• [R80] A correct prprocessing algorithm for Boyer-Moore string searching, SIAM Journal on Computing, Rytter, W.,Vol.9, 1980, pp.509-512.
• [Y79] The complexity of pattern matching for a random string, Yao, A. C.,SIAM Journal on Computing, Vol. 8, 1979, pp.368-387.