1 speeding up on two string matching algorithms advisor: prof. r. c. t. lee speaker: kuei-hao chen,...

1

Speeding up on two string matching algorithms

Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen

, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK, S., LECROQ, T.,

PLANDOWSKI, W. and RYTTER, W.

Algorithmica, Vol.12, 1994, pp.247-267

2

Problem Definition

• Input : A text T and a pattern P.• Output : Find all occurrences of P in T

3

Rule 1: The Suffix to Prefix Rule • For a window to have any chance to match a pattern,

in some way, there must be a suffix of the window which is equal to a prefix of the pattern.

4

Basic Ideas• Open a window W with size |P| in the text.

T|P|

W

p

• Find the longest suffix of W is also the prefix of pattern.

T|P|

p

W

Match!

Case 1:

5

T|P|

W

p

Case 2:

T|P|

W

p

T|P|

W

p

Case 3:

|P|

If there is no such suffix, we move W with length |P|.

6

Preprocessing phase

• T=GCATCGGCGAGAGTATACAGTACG

• P=GCAGAGAG

• L(S): a set contains all prefixes of the pattern.

}G,GC,GCA,GCAG,GCAGA,GCAGAG,GCAGAGA, {GCAGAGAG,)( SL

08 7 6 5 4 3 2 1GA GAGG AC

C

C

C A

We construct the suffix automaton of P.

Suffix Automaton

7

Preprocessing: Construct a Suffix Tree The reversal string of P.

Example: GCAGAGAG GAGAGACGSuffixes of GAGAGACG AGAGACG GAGACG AGACG GACG

ACG CG G

:P:P

:P

:P

C G

G

6

121

CG

A

54

CG

GACG

A

2

3

11109

7

8

GA

GACG

01

2

8 6 4 7 5 3

Suffix tree for GAGAGACG:P

8

G C A T C G C A G C A G A G A GW P

We want to find the longest suffix of W which is equal to a prefix of P.

A C G C T A C G

G A G A G A C G

W

P

Suffix tree for P

We find that ACG (a prefix of , a suffix of W) is a suffix of (a prefix of P).

Thus ACG is the longest suffix of W which is equal to a prefix of P.

W P

C G

G

6

121

CG

A

54

CG

GACG

A

2

3

GA

11109

CG

7

8

GA

CG

GACG

01

2

8 6 4 7 5 3

•Example 1

9

G T A T A C A G

G C A G A G A G

W

P

G A C A T A T G

G A G A G A C G

W

P

Suffix tree for P

We find that GAC is the longest prefix of (thus the longest suffix of W) which is equal to a substring of .

But GAC is not a suffix of and GACA is not a suffix of either.

WP

•Example 2

PP

10

G A C A T A T G

G A G A G A C G

W

P

Luckily, a prefix of GACG, namely G, is also a suffix of .

G can be found by finding the lowest common ancestor of G and GACG.

P

C G

G

6

1

CG

AA

2

01

4

Thus G is the longest prefix of (suffix of W) which is equal to a suffix of (prefix of P).P

W

11

Let X be the longest prefix of (suffix of W) which is equal to a substring of , but not a suffix of .

Let Y be a prefix of X (a suffix of W) which is equal to a suffix of (prefix of P).

Then Y is the longest suffix of W equal to a prefix of P.

PW

P

P

G C A A

G C A A G C

C C A A C G A

A C G AA

W

P

XY

X Y

12

Z is a suffix of which can be found in the suffix tree of .

Y may not exist.

If it exists, it must be in the suffix tree of and must have been found before X is found because Y is a prefix of X.

P

P

X Y

W

P

X

Y

Z

P

13

• Preprocessing phase: the worst case of the time complexity is O(m).

• Searching phase: the worst case of the time complexity is O(mn).

• But it needs time O( ) in average case where r is the size of the alphabet as shown in this paper.

m

mn rlog

14

About the average case analysis of RF algorithm, assume that the text is a random sequence over a size r alphabet and is preserved such that m must be enough large.

This assumption is reasonable.

Let m=16, r=4.

83log

mmr

hold. 8

3log

6238

163

83

216loglog 4

mm

m

m

r

r

15

Theorem. The expected average time of the RF algorithm is O( ).

Proof.

m

mn rlog

Note that r>1, and .8

3log

mmr

For a pattern with length m, there are no more than m substrings. Thus, there are at most m substrings with length . mrlog2

P

...

mrlog2

m

16

Let Li be the length of the shift in the ith attempt of RF algorithm and Let Xi and Yi be the X and the Y in ith attempt respectively.

Let Si be the length of the longest prefix of which appear in in the ith attempt. That is, Si=|Xi|.

Let Ai=|Yi| such that because Yi is a prefix of Xi.

ii AS

P

W

17

In the first attempt of RF algorithm,

mm

m

mrr

mmS m

mrr

r

1

log2Pr

2

2log2log21

m

mS r

11log2Pr 1

18

.4

8

3logby

8

32

log2

by

1

1

1

11

11

11

mL

mm

mmL

mmL

SmL

ASmLS

mLA

r

r

ii

4

mLi

mS ri log2

Let us call the ith shift long if and only if and short otherwise. (It implies that Li is long if .)

19

When at least new symbols are being read at the current attempt, with probability there are at most characters of the suffix of the window can match a substring of P, which causes a long shift.

,1

1m

mrlog2

Tmrlog2

P... ...

mrlog2

20

We divide all attempts into phases. Each phase ends on the first long shift. In other words, there is exactly one long shift in each phase.

T

Short shift

Long shift

Phase 1 Phase 2 Phase 3 Phase 4 Phase 5

21

There are two main ideas in the paper:

(1)The number of all phases is .

(2)We calculate the expected number of comparison of each phase. An expected number of comparison of each phase is .

We shall discuss above two ideas in the next slides.

m

nO

mO log

The number of all phases is .

We know that the length of long shift is

Then

The number of all phases is

m

nO

.4

m

.4

shift longphasem

.

44

m

nO

m

n

nm

23

2

1

4Pr 2

mLi

Claim 1: Assume that Li and Li+1 are both short.

Proof. Suppose Li and Li+1 < , then the pattern is of the form where , w, .

4

m

szwvv k z3k

Then .

That is , Li+2 is the end of a phase.

Next, we calculate the expected number of comprison of each phase.

24

Note that Yi denotes a longest suffix of the window Wi which is equal to a prefix of the pattern, where Wi is a window of the text of length m in the ith attempt. Let Bi be the set of new symbols to read in the ith attempt.

ki wvvY Note that the pattern is of the form .Then , , .

szwvv k

szBi szLi

i i+1 i+2

Li+1

TP Lii attempt

i+1 attempt Yi Bi

Yi+1 Bi+1i+2 attempt

Wi

Wi+1

25

vwLi 1Let Bi+1 be because there exists an overlap between Yi and Yi+1, and

.11 szwvvvszvwBY kkii

i i+1 i+2

Li+1

TP Lii attempt

i+1 attempt Yi Bi

Yi+1 Bi+1i+2 attempt

Wi

Wi+1

26

Example: T=bbcabcabcabcabcadc P=cabcabcabcadd, w=ab,v=c, s=a,z=dd.

Then , addabcc 3P .abcc 3 ki wvvY

,

When P shifts Li+1, the overlap of Yi and Yi+1 is

,addabcccaddcab 3333 szwvvvszvwP

.1 vwLi

.cab 33 vvw

c a b c a b c a b c a

c a b c a b c a b c a Li+1

b b c a b c a b c a b c a b c

Li


a dT

P

Li+1

i attempt

i+1

d

d

d

Overlap

d

d

d

c

i+1 attempt

27

11vw kwvv

''vw 11'' vwvw

.

If there exists a word such that , then because is a minimal period of .

11vw kwvv

Without loss of generality we can assume thatis a minimal period of .

''11 vwvw

Hence,

111 iLwvvw

28

Example: P=abcabcabcabcabcabcabbc, w=cabc,v=ab, s=b,z=c.

szvwvP

szwvvP k

111

6

3

bccabab

bccabcabab

w1v1 is a minimal period of P.

29

We can also assume (eventually changing wv and k) that and sz do not have a common prefix. We may therefore obtain a new fragment s1z1 such that

11vw

.11 szLzs i

A suffix of the read part of the text is of the form , and we have at least C=min(Li+1, Li) new symbols to read in the (i+2)th attempt.

Let e be a random word of length C to be read part of the text such that .

111 svw k

1zCe

30

31

Note that

If |Bi|>|Bi+1|, then , otherwise, , .

11 zCBi

1zBi 1s

i+1 attempt

z1i+2 attempt

BiYi

Yi+1 Bi+1

z1s1

BiYi

Yi+1 Bi+1

z1

1 ii BB1 ii BB

. and 11 iiii LBLB

32

We give an example when |Bi|>|Bi+1|.

T=bbbaaaaaaaacda P=aaaaaaaabc, w=a,v=a, s=a, z=bc.

ezsvw

szwv

wv

P

bc , , a,

perfixcommon a havenot do and bca

of period minimalabca

abcaaa

1111

8

7

3

a a a a a a a a

b b b a a a a a a a a a a

i

T

P Li

i+1

c

i+2

Li+1

d a

b c

a a a a a a a a b c

a a a a a a a a b c

33

We give another example when

T=bbcabcabcabcabcadc P=cabcabcabcadd, w=ab,v=c, s=a, z=dd.

ezsvw

szwv

wv

P

dd , ,ca b,

perfixcommon a havenot do and ddbcaca

of period minimaladdabcc

addabcc

1111

3

3

3

.1 ii BB


c a b c a b c a b c a Li+1

b b c a b c a b c a b c a b c

Li


a dT

P

i+1

d

d

d

d

d

d

c

i+2i

34

It is easy to see that if w1v1s1e is a substring of , then y must be either equal to pref(z1) if , or otherwise.

11111 zsvwv k

1s 111 zvwpref

i i+1 i+2

Li+1

TP Lii attempt

i+1 attempt Yi Bi

Bi+1

e111 svw

1111 svwv li+2 attempt

35

In other words, by the above condition, if , w1v1s1e would only appear to the end of P.

Therefore, e=pref(z1).

1s

i+2T

v1w1v1w1v1w1v1w1v1w1v1w1s1i+2 attempt P z1

w1v1w1v1w1s1 e

otherwise, w1v1s1e may appear to any position of P. Therefore, .111 zvwprefy

Tv1w1v1w1v1w1v1w1v1w1v1w1P z1

w1v1w1v1w1 e

36

The probability that reading e new symbols leads to a long (longer than Li+Li+1 which is less than ) substring of the pattern

is no greater than .2

1

er

e4

2m

i i+1 i+2

Li+1

TP Lii attempt

i+1 attempt Yi Bi

Bi+1

e111 svw

1111 svwv li+2 attempt

Note that . and 11111 ii LzsLwv

37

Therefore,

.2

1

4 Pr 2

mLi

.44

222

mmSmL ii

38

By Claim 1, the assumptions say that when the (k-1)th and (k-2)th shifts are both short, the kth shift is long with probability .

It implies that the kth shift of the phase is short with probability for

2

1

2

1 .3k

39

Let F be the random variable which is the number of short shifts in the phase.

What can we say about the probability distribution of F?

40

By claim 1, we know when (k-2)th and (k-1)th are both short, .

2

1short isth Pr k

3.for ,1

2

1Pr

,1

2

13Pr

,1

2Pr

,1

1Pr

,1

10Pr

2

km

kF

mF

mF

mF

mF

k

41

Let G be the random variable which is the number of comparison of the phase and let L be the number of comparison of a long shift of the phase. Then

The problem is on how to find L.

mFLG

42

For the number of comparison of a long shift of the phase, we know and .

Note that Si is the length of the substring of the pattern that is matched in Wi.

m

mS ri

11log2Pr

m

mS ri

1log2Pr

1log2

11log2

11

m

mmm

mL

r

r

Hence, mFm

mFLG

r

1log2

43

mO

mm

km

m

mmO

mkmm

mmm

mm

kFmkm

FmmFmGE

r

k kkk

kk

rr

kkr

rr

kr

rr

log

2

1

22

log2log

2

11log2

11log2

111log2

Pr1log2

1Pr11log20Pr1log2

2 222

22

22

2

For the expected number of comparison of each phase, we have

According to above discussion, we know that there are phases in the algorithm and an expected number of comparison of each phase is .

Therefore, the expected time of the RF

algorithm is .

mnO

mO r log

m

mnO rlog

45

In this paper, they use X to analyze the average case of RF algorithm finally note that X is the longest suffix of W which is equal to a substring of P .

In fact, the main idea of RF algorithm is to find out Y, but not X. Therefore, we may re-analyze the expected length of Yi.

Note that the Li=shift is equal to Li=m － |Yi|=m － Ai. If Ai is small, Li is large. We expect Ai to be very small.

46

Given a window Wi of T in the ith attempt and a pattern P, the expected length of the longest suffix of Wi equal to a prefix of P is

…..(1)

…..(2)

mmi rm

rm

rrA

111

12

11

12

132

111

12

11

mmi rm

rm

rrA

r

47

(2) － (1)

mii rrrA

rA

11112

.1

1

11

1

11

11

1

11111

1

2

r

r

r

r

rr

rrrA

r

m

mi

48

We can deduce that

21

1

111

r

rA

rA

r

i

i

49

We randomly generate some texts and patterns using Knuth’s random generating function in the first experiment.

Data source

The length of string

Alphabet size r

The number of total comparison with matched

The number

of window

The expected number of

comparison per window r/(r-1 )2

The number of average

comparison per window

text pattern

Random

1000 30 4 17 33 0.4444 0.51515210000 30 4 151 338 0.4444 0.446746

100000 30 4 1316 3377 0.4444 0.3896951000000 30 4 13074 33769 0.4444 0.38716

1000 50 5 6 20 0.3125 0.310000 50 5 64 201 0.3125 0.318408

100000 50 5 645 2012 0.3125 0.3205771000000 50 5 6045 20120 0.3125 0.300447

50

Data source


Alphabet size r

The number of total compari

son with

matched

The number

of window


comparison per window r/(r-1 )2



textpatter

n

Random

1000 30 10 5 33 0.1234 0.15151510000 30 10 30 334 0.1234 0.08982

100000 30 10 346 3344 0.1234 0.1034691000000 30 10 3930 33463 0.1234 0.117443

1000 100 7 4 10 0.1944 0.410000 100 7 14 100 0.1944 0.14

100000 100 7 195 1001 0.1944 0.1948051000000 100 7 1865 10018 0.1944 0.186165

51

In the second experiment, we take news reports from CNN site as T and randomly obtain a word as P.

Data source


Alphabet size r

The number of total compar

ison with

matched

The number

of window


comparison per window r(r-1 )2



text pattern

CNN news

3715 7 35 32 535 0.0302 0.05982222 14 40 2 158 0.0262 0.0126

52

In the 3rd experiment, we take three fragments from human chromosome as T. The pattern is taken from the part of T.

Data source


Alphabet size r

The number of

total compariso

n with matched

The number of window

The expected

number of compariso

n per window r/(r-1 )2

The number of

average comparison

per window

text pattern

Human Chromosome

21 NT_011512.10

1627105 70 4 8942 23372 0.4444 0.3826

Human Chromosome

22 NT_011515.11

3437231 70 4 24648 49455 0.4444 0.4984

Human Chromosome

X NT_033330.7

754004 70 4 5029 10843 0.4444 0.4638

53

Data source


rThe distribution length of the longest suffix of the window

which is equal to a prefix of the pattern

T P0 1 2 3 4 5 6 7 8 9 10 30 70

Random

1000 30 4 23 4 5 1 0 0 0 0 0 0 0 0 010000 30 4 240 60 27 7 4 0 0 0 0 0 0 0 0

100000 30 4 2429 661 225 48 9 5 0 0 0 0 0 0 0

1000000 30 4 24650 6200 2147 587 128 41 11 4 1 0 0 0 0

1000 50 5 16 3 0 1 0 0 0 0 0 0 0 0 010000 50 5 150 40 9 2 0 0 0 0 0 0 0 0 0

100000 50 5 1508 395 88 15 1 5 0 0 0 0 0 0 0

1000000 50 5 15314 3799 812 163 27 5 0 0 0 0 0 0 0

1000 30 10 28 5 0 0 0 0 0 0 0 0 0 0 010000 30 10 305 28 1 0 0 0 0 0 0 0 0 0 0

100000 30 10 3027 289 27 1 0 0 0 0 0 0 0 0 0

1000000 30 10 29959 3119 353 27 6 0 0 0 0 0 0 0 0

1000 100 7 7 2 1 0 0 0 0 0 0 0 0 0 010000 100 7 86 14 0 0 0 0 0 0 0 0 0 0 0

100000 100 7 841 131 23 6 0 0 0 0 0 0 0 0 0

54

Data source


rThe distribution length of the longest suffix of the window

which is equal to a prefix of the patternT P

0 1 2 3 4 5 6 7 8 9 10 30 70

CNN news3715 7 35 503 32 0 0 0 0 0 0 0 0 0 0 02222 14 40 156 2 0 0 0 0 0 0 0 0 0 0 0

Human Chromosome 21 NT_011512.10

1627105 70 4 16265 5823 967 213 68 16 13 3 2 1 0 0 1

Human Chromosome 22 NT_011515.11

3437231 70 4 32269 11843 3990 891 295 111 45 6 2 1 1 0 1

HumanChromosome X NT_033330.7

754004 70 4 7177 2722 701 164 54 17 7 0 0 0 0 0 1

55

We calculate the distribution length of the longest suffix of the window which is equal to a prefix of the pattern in above experiments. We find that almost all Ai are smaller than 5.

Therefore, we conclude that the probability of finding large Ai is very small.

56

Reference• [A90]Algorithms for finding patterns in strings, A. V. Aho, Handbook of

Theoretical Computer Science, Vol. A, Elsevier, Amsterdam, 1990, pp.255-300.

• [A85]The myriad virtues of suffix trees, Apostolico, A., Combinatorial Algorithms on words, NATO Advanced Science Institutes, Series F, Vol. 12, 1985, pp.85-96

• [AG86]The Boyer-Moore-Galil string searching strategies revisited, Apostolico, A. and Giancarlo, R., SIAM, Comput. 15, 1986, pp98-105.

• [BR92]Average running time of the Boyer-Moore-Horspool algorithm, Baeza-Yates, R. A. and Regnier, M. Theoret. Comput. Sci., 1992, pp.19-31.

• [BKR91]Analysis of algorithms and Data Structures, Banachowski, L., Kreczmar, A. and Rytter, W., Addison-Wesley. Reading, MA,1991.

• [BM77] A fast string searching algorithm. Boyer, R. S. and Moore, J. S., Communications of the ACM, Vol. 20, 1977, pp.762-772.

• [C99]Tight bounds on the complexity of the Boyer-Moore pattern string searching algorithm, Cole, R. Proceedings of the second annual ACM-SIAM symposium on Discrete algorithms, 1999, pp.224-233.

57

• [C86] Transducers and repetitions, Crochemore, M., Theoret. Comput. Sci., Vol. 45, 1986, pp.63-86.

• [G79] On improving the worst case running time of the Boyer-Moore string searching algorithm, Galil, Z., Comm. ACM, Vol.22, 1979, pp.505-508.

• [G80] A new proof of the linearity of the Boyer-Moore string searching algorithm, Guibas, L. J. and Odlyzko, A. M., SIAM J. Comput., Vol. 9, 1980, pp. 672-682.

• [H80] Practical fast searching in strings, Horspool, R. N., Software-Practice and Experience, Vol.10, 1980, pp. 501-506.

• [HS80] Fast string searching, Hume, A. and Sunday, D. M.,Software-Practice and Experience, 1980, pp. 1221-1248.

• [KMP77] Fast pattern matching in strings, D.E. Knuth, J.H. Morris and V.R. Pratt, SIAM Journal on Computing, Vol. 6, No.2, 1977, pp 323-350 .

• [L92] A variation on Boyer-Moore algorithm, Lecroq, T.,Theorer. Comput. Sci., Vol.92, 1992, pp.119-144.

• [R80] A correct prprocessing algorithm for Boyer-Moore string searching, SIAM Journal on Computing, Rytter, W.,Vol.9, 1980, pp.509-512.

• [Y79] The complexity of pattern matching for a random string, Yao, A. C.,SIAM Journal on Computing, Vol. 8, 1979, pp.368-387.

1 speeding up on two string matching algorithms advisor: prof. r. c. t. lee speaker: kuei-hao chen,...

Documents