1 alpha skip search algorithm very fast string matching algorithm for small alphabets and long...

23
1 Alpha skip Search Algorithm Very Fast String Matching Algorithm for Small Alphabets and Long Patterns, Christian, C., Thierry, L. and Joseph, D.P., Lecture Notes in Computer Science, Vol. 1448, 1998, pp. 55-64 Advisor: Prof. R. C. T. Lee Reporter: Z. H. Pan

Post on 19-Dec-2015

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Alpha skip Search Algorithm Very Fast String Matching Algorithm for Small Alphabets and Long Patterns, Christian, C., Thierry, L. and Joseph, D.P., Lecture

1

Alpha skip Search Algorithm

Very Fast String Matching Algorithm for Small Alphabets and Long Patterns, Christian, C., Thierry, L. and Joseph, D.P., Lecture Notes in

Computer Science, Vol. 1448, 1998, pp. 55-64

Advisor: Prof. R. C. T. Lee

Reporter: Z. H. Pan

Page 2: 1 Alpha skip Search Algorithm Very Fast String Matching Algorithm for Small Alphabets and Long Patterns, Christian, C., Thierry, L. and Joseph, D.P., Lecture

2

The Exact String Matching Problem:We are given a text string T of length n and a pattern string P of length m and we want to find of all occurrences of P in T.

Example:

CCTAP

CCTAAGTCAGCCTAAGCTT

AGTCCCTAAGCTCCTAAG

There are two occurrences of P in T as shown below: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Input:

Output: 2, 10

Page 3: 1 Alpha skip Search Algorithm Very Fast String Matching Algorithm for Small Alphabets and Long Patterns, Christian, C., Thierry, L. and Joseph, D.P., Lecture

3

• The Alpha Skip Search Algorithm is an improvement of the Skip Search Algorithm.

• The Skip Search Algorithm uses Rule 2, the substring matching rule and Rule 4, two window rule.

Page 4: 1 Alpha skip Search Algorithm Very Fast String Matching Algorithm for Small Alphabets and Long Patterns, Christian, C., Thierry, L. and Joseph, D.P., Lecture

4

Rule 2: The Substring Matching Rule

• For any substring u in T, find a nearest u in P which is to the left of it. If such an u in P exists, move P such then the two u’s match; otherwise, we may define a new partial window.

T

T

P

u

u

P

u

u

Page 5: 1 Alpha skip Search Algorithm Very Fast String Matching Algorithm for Small Alphabets and Long Patterns, Christian, C., Thierry, L. and Joseph, D.P., Lecture

5

Rule 2-2: 1-Suffix Rule (A Special Version of Rule 2)

• Consider the 1-suffix x. We may apply Rule 2-2 now.

T

P

x

x

Page 6: 1 Alpha skip Search Algorithm Very Fast String Matching Algorithm for Small Alphabets and Long Patterns, Christian, C., Thierry, L. and Joseph, D.P., Lecture

6

Rule 4: Two Window Rule

C T T AP =

C G GC G C AT = TT A C C CT A GG T

C G GC G C A T

w1 w2 No prefix of P = a suffix of W1.

No suffix of P = a prefix of W2.

TA C C CT A G

w3 w4

TC T A

Matched!

Page 7: 1 Alpha skip Search Algorithm Very Fast String Matching Algorithm for Small Alphabets and Long Patterns, Christian, C., Thierry, L. and Joseph, D.P., Lecture

7

The Skip Search Algorithm• The Skip Search Algorithm uses Rule 2-2 together

with Rule 4 in a very clever way.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

G C A T C G C A G A G A G T A T A C A G T A C G 0 1 2 3 4 5 6 7

G C A G A G A G

Example:

T :

P : 0 1 2 3 4 5 6 7

G C A G A G A Gthe length of two window

The length of the pattern is m. The length of two window which is a wide window is 2m-1.

Page 8: 1 Alpha skip Search Algorithm Very Fast String Matching Algorithm for Small Alphabets and Long Patterns, Christian, C., Thierry, L. and Joseph, D.P., Lecture

8

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

G C A T C G C A G A G A G T A T A C A G T A C G 0 1 2 3 4 5 6 7

G C A G A G A G

Example:

T :

P :

A

C

G

T

(6,4,2)

(1)

(7,5,3,0)

φ

0 1 2 3 4 5 6 7

G C A G A G A G 0 1 2 3 4 5 6 7

G C A G A G A G

The length of two window is 2m-1.

Page 9: 1 Alpha skip Search Algorithm Very Fast String Matching Algorithm for Small Alphabets and Long Patterns, Christian, C., Thierry, L. and Joseph, D.P., Lecture

9

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

G C A T C G C A G A G A G T A T A C A G T A C G

Example:

T :

A

C

G

T

(6,4,2)

(1)

(7,5,3,0)

φ

The length of two window is 2m-1.

Page 10: 1 Alpha skip Search Algorithm Very Fast String Matching Algorithm for Small Alphabets and Long Patterns, Christian, C., Thierry, L. and Joseph, D.P., Lecture

10

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

G C A T C G C A G A G A G T A T A C A G T A C G 0 1 2 3 4 5 6 7

G C A G A G A G

Example:

T :

P :

A

C

G

T

(6,4,2)

(1)

(7,5,3,0)

φ

The length of two window is 2m-1.

Page 11: 1 Alpha skip Search Algorithm Very Fast String Matching Algorithm for Small Alphabets and Long Patterns, Christian, C., Thierry, L. and Joseph, D.P., Lecture

11

• The Skip Search Algorithm uses a very special version of Rule 2. In it, the substring is limited to one character.

• Later, in alpha skip algorithm, it uses a substring whose length may be longer than 1 and a wide window with length 2m-L is used.

Page 12: 1 Alpha skip Search Algorithm Very Fast String Matching Algorithm for Small Alphabets and Long Patterns, Christian, C., Thierry, L. and Joseph, D.P., Lecture

12

We assume that the size of the alphabet Σ of the text and pattern is σ. In the preprocessing phase, we first use a formula to determine L and then find all substrings in pattern P whose length is L. The information about where the substrings are location in P is stored in a trie. In the searching phase, we use the information which is stored in trie to compare text T with pattern P.

Page 13: 1 Alpha skip Search Algorithm Very Fast String Matching Algorithm for Small Alphabets and Long Patterns, Christian, C., Thierry, L. and Joseph, D.P., Lecture

13

Preprocessing phase

If logσm > 1, L = logσm where σ is the size of the alphabet and m is the length of pattern P; otherwise L=1.

Example:T = aaaababbababbbbbbaabababababbac

P = ababbaba

σ= 3, m=8

L= logσm = log38 = 1

In this case, the σ is 3 and the length of pattern is 8, so that L is 1, that is, the limit of the length of substring is 1.

a b

[7,5,2,0] [6,4,3,1]

trie

Page 14: 1 Alpha skip Search Algorithm Very Fast String Matching Algorithm for Small Alphabets and Long Patterns, Christian, C., Thierry, L. and Joseph, D.P., Lecture

14

Example: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

a a a a b a b b a b a b b b b b b a a b a b a b a b a b b a

T :

0 1 2 3 4 5 6 7

a b a b b a b a

σ= 2, m = 8

L = logσm = log28 = 3

P : a

b

a

a

ab

b

b

b

[5,0] [4,1]

[2]

[3]

Every trie’s leaf stores decreasing numbers of position of pattern P.

Page 15: 1 Alpha skip Search Algorithm Very Fast String Matching Algorithm for Small Alphabets and Long Patterns, Christian, C., Thierry, L. and Joseph, D.P., Lecture

15

root

a

b

a

a

ab

b

b

b

[5,0] [4,1][2]

[3]

Trie

0 1 2 3 4 5 6 7

a b a b b a b aP :

Example:

Page 16: 1 Alpha skip Search Algorithm Very Fast String Matching Algorithm for Small Alphabets and Long Patterns, Christian, C., Thierry, L. and Joseph, D.P., Lecture

16

ab

b

ab

aa

b

b

ab

aa

baroot

0 1 2 3 4 5 6 7

a b a b b a b aP :

0 1 2 3 4 5 6 7

a b a b b a b aP :

ab

aa b

b

bb

ab

aa b

b

bb

[0]

[0] [1] [0] [1][2] [0] [1][2]

a[3]

Page 17: 1 Alpha skip Search Algorithm Very Fast String Matching Algorithm for Small Alphabets and Long Patterns, Christian, C., Thierry, L. and Joseph, D.P., Lecture

17

0 1 2 3 4 5 6 7

a b a b b a b aP :

ab

aa b

b

bb[0] [4,1][2]

a[3]

ab

aa

ab

b

bb

[5,0]

[4,1][2] [3]

Page 18: 1 Alpha skip Search Algorithm Very Fast String Matching Algorithm for Small Alphabets and Long Patterns, Christian, C., Thierry, L. and Joseph, D.P., Lecture

18

Example: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

a a a a b a b b a b a b b b b b b a a b a b a b a b a b b a

T :

0 1 2 3 4 5 6 7

a b a b b a b a

σ= 2, m = 8

L = logσm = log28 = 3

P :

This is a wide window with length 2m-L= 2*8-3=13.

We use a wide window with length 2m-L.

Page 19: 1 Alpha skip Search Algorithm Very Fast String Matching Algorithm for Small Alphabets and Long Patterns, Christian, C., Thierry, L. and Joseph, D.P., Lecture

19

Example: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

T = aaaababbababbbbbbaabababababba 0 1 2 3 4 5 6 7

P = ababbaba

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

T = aaaababbababbbbbbaabababababba

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

T = aaaababbababbbbbbaabababababba0 1 2 3 4 5 6 7

ababbaba

Match!

ab

aa

ab

b

bb

[5,0]

[4,1][2] [3]

Page 20: 1 Alpha skip Search Algorithm Very Fast String Matching Algorithm for Small Alphabets and Long Patterns, Christian, C., Thierry, L. and Joseph, D.P., Lecture

20

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

T = aaaababbababbbbbbaabababababba0 1 2 3 4 5 6 7

ababbaba

Match!

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

T = aaaababbababbbbbbaabababababba0 1 2 3 4 5 6 7

ababbaba

Match!

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

T = aaaababbababbbbbbaabababababba0 1 2 3 4 5 6 7

ababbaba

Match!

ab

aa

ab

b

bb

[5,0]

[4,1][2] [3]No bbb in P

No aab in P

Page 21: 1 Alpha skip Search Algorithm Very Fast String Matching Algorithm for Small Alphabets and Long Patterns, Christian, C., Thierry, L. and Joseph, D.P., Lecture

21

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

T = aaaababbababbbbbbaabababababba0 1 2 3 4 5 6 7

ababbaba 0 1 2 3 4 5 6 7

ababbaba Match!

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

T = aaaababbababbbbbbaabababababba0 1 2 3 4 5 6 7

ababbaba 0 1 2 3 4 5 6 7

ababbaba 0 1 2 3 4 5 6 7

ababbaba

Match!

ab

aa

ab

b

bb

[5,0]

[4,1][2] [3]

Page 22: 1 Alpha skip Search Algorithm Very Fast String Matching Algorithm for Small Alphabets and Long Patterns, Christian, C., Thierry, L. and Joseph, D.P., Lecture

22

Time complexity:

preprocessing phase in O(m) time and space complexity; searching phase in O(mn) time complexity;

Page 23: 1 Alpha skip Search Algorithm Very Fast String Matching Algorithm for Small Alphabets and Long Patterns, Christian, C., Thierry, L. and Joseph, D.P., Lecture

23

References

[BM77]    A Fast String Searching Algorithm , Boyer, R. S. and Moore, J. S. , Communication of the ACM , Vol. 20 , 1977 , pp. 762-772 .

[HS91]    Fast String Searching , Hume, A. and Sundy, D. M. , Software, Practice and Experience , Vol. 21 , 1991 , pp. 1221-1248 .

[MTALSWW92] Speeding Up Two String-Matching Algorithms, Maxime C., Thierry L., Artur C., Leszek G., Stefan J., Wojciech P. and Wojciech R., Lecture Notes In Computer Science, Vol. 577, 1992, pp. 589-600 .

[MW94] Text algorithms, M. Crochemore and W. Rytter, Oxford University Press, 1994.

[KMP77] Fast Pattern Matching in Strings, D.E. Knuth, J.H. Morris and V.R. Pratt, SIAM Journal on Computing, Vol. 6, No.2, 1977, pp 323-350 .

[T92] A variation on the Boyer-Moore algorithm, Thierry Lecroq, Theoretical Computer Science archive, Vol. 92 , No.1, 1992, pp 119-144 .

[T98] Experiments on string matching in memory structures, Thierry Lecroq, Software—Practice & Experience archive, Vol. 28, No.5, 1998, pp 561-568

[T92] Tuning the Boyer-Moore-Horspool string searching algorithm, Timo Raita, Software—Practice & Experience archive, Vol. 22, No.10, 1992, pp. 879-884 .

[G94] String searching algorithms, G.A. Stephen, World Scientific Lecture Notes Series On Computing, Vol. 3, 1994, pp. 243 .