1 alpha skip search algorithm very fast string matching algorithm for small alphabets and long...
Post on 19-Dec-2015
221 views
TRANSCRIPT
1
Alpha skip Search Algorithm
Very Fast String Matching Algorithm for Small Alphabets and Long Patterns, Christian, C., Thierry, L. and Joseph, D.P., Lecture Notes in
Computer Science, Vol. 1448, 1998, pp. 55-64
Advisor: Prof. R. C. T. Lee
Reporter: Z. H. Pan
2
The Exact String Matching Problem:We are given a text string T of length n and a pattern string P of length m and we want to find of all occurrences of P in T.
Example:
CCTAP
CCTAAGTCAGCCTAAGCTT
AGTCCCTAAGCTCCTAAG
There are two occurrences of P in T as shown below: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Input:
Output: 2, 10
3
• The Alpha Skip Search Algorithm is an improvement of the Skip Search Algorithm.
• The Skip Search Algorithm uses Rule 2, the substring matching rule and Rule 4, two window rule.
4
Rule 2: The Substring Matching Rule
• For any substring u in T, find a nearest u in P which is to the left of it. If such an u in P exists, move P such then the two u’s match; otherwise, we may define a new partial window.
T
T
P
u
u
P
u
u
5
Rule 2-2: 1-Suffix Rule (A Special Version of Rule 2)
• Consider the 1-suffix x. We may apply Rule 2-2 now.
T
P
x
x
6
Rule 4: Two Window Rule
C T T AP =
C G GC G C AT = TT A C C CT A GG T
C G GC G C A T
w1 w2 No prefix of P = a suffix of W1.
No suffix of P = a prefix of W2.
TA C C CT A G
w3 w4
TC T A
Matched!
7
The Skip Search Algorithm• The Skip Search Algorithm uses Rule 2-2 together
with Rule 4 in a very clever way.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
G C A T C G C A G A G A G T A T A C A G T A C G 0 1 2 3 4 5 6 7
G C A G A G A G
Example:
T :
P : 0 1 2 3 4 5 6 7
G C A G A G A Gthe length of two window
The length of the pattern is m. The length of two window which is a wide window is 2m-1.
8
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
G C A T C G C A G A G A G T A T A C A G T A C G 0 1 2 3 4 5 6 7
G C A G A G A G
Example:
T :
P :
A
C
G
T
(6,4,2)
(1)
(7,5,3,0)
φ
0 1 2 3 4 5 6 7
G C A G A G A G 0 1 2 3 4 5 6 7
G C A G A G A G
The length of two window is 2m-1.
9
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
G C A T C G C A G A G A G T A T A C A G T A C G
Example:
T :
A
C
G
T
(6,4,2)
(1)
(7,5,3,0)
φ
The length of two window is 2m-1.
10
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
G C A T C G C A G A G A G T A T A C A G T A C G 0 1 2 3 4 5 6 7
G C A G A G A G
Example:
T :
P :
A
C
G
T
(6,4,2)
(1)
(7,5,3,0)
φ
The length of two window is 2m-1.
11
• The Skip Search Algorithm uses a very special version of Rule 2. In it, the substring is limited to one character.
• Later, in alpha skip algorithm, it uses a substring whose length may be longer than 1 and a wide window with length 2m-L is used.
12
We assume that the size of the alphabet Σ of the text and pattern is σ. In the preprocessing phase, we first use a formula to determine L and then find all substrings in pattern P whose length is L. The information about where the substrings are location in P is stored in a trie. In the searching phase, we use the information which is stored in trie to compare text T with pattern P.
13
Preprocessing phase
If logσm > 1, L = logσm where σ is the size of the alphabet and m is the length of pattern P; otherwise L=1.
Example:T = aaaababbababbbbbbaabababababbac
P = ababbaba
σ= 3, m=8
L= logσm = log38 = 1
In this case, the σ is 3 and the length of pattern is 8, so that L is 1, that is, the limit of the length of substring is 1.
a b
[7,5,2,0] [6,4,3,1]
trie
14
Example: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
a a a a b a b b a b a b b b b b b a a b a b a b a b a b b a
T :
0 1 2 3 4 5 6 7
a b a b b a b a
σ= 2, m = 8
L = logσm = log28 = 3
P : a
b
a
a
ab
b
b
b
[5,0] [4,1]
[2]
[3]
Every trie’s leaf stores decreasing numbers of position of pattern P.
15
root
a
b
a
a
ab
b
b
b
[5,0] [4,1][2]
[3]
Trie
0 1 2 3 4 5 6 7
a b a b b a b aP :
Example:
16
ab
b
ab
aa
b
b
ab
aa
baroot
0 1 2 3 4 5 6 7
a b a b b a b aP :
0 1 2 3 4 5 6 7
a b a b b a b aP :
ab
aa b
b
bb
ab
aa b
b
bb
[0]
[0] [1] [0] [1][2] [0] [1][2]
a[3]
17
0 1 2 3 4 5 6 7
a b a b b a b aP :
ab
aa b
b
bb[0] [4,1][2]
a[3]
ab
aa
ab
b
bb
[5,0]
[4,1][2] [3]
18
Example: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
a a a a b a b b a b a b b b b b b a a b a b a b a b a b b a
T :
0 1 2 3 4 5 6 7
a b a b b a b a
σ= 2, m = 8
L = logσm = log28 = 3
P :
This is a wide window with length 2m-L= 2*8-3=13.
We use a wide window with length 2m-L.
19
Example: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
T = aaaababbababbbbbbaabababababba 0 1 2 3 4 5 6 7
P = ababbaba
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
T = aaaababbababbbbbbaabababababba
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
T = aaaababbababbbbbbaabababababba0 1 2 3 4 5 6 7
ababbaba
Match!
ab
aa
ab
b
bb
[5,0]
[4,1][2] [3]
20
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
T = aaaababbababbbbbbaabababababba0 1 2 3 4 5 6 7
ababbaba
Match!
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
T = aaaababbababbbbbbaabababababba0 1 2 3 4 5 6 7
ababbaba
Match!
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
T = aaaababbababbbbbbaabababababba0 1 2 3 4 5 6 7
ababbaba
Match!
ab
aa
ab
b
bb
[5,0]
[4,1][2] [3]No bbb in P
No aab in P
21
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
T = aaaababbababbbbbbaabababababba0 1 2 3 4 5 6 7
ababbaba 0 1 2 3 4 5 6 7
ababbaba Match!
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
T = aaaababbababbbbbbaabababababba0 1 2 3 4 5 6 7
ababbaba 0 1 2 3 4 5 6 7
ababbaba 0 1 2 3 4 5 6 7
ababbaba
Match!
ab
aa
ab
b
bb
[5,0]
[4,1][2] [3]
22
Time complexity:
preprocessing phase in O(m) time and space complexity; searching phase in O(mn) time complexity;
23
References
[BM77] A Fast String Searching Algorithm , Boyer, R. S. and Moore, J. S. , Communication of the ACM , Vol. 20 , 1977 , pp. 762-772 .
[HS91] Fast String Searching , Hume, A. and Sundy, D. M. , Software, Practice and Experience , Vol. 21 , 1991 , pp. 1221-1248 .
[MTALSWW92] Speeding Up Two String-Matching Algorithms, Maxime C., Thierry L., Artur C., Leszek G., Stefan J., Wojciech P. and Wojciech R., Lecture Notes In Computer Science, Vol. 577, 1992, pp. 589-600 .
[MW94] Text algorithms, M. Crochemore and W. Rytter, Oxford University Press, 1994.
[KMP77] Fast Pattern Matching in Strings, D.E. Knuth, J.H. Morris and V.R. Pratt, SIAM Journal on Computing, Vol. 6, No.2, 1977, pp 323-350 .
[T92] A variation on the Boyer-Moore algorithm, Thierry Lecroq, Theoretical Computer Science archive, Vol. 92 , No.1, 1992, pp 119-144 .
[T98] Experiments on string matching in memory structures, Thierry Lecroq, Software—Practice & Experience archive, Vol. 28, No.5, 1998, pp 561-568
[T92] Tuning the Boyer-Moore-Horspool string searching algorithm, Timo Raita, Software—Practice & Experience archive, Vol. 22, No.10, 1992, pp. 879-884 .
[G94] String searching algorithms, G.A. Stephen, World Scientific Lecture Notes Series On Computing, Vol. 3, 1994, pp. 243 .