multiple pattern matching algorithms on collage system t. kida, t. matsumoto, m. takeda, a....
TRANSCRIPT
Multiple Pattern Matching Algorithms on Collage
System
Multiple Pattern Matching Algorithms on Collage
System
T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa
T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa
Department of Informatics, Kyushu University
ContentsContents
• Compressed pattern matching
• Collage system
• Extension of Output function for multiple patterns
• Multi-pattern version of BM type algorithm
• Parallel complexity
• Conclusion
Compressed Pattern MatchingCompressed Pattern Matching
CompressedText
OriginalText
CompressedText
Pattern Matching Machine
CompressedPattern Matching Machine
decompress
Works on This StudyWorks on This Study
Compression method Compressed pattern matching algorithms
Run-length Eilam-Tzoreff & Vishkin (1988)Run-length (two dim) Amir et al. (1992, 1997); Amir & Benson (1992)LZ77 family Farach & Thorup (1995); G sieniec, ą et al. (1996);
Klein & Shapira (2000)LZ78 family Amir et al. (1996); Kida et al. (1998, 1999);
Navarro & Tarhio (2000); Kärkkäinen et al. (2000);LZ family Navarro et al. (1999)Straight-line programs Karpinski et al. (1997); Miyazaki et al. (1997);
Hirao et al. (2000)Huffman Fukamachi et al. (1998); Klein & Shapira (2001);
Miyazaki et al. (1998)Finite state encoding Takeda (1997)Word based encoding Moura et al. (1998)Pattern substitution Manber (1994); Shibata et al. (1998)Antidictionary based Shibata et al. (1999)
Works on This StudyWorks on This Study
Previous
Algorithm for word-based methodWord-based
Algorithm for LZ78LZ78
Algorithm for LZ77LZ77
Algorithm for texts
represented by collage
systemWord-based
LZ78
LZ77Collage Syste
mA Unifying framework for compressed pattern matching.T. Kida et al. (1999), SPIRE1999
Collage SystemCollage System
Compressed pattern matchingCompressed pattern matchingCollage systemCollage systemExtension of Output function for multiple patternsExtension of Output function for multiple patternsMulti-pattern version of BM type algorithmMulti-pattern version of BM type algorithmParallel complexity of compressed pattern matchingParallel complexity of compressed pattern matchingConclusionConclusion
Collage SystemCollage System
X1 = a ;X2 = b ;
D :
S : X3 X6 X4 X5 X2 X3 X1 X5 X4 X2
X6 = [ 3 ]X5 ; (Truncation)
X5 = ( X3 )3 ; (Repetition)
X4 = X2 ・ X1 ; (Concatenation)X3 = X1 ・ X2 ; (Concatenation)
babababab
baab
abbabbaabababbabaabababbab
Notations and DefinitionsNotations and Definitions
• Collage system is a pair 〈 D, S 〉• D : a set of assignments of tokens
– X1 = expr1 ; X2 = expr2 ; ・・・ ; Xn = exprn ;where each exprk is any of the form
• a for a Σ {ε∈ ∪ },• Xi ・ X j for i, j < k,• ( Xi ) j for i < k and an integer j,• [ j ]Xi for i < k and an integer j,• Xi
[ j ] for i < k and an integer j,– ||D|| = n : the number of tokens defined in D– X.u : the string represented by a token X
• S : a sequence of tokens defined in D– Xi1 Xi2 ・・・ Xil ( Xi is a token defined in D)– |S| = l : the number of tokens in S
concatenation
j times repetition
prefix truncation
suffix truncation
primitive assignment
Height of Height of DD
X1 = a ;X2 = b ;
D :
X7 = X6 ・ X4 ;X6 = [ 3 ]X5 ;X5 = ( X3 )3 ;
X4 = X2 ・ X1 ;X3 = X1 ・ X2 ;
height(X7) = 4
height(D) = max{height(X) | XF(D)}
X7
X6 X4
X5
X3
X1 X2
X2 X1
F (D) is the set of tokens defined in D.
Example of Collage System (LZSS [gzip])Example of Collage System (LZSS [gzip])
Xq+1 = (( [i1]Xl(1) Xl(1)+1 ・・・ Xr(1))m1)[ j1] b1;
・・
・Xq+2 = (( [i2]Xl(2) Xl(2)+1 ・・・ Xr(2))m2)[ j2] b2;
Xq+n = (( [in]Xl(n) Xl(n)+1 ・・・ Xr(n))mn)[ jn] bn;
X1 = a1 ; X2 = a2 ; Xq = aq ;・・・
S : Xq+1 Xq+2 ・・・ Xq+n
D :
={a1, ..., aq}bj and 0 ik, jk, mk
Pattern Matching on Collage SystemPattern Matching on Collage System
state: 0 1 2 3 4 3 4 5 11 2 4 1
S : Xi1 Xi2 Xi3 Xi4
7 : goto function
: failure function
a0 1 2 4 5b ba b3
KMP automaton for = a b a b b
original text: abababba
Jump( 4 , Xi4) = 1 Output( 4 , Xi4
) = {3}
3 3 4
Pattern Matching on Collage SystemPattern Matching on Collage System
no truncation
truncation
O( (||D||+|S|) ・ height(D) + ||2 + r ) time
O( ||D|| + ||2 ) space
LZ77
SequiturLZ78
LZSS
BPE
O( ||D|| + |S| + ||2 + r ) time
r is the number of pattern occurrences
LZW
Extension of Output function Extension of Output function for multiple patternsfor multiple patterns
Compressed pattern matchingCompressed pattern matchingCollage systemCollage systemExtension of Output function for multiple patternsExtension of Output function for multiple patternsMulti-pattern version of BM type algorithmMulti-pattern version of BM type algorithmParallel complexity of compressed pattern matchingParallel complexity of compressed pattern matchingConclusionConclusion
Basic IdeaBasic Idea
• Simulate the move of Aho-Corasick pattern matching machine
AC machine for ={aba,ababb,abca,bb}
ac
0 1 2 3 4 5
6 7
98
b ba b
c ab
b {bb}
{abca}
{aba} {ababb,bb}
: goto function: failure function
{ } : output
Jump( q, X) = AC( q, X.u)
Output( q, X)={|v|, o(AC(q, v)), v is a prefix of X.u}
(AC is a transition function of AC machine)
( o is an output function of AC machine)
Enumeration of Enumeration of OutputOutput((qq, , XX))
• Enumerate Output( q, X) Enumerate Occ( , X.u)
Y.u Z.u
Period ?
• Enumerate for each case of Xe.g. Enumerate Occ*( , Y.uZ.u) for X=YZ
Singlepattern
case
Multiple pattern
case
Enumeration of Enumeration of OccOcc*(*(, , xxyy))
O(m2) time and space preprocessing
={abcabc, cabb, abca}
a b c ca b c a ba b c a b c aa b c a b ca b c a b
c ca b c a bc a b c ac a b cc a b
ca b c a ba b c aa b ca b
1 2 3
1
1
3
abca
ca
a
bc bca bcabc
Suffixes of
Prefixes of
1 nil
nil
13
nil
nilnil
nil
(px, py)
px
py
m is the total length of the patterns in
Enumeration of Enumeration of OccOcc((, (, (Y.uY.u))k k ))
• Reduce to the single pattern case– If Y.uY.u is a substring of a pattern in ,
• Add a list of the patterns that occur in X.u with covering Y.u2.
• The number of substring that is a square is O(m). O(m2) space
Generalized Suffix trie
GST
{1, 3, 6}
(Y.u)2 is a substring of 1, and |Y.u| is a period of 1.
(same for 3, 6)
(Y.u)2 is a substring of 1, and |Y.u| is a period of 1.
(same for 3, 6)
X=Y kY.u
1Y.u
Y.u
{1, 3, 6}
m is the total length of the patterns in
Our ResultsOur Results
TheoremThe multiple pattern matching problem for a text represented by a collage system 〈 D, S 〉 can be solved inO( ( ||D|| + |S| ) ・ height(D) + m2 + r ) time,using O( ||D|| + m2 ) space.
If D contains no truncation operation, it can be solved in O( ||D|| + |S| + m2 + r ) time.
TheoremThe multiple pattern matching problem for a text represented by a collage system 〈 D, S 〉 can be solved inO( ( ||D|| + |S| ) ・ height(D) + m2 + r ) time,using O( ||D|| + m2 ) space.
If D contains no truncation operation, it can be solved in O( ||D|| + |S| + m2 + r ) time.
m is the total length of the patterns in r is the number of pattern occurrences
Multi-pattern version ofMulti-pattern version ofBM type algorithmBM type algorithm
Compressed pattern matchingCompressed pattern matchingCollage systemCollage systemExtension of Output function for multiple patternsExtension of Output function for multiple patternsMulti-pattern version of BM type algorithmMulti-pattern version of BM type algorithmParallel complexity of compressed pattern matchingParallel complexity of compressed pattern matchingConclusionConclusion
Boyer-Moore type algorithmBoyer-Moore type algorithm• A Boyer-Moore type algorithm for compressed pattern
matching, CPM2000– Y. Shibata, T. Matsumoto, M. Takeda, A. Shinohara, and S. Ar
ikawa– O( (||D||+|S|) ・ height(D) + || ・ |S| + ||2 + r ) time– O( ||D|| + ||2 ) space– If no truncation, O( ||D||+|S|+ || ・ |S| + ||2 + r ) time
r is the number of pattern occurrencesm is the total length of the patterns in
TheoremThe BM type algorithm for multiple pattern serching on collage system runs in– O( (||D||+|S|) ・ height(D) + || ・ |S| + m2 + r ) time– O( ||D|| + m2 ) space– If no truncation, O( ||D||+|S|+ m|S| + m2 + r ) time
TheoremThe BM type algorithm for multiple pattern serching on collage system runs in– O( (||D||+|S|) ・ height(D) + || ・ |S| + m2 + r ) time– O( ||D|| + m2 ) space– If no truncation, O( ||D||+|S|+ m|S| + m2 + r ) time
Boyer-Moore Type AlgorithmBoyer-Moore Type Algorithm
S ・・・・
Xi1 Xi2 Xi3 Xi4 Xi5 Xi6 Xi7
・・・・CTTAATTAAGCCTGCTAAGCATOriginal text
Pattern occurrences
Shift by
1. Enumerate Occ( , S[i].u)2. Enumerate Occ*( , qS[i].u). 3. Calculate the maximal safe shift Δ
• Calculate Shift (lpps( S[i+1].u ), S[i])• Calculate the smallest k s.t.
4. i:= i +
1. Enumerate Occ( , S[i].u)2. Enumerate Occ*( , qS[i].u). 3. Calculate the maximal safe shift Δ
• Calculate Shift (lpps( S[i+1].u ), S[i])• Calculate the smallest k s.t.
4. i:= i +
Shift(lpps(S[i+1].u), S[i]) (|S[i+j].u|) |lpps(S[i].u)|.j =0
k
Same way of AC typeSame way of AC type
O(m)
Calculate Calculate ShiftShift((lppslpps((SS[[ii+1].+1].uu), ), SS[[ii])])
rightmost_occ(w)
= min l > 0[m l |w| : m l ] = w, or[1: m l ] is a suffix of w
texttext
l la suffix of w
w
w w
w w
w
rightmost_occ(w) = min{rightmost_occ(w)}
Calculate Calculate ShiftShift((lppslpps((SS[[ii+1].+1].uu), ), SS[[ii])])
Shift(lpps(S[i+1].u), X) = rightmost_occ(X.u ・ lpps(S[i+1].u))
O( ||D|| ・ height(D)+ m2) time and O(||D||+ m2) space
S[i]
Shift
=3
Shift(lpps(S[i+1].u), S[i]) (|S[i+j].u|) |lpps(S[i].u)|j =0
k
Experimental ResultExperimental Result
AlphaStation XP1000(Alpha21264: 667MHz)Tru64 UNIX V4.0F
Medline (English text)60.3Mbyte
AlphaStation XP1000(Alpha21264: 667MHz)Tru64 UNIX V4.0F
Medline (English text)60.3Mbyte
5 10 15 20 25 30
Pattern length
0.0
0.3
0.4
0.5
0.8
0.1
0.2
0.6
0.7
CPU
tim
e (
seco
nd)
Search for uncompressed textswith KMP method.
Search for uncompressed textswith Agrep.
Search for texts compressed by BPEwith AC type algorithm.
* Agrep is a search tool developed by Wu and Manber.* BPE: Byte Pair Encoding
Search for texts compressed by BPE with BM type algorithm.
* A single pattern was inputted.
Parallel complexity of Parallel complexity of compressed pattern matchingcompressed pattern matching
Compressed pattern matchingCompressed pattern matchingCollage systemCollage systemExtension of Output function for multiple patternsExtension of Output function for multiple patternsMulti-pattern version of BM type algorithmMulti-pattern version of BM type algorithmParallel complexity of compressed pattern matchingParallel complexity of compressed pattern matchingConclusionConclusion
Problem to considerProblem to consider
• Instance:A regular collage system 〈 D, S 〉 and a set ={1, 2, ,s} of patterns.
• Question:Is there any pattern j that occurs in the text T represented by 〈 D, S 〉 ?
Contains no truncation and repetition
LogCFLLogCFL
Can be efficiently parallelized !
LogCFL NC2
*LogCFL is the class of problems logspace-reducible to a context-free language
The space of pushdown store is not bounded
Nondeterministic Turing machine
Idea of the ProofIdea of the Proof
• Using the lemma of I. Sudborough– LogCFL = AuxPDA( log n, nO(1) )
Using log n space worktape in nO(1) time
*AuxPDA is an auxiliary pushdown automaton.
Show such an AuxPDA MM accepts an input string if and only if there is somepattern that occurs in the text represented by 〈 D, S 〉 .
AuxPDA MAuxPDA M
M
¢ 1#2##s&Xi1Xi2
Xin
$
$ 100000....
Pushdown store
Occ(j, Xik.u) =
Xik
Xik.u[ l ]= j[ t ] ?
t
ConclusionConclusion
• Collage system is a formal system– Texts compressed by various compression method can be expres
sed by collage system.• Two types of algorithm for multiple pattern matching on c
ollage system– AC type
• O( ( ||D|| + |S| ) ・ height(D) + m2 + r ) time• O( ||D|| + |S| + m2 + r ) space
– BM type• O() time and O() space
• Compressed pattern matching can be efficiently parallelized in principle.– For regular collage systems– Not yet for general collage systems