multiple pattern matching algorithms on collage system t. kida, t. matsumoto, m. takeda, a....

29
Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa epartment of Informatics, Kyushu University

Upload: melvin-pope

Post on 14-Jan-2016

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu

Multiple Pattern Matching Algorithms on Collage

System

Multiple Pattern Matching Algorithms on Collage

System

T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa

T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa

Department of Informatics, Kyushu University

Page 2: Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu

ContentsContents

• Compressed pattern matching

• Collage system

• Extension of Output function for multiple patterns

• Multi-pattern version of BM type algorithm

• Parallel complexity

• Conclusion

Page 3: Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu

Compressed Pattern MatchingCompressed Pattern Matching

CompressedText

OriginalText

CompressedText

Pattern Matching Machine

CompressedPattern Matching Machine

decompress

Page 4: Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu

Works on This StudyWorks on This Study

Compression method Compressed pattern matching algorithms

Run-length Eilam-Tzoreff & Vishkin (1988)Run-length (two dim) Amir et al. (1992, 1997); Amir & Benson (1992)LZ77 family Farach & Thorup (1995); G sieniec, ą et al. (1996);

Klein & Shapira (2000)LZ78 family Amir et al. (1996); Kida et al. (1998, 1999);

Navarro & Tarhio (2000); Kärkkäinen et al. (2000);LZ family Navarro et al. (1999)Straight-line programs Karpinski et al. (1997); Miyazaki et al. (1997);

Hirao et al. (2000)Huffman Fukamachi et al. (1998); Klein & Shapira (2001);

Miyazaki et al. (1998)Finite state encoding Takeda (1997)Word based encoding Moura et al. (1998)Pattern substitution Manber (1994); Shibata et al. (1998)Antidictionary based Shibata et al. (1999)

Page 5: Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu

Works on This StudyWorks on This Study

Previous

Algorithm for word-based methodWord-based

Algorithm for LZ78LZ78

Algorithm for LZ77LZ77

Algorithm for texts

represented by collage

systemWord-based

LZ78

LZ77Collage Syste

mA Unifying framework for compressed pattern matching.T. Kida et al. (1999), SPIRE1999

Page 6: Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu

Collage SystemCollage System

Compressed pattern matchingCompressed pattern matchingCollage systemCollage systemExtension of Output function for multiple patternsExtension of Output function for multiple patternsMulti-pattern version of BM type algorithmMulti-pattern version of BM type algorithmParallel complexity of compressed pattern matchingParallel complexity of compressed pattern matchingConclusionConclusion

Page 7: Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu

Collage SystemCollage System

X1 = a ;X2 = b ;

D :

S : X3 X6 X4 X5 X2 X3 X1 X5 X4 X2

X6 = [ 3 ]X5 ;      (Truncation)

X5 = ( X3 )3 ;     (Repetition)

X4 = X2 ・ X1 ;     (Concatenation)X3 = X1 ・ X2 ;     (Concatenation)

babababab

baab

abbabbaabababbabaabababbab

Page 8: Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu

Notations and DefinitionsNotations and Definitions

• Collage system is a pair 〈 D, S 〉• D : a set of assignments of tokens

– X1 = expr1 ; X2 = expr2 ; ・・・ ; Xn = exprn ;where each exprk is any of the form

• a for a Σ {ε∈ ∪ },• Xi ・ X j for i, j < k,• ( Xi ) j for i < k and an integer j,• [ j ]Xi for i < k and an integer j,• Xi

[ j ] for i < k and an integer j,– ||D|| = n : the number of tokens defined in D– X.u : the string represented by a token X

• S : a sequence of tokens defined in D– Xi1 Xi2 ・・・ Xil ( Xi is a token defined in D)– |S| = l : the number of tokens in S

concatenation

j times repetition

prefix truncation

suffix truncation

primitive assignment

Page 9: Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu

Height of Height of DD

X1 = a ;X2 = b ;

D :

X7 = X6 ・ X4 ;X6 = [ 3 ]X5 ;X5 = ( X3 )3 ;

X4 = X2 ・ X1 ;X3 = X1 ・ X2 ;

height(X7) = 4

height(D) = max{height(X) | XF(D)}

X7

X6 X4

X5

X3

X1 X2

X2 X1

F (D) is the set of tokens defined in D.

Page 10: Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu

Example of Collage System (LZSS [gzip])Example of Collage System (LZSS [gzip])

Xq+1 = (( [i1]Xl(1) Xl(1)+1 ・・・ Xr(1))m1)[ j1] b1;

・・

・Xq+2 = (( [i2]Xl(2) Xl(2)+1 ・・・ Xr(2))m2)[ j2] b2;

Xq+n = (( [in]Xl(n) Xl(n)+1 ・・・ Xr(n))mn)[ jn] bn;

X1 = a1 ; X2 = a2 ; Xq = aq ;・・・

S : Xq+1 Xq+2 ・・・ Xq+n

D :

={a1, ..., aq}bj and 0 ik, jk, mk

Page 11: Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu

Pattern Matching on Collage SystemPattern Matching on Collage System

state: 0 1 2 3 4 3 4 5 11 2 4 1

S : Xi1 Xi2 Xi3 Xi4

7 : goto function

: failure function

a0 1 2 4 5b ba b3

KMP automaton for = a b a b b

original text: abababba

Jump( 4 , Xi4) = 1 Output( 4 , Xi4

) = {3}

3 3 4

Page 12: Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu

Pattern Matching on Collage SystemPattern Matching on Collage System

no truncation

truncation

O( (||D||+|S|) ・ height(D) + ||2 + r ) time

O( ||D|| + ||2 ) space

LZ77

SequiturLZ78

LZSS

BPE

O( ||D|| + |S| + ||2 + r ) time

r is the number of pattern occurrences

LZW

Page 13: Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu

Extension of Output function Extension of Output function for multiple patternsfor multiple patterns

Compressed pattern matchingCompressed pattern matchingCollage systemCollage systemExtension of Output function for multiple patternsExtension of Output function for multiple patternsMulti-pattern version of BM type algorithmMulti-pattern version of BM type algorithmParallel complexity of compressed pattern matchingParallel complexity of compressed pattern matchingConclusionConclusion

Page 14: Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu

Basic IdeaBasic Idea

• Simulate the move of Aho-Corasick pattern matching machine

AC machine for ={aba,ababb,abca,bb}

ac

0 1 2 3 4 5

6 7

98

b ba b

c ab

b {bb}

{abca}

{aba} {ababb,bb}

: goto function: failure function

{ } : output

Jump( q, X) = AC( q, X.u)

Output( q, X)={|v|, o(AC(q, v)), v is a prefix of X.u}

(AC is a transition function of AC machine)

( o is an output function of AC machine)

Page 15: Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu

Enumeration of Enumeration of OutputOutput((qq, , XX))

• Enumerate Output( q, X) Enumerate Occ( , X.u)

Y.u Z.u

Period ?

• Enumerate for each case of Xe.g. Enumerate Occ*( , Y.uZ.u) for X=YZ

Singlepattern

case

Multiple pattern

case

Page 16: Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu

Enumeration of Enumeration of OccOcc*(*(, , xxyy))

O(m2) time and space preprocessing

={abcabc, cabb, abca}

a b c ca b c a ba b c a b c aa b c a b ca b c a b

c ca b c a bc a b c ac a b cc a b

ca b c a ba b c aa b ca b

1 2 3

1

1

3

abca

ca

a

bc bca bcabc

Suffixes of

Prefixes of

1 nil

nil

13

nil

nilnil

nil

(px, py)

px

py

m is the total length of the patterns in

Page 17: Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu

Enumeration of Enumeration of OccOcc((, (, (Y.uY.u))k k ))

• Reduce to the single pattern case– If Y.uY.u is a substring of a pattern in ,

• Add a list of the patterns that occur in X.u with covering Y.u2.

• The number of substring that is a square is O(m). O(m2) space

Generalized Suffix trie

GST

{1, 3, 6}

(Y.u)2 is a substring of 1, and |Y.u| is a period of 1.

(same for 3, 6)

(Y.u)2 is a substring of 1, and |Y.u| is a period of 1.

(same for 3, 6)

X=Y kY.u

1Y.u

Y.u

{1, 3, 6}

m is the total length of the patterns in

Page 18: Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu

Our ResultsOur Results

TheoremThe multiple pattern matching problem for a text represented by a collage system 〈 D, S 〉 can be solved inO( ( ||D|| + |S| ) ・ height(D) + m2 + r ) time,using O( ||D|| + m2 ) space.

If D contains no truncation operation, it can be solved in O( ||D|| + |S| + m2 + r ) time.

TheoremThe multiple pattern matching problem for a text represented by a collage system 〈 D, S 〉 can be solved inO( ( ||D|| + |S| ) ・ height(D) + m2 + r ) time,using O( ||D|| + m2 ) space.

If D contains no truncation operation, it can be solved in O( ||D|| + |S| + m2 + r ) time.

m is the total length of the patterns in r is the number of pattern occurrences

Page 19: Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu

Multi-pattern version ofMulti-pattern version ofBM type algorithmBM type algorithm

Compressed pattern matchingCompressed pattern matchingCollage systemCollage systemExtension of Output function for multiple patternsExtension of Output function for multiple patternsMulti-pattern version of BM type algorithmMulti-pattern version of BM type algorithmParallel complexity of compressed pattern matchingParallel complexity of compressed pattern matchingConclusionConclusion

Page 20: Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu

Boyer-Moore type algorithmBoyer-Moore type algorithm• A Boyer-Moore type algorithm for compressed pattern

matching, CPM2000– Y. Shibata, T. Matsumoto, M. Takeda, A. Shinohara, and S. Ar

ikawa– O( (||D||+|S|) ・ height(D) + || ・ |S| + ||2 + r ) time– O( ||D|| + ||2 ) space– If no truncation, O( ||D||+|S|+ || ・ |S| + ||2 + r ) time

r is the number of pattern occurrencesm is the total length of the patterns in

TheoremThe BM type algorithm for multiple pattern serching on collage system runs in– O( (||D||+|S|) ・ height(D) + || ・ |S| + m2 + r ) time– O( ||D|| + m2 ) space– If no truncation, O( ||D||+|S|+ m|S| + m2 + r ) time

TheoremThe BM type algorithm for multiple pattern serching on collage system runs in– O( (||D||+|S|) ・ height(D) + || ・ |S| + m2 + r ) time– O( ||D|| + m2 ) space– If no truncation, O( ||D||+|S|+ m|S| + m2 + r ) time

Page 21: Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu

Boyer-Moore Type AlgorithmBoyer-Moore Type Algorithm

S ・・・・

Xi1 Xi2 Xi3 Xi4 Xi5 Xi6 Xi7

・・・・CTTAATTAAGCCTGCTAAGCATOriginal text

Pattern occurrences

Shift by

1. Enumerate Occ( , S[i].u)2. Enumerate Occ*( , qS[i].u). 3. Calculate the maximal safe shift Δ

• Calculate Shift (lpps( S[i+1].u ), S[i])• Calculate the smallest k s.t.

4. i:= i +

1. Enumerate Occ( , S[i].u)2. Enumerate Occ*( , qS[i].u). 3. Calculate the maximal safe shift Δ

• Calculate Shift (lpps( S[i+1].u ), S[i])• Calculate the smallest k s.t.

4. i:= i +

Shift(lpps(S[i+1].u), S[i]) (|S[i+j].u|) |lpps(S[i].u)|.j =0

k

Same way of AC typeSame way of AC type

O(m)

Page 22: Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu

Calculate Calculate ShiftShift((lppslpps((SS[[ii+1].+1].uu), ), SS[[ii])])

rightmost_occ(w)

= min l > 0[m l |w| : m l ] = w, or[1: m l ] is a suffix of w

texttext

l la suffix of w

w

w w

w w

w

rightmost_occ(w) = min{rightmost_occ(w)}

Page 23: Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu

Calculate Calculate ShiftShift((lppslpps((SS[[ii+1].+1].uu), ), SS[[ii])])

Shift(lpps(S[i+1].u), X) = rightmost_occ(X.u ・ lpps(S[i+1].u))

O( ||D|| ・ height(D)+ m2) time and O(||D||+ m2) space

S[i]

Shift

=3

Shift(lpps(S[i+1].u), S[i]) (|S[i+j].u|) |lpps(S[i].u)|j =0

k

Page 24: Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu

Experimental ResultExperimental Result

AlphaStation XP1000(Alpha21264: 667MHz)Tru64 UNIX V4.0F

Medline (English text)60.3Mbyte

AlphaStation XP1000(Alpha21264: 667MHz)Tru64 UNIX V4.0F

Medline (English text)60.3Mbyte

5 10 15 20 25 30

Pattern length

0.0

0.3

0.4

0.5

0.8

0.1

0.2

0.6

0.7

CPU

tim

e (

seco

nd)

Search for uncompressed textswith KMP method.

Search for uncompressed textswith Agrep.

Search for texts compressed by BPEwith AC type algorithm.

* Agrep is a search tool developed by Wu and Manber.* BPE: Byte Pair Encoding

Search for texts compressed by BPE with BM type algorithm.

* A single pattern was inputted.

Page 25: Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu

Parallel complexity of Parallel complexity of compressed pattern matchingcompressed pattern matching

Compressed pattern matchingCompressed pattern matchingCollage systemCollage systemExtension of Output function for multiple patternsExtension of Output function for multiple patternsMulti-pattern version of BM type algorithmMulti-pattern version of BM type algorithmParallel complexity of compressed pattern matchingParallel complexity of compressed pattern matchingConclusionConclusion

Page 26: Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu

Problem to considerProblem to consider

• Instance:A regular collage system 〈 D, S 〉 and a set ={1, 2, ,s} of patterns.

• Question:Is there any pattern j that occurs in the text T represented by 〈 D, S 〉 ?

Contains no truncation and repetition

LogCFLLogCFL

Can be efficiently parallelized !

LogCFL NC2

*LogCFL is the class of problems logspace-reducible to a context-free language

Page 27: Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu

The space of pushdown store is not bounded

Nondeterministic Turing machine

Idea of the ProofIdea of the Proof

• Using the lemma of I. Sudborough– LogCFL = AuxPDA( log n, nO(1) )

Using log n space worktape in nO(1) time

*AuxPDA is an auxiliary pushdown automaton.

Show such an AuxPDA MM accepts an input string if and only if there is somepattern that occurs in the text represented by 〈 D, S 〉 .

Page 28: Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu

AuxPDA MAuxPDA M

M

¢ 1#2##s&Xi1Xi2

Xin

$

$ 100000....

Pushdown store

Occ(j, Xik.u) =

Xik

Xik.u[ l ]= j[ t ] ?

t

Page 29: Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu

ConclusionConclusion

• Collage system is a formal system– Texts compressed by various compression method can be expres

sed by collage system.• Two types of algorithm for multiple pattern matching on c

ollage system– AC type

• O( ( ||D|| + |S| ) ・ height(D) + m2 + r ) time• O( ||D|| + |S| + m2 + r ) space

– BM type• O() time and O() space

• Compressed pattern matching can be efficiently parallelized in principle.– For regular collage systems– Not yet for general collage systems