shift-and approach to pattern matching in lzw compressed text takuya kida department of informatics...

32
Shift-And Approach to Pattern Matching in LZW Compressed Text Takuya KIDA Department of Informatics Kyushu University, Japan Masayuki TAKEDA Ayumi SHINOHARA Setsuo ARIKAWA

Upload: sean-sturgess

Post on 15-Dec-2015

220 views

Category:

Documents


2 download

TRANSCRIPT

Shift-And Approach to Pattern Matching

in LZW Compressed Text

Takuya KIDA

Department of InformaticsKyushu University, Japan

Masayuki TAKEDA

Ayumi SHINOHARA

Setsuo ARIKAWA

<2/32>

E-mail

Address book

Schedule

Dictionary

Phone numbers

Memo

Electronic book

Database

The available storage devices are limited! I am eager to stuff any available information up to possible! I want to do pattern matching as fast as possible!

Motivation

Motivation

...Yes! Data compression!

...but a suffix trie is very large...

<3/32>

CompressedText

OriginalOriginalTextText

CompressedText

Pattern MatchingPattern Matching MachineMachine

New Machine !New Machine !

Our goal

Our goal

decompress

<4/32>

year researchers compression method

1988 Eliam-Tsoreff and Vishkin run-length

1992 Amir, Landau, and Vishkin two-dimensional run-length

1995 Farach and Thorup LZ77

1996 Amir, Benson and Farach LZW

1997 Karpinski, Rytter, and Shinohara straight-line programs

1996 Gasieniec, et al. LZ77

1997 Miyazaki, Shinohara, and Takeda straight-line programs

1992 Amir and Benson two-dimensional run-length

Amir, Benson, and Farach1994 two-dimensional run-length

1997 Takeda finite state encoding

1998 Shibata byte pair encoding

1994 Manber original compression scheme

1998 Fukamachi, Shinohara, and Takeda Huffman encoding

1998 Kida, et al. LZW

Previous researches

Previous researches

AC automatonAC automatonDCC’98DCC’98

<5/32>

year researchers compression method

1999 Kida, Takeda, Shinohara, andArikawa

LZW

1999 Shibata, et al. Byte pair encoding

Kida, et al.1999 Dictionary based methods(Collage system)

1999 Navarro and Raffinot LZ family

1999 Shibata, Takeda, Shinohara, andArikawa

Antidictionaries

CPM’99CPM’99

CPM’99CPM’99

CPM’99CPM’99

SPIRE’99SPIRE’99

1998 de Moura, Navarro, Ziviani, andBaeza-Yates

Word based encoding

Previous researches

Recent researches

Shift-And algorithmShift-And algorithm

<6/32>

Main results

The new algorithm scans a compressed text in O(n+r) time using O(|D|) space, and reports all occurrences of the pattern after an O(m+||) time and O(||) space preprocessing.

The algorithm is about 1.3 times faster than our previous one which simulates the AC automaton.

The algorithm is about 1.5 times faster than a decompression followed by a simple search using the Shift-And algorithm.

Our main results

|D| : size of the dictionary trie n : compressed text length m : pattern length r : number of pattern occurrences

Lempel-Ziv-Welch CompressionLempel-Ziv-Welch Compression

how to compress and decompress

<8/32>

LZW compression

a b ab ab ba b c aba bc abab1 2 34 5 6 9 114 2

Original text:

Compressed text:

Dictionary trieb

a b c

a

a a

a

bb

b c

0

1 2 3

4 5

6 7

9

8 12

10

11

aba6

6

a

a

b

Lempel-Ziv-Welch(LZW) compression

O(|D|) = O(n)O(|D|) = O(n)

<9/32>

Move of compression

a b ab ab ba b c aba bc abab1 2 34 5 6 9 114 2

Original text:

Compressed text:

Dictionary trie

a b c0

1 2 3b

4a5

a6

b7

b8

c9

a10

b11

a12

How to compress a text

<10/32>

Move of decompression

1 2 34 5 6 9 114 2Original text:

Compressed text:

How to decompress a compressed text

a b ab ab ba b c aba bc abab

Dictionary trie

a b c0

1 2 3b

4a5

a6

b7

b8

c9

a10

b11

a12

O(n) timeO(n) time

O(N) timeO(N) time

Compressed Pattern Matchingin LZW Compressed Text

Compressed Pattern Matchingin LZW Compressed Text

with Shift-And approach

<12/32>

Shift-And approach to pattern matching

10000

abac

a

aabaacaabacabtext:pattern: aabac

11000

00100

10010

11000

00000

10000

11000

00100

10010

00001

10000

00000

10000

11000

11000

11010

&

a a b a c abc11010

00100

00001

mask bits

abac

a

Shift-And approach to pattern matching

Pattern was found!

(Baeza-Yates and Gonnet[1992], Wu and Manber[1992])

<13/32>

Property of SA approach

Properties of Shift-And approach

Simple, but very fast when a pattern length m is not greater than the word length of typical computers (32 or 64).

Assuming m32 (or 64) and that bit-shift operations and bitwise logical operations on integers can be performed in constant time, it runs in O(n) time.

This method has many variations generalized pattern matching pattern matching with k-mismatch pattern matching for multiple patterns

<14/32>

aabaacaabacab

abac

atext:

Basic idea

10000

11000

00100

10010

11000

00000

10000

11000

00100

10010

00001

10000

00000

a ab aa ac a a b a c

Jump! Jump!

pattern: aabac

Basic idea of our algorithm

abc11010

00100

00001

mask bits

10000

11000

10000

6 151compressedtext :

O(1) time?O(1) time?

<15/32>

Basic idea

aabaacaabacab

abac

atext:

10000

11000

00100

10010

11000

00000

10000

11000

00100

10010

00001

10000

00000

abc11010

00100

00001

mask bits

10000

11000

10000

We need a mechanism for reporting all pattern occurrences.

pattern: aabac

6 151compressedtext :

Pattern was found!

1

Basic idea of our algorithm

<16/32>

Main results

Lemma 1 (Realization of ‘Jump’)The state transition function can be realized in O(|D|+m) time using O(|D|) space, and return the value in O(1) time.

Lemma 2 (Realization of ‘Output ’)The procedure which enumerates the pattern occurrences can be realized in O(|D|+m) time using O(|D|) space, and run in O(r) time.

Technical details

|D| : size of the dictionary trie m : pattern length r : number of pattern occurrences

<17/32>

Overview of the algorithm

Overview of the algorithm

Input. pattern P, u1,u2, …,un : LZW compressed text.Output. All occurrences of the patterns.

^Construct mask bits from P.Initialize the dictionary trie, M, U, and V;

l:=0; S:=;

for i:=1 to n do begin for each dOutput(S, ui) do report ‘pattern occurs at position l+d ’;

S:= f (S, u); /* Jump the state! */ l:= l+ |ui|; /* increment the offset */

Update the dictionary trie, M, U, and V;end

Detail of our AlgorithmDetail of our Algorithm

Realization of Jump and Output

<19/32>

Detail of ‘Jump’

for a ∈Σ, u ∈Σ*, and S∈{1,・・・ , m},•

Detail of ‘Jump’

10000

11000

11010

&

state transition

10100

state S={1,3}M(a)={1,2,4}M(b)={3}M(c)={5}

abc11010

00100

00001

abac

a

mask bits

f (S, a) : ((S 1)∪{1}) ∩ M(a)M(a) : { 1 i m | Pattern[i] = a }

bit shiftbit shift OROR ANDAND

<20/32>

Detail of ‘Jump’

f (S, a) : ((S 1)∪{1}) ∩ M(a)M(a) : { 1 i m | Pattern[i] = a }

for a ∈Σ, u ∈Σ*, and S∈{1,・・・ , m},•

f f ((SS, , uu) = (() = ((S S ||uu|)|)∪∪{1,{1, ・・・・・・ , , |u||u|}) }) ∩ ∩ MM((uu))^^ ^^

O(1)O(1)

Detail of ‘Jump’

M(u) :: f({1,・・・ , m}, u)M(u) :: f({1,・・・ , m}, u)^^ ^^

definerecursively

f f ((SS,,εε) :) : SS f f ((SS, , uaua) :) : f f ( ( f f ((SS, , uu), ), aa))

^^^^ ^^

<21/32>

Move of ‘Jump’

aba10010

abac

aacaabac

00001

M(u)^10000

100

10010

10010

&

10000

abac

aaabaacaabacabtext:

11000

00100

10010

11000

00000

10000

11000

00100

10010

00001

10000

00000

Move of f (S, u)^

111

<22/32>

10000

aba10010

abac

aacaabac

00001

M(u)^

Move of ‘Jump’

Move of f (S, u)^

00001

00001

&

10000

abac

aaabaacaabacabtext:

11000

00100

10010

11000

00000

10000

11000

00100

10010

00001

10000

00000

111111

<23/32>

Detail of updating Mhat(u)

How to calculate M(u)^

MM((u u aa)) = f({1,・・・ , m}, u a)^^ ^

= f ( f({1,・・・ , m}, u), a )^

= f ( M(u), a )^

= ((((MM((uu)) 1)1)∪∪{1}){1})∩∩MM((aa))^

u a

u

a

Dictionary trie D

M(u)^

M(u a)^

O(1)O(1)

total:O(|D|) time and spacetotal:O(|D|) time and space

<24/32>

Detail of Output(S,u)

Output(S, u) = { 1 j |u| | m∈S }

How to enumerate the occurrences

2

11

Output(S, u) ={ 2, 11}Output(S, u) ={ 2, 11}

uS

length i prefix of the pattern for the largest i∈S.

patternoccurrence

patternoccurrence

2{1, ...,m}D2{1, ...,m}D

<25/32>

Two subset U and A

U(u) : {1 j |u| | i < m and u[1..i]=Pattern[m-i+1..m]}

V(u) : {1 j |u| | i m and u[1-m+1..i]=Pattern}

Output(S, u) =((m S) U(u)) V(u)

Realization of Output(S, u)

dependent on S independent of S

uS

<26/32>

Detail of updating U and A

How to calculate U(u) and V(u)

u a

u

a

Dictionary trie DU(ua)V(ua)

U(u)V(u)

total:O(|D|) time and spacetotal:O(|D|) time and space

if m∈M(ua) then U(ua) = U(u) {|u a|}else U(ua) = U(u) ;

We can deal with V(n) as the same way of [DCC’98].

O(1)O(1)

-- Is this really practical? --

But... Is it But... Is it really fast ?really fast ?

Uhmm....Uhmm....

<28/32>

Experimentation

◆ Method 1:

◆ Method 2:

CompressedText bcbababc 9

CompressedText

Shift-And

Our previousalgorithm(DCC’98)

◆ Method 3:

Experimental Comparisons

Decompress !

CompressedText

Our new algorithms

<29/32>

Experimentation

Original Text"The Brown corpus"

6.8 Mbytes

Compressed Text

3.4 Mbytes

Language: C (with gcc compiler)Machine : Sun SPARCstation 20 with

remote disk storageFile transfer ratio: 0.96 Mbyte/sec

compresscompress(UNIX command)(UNIX command)

Experimental Comparisons

<30/32>

Experimental results

Experimental results

uncompressedtext

Shift-And

CPU time + File I/O time

1.3 timesfaster!

1.5 timesfaster!

elapsed time(s)

6.05

7.31

8.16

CPU time(s)

Shift-And with decompression

Our previous algorithm(DCC’98)

New algorithmNew algorithm

7.52

6.57

5.15

Method

<31/32>

Experimental results

Experimental results

Shift-And in original text 9.363.09

elapsed time(s)

6.05

7.31

8.16

CPU time(s)

Shift-And with decompression

Our previous algorithm(DCC’98)

New algorithmNew algorithm

7.52

6.57

5.15

Method

<32/32>

Conclusion

Conclusion

The proposed algorithm scans an LZW compressed text in O(n+r) time using O(|D|) space, and reports all occurrences of the pattern after an O(m+||) time and O(||) space preprocessing.

We implemented the algorithm, and showed that it is approximately 1.3 times faster than our previous algorithm.

Our new algorithm has several extensions. generalized pattern matching pattern matching with k-mismatches pattern matching for multiple patterns