1 backward nondeterministic dawg matching algorithm speaker: l. c. chen advisor: prof. r. c. t. lee...

40
1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended String Matching, Navarro, G. and Raffinot, M., Lecture No tes in Computer Science, Vol.1448, 1998, pp. 1 4-33

Post on 21-Dec-2015

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended

1

Backward Nondeterministic DAWG Matching Algorithm

Speaker: L. C. Chen

Advisor: Prof. R. C. T. Lee

A Bit-parallel Approach to Suffix Automata:Fast Extended String Matching,

Navarro, G. and Raffinot, M., Lecture Notes in Computer Science, Vol.1448, 1998, pp. 14-33

Page 2: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended

2

Problem Definition:

Input : A text T and a pattern P.

Output : All the locations where P matches T.

Page 3: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended

3

This algorithm uses rule 1: Suffix to Prefix Rule:

For a window to have any chance to match a pattern, in some way, there must be a suffix of the window which is equal to a prefix of the pattern.

T

P

Page 4: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended

4

Find the longest suffix U of the window which is equal to some prefix of P. Skip the pattern as follows:

T

P

U

Page 5: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended

5

ExampleT = GCA TCGACAGAC TATACAGTACG

P = GACGGATCA

∵The longest suffix of the window which is equal to a prefix of P is “GAC”, slide the window by 6.

T = GCATCGACAGACTATACAGTACGP = GACGGATCA

Page 6: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended

6

We give an example to introduce how this algorithm find the longest suffix of the window which is equal to a prefix of P.

Page 7: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended

7

Text : ABDDCCDBADEGGGGJJ

Pattern : BADADCEAD

Example:

We want to find the longest suffix of “BDDCCDBAD” which is also a prefix of the pattern.

Page 8: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended

8

Text : ABDDCCDBADEGGGGJJ

Pattern : BADADCEAD

Example:

First, we read “D”.

Page 9: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended

9

Text : ABDDCCDBADEGGGGJJ

Pattern : BADADCEAD

Example:

We find all the substrings ”D” in the pattern.

Page 10: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended

10

Text : ABDDCCDBADEGGGGJJ

Pattern : BADADCEAD

Example:

We read the next character “A”.

We check if the right of the substrings ”D” are “A” or not.

Page 11: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended

11

Text : ABDDCCDBADEGGGGJJ

Pattern : BADADCEAD

Example:

Thus, we find out all the substrings ”AD” in the pattern.

Page 12: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended

12

Text : ABDDCCDBADEGGGGJJ

Pattern : BADADCEAD

Example:

We read the next character “B”.

We check if the right of the substrings “AD” are “B” or not.

Page 13: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended

13

Text : ABDDCCDBADEGGGGJJ

Pattern : BADADCEAD

Example:

We find that the substring ”BAD” is in the pattern. Note that “BAD”is also a prefix of P.

Page 14: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended

14

Text : ABDDCCDBADEGGGGJJ

Pattern : BADADCEAD

Example:

We can not find a character “D” in the right of the substring “BAD”.We report that “BAD” is the longest suffix of “BDDCCDBAD”which is equal a prefix of P.

We read the next character “D”.

Page 15: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended

15

Text : ABDDCCDDADEGGGGJJ

Pattern : ACDADCEAD

Another example:

We want to find the longest suffix of “BDDCCDDAD” which is also a substring of the pattern.

Page 16: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended

16

Text : ABDDCCDDADEGGGGJJ

Pattern : ACDADCEAD

First, we find all the substrings ”D” in the pattern.

Another example:

Page 17: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended

17

Text : ABDDCCDDADEGGGGJJ

Pattern : ACDADCEAD

mismatch

Then we find out all the substrings ”AD” in the pattern.

Another example:

Page 18: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended

18

Text : ABDDCCDDADEGGGGJJ

Pattern : ACDADCEAD

Then we find out all the substrings ”AD” in the pattern.

Another example:

Page 19: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended

19

Text : ABDDCCDDADEGGGGJJ

Pattern : ACDADCEAD

mismatch

We find out all the substrings ”DAD” in the pattern.

Another example:

Page 20: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended

20

Text : ABDDCCDDADEGGGGJJ

Pattern : ACDADCEAD

We find out all the substrings ”DAD” in the pattern.

Another example:

Page 21: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended

21

Text : ABDDCCDDADEGGGGJJ

Pattern : ACDADCEAD

mismatch

We find all the substrings ”DDAD” in the pattern.

Another example:

Page 22: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended

22

Text : ABDDCCDDADEGGGGJJ

Pattern : ACDADCEAD

mismatch

We find all the substrings ”DDAD” in the pattern. There is no substring “DDAD” in the pattern.

There is no any suffix of “BDDCCDDAD” which is equal to a prefixof P.

Another example:

Page 23: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended

23

The idea that we explained above is the main idea of this

algorithm. And next we will use bit-parallel method to

implement this algorithm.

Page 24: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended

24

We use bits to store the positions of a character in P.

Example:

For character “A”, we store A: 0 1 00 01 0

For character “B”, we store B: 0 0 11 0 00

For character “C”, we store C: 1 0 0 0 100

For character “D”, we store D: 0 0 0 0 0 01

For the characters do not exit in P we store *: 0 0 0 0 0 0 0

P: CABBCAD P: CABBCAD

Page 25: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended

25

Text: ABCABCABA

Pattern: CABBCAD

,∑={A,B,C,D}

Pattern: CABCCAD

A: 0100010 B: 0011000 C: 1000100 D: 0000001 other: 0000000

D: 1111111

Here, we explain how to use bit-parallel to find the substring of a pattern which is equaled to a suffix of the window.

We use a mask D to record some information.

Page 26: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended

26

Text: ABCABCABA

Pattern: CABBCAD

,∑={A,B,C,D}

Pattern: CABCCAD

A: 0100010 B: 0011000 C: 1000100 D: 0000001 other: 0000000

D: 1111111

A: 0100010And

0100010

D= 0100010<<1 =1000100

D: 1000100

<<1: left shift one bit.

Page 27: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended

27

Text: ABCABCABA

Pattern: CABBCAD

,∑={A,B,C,D}

D: 1000100

C: 1000100And

1000100

We know “CA” is a suffix of the window which is equal to a prefix of the pattern.

D= 1000100<<1 =0001000

Pattern: CABCCAD

A: 0100010 B: 0011000 C: 1000100 D: 0000001 other: 0000000

D: 0001000

Page 28: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended

28

Text: ABCABCABA

Pattern: CABBCAD

,∑={A,B,C,D}

D: 0001000

B: 0011000And

0001000

We know “BCA” is a substring of the pattern.

D= 0001000<<1 =0010000

Pattern: CABCCAD

A: 0100010 B: 0011000 C: 1000100 D: 0000001 other: 0000000

D: 0010000

Page 29: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended

29

Text: ABCABCABA

Pattern: CABBCAD

,∑={A,B,C,D}

D: 0010000

A: 0100010And

0000000

There is no substring “ABCA” in the pattern.

Pattern: CABCCAD

A: 0100010 B: 0011000 C: 1000100 D: 0000001 other: 0000000

Page 30: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended

30

Text: ABCABCABA

Pattern: CABBCAD

,∑={A,B,C,D}

“CA” is a suffix of “BCA” which is a prefix of the pattern.

Page 31: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended

31

Text: ABCABCCBA ,∑={A,B,C,D}Example:

Pattern: ACBCCBD

We take another example:

Page 32: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended

32

Text: ABCABCCBA

Pattern: ACBCCBD

,∑={A,B,C,D}Example:

First, we build:

Pattern: ACBCCBD

A: 1000000 B: 0010010 C: 0101100 D: 0000001 others: 0000000

Page 33: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended

33

Text: ABCABCCBA

Pattern: ACBCCBD

,∑={A,B,C,D}Example:

D: 1111111

Pattern: ACBCCBD

A: 1000000 B: 0010010 C: 0101100 D: 0000001 others: 0000000

Page 34: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended

34

Text: ABCABCCBA

Pattern: ACBCCBD

,∑={A,B,C,D}Example:

Pattern: ACBCCBD

A: 1000000 B: 0010010 C: 0101100 D: 0000001 others: 0000000

D: 1111111

Page 35: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended

35

Text: ABCABCCBA

Pattern: ACBCCBD

,∑={A,B,C,D}Example:

D: 1111111

D: 1111111

C: 0101100And

0101100

We set D =

Where there is a “1”, there is a substring “C” in Pattern.

0101100<<1= 1011000

Pattern: ACBCCBD

A: 1000000 B: 0010010 C: 0101100 D: 0000001 others: 0000000

Page 36: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended

36

Text: ABCABCCBA

Pattern: ACBCCBD

,∑={A,B,C,D}Example:

D: 1011000

D: 1011000

C: 0101100And

0001000

We set D =

Where there is a “1”, there is a substring “CC” in Pattern.

0001000<<1= 0010000

Pattern: ACBCCBD

A: 1000000 B: 0010010 C: 0101100 D: 0000001 others: 0000000

Page 37: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended

37

Text: ABCABCCBA

Pattern: ACBCCBD

,∑={A,B,C,D}Example:

D: 0010000

D: 0010000

B: 0010010And

0010000

We set D =

Where there is a “1”, there is a substring “BCC” in Pattern.

0010000<<1= 0100000

Pattern: ACBCCBD

A: 1000000 B: 0010010 C: 0101100 D: 0000001 others: 0000000

Page 38: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended

38

Text: ABCABCCBA

Pattern: ACBCCBD

,∑={A,B,C,D}Example:

D: 0100000

D: 0100000

A: 1000000And

0000000

There is no any suffix of the window which is equal to a prefix of the pattern.

There is no substring “ABCC” in Pattern.

Pattern: ACBCCBD

A: 1000000 B: 0010010 C: 0101100 D: 0000001 others: 0000000

Page 39: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended

39

Time Complexity:

If the length of the text is n and the length of pattern is m,

the time complexity of this algorithm is O(mn) in the worst case.

Page 40: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended

40

Reference• [BG92]A new approach to text searching, R. Baeza-Yates and Navarro, G., CACM. Vol. 35, 1

992, pp.74-82.

• [BEH89]Average sizes of suffix trees and dawgs., Blumer, A., Ehrenfeucht, A. and Haussler, D., Discrete Applied Mathematics, Vol. 24, 1989, pp.37-45.

• [BM77] A fast string searching algorithm. Boyer, R. S. and Moore, J. S., Communications of the ACM, Vol. 20, 1977, pp.762-772.

• [GM98] A Bit-Parallel Approach to Suffix Automata: Fast Extended String Matching, G. NAVARRO and M. RAFFINOT, In Proceedings of the 9th Annual Symposium on Combinatorial Pattern Matching, Lecture Notes in Computer Science 1448, Springer-Verlag, Berlin, 1998, pp.14-31.