bbm 202 - algorithms - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · bbm 202 -...

86
BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course slides are adapted from the slides prepared by R. Sedgewick and K. Wayne of Princeton University. 1

Upload: others

Post on 07-Sep-2019

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

BBM 202 - ALGORITHMS

SUBSTRING SEARCH

DEPT. OF COMPUTER ENGINEERING

Acknowledgement:ThecourseslidesareadaptedfromtheslidespreparedbyR.SedgewickandK.WayneofPrincetonUniversity.

1

Page 2: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

TODAY‣ Substring search‣ Brute force‣ Knuth-Morris-Pratt‣ Boyer-Moore‣ Rabin-Karp

2

Page 3: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

3

Substring search

Goal. Find pattern of length M in a text of length N.

typically N >> M

Substring search

N E E D L E

I N A H A Y S T A C K N E E D L E I N A

match

pattern

text

3

Page 4: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

4

Substring search applications

Goal. Find pattern of length M in a text of length N.

Computer forensics.  Search memory or disk for signatures, e.g., all URLs or RSA keys that the user has entered.

typically N >> M

http://citp.princeton.edu/memory

Substring search

N E E D L E

I N A H A Y S T A C K N E E D L E I N A

match

pattern

text

4

Page 5: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

5

Substring search applications

Goal. Find pattern of length M in a text of length N.

Identify patterns indicative of spam. • PROFITS

• L0SE WE1GHT

• There is no catch.

• This is a one-time mailing.

• This message is sent in compliance with spam regulations.

typically N >> M

Substring search

N E E D L E

I N A H A Y S T A C K N E E D L E I N A

match

pattern

text

5

Page 6: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

6

Substring search applications

Electronic surveillance.Need to monitor all

internet traffic.

(security)

No way!

(privacy)

Well, we’re mainly

interested in

“ATTACK AT DAWN”

OK. Build a

machine that just

looks for that.

“ATTACK AT DAWN”

substring searchmachine

found

6

Page 7: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

7

Substring search applications

Screen scraping. Extract relevant data from web page.

Ex. Find string delimited by <b> and </b> after first occurrence ofpattern Last Trade:.

http://finance.yahoo.com/q?s=goog

... <tr> <td class= "yfnc_tablehead1" width= "48%"> Last Trade: </td> <td class= "yfnc_tabledata1"> <big><b>452.92</b></big> </td></tr> <td class= "yfnc_tablehead1" width= "48%"> Trade Time: </td> <td class= "yfnc_tabledata1"> ...

7

Page 8: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

8

Screen scraping: Java implementation

Java library. The indexOf() method in Java's string library returns the index of the first occurrence of a given string, starting at a given offset.

public class StockQuote { public static void main(String[] args) { String name = "http://finance.yahoo.com/q?s="; In in = new In(name + args[0]); String text = in.readAll(); int start = text.indexOf("Last Trade:", 0); int from = text.indexOf("<b>", start); int to = text.indexOf("</b>", from); String price = text.substring(from + 3, to); StdOut.println(price); } }

% java StockQuote goog 582.93

% java StockQuote msft 24.84

8

Page 9: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

SUBSTRING SEARCH

‣ Brute force‣ Knuth-Morris-Pratt‣ Boyer-Moore‣ Rabin-Karp

9

Page 10: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Check for pattern starting at each text position.

10

Brute-force substring search

Brute-force substring search

i j i+j 0 1 2 3 4 5 6 7 8 9 10

A B A C A D A B R A C

0 2 2 A B R A 1 0 1 A B R A 2 1 3 A B R A 3 0 3 A B R A 4 1 5 A B R A 5 0 5 A B R A 6 4 10 A B R A

entries in gray arefor reference only

entries in blackmatch the text

return i when j is M

entries in red aremismatches

txt

pat

match

10

Page 11: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Check for pattern starting at each text position.

public static int search(String pat, String txt){ int M = pat.length(); int N = txt.length(); for (int i = 0; i <= N - M; i++) { int j; for (j = 0; j < M; j++) if (txt.charAt(i+j) != pat.charAt(j)) break; if (j == M) return i; } return N; }

11

Brute-force substring search: Java implementation

index in text where pattern starts

not found

i j i + j 0 1 2 3 4 5 6 7 8 9 1 0

A B A C A D A B R A C

4 3 7 A D A C R

5 0 5 A D A C R

11

Page 12: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Brute-force algorithm can be slow if text and pattern are repetitive.

Worst case. ~ M N char compares.

12

Brute-force substring search: worst case

Brute-force substring search (worst case)

i j i+j 0 1 2 3 4 5 6 7 8 9

A A A A A A A A A B

0 4 4 A A A A B 1 4 5 A A A A B 2 4 6 A A A A B 3 4 7 A A A A B 4 4 8 A A A A B 5 5 10 A A A A B

txt

pat

Brute-force substring search

i j i+j 0 1 2 3 4 5 6 7 8 9 10

A B A C A D A B R A C

0 2 2 A B R A 1 0 1 A B R A 2 1 3 A B R A 3 0 3 A B R A 4 1 5 A B R A 5 0 5 A B R A 6 4 10 A B R A

entries in gray arefor reference only

entries in blackmatch the text

return i when j is M

entries in red aremismatches

txt

pat

match

12

Page 13: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

In many applications, we want to avoid backup in text stream.

• Treat input as stream of data.

• Abstract model: standard input. Brute-force algorithm needs backup for every mismatch.

Approach 1. Maintain buffer of last M characters.

Approach 2. Stay tuned.

A A A A A A A A A A A A A A A A A A A A A B

A A A A A B

Backup

13

“ATTACK AT DAWN”

substring search

machine

found

A A A A A A A A A A A A A A A A A A A A A B

A A A A A B

matched charsmismatch

shift pattern right one position

backup

13

Page 14: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Same sequence of char compares as previous implementation.

• i points to end of sequence of already-matched chars in text.

• j stores number of already-matched chars (end of sequence in pattern).

public static int search(String pat, String txt) { int i, N = txt.length(); int j, M = pat.length(); for (i = 0, j = 0; i < N && j < M; i++) { if (txt.charAt(i) == pat.charAt(j)) j++; else { i -= j; j = 0; } } if (j == M) return i - M; else return N; }

14

Brute-force substring search: alternate implementation

backup

i j 0 1 2 3 4 5 6 7 8 9 1 0

A B A C A D A B R A C

7 3 A D A C R

5 0 A D A C R

14

Page 15: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

15

Algorithmic challenges in substring search

Brute-force is not always good enough.Theoretical challenge. Linear-time guarantee.

Practical challenge. Avoid backup in text stream. often no room or time to save text

fundamental algorithmic problem

Now is the time for all people to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for many good people to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for a lot of good people to come to the aid of their party. Now is the time for all of the good people to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for each good person to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for all good Republicans to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for many or all good people to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for all good Democrats to come to the aid of their party. Now is the time for all people to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for many good people to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for a lot of good people to come to the aid of their party. Now is the time for all of the good people to come to the aid of their party. Now is the time for all good people to come to the aid of their attack at dawn party. Now is the time for each person to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for all good Republicans to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for many or all good people to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for all good Democrats to come to the aid of their party.

15

Page 16: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

SUBSTRING SEARCH

‣ Brute force‣ Knuth-Morris-Pratt‣ Boyer-Moore‣ Rabin-Karp

16

Page 17: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Knuth-Morris-Pratt substring search

Intuition. Suppose we are searching in text for pattern BAAAAAAAAA.

• Suppose we match 5 chars in pattern, with mismatch on 6th char.

• We know previous 6 chars in text are BAAAAB.

• Don't need to back up text pointer!

Knuth-Morris-Pratt algorithm. Clever method to always avoid backup. (!)

17

Text pointer backup in substring searching

A B A A A A B A A A A A A A A A

B A A A A A A A A A B A A A A A A A A A B A A A A A A A A A B A A A A A A A A A B A A A A A A A A A B A A A A A A A A A

B A A A A A A A A A

i

after mismatchon sixth char

but no backupis needed

brute-force backsup to try this

and this

and this

and this

and this

pattern

text

assuming { A, B } alphabet

17

Page 18: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

DFA is abstract string-searching machine.

• Finite number of states (including start and halt).

• Exactly one transition for each char in alphabet.

• Accept if sequence of transitions leads to halt state.

Deterministic finite state automaton (DFA)

18Constructing the DFA for KMP substring search for A B A B A C

0 1 2 3 4 5 6A B A A

B,C

A

CB,CC

B

AB,C A

C

0 1 2 3 4 5 A B A B A C 1 1 3 1 5 1 0 2 0 4 0 4 0 0 0 0 0 6

dfa[][j]ABC

X

pat.charAt(j)j

B

Constructing the DFA for KMP substring search for A B A B A C

0 1 2 3 4 5 6A B A A

B,C

A

CB,CC

B

AB,C A

C

0 1 2 3 4 5 A B A B A C 1 1 3 1 5 1 0 2 0 4 0 4 0 0 0 0 0 6

dfa[][j]ABC

X

pat.charAt(j)j

B

internal representation

graphical representation

If in state j reading char c:

if j is 6 halt and accept

•else move to state dfa[c][j]

18

Page 19: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

DFA simulation

19

1 1 3 1 5 1

0 2 0 4 0 4

0 0 0 0 0 6

A B A B A C

0 1 2 3 4 5

A

B

C

A A B A C A A B A B A C A A

10 32 4 65BA BA CA

B

A

A

B, C

B, C

B, C

C

A

C

pat.charAt(j)

dfa[][j]

19

Page 20: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

10 32 4 65

DFA simulation

20

10 32 4 65BA BA CA

B

A

A

B, C

B, C

B, C

C

1 1 3 1 5 1

0 2 0 4 0 4

0 0 0 0 0 6

A B A B A C

0 1 2 3 4 5

A

B

C

A

A A B A C A A B A B A C A A

C

pat.charAt(j)

dfa[][j]

20

Page 21: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

10 32 4 65

DFA simulation

21

1 32 4 65BA BA CA

B

A

A

CB, C

B, C

B, C

C

A

A A B A C A A B A B A C A A

1 1 3 1 5 1

0 2 0 4 0 4

0 0 0 0 0 6

A B A B A C

0 1 2 3 4 5

A

B

C

pat.charAt(j)

dfa[][j]

21

Page 22: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

10 32 4 65

DFA simulation

22

1 32 4 65BA BA CA

B

A

A

B, C

B, C

B, C

C

A

A A B A C A A B A B A C A A

C

1 1 3 1 5 1

0 2 0 4 0 4

0 0 0 0 0 6

A B A B A C

0 1 2 3 4 5

A

B

C

pat.charAt(j)

dfa[][j]

22

Page 23: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

10 32 4 65

DFA simulation

23

32 4 65BA BA CA

B

A

A

B, C

B, C

B, C

C

A

A A B A C A A B A B A C A A

C

1 1 3 1 5 1

0 2 0 4 0 4

0 0 0 0 0 6

A B A B A C

0 1 2 3 4 5

A

B

C

pat.charAt(j)

dfa[][j]

23

Page 24: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

10 32 4 65

DFA simulation

24

3 4 65BA BA CA

B

A

A

B, C

B, C

B, C

C

A

A A B A C A A B A B A C A A

C

1 1 3 1 5 1

0 2 0 4 0 4

0 0 0 0 0 6

A B A B A C

0 1 2 3 4 5

A

B

C

pat.charAt(j)

dfa[][j]

24

Page 25: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

10 32 4 65

DFA simulation

25

4 65BA BA CA

B

A

A

B, C

B, C

B, C

C

A

A A B A C A A B A B A C A A

C

1 1 3 1 5 1

0 2 0 4 0 4

0 0 0 0 0 6

A B A B A C

0 1 2 3 4 5

A

B

C

pat.charAt(j)

dfa[][j]

25

Page 26: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

1 32 4 65

DFA simulation

26

4 65BA BA CA

B

A

A

B, C

B, C

B, C

C

A A B A C A A B A B A C A A

0

A

C

1 1 3 1 5 1

0 2 0 4 0 4

0 0 0 0 0 6

A B A B A C

0 1 2 3 4 5

A

B

C

pat.charAt(j)

dfa[][j]

26

Page 27: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

1 32 4 65

DFA simulation

27

4 65A BA CA

B

A

A

B, C

B, C

B, C

C

A A B A C A A B A B A C A A

0

A

BC

1 1 3 1 5 1

0 2 0 4 0 4

0 0 0 0 0 6

A B A B A C

0 1 2 3 4 5

A

B

C

pat.charAt(j)

dfa[][j]

27

Page 28: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

10 32 4 65

DFA simulation

28

32 4 65BA BA CA

B

A

A

B, C

B, C

B, C

C

A

A A B A C A A B A B A C A A

C

1 1 3 1 5 1

0 2 0 4 0 4

0 0 0 0 0 6

A B A B A C

0 1 2 3 4 5

A

B

C

pat.charAt(j)

dfa[][j]

28

Page 29: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

10 32 4 65

DFA simulation

29

3 4 65BA BA CA

B

A

A

B, C

B, C

B, C

C

A

A A B A C A A B A B A C A A

C

1 1 3 1 5 1

0 2 0 4 0 4

0 0 0 0 0 6

A B A B A C

0 1 2 3 4 5

A

B

C

pat.charAt(j)

dfa[][j]

29

Page 30: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

10 32 4 65

DFA simulation

30

4 65BA BA CA

B

A

A

B, C

B, C

B, C

C

A

A A B A C A A B A B A C A A

C

1 1 3 1 5 1

0 2 0 4 0 4

0 0 0 0 0 6

A B A B A C

0 1 2 3 4 5

A

B

C

pat.charAt(j)

dfa[][j]

30

Page 31: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

10 32 4 65

DFA simulation

31

65BA BA CA

B

A

A

B, C

B, C

B, C

C

A

A A B A C A A B A B A C A A

C

1 1 3 1 5 1

0 2 0 4 0 4

0 0 0 0 0 6

A B A B A C

0 1 2 3 4 5

A

B

C

pat.charAt(j)

dfa[][j]

31

Page 32: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

10 32 4 65

DFA simulation

32

6BA BA CA

B

A

A

B, C

B, C

B, C

C

A

A A B A C A A B A B A C A A

C

substring found

1 1 3 1 5 1

0 2 0 4 0 4

0 0 0 0 0 6

A B A B A C

0 1 2 3 4 5

A

B

C

pat.charAt(j)

dfa[][j]

32

Page 33: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Q. What is interpretation of DFA state after reading in txt[i]?A. State = number of characters in pattern that have been matched.Ex. DFA is in state 3 after reading in txt[0..6].

0 1 2 3 4 5 6 7 8

B C B A A B A C A

Interpretation of Knuth-Morris-Pratt DFA

33

txt

0 1 2 3 4 5

A B A B A Cpat

suffix of text[0..6] prefix of pat[]

10 32 4 65BA BA CA

B

A

A

B, C

B, C

B, C

C

A

C

i

length of longest prefix of pat[] that is a suffix of txt[0..i]

33

Page 34: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Knuth-Morris-Pratt substring search: Java implementation

Key differences from brute-force implementation.

• Need to precompute dfa[][] from pattern.

• Text pointer i never decrements.

Running time.• Simulate DFA on text: at most N character accesses.

• Build DFA: how to do efficiently? [warning: tricky algorithm ahead]

34

public int search(String txt) { int i, j, N = txt.length(); for (i = 0, j = 0; i < N && j < M; i++) j = dfa[txt.charAt(i)][j]; if (j == M) return i - M; else return N; }

no backup

34

Page 35: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Knuth-Morris-Pratt substring search: Java implementation

Key differences from brute-force implementation.

• Need to precompute dfa[][] from pattern.

• Text pointer i never decrements.

• Could use input stream.

35

public int search(In in) { int i, j; for (i = 0, j = 0; !in.isEmpty() && j < M; i++) j = dfa[in.readChar()][j]; if (j == M) return i - M; else return NOT_FOUND; }

Constructing the DFA for KMP substring search for A B A B A C

0 1 2 3 4 5 6A B A A

B,C

A

CB,CC

B

AB,C A

C

0 1 2 3 4 5 A B A B A C 1 1 3 1 5 1 0 2 0 4 0 4 0 0 0 0 0 6

dfa[][j]ABC

X

pat.charAt(j)j

B

no backup

35

Page 36: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Include one state for each character in pattern (plus accept state).

Knuth-Morris-Pratt construction

36

4 65

A B A B A C

0 1 2 3 4 5

A

B

C

3210

Constructing the DFA for KMP substring search for A B A B A C

pat.charAt(j)

dfa[][j]

36

Page 37: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Match transition. If in state j and next char c == pat.charAt(j), go to j+1.

Knuth-Morris-Pratt construction

37

4 65BA BA CA

1 3 5

2 4

6

A B A B A C

0 1 2 3 4 5

A

B

C

3210

Constructing the DFA for KMP substring search for A B A B A C

first j characters of pattern

have already been matched

now first j+1 characters ofpattern have been matched

next char matches

pat.charAt(j)

dfa[][j]

37

Page 38: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Mismatch transition: back up if c != pat.charAt(j).

Knuth-Morris-Pratt construction

38

0 4 65BA BA CA

B, C

1 3 5

0 2 4

0 6

A B A B A C

0 1 2 3 4 5

A

B

C

321

Constructing the DFA for KMP substring search for A B A B A C

j

pat.charAt(j)

dfa[][j]

38

Page 39: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Mismatch transition: back up if c != pat.charAt(j).

Knuth-Morris-Pratt construction

39

10 4 65BA BA CA

B, C

1 1 3 5

0 2 4

0 0 6

A B A B A C

0 1 2 3 4 5

A

B

C

A

C32

Constructing the DFA for KMP substring search for A B A B A C

j

pat.charAt(j)

dfa[][j]

39

Page 40: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Mismatch transition: back up if c != pat.charAt(j).

Knuth-Morris-Pratt construction

40

10 2 4 65BA BA CA

B, C

B, C

1 1 3 5

0 2 0 4

0 0 0 6

A B A B A C

0 1 2 3 4 5

A

B

C

A

C3

Constructing the DFA for KMP substring search for A B A B A C

j

pat.charAt(j)

dfa[][j]

40

Page 41: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Mismatch transition: back up if c != pat.charAt(j).

Knuth-Morris-Pratt construction

41

10 32 4 65BA BA CA

A

B, C

B, C

C

1 1 3 1 5

0 2 0 4

0 0 0 0 6

A B A B A C

0 1 2 3 4 5

A

B

C

A

C

Constructing the DFA for KMP substring search for A B A B A C

j

pat.charAt(j)

dfa[][j]

41

Page 42: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Mismatch transition: back up if c != pat.charAt(j).

Knuth-Morris-Pratt construction

42

10 32 4 65BA BA CA

A

B, C

B, C

B, C

C

1 1 3 1 5

0 2 0 4 0

0 0 0 0 0 6

A B A B A C

0 1 2 3 4 5

A

B

C

A

C

Constructing the DFA for KMP substring search for A B A B A C

j

pat.charAt(j)

dfa[][j]

42

Page 43: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Mismatch transition: back up if c != pat.charAt(j).

Knuth-Morris-Pratt construction

43

10 32 4 65BA BA CA

B

A

A

B, C

B, C

B, C

C

1 1 3 1 5 1

0 2 0 4 0 4

0 0 0 0 0 6

A B A B A C

0 1 2 3 4 5

A

B

C

A

C

Constructing the DFA for KMP substring search for A B A B A C

j

pat.charAt(j)

dfa[][j]

43

Page 44: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Knuth-Morris-Pratt construction

44

1 1 3 1 5 1

0 2 0 4 0 4

0 0 0 0 0 6

A B A B A C

0 1 2 3 4 5

A

B

C

Constructing the DFA for KMP substring search for A B A B A C

10 32 4 65BA BA CA

B

A

A

B, C

B, C

B, C

C

A

C

pat.charAt(j)

dfa[][j]

44

Page 45: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Include one state for each character in pattern (plus accept state).

How to build DFA from pattern?

45

10 32 4 65

A B A B A C

0 1 2 3 4 5pat.charAt(j)

A

B

C

dfa[][j]

45

Page 46: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Match transition. If in state j and next char c == pat.charAt(j), go to j+1.

How to build DFA from pattern?

46

10 32 4 65BA BA CA

1 3 5

2 4

6

A B A B A C

0 1 2 3 4 5pat.charAt(j)

A

B

C

dfa[][j]

first j characters of pattern

have already been matched

now first j+1 characters ofpattern have been matched

next char matches

46

Page 47: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Mismatch transition. If in state j and next char c != pat.charAt(j), then the last j-1 characters of input are pat[1..j-1], followed by c.

To compute dfa[c][j]: Simulate pat[1..j-1] on DFA and take transition c.Running time. Seems to require j steps.

Ex. dfa['A'][5] = 1; dfa['B'][5] = 4

How to build DFA from pattern?

47

simulate BABA;

take transition 'A'

= dfa['A'][3]

simulate BABA;

take transition 'B'

= dfa['B'][3]

still under construction (!)

10 32 4 65BA A CA

B

A

B, C

B, C

B, C

C

A

C

Bpat.charAt(j) BA

2 5

A

0 1 3 4

CA

j

j

3 B

A

B

A

simulation

of BABA

47

Page 48: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Mismatch transition. If in state j and next char c != pat.charAt(j), then the last j-1 characters of input are pat[1..j-1], followed by c.

To compute dfa[c][j]: Simulate pat[1..j-1] on DFA and take transition c.Running time. Takes only constant time if we maintain state X.

Ex. dfa['A'][5] = 1; dfa['B'][5] = 4;

How to build DFA from pattern?

48

from state X,

take transition 'A' = dfa['A'][X]

from state X,

take transition 'B' = dfa['B'][X]

state X

10 32 4 65BA BA CA

B

A

A

B, C

B, C

B, C

C

A

C

j

B BA

2 5

A

0 1 3 4

CA

X

from state X, take transition 'C'

= dfa['C'][X]

•X'= 0

B

A

48

Page 49: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Include one state for each character in pattern (plus accept state).

Knuth-Morris-Pratt construction (in linear time)

49

4 65

A B A B A C

0 1 2 3 4 5

A

B

C

3210

Constructing the DFA for KMP substring search for A B A B A C

pat.charAt(j)

dfa[][j]

49

Page 50: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Match transition. For each state j, dfa[pat.charAt(j)][j] = j+1.

Knuth-Morris-Pratt construction (in linear time)

50

4 65BA BA CA

1 3 5

2 4

6

A B A B A C

0 1 2 3 4 5

A

B

C

3210

Constructing the DFA for KMP substring search for A B A B A C

first j characters of pattern

have already been matched

now first j+1 characters ofpattern have been matched

pat.charAt(j)

dfa[][j]

50

Page 51: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Mismatch transition. For state 0 and char c != pat.charAt(j), set dfa[c][0] = 0.

Knuth-Morris-Pratt construction (in linear time)

51

0 4 65BA BA CA

B, C

1 3 5

0 2 4

0 6

A B A B A C

0 1 2 3 4 5

A

B

C

321

Constructing the DFA for KMP substring search for A B A B A C

j

pat.charAt(j)

dfa[][j]

51

Page 52: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Mismatch transition. For each state j and char c != pat.charAt(j), setdfa[c][j] = dfa[c][X]; then update X = dfa[pat.charAt(j)][X].

Knuth-Morris-Pratt construction (in linear time)

52

10 4 65BA BA CA

B, C

1 3 5

0 2 4

0 6

A B A B A C

0 1 2 3 4 5

A

B

C

A

C32

Constructing the DFA for KMP substring search for A B A B A C

X = simulation of empty string

j

X

1

0

pat.charAt(j)

dfa[][j]

52

Page 53: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Mismatch transition. For each state j and char c != pat.charAt(j), setdfa[c][j] = dfa[c][X]; then update X = dfa[pat.charAt(j)][X].

Knuth-Morris-Pratt construction (in linear time)

53

10 2 4 65BA BA CA

B, C

B, C

1 1 3 5

0 2 4

0 0 6

A B A B A C

0 1 2 3 4 5

A

B

C

A

C3

Constructing the DFA for KMP substring search for A B A B A C

X = simulation of B

j

X

0

0

pat.charAt(j)

dfa[][j]

53

Page 54: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Mismatch transition. For each state j and char c != pat.charAt(j), setdfa[c][j] = dfa[c][X]; then update X = dfa[pat.charAt(j)][X].

2

0

1

Knuth-Morris-Pratt construction (in linear time)

54

10 32 4 65BA BA CA

A

B, C

B, C

C

1 3 5

0 0 4

0 0 6

A B A B A C

0 1 2 3 4 5

A

B

C

A

C

Constructing the DFA for KMP substring search for A B A B A C

X = simulation of B A

j

X

2

0

1

pat.charAt(j)

dfa[][j]

54

Page 55: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Mismatch transition. For each state j and char c != pat.charAt(j), setdfa[c][j] = dfa[c][X]; then update X = dfa[pat.charAt(j)][X].

Knuth-Morris-Pratt construction (in linear time)

55

10 32 4 65BA BA CA

A

B, C

B, C

B, C

C

1 1 3 1 5

0 2 0 4

0 0 0 0 6

A B A B A C

0 1 2 3 4 5

A

B

C

A

C

Constructing the DFA for KMP substring search for A B A B A C

X = simulation of B A B

j

X

0

0

pat.charAt(j)

dfa[][j]

55

Page 56: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Mismatch transition. For each state j and char c != pat.charAt(j), setdfa[c][j] = dfa[c][X]; then update X = dfa[pat.charAt(j)][X].

Knuth-Morris-Pratt construction (in linear time)

56

10 32 4 65BA BA CA

B

A

A

B, C

B, C

B, C

C

1 1 3 1 5

0 2 0 4 0

0 0 0 0 0 6

A B A B A C

0 1 2 3 4 5

A

B

C

A

C

Constructing the DFA for KMP substring search for A B A B A C

X = simulation of B A B A

j

X

1

4

pat.charAt(j)

dfa[][j]

56

Page 57: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Mismatch transition. For each state j and char c != pat.charAt(j), setdfa[c][j] = dfa[c][X]; then update X = dfa[pat.charAt(j)][X].

Knuth-Morris-Pratt construction (in linear time)

57

10 32 4 6BA BA CA

B

A

A

B, C

B, C

B, C

C

1 1 3 1 5 1

0 2 0 4 0 4

0 0 0 0 0 6

A B A B A C

0 1 2 3 4 5pat.charAt(j)

A

B

C

dfa[][j]

A

C

Constructing the DFA for KMP substring search for A B A B A C

X = simulation of B A B A C

j

X

5

57

Page 58: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Knuth-Morris-Pratt construction (in linear time)

58

1 1 3 1 5 1

0 2 0 4 0 4

0 0 0 0 0 6

A B A B A C

0 1 2 3 4 5pat.charAt(j)

A

B

C

dfa[][j]

Constructing the DFA for KMP substring search for A B A B A C

10 32 4 65BA BA CA

B

A

A

B, C

B, C

B, C

C

A

C

58

Page 59: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Constructing the DFA for KMP substring search: Java implementation

For each state j:

• Copy dfa[][X] to dfa[][j] for mismatch case.

• Set dfa[pat.charAt(j)][j] to j+1 for match case.

• Update X.

Running time. M character accesses (but space proportional to R M).

59

public KMP(String pat) { this.pat = pat; M = pat.length(); dfa = new int[R][M]; dfa[pat.charAt(0)][0] = 1; for (int X = 0, j = 1; j < M; j++) { for (int c = 0; c < R; c++) dfa[c][j] = dfa[c][X]; dfa[pat.charAt(j)][j] = j+1; X = dfa[pat.charAt(j)][X]; } }

copy mismatch cases

set match case

update restart state

59

Page 60: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Proposition. KMP substring search accesses no more than M + N charsto search for a pattern of length M in a text of length N.

Pf. Each pattern char accessed once when constructing the DFA;each text char accessed once (in the worst case) when simulating the DFA.Proposition. KMP constructs dfa[][] in time and space proportional to R M. Larger alphabets. Improved version of KMP constructs nfa[] in time and space proportional to M.

60

KMP substring search analysis

NFA corresponding to the string A B A B A C

0 1 2 3 4 5 6A B A A C

0 1 2 3 4 5 A B A B A C 0 0 0 0 0 3

next[j]pat.charAt(j)

j

graphical representation

internal representation

mismatch transition(back up at least one state)

B

KMP NFA for ABABAC

60

Page 61: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

61

Knuth-Morris-Pratt: brief history

• Independently discovered by two theoreticians and a hacker.- Knuth: inspired by esoteric theorem, discovered linear-time algorithm

- Pratt: made running time independent of alphabet size

- Morris: built a text editor for the CDC 6400 computer

• Theory meets practice.

Don Knuth Vaughan PrattJim Morris

SIAM J. COMPUT.Vol. 6, No. 2, June 1977

FAST PATTERN MATCHING IN STRINGS*

DONALD E. KNUTHf, JAMES H. MORRIS, JR.:l: AND VAUGHAN R. PRATT

Abstract. An algorithm is presented which finds all occurrences of one. given string withinanother, in running time proportional to the sum of the lengths of the strings. The constant ofproportionality is low enough to make this algorithm of practical use, and the procedure can also beextended to deal with some more general pattern-matching problems. A theoretical application of thealgorithm shows that the set of concatenations of even palindromes, i.e., the language {can}*, can berecognized in linear time. Other algorithms which run even faster on the average are also considered.

Key words, pattern, string, text-editing, pattern-matching, trie memory, searching, period of astring, palindrome, optimum algorithm, Fibonacci string, regular expression

Text-editing programs are often required to search through a string ofcharacters looking for instances of a given "pattern" string; we wish to find allpositions, or perhaps only the leftmost position, in which the pattern occurs as acontiguous substring of the text. For example, c a e n a r y contains the patterne n, but we do not regard c a n a r y as a substring.

The obvious way to search for a matching pattern is to try searching at everystarting position of the text, abandoning the search as soon as an incorrectcharacter is found. But this approach can be very inefficient, for example when weare looking for an occurrence of aaaaaaab in aaaaaaaaaaaaaab.When the pattern is a"b and the text is a2"b, we will find ourselves making (n + 1)comparisons of characters. Furthermore, the traditional approach involves"backing up" the input text as we go through it, and this can add annoyingcomplications when we consider the buffering operations that are frequentlyinvolved.

In this paper we describe a pattern-matching algorithm which finds alloccurrences of a pattern of length rn within a text of length n in O(rn + n) units oftime, without "backing up" the input text. The algorithm needs only O(m)locations of internal memory if the text is read from an external file, and onlyO(log m) units of time elapse between consecutive single-character inputs. All ofthe constants of proportionality implied by these "O" formulas are independentof the alphabet size.

* Received by the editors August 29, 1974, and in revised form April 7, 1976.t Computer Science Department, Stanford University, Stanford, California 94305. The work of

this author was supported in part by the National Science Foundation under Grant GJ 36473X and bythe Office of Naval Research under Contract NR 044-402.

Xerox Palo Alto Research Center, Palo Alto, California 94304. The work of this author wassupported in part by the National Science Foundation under Grant GP 7635 at the University ofCalifornia, Berkeley.

Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Mas-sachusetts 02139. The work of this author was supported in part by the National Science Foundationunder Grant GP-6945 at University of California, Berkeley, and under Grant GJ-992 at StanfordUniversity.

323

61

Page 62: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

SUBSTRING SEARCH

‣ Brute force‣ Knuth-Morris-Pratt‣ Boyer-Moore‣ Rabin-Karp

62

Page 63: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Boyer Moore Intuition

• Scan the text with a window of M chars (length of pattern)

• Case 1: Scan Window is exactly on top of the searched pattern

- Starting from one end check if all characters are equal. (We must check!)

• Case 2: Scan Window starts after the pattern starts.

63

Text

Scan Window (M)

Pattern in Text (M)

63

Page 64: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Boyer Moore Intuition (2)

• Case 3: Scan Window starts before the pattern starts

• Case 4: Independent

• In case 4, simply shift window M characters

• Avoid Case 2

• Convert Case 3 to Case 1, by shifting appropriately

64

64

Page 65: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Boyer Moore Intuition (3)

• If we can recognise the character in the scan window end-point, we can find how many characters to shift.

• So, for example D is the 4th character, we must shift window 4 characters so that they overlap.

65

A B C D E F G H

d = 4

A B C D E F G H

d = 4

65

Page 66: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Boyer Moore Intuition (4)

• A potential problem, the character in the text can repeat.

• For example, pattern = XXAXX and the text is A X A X A X A X X A X X A X A X A X A X

• Solution: be conservative, choose the instance with the least Shift (so we cannot miss the others).

66

66

Page 67: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Boyer Moore Intuition (5)

• A X A X A X A X X A X X A X A X A X A X Search: XXAXX

• So, for the example when it is A at the endpoint we must shift for 2

characters.

• text: AAAAX we have a mismatch in last A, now we must shift only once, so that we can check the configuration where the A we found moves to middle.

• text: AAYXX we have a mismatch in Y, now we must shift 3 times as we know that the last 2 characters are in pattern and they can be repeating in the first 3 characters.

67

67

Page 68: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Intuition.

• Scan characters in pattern from right to left.

• Can skip as many as M text chars when finding one not in the pattern.- First we check the character in index pattern.length()-1

- It is N which is not E, so we know that first 5 characters is not a match. Shift text 5 characters

- S != E so shift 5, E == E so we can check for the pattern.length()-2, L!=N, skip 4.

Boyer-Moore: mismatched character heuristic

68

Mismatched character heuristic for right-to-left (Boyer-Moore) substring search

i j 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 F I N D I N A H A Y S T A C K N E E D L E I N A 0 5 N E E D L E 5 5 N E E D L E11 4 N E E D L E15 0 N E E D L E return i = 15

pattern

text

68

Page 69: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Boyer-Moore: mismatched character heuristic

Q. How much to skip?Case 1. Mismatch character not in pattern.

69

. . . . . . T L E . . . . . . N E E D L E

txt

pat

mismatch character 'T' not in pattern: increment i one character beyond 'T'

i

. . . . . . T L E . . . . . . N E E D L E

txt

pat

i

before

after

69

Page 70: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Boyer-Moore: mismatched character heuristic

Q. How much to skip?

Case 2a. Mismatch character in pattern.

70

. . . . . . N L E . . . . . . N E E D L E

txt

pat

mismatch character 'N' in pattern: align text 'N' with rightmost pattern 'N'

i

. . . . . . N L E . . . . . . N E E D L E

txt

pat

i

before

after

70

Page 71: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Boyer-Moore: mismatched character heuristic

Q. How much to skip?

Case 2b. Mismatch character in pattern (but heuristic no help).

71

. . . . . . E L E . . . . . . N E E D L E

txt

pat

before

mismatch character 'E' in pattern: align text 'E' with rightmost pattern 'E' ?

i

. . . . . . E L E . . . . . . N E E D L E

txt

pat

aligned with rightmost E?

i

71

Page 72: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Boyer-Moore: mismatched character heuristic

Q. How much to skip?

Case 2b. Mismatch character in pattern (but heuristic no help).

72

. . . . . . E L E . . . . . . N E E D L E

txt

pat

mismatch character 'E' in pattern: increment i by 1

i

. . . . . . E L E . . . . . . N E E D L E

txt

pat

i

before

after

72

Page 73: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Boyer-Moore: mismatched character heuristic

Q. How much to skip?

A. Precompute index of rightmost occurrence of character c in pattern

(-1 if character not in pattern).

73

right = new int[R]; for (int c = 0; c < R; c++) right[c] = -1; for (int j = 0; j < M; j++) right[pat.charAt(j)] = j;

Boyer-Moore skip table computation

c right[c]

N E E D L E 0 1 2 3 4 5A -1 -1 -1 -1 -1 -1 -1 -1B -1 -1 -1 -1 -1 -1 -1 -1C -1 -1 -1 -1 -1 -1 -1 -1D -1 -1 -1 -1 3 3 3 3E -1 -1 1 2 2 2 5 5... -1L -1 -1 -1 -1 -1 4 4 4M -1 -1 -1 -1 -1 -1 -1 -1N -1 0 0 0 0 0 0 0... -1

73

Page 74: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Boyer-Moore: Java implementation

74

public int search(String txt) { int N = txt.length(); int M = pat.length(); int skip; for (int i = 0; i <= N-M; i += skip) { skip = 0; for (int j = M-1; j >= 0; j--) { if (pat.charAt(j) != txt.charAt(i+j)) { skip = Math.max(1, j - right[txt.charAt(i+j)]); break; } } if (skip == 0) return i; } return N; }

compute skip value

match

in case other term is nonpositive

74

Page 75: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Another Example

75

A X A X A X A X X X A X A X X X X A A A

SEARCH FOR: XXXX

If the window scan points to an unrecognised character, we can skip past

that character. For this example, for the initial step we first match X at

the end, when check for previous character (A) which is not in the string

we skip 3 steps. The X at the end, we matched can still be the first

character of the pattern, so we do not skip that.

75

Page 76: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Property. Substring search with the Boyer-Moore mismatched character

heuristic takes about ~ N / M character compares to search for a pattern of

length M in a text of length N.

Worst-case. Can be as bad as ~ M N.

Boyer-Moore variant. Can improve worst case to ~ 3 N by adding aKMP-like rule to guard against repetitive patterns.

Boyer-Moore: analysis

76

sublinear!

Boyer-Moore-Horspool substring search (worst case)

i skip 0 1 2 3 4 5 6 7 8 9

B B B B B B B B B B

0 0 A B B B B 1 1 A B B B B 2 1 A B B B B 3 1 A B B B B 4 1 A B B B B 5 1 A B B B B

txt

pat

76

Page 77: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

SUBSTRING SEARCH

‣ Brute force‣ Knuth-Morris-Pratt‣ Boyer-Moore‣ Rabin-Karp

77

Page 78: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Rabin-Karp fingerprint search

Basic idea = modular hashing.

• Compute a hash of pattern characters 0 to M - 1.

• For each i, compute a hash of text characters i to M + i - 1.

• If pattern hash = text substring hash, check for a match.

78

Basis for Rabin-Karp substring search

txt.charAt(i)i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3

0 3 1 4 1 5 % 997 = 508

1 1 4 1 5 9 % 997 = 201

2 4 1 5 9 2 % 997 = 715

3 1 5 9 2 6 % 997 = 971

4 5 9 2 6 5 % 997 = 442

5 9 2 6 5 3 % 997 = 929

6 2 6 5 3 5 % 997 = 613

pat.charAt(i)i 0 1 2 3 4

2 6 5 3 5 % 997 = 613

return i = 6

match

78

Page 79: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Modular hash function. Using the notation ti for txt.charAt(i), we wish to compute

Intuition. M-digit, base-R integer, modulo Q.

Horner's method. Linear-time method to evaluate degree-M polynomial.

Efficiently computing the hash function

79

// Compute hash for M-digit key private long hash(String key, int M) { long h = 0; for (int j = 0; j < M; j++) h = (R * h + key.charAt(j)) % Q; return h; }

•xi = ti R M-1 + ti+1 R M-2 + … + ti+M-1 R 0 (mod Q)

Computing the hash value for the pattern with Horner’s method

pat.charAt() i 0 1 2 3 4 2 6 5 3 5

0 2 % 997 = 2

1 2 6 % 997 = (2*10 + 6) % 997 = 26

2 2 6 5 % 997 = (26*10 + 5) % 997 = 265

3 2 6 5 3 % 997 = (265*10 + 3) % 997 = 659

4 2 6 5 3 5 % 997 = (659*10 + 5) % 997 = 613

QR

79

Page 80: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Challenge. How to efficiently compute xi+1 given that we know xi.Key property. Can update hash function in constant time!

Efficiently computing the hash function

80

•xi = ti R M–1 + ti+1 R M–2 + … + ti+M–1 R0

•xi+1 = ti+1 R M–1 + ti+2 R M–2 + … + ti+M R0

•xi+1 = ( xi – t i R M–1 ) R + t i +M

Key computation in Rabin-Karp substring search(move right one position in the text)

i ... 2 3 4 5 6 7 ... 1 4 1 5 9 2 6 5 4 1 5 9 2 6 5 4 1 5 9 2 - 4 0 0 0 0 1 5 9 2 * 1 0 1 5 9 2 0 + 6 1 5 9 2 6

current value

subtract leading digit

multiply by radix

add new trailing digit

new value

current valuenew value

text

current

value

subtract leading digit

add new trailing digit

multiply

by radix(can precompute RM–2)

80

Page 81: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Rabin-Karp substring search example

81

Rabin-Karp substring search example

i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3

0 3 % 997 = 3

1 3 1 % 997 = (3*10 + 1) % 997 = 31

2 3 1 4 % 997 = (31*10 + 4) % 997 = 314

3 3 1 4 1 % 997 = (314*10 + 1) % 997 = 150

4 3 1 4 1 5 % 997 = (150*10 + 5) % 997 = 508

5 1 4 1 5 9 % 997 = ((508 + 3*(997 - 30))*10 + 9) % 997 = 201

6 4 1 5 9 2 % 997 = ((201 + 1*(997 - 30))*10 + 2) % 997 = 715

7 1 5 9 2 6 % 997 = ((715 + 4*(997 - 30))*10 + 6) % 997 = 971

8 5 9 2 6 5 % 997 = ((971 + 1*(997 - 30))*10 + 5) % 997 = 442

9 9 2 6 5 3 % 997 = ((442 + 5*(997 - 30))*10 + 3) % 997 = 929

10 2 6 5 3 5 % 997 = ((929 + 9*(997 - 30))*10 + 5) % 997 = 613

Q

RM R

return i-M+1 = 6

match

81

Page 82: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Rabin-Karp: Java implementation

82

public class RabinKarp { private long patHash; // pattern hash value private int M; // pattern length private long Q; // modulus private int R; // radix private long RM; // R^(M-1) % Q

public RabinKarp(String pat) { M = pat.length(); R = 256; Q = longRandomPrime();

RM = 1; for (int i = 1; i <= M-1; i++) RM = (R * RM) % Q; patHash = hash(pat, M); }

private long hash(String key, int M) { /* as before */ }

public int search(String txt) { /* see next slide */ } }

precompute RM – 1 (mod Q)

a large prime

(but avoid overflow)

82

Page 83: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Rabin-Karp: Java implementation (continued)

Monte Carlo version. Return match if hash match.Las Vegas version. Check for substring match if hash match;continue search if false collision.

83

public int search(String txt) { int N = txt.length(); int txtHash = hash(txt, M); if (patHash == txtHash) return 0; for (int i = M; i < N; i++) { txtHash = (txtHash + Q - RM*txt.charAt(i-M) % Q) % Q; txtHash = (txtHash*R + txt.charAt(i)) % Q; if (patHash == txtHash) return i - M + 1; } return N; }

check for hash collision using rolling hash function

83

Page 84: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Rabin-Karp analysis

Theory. If Q is a sufficiently large random prime (about M N 2), then the probability of a false collision is about 1 / N.

Practice. Choose Q to be a large prime (but not so large as to cause

overflow). Under reasonable assumptions, probability of a collision is about 1 / Q.

Monte Carlo version.

• Always runs in linear time.

• Extremely likely to return correct answer (but not always!).

Las Vegas version.• Always returns correct answer.

• Extremely likely to run in linear time (but worst case is M N).

84

84

Page 85: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Rabin-Karp fingerprint search

Advantages.

• Extends to 2d patterns.

• Extends to finding multiple patterns.

Disadvantages.

• Arithmetic ops slower than char compares.

• Las Vegas version requires backup.

• Poor worst-case guarantee.

85

85

Page 86: BBM 202 - ALGORITHMS - web.cs.hacettepe.edu.trbbm202/slides/16-substring-search.pdf · BBM 202 - ALGORITHMS SUBSTRING SEARCH DEPT. OF COMPUTER ENGINEERING Acknowledgement: The course

Cost of searching for an M-character pattern in an N-character text.

86

Substring search cost summary

Rabin-Karp substring search is known as a fingerprint search because it uses a small amount of information to represent a (potentially very large) pattern. Then it looks for this fingerprint (the hash value) in the text. The algorithm is efficient because the fingerprints can be efficiently computed and compared.

Summary The table at the bottom of the page summarizes the algorithms that we have discussed for substring search. As is often the case when we have several algo-rithms for the same task, each of them has attractive features. Brute force search is easy to implement and works well in typical cases (Java’s indexOf() method in String uses brute-force search); Knuth-Morris-Pratt is guaranteed linear-time with no backup in the input; Boyer-Moore is sublinear (by a factor of M) in typical situations; and Rabin-Karp is linear. Each also has drawbacks: brute-force might require time proportional to MN; Knuth-Morris-Pratt and Boyer-Moore use extra space; and Rabin-Karp has a relatively long inner loop (several arithmetic operations, as opposed to character com-pares in the other methods. These characteristics are summarized in the table below.

algorithm versionoperation count backup

in input? correct? extra spaceguarantee typical

brute force — M N 1.1 N yes yes 1

Knuth-Morris-Pratt

full DFA (Algorithm 5.6 ) 2 N 1.1 N no yes MR

mismatch transitions only 3 N 1.1 N no yes M

Boyer-Moore

full algorithm 3 N N / M yes yes R

mismatched char heuristic only

(Algorithm 5.7 )M N N / M yes yes R

Rabin-Karp†Monte Carlo

(Algorithm 5.8 ) 7 N 7 N no yes † 1

Las Vegas 7 N † 7 N no † yes 1

† probabilisitic guarantee, with uniform hash function

Cost summary for substring-search implementations

6795.3 � Substring Search

�������������� ������������� �

Rabin-Karp substring search is known as a fingerprint search because it uses a small amount of information to represent a (potentially very large) pattern. Then it looks for this fingerprint (the hash value) in the text. The algorithm is efficient because the fingerprints can be efficiently computed and compared.

Summary The table at the bottom of the page summarizes the algorithms that we have discussed for substring search. As is often the case when we have several algo-rithms for the same task, each of them has attractive features. Brute force search is easy to implement and works well in typical cases (Java’s indexOf() method in String uses brute-force search); Knuth-Morris-Pratt is guaranteed linear-time with no backup in the input; Boyer-Moore is sublinear (by a factor of M) in typical situations; and Rabin-Karp is linear. Each also has drawbacks: brute-force might require time proportional to MN; Knuth-Morris-Pratt and Boyer-Moore use extra space; and Rabin-Karp has a relatively long inner loop (several arithmetic operations, as opposed to character com-pares in the other methods. These characteristics are summarized in the table below.

algorithm versionoperation count backup

in input? correct? extra spaceguarantee typical

brute force — M N 1.1 N yes yes 1

Knuth-Morris-Pratt

full DFA (Algorithm 5.6 ) 2 N 1.1 N no yes MR

mismatch transitions only 3 N 1.1 N no yes M

Boyer-Moore

full algorithm 3 N N / M yes yes R

mismatched char heuristic only

(Algorithm 5.7 )M N N / M yes yes R

Rabin-Karp†Monte Carlo

(Algorithm 5.8 ) 7 N 7 N no yes † 1

Las Vegas 7 N † 7 N no † yes 1

† probabilisitic guarantee, with uniform hash function

Cost summary for substring-search implementations

6795.3 � Substring Search

�������������� ������������� �

86