umass lowell computer science 91.503 analysis of algorithms prof. karen daniels fall, 2008

37
UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008 Tuesday, 12/2/08 Tuesday, 12/2/08 String Matching Algorithms String Matching Algorithms Chapter 32 Chapter 32

Upload: tanner-charles

Post on 03-Jan-2016

27 views

Category:

Documents


0 download

DESCRIPTION

UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008. Tuesday, 12/2/08 String Matching Algorithms Chapter 32. Ch 32 String Matching. Automata. Chapter Dependencies. You’re responsible for material in Sections 32.1-32.4 of this chapter. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008

UMass Lowell Computer Science 91.503

Analysis of Algorithms Prof. Karen Daniels

Fall, 2008

UMass Lowell Computer Science 91.503

Analysis of Algorithms Prof. Karen Daniels

Fall, 2008

Tuesday, 12/2/08Tuesday, 12/2/08

String Matching AlgorithmsString Matching AlgorithmsChapter 32 Chapter 32

Page 2: UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008

Chapter DependenciesChapter Dependencies

Ch 32String Matching

Automata You’re responsible for material in Sections 32.1-32.4 of this chapter.

Page 3: UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008

String Matching AlgorithmsString Matching Algorithms

Motivation & BasicsMotivation & Basics

Page 4: UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008

String Matching ProblemString Matching Problem

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

MotivationsMotivations: text-editing, pattern matching in DNA sequences: text-editing, pattern matching in DNA sequences

TextText: array : array T T [1...[1...nn]] PatternPattern: array : array P P [1...[1...mm]]

Array ElementArray Element: Character from finite alphabet : Character from finite alphabet

Pattern Pattern PP occurs with shift occurs with shift ss in in TT if if PP [1... [1...mm] = ] = TT [ [ss +1...+1...s s + + mm] ] mns 0

mn

32.1

Page 5: UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008

String Matching AlgorithmsString Matching Algorithms

Naive AlgorithmNaive Algorithm Worst-case running time in O((Worst-case running time in O((nn--mm+1) +1) mm))

Rabin-KarpRabin-Karp Worst-case running time in O((Worst-case running time in O((nn--mm+1) +1) mm)) Better than this on average and in practiceBetter than this on average and in practice

Finite Automaton-BasedFinite Automaton-Based Worst-case running time in O(Worst-case running time in O(nn + + mm||))

Knuth-Morris-PrattKnuth-Morris-Pratt Worst-case running time in O(Worst-case running time in O(nn + + mm))

Page 6: UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008

Notation & TerminologyNotation & Terminology

* = set of all finite-length strings formed * = set of all finite-length strings formed using characters from alphabet using characters from alphabet

Empty string: Empty string: |x| = length of string x|x| = length of string x w is a prefix of x: w is a prefix of x: ww xx w is a suffix of x: w is a suffix of x: ww xx prefix, suffix are prefix, suffix are transitivetransitive

ab abccaab abcca

cca abccacca abcca

Page 7: UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008

Overlapping Suffix LemmaOverlapping Suffix Lemma

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

32.1

32.3 32.1

Page 8: UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008

String Matching AlgorithmsString Matching Algorithms

Naive AlgorithmNaive Algorithm

Page 9: UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008

Naive String MatchingNaive String Matching

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

worst-case running time is in worst-case running time is in ((((nn--mm+1)+1)mm))

32.4

Page 10: UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008

String Matching AlgorithmsString Matching Algorithms

Rabin-KarpRabin-Karp

Page 11: UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008

Rabin-Karp AlgorithmRabin-Karp Algorithm

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

Assume each character is digit in radix-d notation Assume each character is digit in radix-d notation (e.g. d=10)(e.g. d=10)

p = decimal value of patternp = decimal value of pattern ttss = decimal value of substring T[s+1..s+m] = decimal value of substring T[s+1..s+m] for s = 0,1...,n-mfor s = 0,1...,n-m

Strategy: Strategy: compute p in O(m) time (which is in O(n))compute p in O(m) time (which is in O(n))

compute all tcompute all tii values in total of O(n) time values in total of O(n) time

find all valid shifts s in O(n) time by comparing p with each tfind all valid shifts s in O(n) time by comparing p with each tss

Compute p in O(m) time using Horner’s rule:Compute p in O(m) time using Horner’s rule: p = P[m] + d(P[m-1] + d(P[m-2] + ... + d(P[2] + dP[1])))p = P[m] + d(P[m-1] + d(P[m-2] + ... + d(P[2] + dP[1])))

Compute tCompute t00 similarly from T[1..m] in O(m) time similarly from T[1..m] in O(m) time

Compute remaining tCompute remaining tii’s in O(n-m) time’s in O(n-m) time tts+1s+1 = d(t = d(tss - d - d m-1m-1T[s+1]) + T[s+m+1]T[s+1]) + T[s+m+1]

Page 12: UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008

Rabin-Karp AlgorithmRabin-Karp Algorithm

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

p, tp, tss may be large, so use mod may be large, so use mod

32.5

Page 13: UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008

Rabin-Karp Algorithm (continued)Rabin-Karp Algorithm (continued)

p = 31415p = 31415

spuriousspurious

hithit

ts+1 = d(ts - d m-1T[s+1]) + T[s+m+1]

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

d m-1 mod q

Page 14: UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008

Rabin-Karp Algorithm (continued)Rabin-Karp Algorithm (continued)

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

Page 15: UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008

Rabin-Karp Algorithm (continued)Rabin-Karp Algorithm (continued)

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.worst-case running time is in worst-case running time is in ((n-m+1)m)((n-m+1)m)

(m) in (m) in (n)(n)

(m)(m)

(m)(m)((n-m+1)m)((n-m+1)m)

high-order digit position for m-digit window

Matching loop invariant: when line 10 executedts=T[s+1..s+m] mod q

rule out spurious hit

Try all possible shifts

d is radix. q is modulus

Preprocessing

Page 16: UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008

Rabin-Karp Algorithm (continued)Rabin-Karp Algorithm (continued)source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

average-case running time is in average-case running time is in (n+m)(n+m)

Assume reducing mod q is like random mapping from * to Zq

Estimate (chance that ts= p mod q) = 1/q # spurious hits is in O(n/q)

(m) in (m) in (n)(n)

(m)(m)

(m)(m)((n-m+1)m)((n-m+1)m)

high-order digit position for m-digit window

Matching loop invariant: when line 10 executedts=T[s+1..s+m] mod q

rule out spurious hit

Try all possible shifts

d is radix q is modulus

Preprocessing

Expected matching time = O(n) + O(m(v + n/q)) (v = # valid shifts)

If v is in O(1) and q >= m

set of all finite-length set of all finite-length strings formed from strings formed from

preprocessing + tpreprocessing + tss updates updates explicit matching comparisonsexplicit matching comparisons

Page 17: UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008

String Matching AlgorithmsString Matching Algorithms

Finite AutomataFinite Automata

Page 18: UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008

Finite AutomataFinite Automata

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

StrategyStrategy: Build automaton for pattern, then examine each text character once.: Build automaton for pattern, then examine each text character once.

worst-case running time is in worst-case running time is in (n) + (n) + automaton creation timeautomaton creation time

32.6

Page 19: UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008

Finite AutomataFinite Automata

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

Page 20: UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008

String-Matching AutomatonString-Matching Automaton

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

Pattern = P = Pattern = P = ababacaababaca

Automaton accepts Automaton accepts strings strings ending in Pending in P

32.7

Page 21: UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008

String-Matching AutomatonString-Matching Automaton

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

Suffix Function for P:

(x) = length of longest prefix of P that is a suffix of x

}:max{)( xPkx k

Automaton’s operational invariant

at each stepat each step: keeps track of longest pattern prefix that is a suffix of what has been read so far: keeps track of longest pattern prefix that is a suffix of what has been read so far

32.3

32.4

Page 22: UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008

String-Matching AutomatonString-Matching Automaton

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

Simulate behavior of string-matching automaton that finds occurrences of pattern P of length m in T[1..n]

worst-case running time of worst-case running time of matchingmatching is in is in (n) (n)

assuming automaton has assuming automaton has already been createdalready been created......

Page 23: UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008

String-Matching Automaton (continued)String-Matching Automaton (continued)

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

Correctness of matching procedure...Correctness of matching procedure...

32.4

32.3

32.3 )()( )( aPxa x to be proved next…

Page 24: UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008

String-Matching Automaton (continued)String-Matching Automaton (continued)

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

Correctness of matching procedure...Correctness of matching procedure...

32.2

32.8 32.2

32.8

Page 25: UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008

String-Matching Automaton (continued)String-Matching Automaton (continued)

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

Correctness of matching procedure...Correctness of matching procedure...

32.3

32.9 32.3

32.9

32.2

32.1

Page 26: UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008

String-Matching Automaton (continued)String-Matching Automaton (continued)

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

Correctness of matching procedure...Correctness of matching procedure...

32.4

32.3

32.3 )()( )( aPxa x

Page 27: UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008

String-Matching Automaton (continued)String-Matching Automaton (continued)

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

worst-case running time of worst-case running time of automaton creationautomaton creation is in is in (m(m3 3 |||) |)

worst-case running time of entire string-matching strategy worst-case running time of entire string-matching strategy

is in is in (m(m |||) + |) + (n) (n)

can be improved to: can be improved to: (m(m |||) |)

pattern matching timepattern matching timeautomaton creation timeautomaton creation time

Page 28: UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008

String Matching AlgorithmsString Matching Algorithms

Knuth-Morris-PrattKnuth-Morris-Pratt

Page 29: UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008

Knuth-Morris-Pratt OverviewKnuth-Morris-Pratt Overview

Achieve Achieve (n+m)(n+m) time by shortening time by shortening automaton preprocessing time below automaton preprocessing time below (m(m |||)|)

ApproachApproach:: don’t precompute automaton’s transition functiondon’t precompute automaton’s transition function calculate enough transition data “on-the-fly”calculate enough transition data “on-the-fly” obtain data via “alphabet-independent” pattern obtain data via “alphabet-independent” pattern

preprocessingpreprocessing pattern preprocessing pattern preprocessing compares pattern against compares pattern against

shifts of itselfshifts of itself

Page 30: UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008

Knuth-Morris-Pratt AlgorithmKnuth-Morris-Pratt Algorithm

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

determine how pattern matches against itself determine how pattern matches against itself

32.10

Page 31: UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008

Knuth-Morris-Pratt AlgorithmKnuth-Morris-Pratt Algorithm

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

Prefix function Prefix function shows how pattern matches against itself shows how pattern matches against itself

Equivalently, what is largest k < q such that PEquivalently, what is largest k < q such that Pkk P Pqq? ?

} and :max{)( qk PPqkkq

(q) is length of longest prefix of P that is a proper suffix of P(q) is length of longest prefix of P that is a proper suffix of Pqq

Example:Example:

32.5

Page 32: UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008

Knuth-Morris-Pratt AlgorithmKnuth-Morris-Pratt Algorithm

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

(m) in (m) in (n)(n)

using amortized analysis

# characters matched

scan text left-to-right

next character does not match

next character matches

Is all of P matched?

Look for next match

(m+n) (m+n)

using amortized analysis

(n) (n)

Page 33: UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008

Knuth-Morris-Pratt AlgorithmKnuth-Morris-Pratt Algorithm

Amortized Analysis

k

Potential Method

k = current state of algorithm

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

(m) (m) in in (n)(n)

initial potential value

potential decreases

Potential is never negative since (k) >= 0 for all k

potential increases by <=1 in each execution of for loop body

amortized amortized cost of loop cost of loop body is in body is in (1)(1)

(m) loop (m) loop iterationsiterations

Page 34: UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008

Knuth-Morris-Pratt AlgorithmKnuth-Morris-Pratt Algorithm

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

Correctness...Correctness...

Page 35: UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008

Knuth-Morris-Pratt AlgorithmKnuth-Morris-Pratt Algorithm

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

Correctness...Correctness...

32.5

32.1

32.6

32.6

Page 36: UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008

Knuth-Morris-Pratt AlgorithmKnuth-Morris-Pratt Algorithm

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

Correctness...Correctness...

32.11 32.5

Page 37: UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008

Knuth-Morris-Pratt AlgorithmKnuth-Morris-Pratt Algorithm

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

Correctness...Correctness...

32.6

32.5

32.5

32.7

32.6