umass lowell computer science 91.503 analysis of algorithms prof. karen daniels fall, 2008
DESCRIPTION
UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008. Tuesday, 12/2/08 String Matching Algorithms Chapter 32. Ch 32 String Matching. Automata. Chapter Dependencies. You’re responsible for material in Sections 32.1-32.4 of this chapter. - PowerPoint PPT PresentationTRANSCRIPT
UMass Lowell Computer Science 91.503
Analysis of Algorithms Prof. Karen Daniels
Fall, 2008
UMass Lowell Computer Science 91.503
Analysis of Algorithms Prof. Karen Daniels
Fall, 2008
Tuesday, 12/2/08Tuesday, 12/2/08
String Matching AlgorithmsString Matching AlgorithmsChapter 32 Chapter 32
Chapter DependenciesChapter Dependencies
Ch 32String Matching
Automata You’re responsible for material in Sections 32.1-32.4 of this chapter.
String Matching AlgorithmsString Matching Algorithms
Motivation & BasicsMotivation & Basics
String Matching ProblemString Matching Problem
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
MotivationsMotivations: text-editing, pattern matching in DNA sequences: text-editing, pattern matching in DNA sequences
TextText: array : array T T [1...[1...nn]] PatternPattern: array : array P P [1...[1...mm]]
Array ElementArray Element: Character from finite alphabet : Character from finite alphabet
Pattern Pattern PP occurs with shift occurs with shift ss in in TT if if PP [1... [1...mm] = ] = TT [ [ss +1...+1...s s + + mm] ] mns 0
mn
32.1
String Matching AlgorithmsString Matching Algorithms
Naive AlgorithmNaive Algorithm Worst-case running time in O((Worst-case running time in O((nn--mm+1) +1) mm))
Rabin-KarpRabin-Karp Worst-case running time in O((Worst-case running time in O((nn--mm+1) +1) mm)) Better than this on average and in practiceBetter than this on average and in practice
Finite Automaton-BasedFinite Automaton-Based Worst-case running time in O(Worst-case running time in O(nn + + mm||))
Knuth-Morris-PrattKnuth-Morris-Pratt Worst-case running time in O(Worst-case running time in O(nn + + mm))
Notation & TerminologyNotation & Terminology
* = set of all finite-length strings formed * = set of all finite-length strings formed using characters from alphabet using characters from alphabet
Empty string: Empty string: |x| = length of string x|x| = length of string x w is a prefix of x: w is a prefix of x: ww xx w is a suffix of x: w is a suffix of x: ww xx prefix, suffix are prefix, suffix are transitivetransitive
ab abccaab abcca
cca abccacca abcca
Overlapping Suffix LemmaOverlapping Suffix Lemma
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
32.1
32.3 32.1
String Matching AlgorithmsString Matching Algorithms
Naive AlgorithmNaive Algorithm
Naive String MatchingNaive String Matching
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
worst-case running time is in worst-case running time is in ((((nn--mm+1)+1)mm))
32.4
String Matching AlgorithmsString Matching Algorithms
Rabin-KarpRabin-Karp
Rabin-Karp AlgorithmRabin-Karp Algorithm
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
Assume each character is digit in radix-d notation Assume each character is digit in radix-d notation (e.g. d=10)(e.g. d=10)
p = decimal value of patternp = decimal value of pattern ttss = decimal value of substring T[s+1..s+m] = decimal value of substring T[s+1..s+m] for s = 0,1...,n-mfor s = 0,1...,n-m
Strategy: Strategy: compute p in O(m) time (which is in O(n))compute p in O(m) time (which is in O(n))
compute all tcompute all tii values in total of O(n) time values in total of O(n) time
find all valid shifts s in O(n) time by comparing p with each tfind all valid shifts s in O(n) time by comparing p with each tss
Compute p in O(m) time using Horner’s rule:Compute p in O(m) time using Horner’s rule: p = P[m] + d(P[m-1] + d(P[m-2] + ... + d(P[2] + dP[1])))p = P[m] + d(P[m-1] + d(P[m-2] + ... + d(P[2] + dP[1])))
Compute tCompute t00 similarly from T[1..m] in O(m) time similarly from T[1..m] in O(m) time
Compute remaining tCompute remaining tii’s in O(n-m) time’s in O(n-m) time tts+1s+1 = d(t = d(tss - d - d m-1m-1T[s+1]) + T[s+m+1]T[s+1]) + T[s+m+1]
Rabin-Karp AlgorithmRabin-Karp Algorithm
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
p, tp, tss may be large, so use mod may be large, so use mod
32.5
Rabin-Karp Algorithm (continued)Rabin-Karp Algorithm (continued)
p = 31415p = 31415
spuriousspurious
hithit
ts+1 = d(ts - d m-1T[s+1]) + T[s+m+1]
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
d m-1 mod q
Rabin-Karp Algorithm (continued)Rabin-Karp Algorithm (continued)
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
Rabin-Karp Algorithm (continued)Rabin-Karp Algorithm (continued)
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.worst-case running time is in worst-case running time is in ((n-m+1)m)((n-m+1)m)
(m) in (m) in (n)(n)
(m)(m)
(m)(m)((n-m+1)m)((n-m+1)m)
high-order digit position for m-digit window
Matching loop invariant: when line 10 executedts=T[s+1..s+m] mod q
rule out spurious hit
Try all possible shifts
d is radix. q is modulus
Preprocessing
Rabin-Karp Algorithm (continued)Rabin-Karp Algorithm (continued)source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
average-case running time is in average-case running time is in (n+m)(n+m)
Assume reducing mod q is like random mapping from * to Zq
Estimate (chance that ts= p mod q) = 1/q # spurious hits is in O(n/q)
(m) in (m) in (n)(n)
(m)(m)
(m)(m)((n-m+1)m)((n-m+1)m)
high-order digit position for m-digit window
Matching loop invariant: when line 10 executedts=T[s+1..s+m] mod q
rule out spurious hit
Try all possible shifts
d is radix q is modulus
Preprocessing
Expected matching time = O(n) + O(m(v + n/q)) (v = # valid shifts)
If v is in O(1) and q >= m
set of all finite-length set of all finite-length strings formed from strings formed from
preprocessing + tpreprocessing + tss updates updates explicit matching comparisonsexplicit matching comparisons
String Matching AlgorithmsString Matching Algorithms
Finite AutomataFinite Automata
Finite AutomataFinite Automata
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
StrategyStrategy: Build automaton for pattern, then examine each text character once.: Build automaton for pattern, then examine each text character once.
worst-case running time is in worst-case running time is in (n) + (n) + automaton creation timeautomaton creation time
32.6
Finite AutomataFinite Automata
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
String-Matching AutomatonString-Matching Automaton
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
Pattern = P = Pattern = P = ababacaababaca
Automaton accepts Automaton accepts strings strings ending in Pending in P
32.7
String-Matching AutomatonString-Matching Automaton
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
Suffix Function for P:
(x) = length of longest prefix of P that is a suffix of x
}:max{)( xPkx k
Automaton’s operational invariant
at each stepat each step: keeps track of longest pattern prefix that is a suffix of what has been read so far: keeps track of longest pattern prefix that is a suffix of what has been read so far
32.3
32.4
String-Matching AutomatonString-Matching Automaton
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
Simulate behavior of string-matching automaton that finds occurrences of pattern P of length m in T[1..n]
worst-case running time of worst-case running time of matchingmatching is in is in (n) (n)
assuming automaton has assuming automaton has already been createdalready been created......
String-Matching Automaton (continued)String-Matching Automaton (continued)
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
Correctness of matching procedure...Correctness of matching procedure...
32.4
32.3
32.3 )()( )( aPxa x to be proved next…
String-Matching Automaton (continued)String-Matching Automaton (continued)
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
Correctness of matching procedure...Correctness of matching procedure...
32.2
32.8 32.2
32.8
String-Matching Automaton (continued)String-Matching Automaton (continued)
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
Correctness of matching procedure...Correctness of matching procedure...
32.3
32.9 32.3
32.9
32.2
32.1
String-Matching Automaton (continued)String-Matching Automaton (continued)
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
Correctness of matching procedure...Correctness of matching procedure...
32.4
32.3
32.3 )()( )( aPxa x
String-Matching Automaton (continued)String-Matching Automaton (continued)
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
worst-case running time of worst-case running time of automaton creationautomaton creation is in is in (m(m3 3 |||) |)
worst-case running time of entire string-matching strategy worst-case running time of entire string-matching strategy
is in is in (m(m |||) + |) + (n) (n)
can be improved to: can be improved to: (m(m |||) |)
pattern matching timepattern matching timeautomaton creation timeautomaton creation time
String Matching AlgorithmsString Matching Algorithms
Knuth-Morris-PrattKnuth-Morris-Pratt
Knuth-Morris-Pratt OverviewKnuth-Morris-Pratt Overview
Achieve Achieve (n+m)(n+m) time by shortening time by shortening automaton preprocessing time below automaton preprocessing time below (m(m |||)|)
ApproachApproach:: don’t precompute automaton’s transition functiondon’t precompute automaton’s transition function calculate enough transition data “on-the-fly”calculate enough transition data “on-the-fly” obtain data via “alphabet-independent” pattern obtain data via “alphabet-independent” pattern
preprocessingpreprocessing pattern preprocessing pattern preprocessing compares pattern against compares pattern against
shifts of itselfshifts of itself
Knuth-Morris-Pratt AlgorithmKnuth-Morris-Pratt Algorithm
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
determine how pattern matches against itself determine how pattern matches against itself
32.10
Knuth-Morris-Pratt AlgorithmKnuth-Morris-Pratt Algorithm
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
Prefix function Prefix function shows how pattern matches against itself shows how pattern matches against itself
Equivalently, what is largest k < q such that PEquivalently, what is largest k < q such that Pkk P Pqq? ?
} and :max{)( qk PPqkkq
(q) is length of longest prefix of P that is a proper suffix of P(q) is length of longest prefix of P that is a proper suffix of Pqq
Example:Example:
32.5
Knuth-Morris-Pratt AlgorithmKnuth-Morris-Pratt Algorithm
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
(m) in (m) in (n)(n)
using amortized analysis
# characters matched
scan text left-to-right
next character does not match
next character matches
Is all of P matched?
Look for next match
(m+n) (m+n)
using amortized analysis
(n) (n)
Knuth-Morris-Pratt AlgorithmKnuth-Morris-Pratt Algorithm
Amortized Analysis
k
Potential Method
k = current state of algorithm
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
(m) (m) in in (n)(n)
initial potential value
potential decreases
Potential is never negative since (k) >= 0 for all k
potential increases by <=1 in each execution of for loop body
amortized amortized cost of loop cost of loop body is in body is in (1)(1)
(m) loop (m) loop iterationsiterations
Knuth-Morris-Pratt AlgorithmKnuth-Morris-Pratt Algorithm
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
Correctness...Correctness...
Knuth-Morris-Pratt AlgorithmKnuth-Morris-Pratt Algorithm
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
Correctness...Correctness...
32.5
32.1
32.6
32.6
Knuth-Morris-Pratt AlgorithmKnuth-Morris-Pratt Algorithm
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
Correctness...Correctness...
32.11 32.5
Knuth-Morris-Pratt AlgorithmKnuth-Morris-Pratt Algorithm
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
Correctness...Correctness...
32.6
32.5
32.5
32.7
32.6