CSC 212 –Data Structures
Lecture 36:
Pattern Matching
Suffixes and Prefixes
“I am the Lizard King!”Prefixes Suffixes
II I a
I am…I am the Lizard KinI am the Lizard King I am the Lizard King!
!g!ng!ing!…am the Lizard King!
am the Lizard King!
I am the Lizard King!
KMP Algorithm
Asymptotically optimal algorithmMeans cannot do better in big-Oh terms
Compares from left-to-rightSo like BruteForce, not Boyer-MooreBut shifts pattern intelligently
Relies on a Key Insight™Preprocess pattern to avoid redundant
comparisonsAlways go forward; Never, ever look back
The KMP Algorithm
x
j
. . a b a a b . . . . .
a b a a b a
a b a a b a
Do notrepeat thesecomparisons
Need to resume
comparinghere
Shifting P hereensures these
two entries match
KMP Failure Function
Assume P[j] ≠ T[k]. Need rank in P to next compared to T[k]
E.g., How should we shift P after a miss? Uses failure function, F(j-1),
One value defined for each rank in PSpecifies rank j in P must restart comparisons
Computing Failure Function
For rank j, find longest proper prefix and suffix of P[0...j] For speed, store failure function in arrayUnlike Boyer-Moore, works w/infinite alphabets
Takes at most O(2m) = O(m) time
Similar algorithm computes failure function & KMP
Computing Failure FunctionAlgorithm KMPFailureFunction(String P)
F[0] 0i 1j 0while i < P.length()
if P[i] = P[j] // So, P[0…j] = P[i - j…i] F[i] j + 1 // Record the length of this prefix/suffix i i + 1 // Advance a character and see if still matches j j + 1else if j > 0 // No match, need to restart our computation j F[j - 1] // Skip over longest prefix that is also a suffixelse F[i] 0 // No prefix of P[0…i] is a suffix of P[0…i] i i + 1 // Move to the next character
return F
KMP Failure Functionj 0 1 2 3 4
P[j] a b a a b a
F(j) 0 0 1 1 2
The KMP AlgorithmAlgorithm KMPMatch(String T, String P)
F KMPFailureFunction(P)i 0j 0while i < T.length()
if P[j] = T[i] // So, P[0…j] = T[i - j…i] if j = P.length() - 1 return i - j i i + 1 // Advance and see if still a match j j + 1else if j > 0 // No match, but a prefix of P[0…j-1] matches j F[j - 1] // So skip past longest prefix that is a suffixelse i i + 1 // Nothing to reuse, move to the next character
return F
Example
1
a b a c a a b a c a b a c a b a a b b
7
8
19181715
a b a c a b
1614
13
2 3 4 5 6
9
a b a c a b
a b a c a b
a b a c a b
a b a c a b
10 11 12
c
j 0 1 2 3 4
P[j] a b a c a b
F(j) 0 0 1 0 1
The KMP Algorithm
In each pass of KMPMatch, either:P[j]=T[i] i increases by one, orP[j]≠T[i] & j > 0 P shifted right by at least 1P[j]≠T[i] & j = 0 i increases by 1
So at most 2n iterations of loop KMPMatch takes O(2n) = O(n) time KMPFailureFunction needs O(m) time Thus, algorithm runs in O(m n) time
Your Turn
Get back into groups and do activity
Before Next Lecture…
Finish up assignments Start thinking about questions for Final