jass04 - sequential pattern matchingtobias reichl1 joint advanced student school 2004 complexity...

32
JASS04 - Sequential Pattern Matching Tobias Reichl 1 Joint Advanced Student School 2004 Complexity Analysis of String Algorithms Sequential Pattern Matching: Analysis of Knuth-Morris-Pratt type algorithms using the Subadditive Ergodic Theorem 27 June 2022

Upload: curtis-alexander

Post on 24-Dec-2015

223 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: JASS04 - Sequential Pattern MatchingTobias Reichl1 Joint Advanced Student School 2004 Complexity Analysis of String Algorithms Sequential Pattern Matching:

JASS04 - Sequential Pattern MatchingTobias Reichl 1

Joint Advanced Student School 2004

Complexity Analysis of String Algorithms

Sequential Pattern Matching:Analysis of Knuth-Morris-Pratt type algorithms

using the Subadditive Ergodic Theorem

19 April 2023

Page 2: JASS04 - Sequential Pattern MatchingTobias Reichl1 Joint Advanced Student School 2004 Complexity Analysis of String Algorithms Sequential Pattern Matching:

JASS04 - Sequential Pattern MatchingTobias Reichl 2

Overview

1. Pattern Matching• Sequential Algorithms• Knuth-Morris-Pratt-Algorithm

2. Probabilistic tools• Subadditive Ergodic Theorem• Martingales and Azuma's Inequality

3. Analysis of KMP-Algorithms• Properties of KMP• Establishing subadditivity• Analysis

Page 3: JASS04 - Sequential Pattern MatchingTobias Reichl1 Joint Advanced Student School 2004 Complexity Analysis of String Algorithms Sequential Pattern Matching:

JASS04 - Sequential Pattern MatchingTobias Reichl 3

Pattern Matching

• Text , pattern

• Comparison:

• Alignment Position:

for some k.

abcdexxxxxabxxxabcxxxabcde

Pattern pText t

Pattern-text comparison: Pattern-text comparison: M(l,k)=1M(l,k)=1

Alignment position AP

nt1mp1

otherwise0 tocompared is 1

),(kplt

klM

1)),1(( kkAPM

Page 4: JASS04 - Sequential Pattern MatchingTobias Reichl1 Joint Advanced Student School 2004 Complexity Analysis of String Algorithms Sequential Pattern Matching:

JASS04 - Sequential Pattern MatchingTobias Reichl 4

Sequential Algorithms - Definition

i. Semi-sequential: AP are non-decreasing.

ii. Strongly semi-sequential: (i) and comparisons

define non-decreasing text positions .

iii. Sequential: (i) and

iv. Strongly sequential: (i), (ii) and (iii)

ii klM ,

11

1)1(1,

klkl ptklM

il

abcdexxxxxabxxxabcxxxabcde

Text is compared only if following a prefix of the pattern. Example:

Page 5: JASS04 - Sequential Pattern MatchingTobias Reichl1 Joint Advanced Student School 2004 Complexity Analysis of String Algorithms Sequential Pattern Matching:

JASS04 - Sequential Pattern MatchingTobias Reichl 5

Example: Naive / brute force algorithm

• Every text position is alignment position.

• Text is scanned until...– pattern is found - then done.– mismatch occurs - then shift by one and retry.

• Sequential algorithm.

abcde

xxxxxabxxxabcxxxabcde

abcdeabcde

+1

+1

+1

Page 6: JASS04 - Sequential Pattern MatchingTobias Reichl1 Joint Advanced Student School 2004 Complexity Analysis of String Algorithms Sequential Pattern Matching:

JASS04 - Sequential Pattern MatchingTobias Reichl 6

• Idea: (Morris-Pratt) Disreagard APs already known not to be followed by a prefix of p.

• Knowledge:– Already processed pattern– Pre-processing of p.

• Strongly sequential algorithm.

Knuth-Morris-Pratt type algorithms (1)

xxxxxabxxxabcxxxabcde

ababcdeababcde

+S

Page 7: JASS04 - Sequential Pattern MatchingTobias Reichl1 Joint Advanced Student School 2004 Complexity Analysis of String Algorithms Sequential Pattern Matching:

JASS04 - Sequential Pattern MatchingTobias Reichl 7

Knuth-Morris-Pratt type algorithms (2)

• Morris-Pratt:

• Knuth-Morris-Pratt:

}}:0min{;min{ )1(1

11

skk

s ppskS

}}:min{;min{ )1(1

11

sksk

kk

skks ppandppskS

xxxxxabxxxabcxxxabcde

ababcdeababcde

xxxxxabxxxabcxxxabcde

ababcdeababcde

(KMP also skips mismatching letters)

Page 8: JASS04 - Sequential Pattern MatchingTobias Reichl1 Joint Advanced Student School 2004 Complexity Analysis of String Algorithms Sequential Pattern Matching:

JASS04 - Sequential Pattern MatchingTobias Reichl 8

• Overall complexity:

• Pattern or text is a realization of random sequence:

• Question: complexity of KMP?

Pattern Matching - Complexity

],1[],[

, ,,

mksrl

sr klMptc

nn cc :,1

nC

Page 9: JASS04 - Sequential Pattern MatchingTobias Reichl1 Joint Advanced Student School 2004 Complexity Analysis of String Algorithms Sequential Pattern Matching:

JASS04 - Sequential Pattern MatchingTobias Reichl 9

Fekete (1923)

• Subadditivity:

• Superadditivity:

Subadditivity – Deterministic Sequence

m

x

n

x m

m

n

n 1inflim

nmnm xxx

m

x

n

x m

m

n

n 1suplim

nmnm xxx

Page 10: JASS04 - Sequential Pattern MatchingTobias Reichl1 Joint Advanced Student School 2004 Complexity Analysis of String Algorithms Sequential Pattern Matching:

JASS04 - Sequential Pattern MatchingTobias Reichl 10

Example: Longest Common Subsequence

• Superadditive:

• Hence:

abcdeabcdfabcab

ababcafbcdabcde

abcdeabc

ababcafb

dfabcab

cdabcde

LCS: "abcabcdabc" (10) LCS: "abcab" (5), "dabc" (4)

}1

,11:max{

21

21,1

njjjand

niiiwhereKkforYXKL

k

kjin kk

nmmn LLL ,,1,1

8284.0suplim

?

1

m

LE

n

a m

m

n

n

(Conjectured by Steele in 1982)

Page 11: JASS04 - Sequential Pattern MatchingTobias Reichl1 Joint Advanced Student School 2004 Complexity Analysis of String Algorithms Sequential Pattern Matching:

JASS04 - Sequential Pattern MatchingTobias Reichl 11

Subadditivity – "Almost subadditive"

DeBruijn and Erdös (1952)

• positive and non-decreasing sequence

• "Almost subadditive":

12

k

k

k

c

nmnmnm cxxx

m

x

n

x m

m

n

n 1inflim

nc

Page 12: JASS04 - Sequential Pattern MatchingTobias Reichl1 Joint Advanced Student School 2004 Complexity Analysis of String Algorithms Sequential Pattern Matching:

JASS04 - Sequential Pattern MatchingTobias Reichl 12

Subadditive Ergodic Theorem

Kingman (1976), Liggett (1985)

i.

ii. is a stationary sequence

iii. does not depend on m

iv.

nmmn XXX ,,0,0

1,: )1(, nXk knnk

1,, kX kmm

00,01,0 where][and][ cncXEXE n

XEm

XE

n

XE m

m

n

n

:

][inf

][lim ,0

1

,0

(a.s.)lim ,0 n

X n

n

Page 13: JASS04 - Sequential Pattern MatchingTobias Reichl1 Joint Advanced Student School 2004 Complexity Analysis of String Algorithms Sequential Pattern Matching:

JASS04 - Sequential Pattern MatchingTobias Reichl 13

Almost Subadditive Ergodic Theorem

Deriennic (1983)

• Subadditivity can be relaxed to

with

• Then, too:

nnmmn AXXX ,,0,0

(a.s.)lim ,0 n

X n

n

0lim

nAE nn

Page 14: JASS04 - Sequential Pattern MatchingTobias Reichl1 Joint Advanced Student School 2004 Complexity Analysis of String Algorithms Sequential Pattern Matching:

JASS04 - Sequential Pattern MatchingTobias Reichl 14

Martingales

• A sequenceis a martingale with respect to the filtration if for all :

• defines a random variable depending on the knowledge contained in .

nnnnn YFYEXXXYE |,,,| 1101

nYE

0,,1 nXXfY nn

),,( 0 nn XXF 0n

nXX ,,1 nn FYE |1

Page 15: JASS04 - Sequential Pattern MatchingTobias Reichl1 Joint Advanced Student School 2004 Complexity Analysis of String Algorithms Sequential Pattern Matching:

JASS04 - Sequential Pattern MatchingTobias Reichl 15

Martingale Differences

• The martingale difference is defined as

so that:

• Observe:

1 nnn YYD

n

iin DYY

10

0

]|[]|[]|[ 11

nn

nnnnnn

YY

FYEFYEFDE

Page 16: JASS04 - Sequential Pattern MatchingTobias Reichl1 Joint Advanced Student School 2004 Complexity Analysis of String Algorithms Sequential Pattern Matching:

JASS04 - Sequential Pattern MatchingTobias Reichl 16

Azuma's Inequality (1)

• Let be a martingale• Define the martingale difference as

(The mean of the same element but depending on different knowledge)

• Observe:

),,( 1 nnn XXfY

nnnnn YEFYEYFYE 0|and|

1|| inini FYEFYED

nnnnn

n

ii YEYFYEFYED

0

1

||

(Deviation from the mean)

Page 17: JASS04 - Sequential Pattern MatchingTobias Reichl1 Joint Advanced Student School 2004 Complexity Analysis of String Algorithms Sequential Pattern Matching:

JASS04 - Sequential Pattern MatchingTobias Reichl 17

Hoeffding's Inequality

• Let be a martingale

• Let there exist constant

• Then:nnnn cDYY 1

n

i i

n

ii

on

c

xxD

xYY

1

2

2

1 2exp2Pr

Pr

0nnY

nc

Page 18: JASS04 - Sequential Pattern MatchingTobias Reichl1 Joint Advanced Student School 2004 Complexity Analysis of String Algorithms Sequential Pattern Matching:

JASS04 - Sequential Pattern MatchingTobias Reichl 18

Azuma's Inequality (2)

• Summary:– If is bounded, we know how to assess the

deviation from the mean.– So now we need a bound on .

• Trick: Let be an independent copy of .

• Then:iX̂ iX

inin

inin

FXXXfE

FXXXfE

|,,ˆ,,

|,,,,

1

11

iD

iD

Page 19: JASS04 - Sequential Pattern MatchingTobias Reichl1 Joint Advanced Student School 2004 Complexity Analysis of String Algorithms Sequential Pattern Matching:

JASS04 - Sequential Pattern MatchingTobias Reichl 19

Azuma's Inequality (3)

• Hence:

• And we can postulate:

inininin

inininin

i

FXXXfEFXXXfE

FXXXfEFXXXfE

D

|,,ˆ,,|,,,,

|,,,,|,,,,

11

111

ii cD

Page 20: JASS04 - Sequential Pattern MatchingTobias Reichl1 Joint Advanced Student School 2004 Complexity Analysis of String Algorithms Sequential Pattern Matching:

JASS04 - Sequential Pattern MatchingTobias Reichl 20

Azuma's Inequality (4)

• Let be a martingale

• If there exists constant such that

where is an independent copy of

• Then:

ininnin cXXXfXXXf ,,ˆ,,,,,, 11

2

1

2

11

2exp2

,,ˆ,,,,,,Pr

Pr

i

n

i

ninnin

nn

c

x

xXXXfEXXXf

xYEY

nnn XXfY ,,1

ic

iXiX̂

Page 21: JASS04 - Sequential Pattern MatchingTobias Reichl1 Joint Advanced Student School 2004 Complexity Analysis of String Algorithms Sequential Pattern Matching:

JASS04 - Sequential Pattern MatchingTobias Reichl 21

KMP: Unavoidable alignment positions

• A position in the text is called unavoidable AP if for any r,l it's an AP when run on .

• KMP-like algorithms have the same set of unavoidable alignment positions

where

• Example:

n

l lUU1

}1,}{minmin{

1

lptU l

klk

l

milir andlrt

abcde

xxxxxabxxxabcxxxabcde

llU

Page 22: JASS04 - Sequential Pattern MatchingTobias Reichl1 Joint Advanced Student School 2004 Complexity Analysis of String Algorithms Sequential Pattern Matching:

JASS04 - Sequential Pattern MatchingTobias Reichl 22

Pattern Matching: l-convergence

• An algorithm is l-convergent if there exists an increasing sequence of unavoidable alignment positions satisfying

• l-convergence indicates the maximum size "jumps" for an algorithm.

lUU ii 1

n

iiU 1

Page 23: JASS04 - Sequential Pattern MatchingTobias Reichl1 Joint Advanced Student School 2004 Complexity Analysis of String Algorithms Sequential Pattern Matching:

JASS04 - Sequential Pattern MatchingTobias Reichl 23

KMP: Establishing m-convergence

• Let AP be an alignment position

• Define:

• Hence: and so KMP-like algorithms are m-convergent.

mAPl lUmlmp l 1

mAPU l

Page 24: JASS04 - Sequential Pattern MatchingTobias Reichl1 Joint Advanced Student School 2004 Complexity Analysis of String Algorithms Sequential Pattern Matching:

JASS04 - Sequential Pattern MatchingTobias Reichl 24

KMP: Establishing subadditivity (1)

• If (number of comparisons) is subadditive we can prove linear complexity of KMP-like algorithms.

• We have to show: is (almost) subadditive:

• Approach:An l-convergent sequential algorithm satisfies:

lmmccc nrrn 2,,1,1

nc

accc nrrn ,,1,1

nc

Page 25: JASS04 - Sequential Pattern MatchingTobias Reichl1 Joint Advanced Student School 2004 Complexity Analysis of String Algorithms Sequential Pattern Matching:

JASS04 - Sequential Pattern MatchingTobias Reichl 25

KMP: Establishing subadditivity (2)

• Proof:– : the smallest unavoidable AP greater than r.– We split into

and . nUrn rccc ,,1,1

nUnr rcc ,,

r

nc ,1

rU

nrrn ccc ,,1,1 rU

nUr rcc ,,1

nUnr rcc ,,

Page 26: JASS04 - Sequential Pattern MatchingTobias Reichl1 Joint Advanced Student School 2004 Complexity Analysis of String Algorithms Sequential Pattern Matching:

JASS04 - Sequential Pattern MatchingTobias Reichl 26

KMP: Establishing subadditivity (3)

• Comparisons done after r with AP before r:

• Comparisons with AP between r and :

• No more than m comparisons can be saved at

rrU

S1

S2S2

21 1, mAPiiMS

rAP ri

lmiiAPMSrUAPr mi

),1(2

Contributing to and

nc ,1 rc ,1

Contributing to only

nc ,1

Contributing to and

nc ,1 nU rc ,

rU

?

???

??

rU

Page 27: JASS04 - Sequential Pattern MatchingTobias Reichl1 Joint Advanced Student School 2004 Complexity Analysis of String Algorithms Sequential Pattern Matching:

JASS04 - Sequential Pattern MatchingTobias Reichl 27

• Comparisons with AP between r and :

• No more than m comparisons can be saved at

KMP: Establishing subadditivity (4)

rrU

S3S3

lmiiAPMSrU

rAP i

1

3 ),1(

Contributing to only

nrc ,

Contributing to and

nrc , nU rc ,

rU

??

??

rU

Page 28: JASS04 - Sequential Pattern MatchingTobias Reichl1 Joint Advanced Student School 2004 Complexity Analysis of String Algorithms Sequential Pattern Matching:

JASS04 - Sequential Pattern MatchingTobias Reichl 28

KMP: Establishing subadditivity (5)

• So we are able to bound:

• We have shown: is (almost) subadditive:

• Now we are able to apply the Subadditive Ergodic Theorem.

lmmSSSccc nrrn 2321,,1,1

nc

accc nrrn ,,1,1

Page 29: JASS04 - Sequential Pattern MatchingTobias Reichl1 Joint Advanced Student School 2004 Complexity Analysis of String Algorithms Sequential Pattern Matching:

JASS04 - Sequential Pattern MatchingTobias Reichl 29

KMP: Different Modeling Assumptions

• Deterministic Model:Text and pattern are non random.

• Semi-Random Model:Text is a realization of a stationary and ergodic sequence, pattern is given.

• Stationary model:Both text and pattern are realizations of a stationary and ergodic sequence.

Page 30: JASS04 - Sequential Pattern MatchingTobias Reichl1 Joint Advanced Student School 2004 Complexity Analysis of String Algorithms Sequential Pattern Matching:

JASS04 - Sequential Pattern MatchingTobias Reichl 30

KMP: Applying the Subadditive Ergodic Theorem

• We have shown: is (almost) subadditive

• Deterministic Model:

• Semi-Random Model:

• Stationary Model:

nc

)(

,maxlim 1 p

n

ptcnt

n

)(

)(lim 2 p

n

pCE nt

n

3

,lim n

CE npt

n

(a.s.))()(

lim 2 pn

pCn

n

Page 31: JASS04 - Sequential Pattern MatchingTobias Reichl1 Joint Advanced Student School 2004 Complexity Analysis of String Algorithms Sequential Pattern Matching:

JASS04 - Sequential Pattern MatchingTobias Reichl 31

KMP: Applying Azuma's Inequality

• satisfies:

where is an independent copy of .

• So, using Azuma's Inequality:

• is concentrated around its mean:

211 2,,ˆ,,,,,, mTTTCTTTC ninnin

11

22exp2Pr

2

2

omn

nnnCn

11 onCE n

nC

iTiT̂

nC

Page 32: JASS04 - Sequential Pattern MatchingTobias Reichl1 Joint Advanced Student School 2004 Complexity Analysis of String Algorithms Sequential Pattern Matching:

JASS04 - Sequential Pattern MatchingTobias Reichl 32

Conclusion

• Using the Subadditive Ergodic Theorem we can show there exists a linearity constant for the worst and average case resp. KMP has linear complexity.

• The Subadditive Ergodic Theorem proves the existence of this constant but says nothing how to compute it.

• Using Azuma's Inequality we can show that the number of comparisons is well concentrated around its mean.