randomized algorithms · 2020. 6. 12. · randomized algorithms cs648 lecture 5 • application of...

37
Randomized Algorithms CS648 Lecture 5 Application of Fingerprinting Technique 1-dimensional Pattern matching Union bound Preparation for a memorable lecture on 23 January . 1 A powerful tool

Upload: others

Post on 19-Feb-2021

8 views

Category:

Documents


0 download

TRANSCRIPT

  • Randomized Algorithms CS648

    Lecture 5 • Application of Fingerprinting Technique

    • 1-dimensional Pattern matching

    • Union bound

    • Preparation for a memorable lecture on 23 January.

    1

    A powerful tool

  • FINGERPRINTING APPLICATION 2

    Pattern matching

    2

  • Text 𝑻[0…𝑛 − 1]:

    Pattern 𝑷[0…𝑚 − 1]:

    Pattern 𝑷 is said to appear in Text 𝑻 at location 𝑘 if

    𝑻 𝑖 + 𝑘 = 𝑷[𝑖].

    Problem: Given a Text 𝑻[0…𝑛 − 1], and a pattern 𝑷[0…𝑚 − 1],

    does 𝑷 appear anywhere in 𝑻 ?

    Deterministic Algorithm

    • Trivial algorithm: O(𝑚𝑛) time

    • Knuth-Morris-Pratt algorithm: O(𝑚 + 𝑛) time

    Randomized Monte Carlo Algorithm

    • O(𝑚 + 𝑛) time, and error probability < 1

    𝑛𝑐

    3

    100101100110001101111010101110101010111010000101

    011110101011101

    17

    for all 0 ≤ 𝑖 < 𝑚,

  • Motivation

    • Simplicity, real time implementation, streaming environment

    • Extension to 2-dimensions

    • Converting Monte Carlo to Las Vegas algorithm

    4

    1 1 1 0 1 1 0 1 1 0 1 1 1 1 1 1

    m⨯m

    n⨯n

    O(𝒏𝟐 +𝒎𝟐) time algorithm

  • RANDOMIZED ALGORITHM FOR PATTERN MATCHING

    5

  • Checking if 𝑷 appears in Text 𝑻 at location 𝒌

    Text 𝑻[0…𝑛 − 1]:

    Pattern 𝑷[0…𝑚 − 1]:

    Observation: O(𝑚) time algorithm is obvious.

    Question: How to do this task in O(1) time ?

    Answer: have a fingerprint .

    Question: What properties should the fingerprint possess?

    • ??

    • ??

    6

    0111101110110101

    𝒌

    100101100110001101111010101010101010111010000101

    Small size

    Efficiently computable

  • Checking if 𝑷 appears in Text 𝑻 at location 𝒌

    Text 𝑻[0…𝑛 − 1]:

    Pattern 𝑷[0…𝑚 − 1]:

    𝑵𝑃 = 2𝑚−1−𝑖 ∙ 𝑷[𝑖]𝑚−1𝑖=0

    𝑵𝑇(𝑘) = 2𝑚−1−𝑖 ∙ 𝑻[𝑖 + 𝑘]𝑚−1𝑖=0

    Let 𝒒 be a prime number selected randomly uniformly from [2, 𝑡 ] 𝒇𝑇 𝑘 = 𝑵𝑇(𝑘) mod 𝒒.

    𝒇𝑃 = 𝑵𝑃 mod 𝒒.

    If 𝒇𝑇 𝑘 = 𝒇𝑃 then conclude that 𝑷 appears at 𝑘.

    Error occurs if “𝒒 is one of the prime factors of (𝑵𝑃 − 𝑵𝑇(𝒌))”

    Error probability at location 𝑘 ≤ 𝑚𝜋(𝑡)

    Fingerprint size to get error probability ≤ 𝑛−𝑐 = 7

    𝒌

    100101100110001101111010101010101010111010000101

    0111101110110101

    Θ(𝑚) since 𝑵𝑇(𝑘) has 𝑚 bits.

    O(log𝑚 + log𝑛) bits

    How much time to compute 𝒇𝑇 𝑘 ?

    Idea: (Inspiration form the last lecture) visualize the sequence of bits as a long number

  • Checking if 𝑷 appears in Text 𝑻 at location 𝒌

    Text 𝑻[0…𝑛 − 1]:

    Pattern 𝑷[0…𝑚 − 1]:

    𝑵𝑃 = 2𝑚−1−𝑖 ∙ 𝑷[𝑖]𝑚−1𝑖=0

    𝑵𝑇(𝑘) = 2𝑚−1−𝑖 ∙ 𝑻[𝑖 + 𝑘]𝑚−1𝑖=0

    Question: Any relation between 𝒇𝑇(𝑘) and 𝒇𝑇(𝑘 + 1) ?

    Question: Any relation between 𝑵𝑇(𝑘) and 𝑵𝑇(𝑘 + 1) ?

    𝑵𝑇(𝑘 + 1) = 𝑵𝑇 𝑘 − 2𝑚−1 ∙ 𝑇 𝑘 ∙ 2 + 𝑇 𝑘 + 𝑚

    𝑵𝑇(𝑘 + 1) mod 𝒒 = ( 𝑵𝑇 𝑘 − 2𝑚−1 ∙ 𝑇 𝑘 ∙ 2 + 𝑇 𝑘 + 𝑚 ) mod 𝒒

    𝒇𝑇(𝑘 + 1) = ( 𝒇𝑇 𝑘 − 2𝑚−1 ∙ 𝑇 𝑘 ∙ 2 + 𝑇 𝑘 + 𝑚 ) mod 𝒒

    𝒇𝑇(𝑘 + 1) = ( 𝒇𝑇 𝑘 − (2𝑚−1mod 𝒒) ∙ 𝑇 𝑘 ∙ 2 + 𝑇 𝑘 + 𝑚 ) mod 𝒒

    8

    𝒌

    100101100110001101111010101010101010111010000101

    0111101110110101 𝑵𝑇(𝑘)

    𝑵𝑇(𝑘 + 1)

    < 𝒒

    Any relation between 𝒇𝑇(𝑘) and 𝒇𝑇(𝑘 + 1) ?

  • Fingerprint function: how good is it ?

    Text 𝑻[0…𝑛 − 1]:

    Pattern 𝑷[0…𝑚 − 1]:

    𝒇𝑃 = 2𝑚−1−𝑖 ∙ 𝑷[𝑖]𝑚−1𝑖=0 mod 𝒒

    𝒇𝑇(𝑘) = 2𝑚−1−𝑖 ∙ 𝑻[𝑖 + 𝑘]𝑚−1𝑖=0 mod 𝒒

    The fingerprint function

    • Occupies 𝐥𝐨𝐠 𝒒 bits.

    • Computing takes O(𝐥𝐨𝐠 𝒒) bits operations.

    • Error probability for any particular location is 𝑚𝜋(𝑡)

    Question: What is the error probability of the algorithm ?

    Homework (to be done in the next class as well)

    9

    𝒌

    100101100110001101111010101010101010111010000101

    0111101110110101

    <1

    𝑛c for 𝑡 = 4𝑛𝑐𝑚 log 𝑛𝑚

    O(1) time in word-RAM model of computation

  • Randomized Algorithms discussed till now

    • Randomized algorithm for Approximate Median

    • Randomized Quick Sort

    • Frievald’s algo. for Matrix Product Verification

    • Randomized algorithm for Equality of two files

    • Randomized algorithm for Pattern Matching

    10

    Randomly select a sample

    Randomly select the pivots

    Randomly select a vector

    Randomly select a prime number

    Randomly select a prime number

  • Randomized Algorithms

    How does one go about designing a randomized algorithm ?

    11

  • Randomized Algorithms

    Some random idea is required to design a randomized algorithm.

    Ponder over it …

    12

  • RANDOMIZED QUICK SORT

    13

  • Randomized Quick Sort

    14

    Elements of A arranged in Increasing order of values

    𝒏 𝟒 𝟑𝒏 𝟒

    A

    𝟏 … 𝒏

    pivot

  • Randomized Quick Sort

    Observation: There are many elements in A that are good pivot.

    Is it possible to select one good pivot efficiently ?

    (not possible deterministically )

    We select pivot element randomly uniformly.

    15

    A randomly selected element is a good pivot with probability 1

    2

  • RANDOMIZED ALGORITHM FOR APPROXIMATE MEDIAN

    16

  • Randomized Algorithm for Approximate median

    A random sample captures the essence of the original population.

    17

  • Randomized Algorithm for Approximate median

    Idea: Is it possible to select a small sample of elements whose median approximates the median ?

    (not possible deterministically )

    Median of a uniformly random sample will be approximate median.

    18

    A random sample captures the essence of the original population.

  • Randomized Algorithms

    An idea based on insight into the problem

    Difficult/impossible to exploit the idea deterministically

    A randomized algorithm

    19

    Randomization to materialize the idea

  • FRIEVALD’S TECHNIQUE APPLICATION

    MATRIX PRODUCT VERIFICATION

    Homework:

    What is the key idea ?

    How does randomization help to materialize it ?

    20

  • THE UNION THEOREM

    21

  • Probability tool (union theorem)

    Suppose there is an event 𝜀 defined over a probability space (𝛀,P).

    𝜀 = ∪𝑖 𝜀𝑖

    Question: How is P(𝜀) related to 𝐏(𝜀𝑖𝑖 ) ?

    𝐏 𝜀 ≤ 𝐏(𝜀𝑖𝑖 )

    If 𝐏(𝜀𝑖) is same for each 𝑖, then

    P(𝜀) ≤ 𝑛 𝐏(𝜀𝑖)

    22

  • Probability tool (union theorem)

    Question: Where to use Union theorem ?

    Suppose there is an event 𝜀 defined over a probability space (𝛀,P).

    Aim: to get an upper bound on P(𝜀).

    If it is difficult to calculate P(𝜀),

    try to express 𝜀 as union of 𝑛 events 𝜀𝑖 (usually similar/same) such that

    • it is easy to calculate/bound P(𝜀𝑖 )

    Then you may bound P(𝜀) using the following inequality:

    P(𝜀) ≤ 𝑛 𝐏(𝜀𝑖)

    23

  • APPLICATIONS OF THE UNION THEOREM

    24

  • Balls into Bins

    Ball-bin Experiment: There are 𝑚 balls and 𝑛 bins.

    Each ball falls into a bin randomly uniformly and independent of other balls.

    Used in: • Hashing

    • Load balancing in distributed environment

    25

    1 2 3 … 𝑗 … 𝑛

    1 2 3 4 5 … 𝑚 − 1 𝑚

  • Balls into Bins

    Ball-bin Experiment: There are 𝑚 balls and 𝑛 bins.

    Each ball falls into a bin randomly uniformly and independent of other balls.

    Theorem: For the case when 𝑚 = 𝑛,

    with very high probability, maximum load is O(log 𝑛).

    26

    1 2 3 … 𝑗 … 𝑛

    1 2 3 4 5 … 𝑚 − 1 𝑚

  • Balls into Bins

    Event 𝜀: There is some bin having at least c log 𝑛 balls.

    27

    1 2 3 … 𝑗 … 𝑛

    1 2 3 4 5 … 𝑚 − 1 𝑚

  • Balls into Bins

    Event 𝜀: There is some bin having at least c log 𝑛 balls. Event 𝜀𝑗: 𝑗th bin has at least c log 𝑛 balls.

    Question: What is the relation between 𝜀 and 𝜀𝑗 ?

    Answer: 𝜀 = 𝜀𝑗𝑗

    28

    1 2 3 … 𝑗 … 𝑛

    1 2 3 4 5 … 𝑚 − 1 𝑚

    P(𝜀) ≤ 𝐏(𝜀𝑗)𝑗

    perspective of 𝑗th bin

  • Balls into Bins

    Event 𝜀: There is some bin having at least c log 𝑛 balls. Event 𝜀𝑗: 𝑗th bin has at least c log 𝑛 balls.

    Observation: In order to show P(𝜀) < 𝑛−4, it suffice to show P(𝜀𝑗) < ??

    P(𝜀𝑗) < 𝑛−5 29

    1 2 3 … 𝑗 … 𝑛

    1 2 3 4 5 … 𝑚 − 1 𝑚

    𝑛−5

    P(𝜀) ≤ 𝐏(𝜀𝑗)𝑗

  • AIM: TO SHOW P(𝜀𝑗) < 𝑛−5

    P(𝑗th bin has at least 𝐜 𝐥𝐨𝐠 𝑛 balls) < 𝑛−5

    30

  • Calculating P(𝜀𝑗) P[𝜀𝑗] = P(𝑗th bin has 𝑖 balls)𝑖=𝑐 log 𝑛

    = 𝑛𝑖∙ (1𝑛 )𝑖 ∙ (1 − 1

    𝑛 )𝑛−𝑖𝑖=𝑐 log 𝑛

    ≤ 𝑛𝑖∙𝑖=𝑐 log 𝑛 (1

    𝑛 )𝑖

    = 𝑖=𝑐 log 𝑛 𝑛∙ 𝑛−1 𝑛−2 …(𝑛−𝑖+1)

    𝑖 ! (1

    𝑛 )𝑖

    ≤ 𝑖=𝑐 log 𝑛1𝑖 !

    ≤ 1

    2 𝑖=𝑐 log 𝑛 (

    𝑒

    𝑖 )𝑖

    ≤ 1

    2 𝑖=𝑐 log 𝑛 (

    𝑒

    𝑐 log 𝑛 )𝑖

    ≤ 1

    2 𝑖=2𝑒 log 𝑛 (

    𝑒

    2𝑒 log 𝑛 )𝑖

    ≤ 1

    2 𝑖=2𝑒 log 𝑛 (

    1

    2 )𝑖

    ≤ (12 )2𝑒 log 𝑛

    ≤ 𝑛−2𝑒 ≤ 𝑛−5

    31

    Using Stirling’s formula 𝑚! ≈ (𝑚𝑒

    )𝑚 2𝜋𝑚

    Choosing 𝑐 = 2𝑒

  • Balls into Bins

    Theorem:

    If 𝑛 balls are thrown randomly uniformly and independently into bins 𝑛,

    then with probability 1 − 𝑛−4, maximum load of any bin will be O(log 𝑛) balls.

    Homework exercise:

    With slightly more careful calculation, it can be shown that the maximum load will be O((log 𝑛)/log log 𝑛).

    32

  • APPLICATION 2 OF THE UNION THEOREM

    Randomized Quick sort:

    The secret of its popularity

    33

  • What makes Quick sort popular ?

    Inference:

    As 𝒏 increases,

    the chances of deviation from average case

    No. of times run time exceeds average by 100 1000 𝟏𝟎𝟒 𝟏𝟎𝟓 𝟏𝟎𝟔

    𝟏𝟎%

    𝟐𝟎%

    𝟓𝟎%

    𝟏𝟎𝟎%

    No. of repetitions = 𝟏𝟎𝟎𝟎

    190

    28

    2

    0

    49

    17

    1

    0

    22

    12

    1

    0

    10

    3

    0

    0

    3

    0

    0

    0

    The reliability of quick sort

  • What makes Quick sort popular ?

    Theorem [Colin McDiarmid, 1991]:

    Prob. the run time exceeds average by 𝒙% =

    Prob. run time exceeds double the average for 𝒏 = 𝟏𝟎𝟔 is ?

    𝒏− 𝒙𝟏𝟎𝟎 𝐥𝐧 𝐥𝐧 𝒏

    < 𝟏𝟎−𝟏𝟓

    This result is too complex and involves sophisticated tools of probability. So it is not worth discussion in this course at this stage.

    But we can get a similar result using elementary probability tools .

  • Concentration of Randomized Quick Sort

    𝐗 : random variable for the no. of comparisons during Randomized Quick Sort

    We know: E[𝐗]= 2𝑛 l𝑜𝑔𝑒 𝑛 − 𝑶(𝑛)

    Our aim: P(𝐗 > 𝑐 𝑛 l𝑜𝑔𝑒 𝑛) < 𝑛−𝑑

    For any constant 𝑑,

    we can find a constant 𝑐 such that the above inequality holds.

    We shall show that P(𝐗 > 8𝑛 l𝑜𝑔4/3 𝑛) < 𝑛−7

    36

    A

    𝟏 … 𝒏

  • Concentration of Randomized Quick Sort

    Theorem: Probability that Randomized Quick sort performs more than

    8𝑛 l𝑜𝑔4/3 𝑛 comparisons is less than 𝑛−7.

    Tools needed:

    1. Union theorem

    2. Probability (less than 𝑡 HEADS during 8𝑡 tosses of a fair coin) ≤ (3

    4)8𝑡.

    3. The right perspective

    37

    Spend at least one hour on your own to prove the above theorem with an open mind. If you do so,

    you will be able to appreciate the beauty of its proof that we shall discuss on 24th Jan.