randomized algorithms · 2020. 6. 12. · randomized algorithms cs648 lecture 5 • application of...

Randomized Algorithms CS648

Lecture 5 • Application of Fingerprinting Technique

• 1-dimensional Pattern matching

• Union bound

• Preparation for a memorable lecture on 23 January.

1

A powerful tool

FINGERPRINTING APPLICATION 2

Pattern matching

2

Text 𝑻[0…𝑛 − 1]:

Pattern 𝑷[0…𝑚 − 1]:

Pattern 𝑷 is said to appear in Text 𝑻 at location 𝑘 if

𝑻 𝑖 + 𝑘 = 𝑷[𝑖].

Problem: Given a Text 𝑻[0…𝑛 − 1], and a pattern 𝑷[0…𝑚 − 1],

does 𝑷 appear anywhere in 𝑻 ?

Deterministic Algorithm

• Trivial algorithm: O(𝑚𝑛) time

• Knuth-Morris-Pratt algorithm: O(𝑚 + 𝑛) time

Randomized Monte Carlo Algorithm

• O(𝑚 + 𝑛) time, and error probability < 1

𝑛𝑐

3

100101100110001101111010101110101010111010000101

011110101011101

17

for all 0 ≤ 𝑖 < 𝑚,

Motivation

• Simplicity, real time implementation, streaming environment

• Extension to 2-dimensions

• Converting Monte Carlo to Las Vegas algorithm

4

1 1 1 0 1 1 0 1 1 0 1 1 1 1 1 1

m⨯m

n⨯n

O(𝒏𝟐 +𝒎𝟐) time algorithm

RANDOMIZED ALGORITHM FOR PATTERN MATCHING

5

Checking if 𝑷 appears in Text 𝑻 at location 𝒌

Text 𝑻[0…𝑛 − 1]:


Observation: O(𝑚) time algorithm is obvious.

Question: How to do this task in O(1) time ?

Answer: have a fingerprint .

Question: What properties should the fingerprint possess?

• ??

• ??

6

0111101110110101

𝒌

100101100110001101111010101010101010111010000101

Small size

Efficiently computable


Text 𝑻[0…𝑛 − 1]:


𝑵𝑃 = 2𝑚−1−𝑖 ∙ 𝑷[𝑖]𝑚−1𝑖=0

𝑵𝑇(𝑘) = 2𝑚−1−𝑖 ∙ 𝑻[𝑖 + 𝑘]𝑚−1𝑖=0

Let 𝒒 be a prime number selected randomly uniformly from [2, 𝑡 ] 𝒇𝑇 𝑘 = 𝑵𝑇(𝑘) mod 𝒒.

𝒇𝑃 = 𝑵𝑃 mod 𝒒.

If 𝒇𝑇 𝑘 = 𝒇𝑃 then conclude that 𝑷 appears at 𝑘.

Error occurs if “𝒒 is one of the prime factors of (𝑵𝑃 − 𝑵𝑇(𝒌))”

Error probability at location 𝑘 ≤ 𝑚𝜋(𝑡)

Fingerprint size to get error probability ≤ 𝑛−𝑐 = 7

𝒌

100101100110001101111010101010101010111010000101

0111101110110101

Θ(𝑚) since 𝑵𝑇(𝑘) has 𝑚 bits.

O(log𝑚 + log𝑛) bits

How much time to compute 𝒇𝑇 𝑘 ?

Idea: (Inspiration form the last lecture) visualize the sequence of bits as a long number


Text 𝑻[0…𝑛 − 1]:


𝑵𝑃 = 2𝑚−1−𝑖 ∙ 𝑷[𝑖]𝑚−1𝑖=0

𝑵𝑇(𝑘) = 2𝑚−1−𝑖 ∙ 𝑻[𝑖 + 𝑘]𝑚−1𝑖=0

Question: Any relation between 𝒇𝑇(𝑘) and 𝒇𝑇(𝑘 + 1) ?

Question: Any relation between 𝑵𝑇(𝑘) and 𝑵𝑇(𝑘 + 1) ?

𝑵𝑇(𝑘 + 1) = 𝑵𝑇 𝑘 − 2𝑚−1 ∙ 𝑇 𝑘 ∙ 2 + 𝑇 𝑘 + 𝑚

𝑵𝑇(𝑘 + 1) mod 𝒒 = ( 𝑵𝑇 𝑘 − 2𝑚−1 ∙ 𝑇 𝑘 ∙ 2 + 𝑇 𝑘 + 𝑚 ) mod 𝒒

𝒇𝑇(𝑘 + 1) = ( 𝒇𝑇 𝑘 − 2𝑚−1 ∙ 𝑇 𝑘 ∙ 2 + 𝑇 𝑘 + 𝑚 ) mod 𝒒

𝒇𝑇(𝑘 + 1) = ( 𝒇𝑇 𝑘 − (2𝑚−1mod 𝒒) ∙ 𝑇 𝑘 ∙ 2 + 𝑇 𝑘 + 𝑚 ) mod 𝒒

8

𝒌

100101100110001101111010101010101010111010000101

0111101110110101 𝑵𝑇(𝑘)

𝑵𝑇(𝑘 + 1)

< 𝒒

Any relation between 𝒇𝑇(𝑘) and 𝒇𝑇(𝑘 + 1) ?

Fingerprint function: how good is it ?

Text 𝑻[0…𝑛 − 1]:


𝒇𝑃 = 2𝑚−1−𝑖 ∙ 𝑷[𝑖]𝑚−1𝑖=0 mod 𝒒

𝒇𝑇(𝑘) = 2𝑚−1−𝑖 ∙ 𝑻[𝑖 + 𝑘]𝑚−1𝑖=0 mod 𝒒

The fingerprint function

• Occupies 𝐥𝐨𝐠 𝒒 bits.

• Computing takes O(𝐥𝐨𝐠 𝒒) bits operations.

• Error probability for any particular location is 𝑚𝜋(𝑡)

Question: What is the error probability of the algorithm ?

Homework (to be done in the next class as well)

9

𝒌

100101100110001101111010101010101010111010000101

0111101110110101

<1

𝑛c for 𝑡 = 4𝑛𝑐𝑚 log 𝑛𝑚

O(1) time in word-RAM model of computation

Randomized Algorithms discussed till now

• Randomized algorithm for Approximate Median

• Randomized Quick Sort

• Frievald’s algo. for Matrix Product Verification

• Randomized algorithm for Equality of two files

• Randomized algorithm for Pattern Matching

10

Randomly select a sample

Randomly select the pivots

Randomly select a vector

Randomly select a prime number

Randomly select a prime number

Randomized Algorithms

How does one go about designing a randomized algorithm ?

11


Some random idea is required to design a randomized algorithm.

Ponder over it …

12

RANDOMIZED QUICK SORT

13

Randomized Quick Sort

14

Elements of A arranged in Increasing order of values

𝒏 𝟒 𝟑𝒏 𝟒

A

𝟏 … 𝒏

pivot

Randomized Quick Sort

Observation: There are many elements in A that are good pivot.

Is it possible to select one good pivot efficiently ?

(not possible deterministically )

We select pivot element randomly uniformly.

15

A randomly selected element is a good pivot with probability 1

2

RANDOMIZED ALGORITHM FOR APPROXIMATE MEDIAN

16

Randomized Algorithm for Approximate median

A random sample captures the essence of the original population.

17

Randomized Algorithm for Approximate median

Idea: Is it possible to select a small sample of elements whose median approximates the median ?

(not possible deterministically )

Median of a uniformly random sample will be approximate median.

18

A random sample captures the essence of the original population.


An idea based on insight into the problem

Difficult/impossible to exploit the idea deterministically

A randomized algorithm

19

Randomization to materialize the idea

FRIEVALD’S TECHNIQUE APPLICATION

MATRIX PRODUCT VERIFICATION

Homework:

What is the key idea ?

How does randomization help to materialize it ?

20

THE UNION THEOREM

21

Probability tool (union theorem)

Suppose there is an event 𝜀 defined over a probability space (𝛀,P).

𝜀 = ∪𝑖 𝜀𝑖

Question: How is P(𝜀) related to 𝐏(𝜀𝑖𝑖 ) ?

𝐏 𝜀 ≤ 𝐏(𝜀𝑖𝑖 )

If 𝐏(𝜀𝑖) is same for each 𝑖, then

P(𝜀) ≤ 𝑛 𝐏(𝜀𝑖)

22

Probability tool (union theorem)

Question: Where to use Union theorem ?

Suppose there is an event 𝜀 defined over a probability space (𝛀,P).

Aim: to get an upper bound on P(𝜀).

If it is difficult to calculate P(𝜀),

try to express 𝜀 as union of 𝑛 events 𝜀𝑖 (usually similar/same) such that

• it is easy to calculate/bound P(𝜀𝑖 )

Then you may bound P(𝜀) using the following inequality:

P(𝜀) ≤ 𝑛 𝐏(𝜀𝑖)

23

APPLICATIONS OF THE UNION THEOREM

24

Balls into Bins

Ball-bin Experiment: There are 𝑚 balls and 𝑛 bins.

Each ball falls into a bin randomly uniformly and independent of other balls.

Used in: • Hashing

• Load balancing in distributed environment

25

1 2 3 … 𝑗 … 𝑛

1 2 3 4 5 … 𝑚 − 1 𝑚

Balls into Bins

Ball-bin Experiment: There are 𝑚 balls and 𝑛 bins.

Each ball falls into a bin randomly uniformly and independent of other balls.

Theorem: For the case when 𝑚 = 𝑛,

with very high probability, maximum load is O(log 𝑛).

26

1 2 3 … 𝑗 … 𝑛

1 2 3 4 5 … 𝑚 − 1 𝑚

Balls into Bins

Event 𝜀: There is some bin having at least c log 𝑛 balls.

27

1 2 3 … 𝑗 … 𝑛

1 2 3 4 5 … 𝑚 − 1 𝑚

Balls into Bins

Event 𝜀: There is some bin having at least c log 𝑛 balls. Event 𝜀𝑗: 𝑗th bin has at least c log 𝑛 balls.

Question: What is the relation between 𝜀 and 𝜀𝑗 ?

Answer: 𝜀 = 𝜀𝑗𝑗

28

1 2 3 … 𝑗 … 𝑛

1 2 3 4 5 … 𝑚 − 1 𝑚

P(𝜀) ≤ 𝐏(𝜀𝑗)𝑗

perspective of 𝑗th bin

Balls into Bins

Event 𝜀: There is some bin having at least c log 𝑛 balls. Event 𝜀𝑗: 𝑗th bin has at least c log 𝑛 balls.

Observation: In order to show P(𝜀) < 𝑛−4, it suffice to show P(𝜀𝑗) < ??

P(𝜀𝑗) < 𝑛−5 29

1 2 3 … 𝑗 … 𝑛

1 2 3 4 5 … 𝑚 − 1 𝑚

𝑛−5

P(𝜀) ≤ 𝐏(𝜀𝑗)𝑗

AIM: TO SHOW P(𝜀𝑗) < 𝑛−5

P(𝑗th bin has at least 𝐜 𝐥𝐨𝐠 𝑛 balls) < 𝑛−5

30

Calculating P(𝜀𝑗) P[𝜀𝑗] = P(𝑗th bin has 𝑖 balls)𝑖=𝑐 log 𝑛

= 𝑛𝑖∙ (1𝑛 )𝑖 ∙ (1 − 1

𝑛 )𝑛−𝑖𝑖=𝑐 log 𝑛

≤ 𝑛𝑖∙𝑖=𝑐 log 𝑛 (1

𝑛 )𝑖

= 𝑖=𝑐 log 𝑛 𝑛∙ 𝑛−1 𝑛−2 …(𝑛−𝑖+1)

𝑖 ! (1

𝑛 )𝑖

≤ 𝑖=𝑐 log 𝑛1𝑖 !

≤ 1

2 𝑖=𝑐 log 𝑛 (

𝑒

𝑖 )𝑖

≤ 1

2 𝑖=𝑐 log 𝑛 (

𝑒

𝑐 log 𝑛 )𝑖

≤ 1

2 𝑖=2𝑒 log 𝑛 (

𝑒

2𝑒 log 𝑛 )𝑖

≤ 1

2 𝑖=2𝑒 log 𝑛 (

1

2 )𝑖

≤ (12 )2𝑒 log 𝑛

≤ 𝑛−2𝑒 ≤ 𝑛−5

31

Using Stirling’s formula 𝑚! ≈ (𝑚𝑒

)𝑚 2𝜋𝑚

Choosing 𝑐 = 2𝑒

Balls into Bins

Theorem:

If 𝑛 balls are thrown randomly uniformly and independently into bins 𝑛,

then with probability 1 − 𝑛−4, maximum load of any bin will be O(log 𝑛) balls.

Homework exercise:

With slightly more careful calculation, it can be shown that the maximum load will be O((log 𝑛)/log log 𝑛).

32

APPLICATION 2 OF THE UNION THEOREM

Randomized Quick sort:

The secret of its popularity

33

What makes Quick sort popular ?

Inference:

As 𝒏 increases,

the chances of deviation from average case

No. of times run time exceeds average by 100 1000 𝟏𝟎𝟒 𝟏𝟎𝟓 𝟏𝟎𝟔

𝟏𝟎%

𝟐𝟎%

𝟓𝟎%

𝟏𝟎𝟎%

No. of repetitions = 𝟏𝟎𝟎𝟎

190

28

2

0

49

17

1

0

22

12

1

0

10

3

0

0

3

0

0

0

The reliability of quick sort

What makes Quick sort popular ?

Theorem [Colin McDiarmid, 1991]:

Prob. the run time exceeds average by 𝒙% =

Prob. run time exceeds double the average for 𝒏 = 𝟏𝟎𝟔 is ?

𝒏− 𝒙𝟏𝟎𝟎 𝐥𝐧 𝐥𝐧 𝒏

< 𝟏𝟎−𝟏𝟓

This result is too complex and involves sophisticated tools of probability. So it is not worth discussion in this course at this stage.

But we can get a similar result using elementary probability tools .

Concentration of Randomized Quick Sort

𝐗 : random variable for the no. of comparisons during Randomized Quick Sort

We know: E[𝐗]= 2𝑛 l𝑜𝑔𝑒 𝑛 − 𝑶(𝑛)

Our aim: P(𝐗 > 𝑐 𝑛 l𝑜𝑔𝑒 𝑛) < 𝑛−𝑑

For any constant 𝑑,

we can find a constant 𝑐 such that the above inequality holds.

We shall show that P(𝐗 > 8𝑛 l𝑜𝑔4/3 𝑛) < 𝑛−7

36

A

𝟏 … 𝒏

Concentration of Randomized Quick Sort

Theorem: Probability that Randomized Quick sort performs more than

8𝑛 l𝑜𝑔4/3 𝑛 comparisons is less than 𝑛−7.

Tools needed:

1. Union theorem

2. Probability (less than 𝑡 HEADS during 8𝑡 tosses of a fair coin) ≤ (3

4)8𝑡.

3. The right perspective

37

Spend at least one hour on your own to prove the above theorem with an open mind. If you do so,

you will be able to appreciate the beauty of its proof that we shall discuss on 24th Jan.

randomized algorithms · 2020. 6. 12. · randomized algorithms cs648 lecture 5 • application of...

Documents