5. balls, bins, and random graphselomaa/teach/randal-17-5.pdf · 5 balls, bins, and random graphs...

27
2/8/2017 1 5. Balls, Bins, and Random Graphs The Birthday Paradox Balls into Bins The Poisson Distribution The Poisson Approximation Hashing Random Graphs 5 Balls, Bins, and Random Graphs Let us throw balls randomly into bins, each ball lands in a bin chosen independently and uniformly at random (I+U@R) We use the techniques we have developed previously to analyze this process and develop a new approach based on what is known as the Poisson approximation 8-Feb-17 MAT-72306 RandAl, Spring 2017 189

Upload: others

Post on 22-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

2/8/2017

1

5. Balls, Bins, andRandom Graphs

The Birthday ParadoxBalls into Bins

The Poisson DistributionThe Poisson Approximation

HashingRandom Graphs

5 Balls, Bins, and Random Graphs

• Let us throw balls randomly into bins, eachball lands in a bin chosen independently anduniformly at random (I+U@R)

• We use the techniques we have developedpreviously to analyze this process and develop anew approach based onwhat is known as thePoisson approximation

8-Feb-17MAT-72306 RandAl, Spring 2017 189

2/8/2017

2

5.1. Example: The Birthday Paradox

• Is it more likely that some two people in theroom share the same birthday or that no twopeople in the room share the same birthday?

• We assume that the birthday of each person is arandom day from a 365-day year, each chosenI+U@R

• We assume that a person's birthday is equallylikely to be any day of the year, we avoid leapyears, and we ignore the possibility of twins

8-Feb-17MAT-72306 RandAl, Spring 2017 190

• Let there be 30 people• Thirty days must be chosen from the 365; there

are ways to do this• These 30 days can be assigned to the people in

any of the 30! possible orders• Hence there are 30! configurations where

no two people share the same birthday, out ofthe 365 ways the birthdays could occur

• Thus, the probability is30!

3658-Feb-17MAT-72306 RandAl, Spring 2017 191

2/8/2017

3

• We can also consider one person at a time– The first person has a birthday– The probability that the second person has a

different birthday is (1 1/365)– The probability that the third person then has

a birthday different from the first two, giventhat they have different birthdays, is (12/365)

– Continuing on, the probability that the thperson has a different birthday than the first

1, assuming that they have differentbirthdays, is (1 ( 1)/365)

8-Feb-17MAT-72306 RandAl, Spring 2017 192

• So the probability that 30 people all havedifferent birthdays is the product of these terms:

1365

2365

29365

• This product is 0.2937, so when 30 people are inthe room there is more than a 70% chance thattwo share the same birthday

• A similar calculation shows that only 23 peopleneed to be in the room before it is more likelythan not that two people share a birthday

8-Feb-17MAT-72306 RandAl, Spring 2017 193

2/8/2017

4

• More generally, if there are people andpossible birthdays then the probability that allhave different birthdays is

• Using that / when is smallcompared to , we see that if is smallcompared to then

= exp = ( )

8-Feb-17MAT-72306 RandAl, Spring 2017 194

• Hence the value for at which the probabilitythat people all have different birthdays is 1/2is approximately given by the equation

= ln 2 ,

or = ln 2• For = 365, this approximation gives =

22.49, matching the exact calculation quite well• Mars has = 687 days, need = 30.86 aliens

8-Feb-17MAT-72306 RandAl, Spring 2017 195

2/8/2017

5

• The following simple arguments give loosebounds and good intuition

• Let us consider each person one at a time, andlet be the event that the th person's birthdaydoes not match any of the birthdays of the first

1 people• Then the probability that the first people fail to

have distinct birthdays is

Pr Pr1 ( 1)

• If this Pr is < 1/2, so with peoplethe Pr is 1/2 that all birthdays will be distinct

8-Feb-17MAT-72306 RandAl, Spring 2017 196

• Now assume that the first people all havedistinct birthdays

• Each person after that has probability at least= 1 of having the same birthday as

one of these first people• Hence the Pr that the next people all have

different birthdays than the first

1<

1<

12

• Hence, once there are 2 people, the Pr is atmost 1/ that all birthdays will be distinct

8-Feb-17MAT-72306 RandAl, Spring 2017 197

2/8/2017

6

5.2. Balls into Bins

• balls are thrown into bins, with the locationof each ball chosen I+U@R from thepossibilities

• The question behind the birthday paradox iswhether or not there is a bin with two balls

• How many of the bins are empty?• How many balls are in the fullest bin?• Many of these questions have applications to the

design and analysis of algorithms

8-Feb-17MAT-72306 RandAl, Spring 2017 198

• Birthday paradox: place balls randomly intobins then, for some , at least one ofthe bins is likely to have more than one ball in it

• Another interesting question concerns the maxnumber of balls in a bin, or the maximum load

• Let us consider the case where = , so thatthe number of balls equals the number of binsand the average load is 1

• Of course the maximum load is , but it is veryunlikely that all balls land in the same bin

• We seek an upper bound that holds withprobability tending to 1 as grows large

8-Feb-17MAT-72306 RandAl, Spring 2017 199

2/8/2017

7

• We can show that the maximum load is not morethan 3ln / ln ln with probability at most 1for sufficiently large via a direct calculation anda union bound

• This is a very loose bound; although themaximum load is in fact (ln / ln ln ) withprobability close to 1, the constant factor 3 ischosen to simplify the argument and could bereduced with more care

Lemma 5.1: When balls are thrown I+U@R intobins, the probability that the maximum load is

more than 3ln / ln ln is at most 1 forsufficiently large

8-Feb-17MAT-72306 RandAl, Spring 2017 200

Proof: The probability that bin 1 receives at leastballs is at most

1.

This follows from a union bound; there aredistinct sets of balls, and for any set of ballsthe probability that all land in bin 1 is 1 .We now use the inequalities

1 1!

8-Feb-17MAT-72306 RandAl, Spring 2017 201

2/8/2017

8

• The second inequality is a consequence of thefollowing general bound on factorials: since

!<

!=

we have! >

• Applying a union bound again allows us to findthat, for 3 ln / ln ln , the probability thatany bin receives at least balls is boundedabove by

ln ln3 ln

8-Feb-17MAT-72306 RandAl, Spring 2017 202

ln lnln

==

1

for sufficiently large.

8-Feb-17MAT-72306 RandAl, Spring 2017 203

2/8/2017

9

5.2.2. Application: Bucket Sort

• Bucket sort breaks the ( log ) lower boundfor standard comparison-based sorting, undercertain assumptions on the input

• We want to sort a set of = 2 integers chosenI+U@R from the range [0, 2 ), where

• Using Bucket sort, we can sort the numbers inexpected time ( )

• Expectation is over the choice of the randominput, Bucket sort is a deterministic algorithm

8-Feb-17MAT-72306 RandAl, Spring 2017 204

• Bucket sort works in two stages• First we place the elements into buckets• The th bucket holds all elements whose first

binary digits correspond to the number• E.g., if = 2 bucket 3 contains all elements

whose first 10 binary digits are 0000000011• When the elements of the th bucket all

come before those in the th bucket in thesorted order

• Assuming that each element can be placed inthe appropriate bucket in (1) time, this stagerequires only ( ) time

8-Feb-17MAT-72306 RandAl, Spring 2017 205

2/8/2017

10

• Because the elements to be sorted are chosenuniformly, the number of elements that land in aspecific bucket follows a binomial distribution

( , 1 )• Buckets can be implemented using linked lists• In the second stage, each bucket is sorted using

any standard quadratic time algorithm (e.g.,Bubblesort or Insertion sort)

• Concatenating the sorted lists from each bucketin order gives the sorted order for the elements

• It remains to show that the expected time spentin the second stage is only ( )

8-Feb-17MAT-72306 RandAl, Spring 2017 206

• The result relies on our assumption regardingthe input distribution.

• Under the uniform distribution, Bucket sort fallsnaturally into the balls and bins model:– the elements are balls, buckets are bins, and

each ball falls uniformly at random into a bin• Let be the number of elements that land in the

th bucket• The time to sort the th bucket is then at most

for some constant8-Feb-17MAT-72306 RandAl, Spring 2017 207

2/8/2017

11

• The expected time spent sorting is at most

= =

• The second equality follows from symmetry:is the same for all buckets

• Since ~ ( , 1 ), using earlier results yields

=( 1)

+ 1 = 21

< 2

• Hence the total expected time spent in thesecond stage is at most , so Bucket sort runsin expected linear time

8-Feb-17MAT-72306 RandAl, Spring 2017 208

5.3. The Poisson Distribution

• We now consider the probability that a given binis empty in balls and bins model as well asthe expected number of empty bins

• For the first bin to be empty, it must be missedby all balls

• Since each ball hits the first bin with probability1 , the probability the first bin remains empty is

1

8-Feb-17MAT-72306 RandAl, Spring 2017 209

2/8/2017

12

• Symmetry: the probability is the same for all bins• If is a RV that is 1 when the th bin is empty

and 0 otherwise, then = 1• Let represent the number of empty bins• Then, by the linearity of expectations,

= =1

• Thus, the expected fraction of empty bins isapproximately

• This approximation is very good even formoderately size values of and

8-Feb-17MAT-72306 RandAl, Spring 2017 210

• Generalize to find the expected fraction of binswith balls for any constant

• The probability that a given bin has balls is1 1

=1!

( 1) ( + 1) 1

• When and are large compared to , thesecond factor on the RHS is approx. ,and the third factor is approx.

8-Feb-17MAT-72306 RandAl, Spring 2017 211

2/8/2017

13

• Hence the probability that a given bin hasballs is approximately

!and the expected number of bins with exactlyballs is approximately

Definition 5.1: A discrete Poisson random variablewith parameter is given by the following

probability distribution on = 0, 1,2, … :

Pr = =!

8-Feb-17MAT-72306 RandAl, Spring 2017 212

• The expectation of this random variable is :

[ ] = Pr =

=!

=( 1)!

=!

=

• Because probabilities sum to 18-Feb-17MAT-72306 RandAl, Spring 2017 213

2/8/2017

14

• In the context of throwing balls into bins, thedistribution of the number of balls in a bin isapproximately Poisson with = , which isexactly the average number of balls per bin, asone might expect

Lemma 5.2: The sum of a finite number ofindependent Poisson random variables is aPoisson random variable.

8-Feb-17MAT-72306 RandAl, Spring 2017 214

Lemma 5.3: The MGF of a Poisson RV withparameter , is

= .Proof: For any ,

=!

=( )!

= .

8-Feb-17MAT-72306 RandAl, Spring 2017 215

2/8/2017

15

• Differentiating yields:=

= ( + 1)

• Setting = 0 gives= =

== + 1

= + 1= + 1 =

8-Feb-17MAT-72306 RandAl, Spring 2017 216

• Given two independent Poisson RVs andwith means and , apply Theorem 4.3 toprove

= = ( )

• This is the MGF of a Poisson RV with mean+

• By Theorem 4.2, the MGF uniquely defines thedistribution, and hence the sum + is aPoisson RV with mean +

8-Feb-17MAT-72306 RandAl, Spring 2017 217

2/8/2017

16

Theorem 5.4: Let be a Poisson RV withparameter .1. If > , then

Pr ;

2. If < , then

Pr .

Proof: For any > 0 and > ,

Pr = Pr .

8-Feb-17MAT-72306 RandAl, Spring 2017 218

Plugging in the expression for the MGF of thePoisson distribution, we have

Pr .

Choosing = ln > 0 gives

Pr

=

The proof of 2 is similar.

8-Feb-17MAT-72306 RandAl, Spring 2017 219

2/8/2017

17

5.3.1. Limit of the Binomial Distribution

• The Poisson distribution is the limit distribution ofthe binomial distribution with parameters and

, when is large and is small

Theorem 5.5: Let ~ ( , ), where is afunction of and lim is a constant that isindependent of . Then, for any fixed ,

lim Pr = =!

.

8-Feb-17MAT-72306 RandAl, Spring 2017 220

• This theorem directly applies to the balls-and-bins scenario

• Consider the situation where there are ballsand bins, where is a function of andlim =

• Let be the number of balls in a specific bin• Then ~ ( , 1/ )• Theorem 5.5 thus applies and says that

lim Pr = =/

!.

matching the earlier approximation8-Feb-17MAT-72306 RandAl, Spring 2017 221

2/8/2017

18

• Consider the # of spelling or grammaticalmistakes in a book

• Model such mistakes s.t. each word is likely tohave an error with some very small probability

• The # of errors is a binomial RV with large andsmall and can be treated as a Poisson RV

• As another example, consider the # of chocolatechips inside a chocolate chip cookie

• Model by splitting the volume of the cookie into alarge # of small disjoint compartments, so that achip lands in each with some probability

• Now the # of chips in a cookie roughly follows aPoisson distribution

8-Feb-17MAT-72306 RandAl, Spring 2017 222

5.5. Application: Hashing

• Consider a password checker, which preventspeople from using easily cracked passwords bykeeping a dictionary of unacceptable ones

• The application would check if the requestedpassword is unacceptable

• A checker could store the unacceptablepasswords alphabetically and do a binary searchon the dictionary to check a proposed password

• A binary search would require (log ) time forwords

8-Feb-17MAT-72306 RandAl, Spring 2017 223

2/8/2017

19

5.5.1. Chain Hashing

• Another possibility is to place the words into binsand search the appropriate bin for the word

• Words in a bin are represented by a linked list• The placement of words into bins is

accomplished by using a hash function• A hash function from a universe into a range

[0, 1] can be thought of as a way of placingitems from the universe into bins

8-Feb-17MAT-72306 RandAl, Spring 2017 224

• Here the universe consist of possiblepassword strings

• The collection of bins is called a hash table• This approach to hashing is called chain

hashing• Using a hash table turns the dictionary problem

into a balls-and-bins problem• If our dictionary of unacceptable passwords

consists of words and the range of the hashfunction is [0, 1], then we can model thedistribution of words in bins with the samedistribution as balls placed randomly in bins

8-Feb-17MAT-72306 RandAl, Spring 2017 225

2/8/2017

20

• It is a strong assumption to presume that a hashfunction maps words into bins in a fashion thatappears random, so that the location of each word isindependent and identically distributed (i.i.d)

• We assume that– for each , the probability that ( ) = is

1/ (for 1) and– that the values of ( ) for each are

independent of each other• This does not mean that every evaluation of ( )

yields a different random answer• The value of ( ) is fixed for all time; it is just equally

likely to take on any value in the range8-Feb-17MAT-72306 RandAl, Spring 2017 226

• Consider the search time when there are binsand words

• To search for an item, we first hash it to find thebin that it lies in and then search sequentiallythrough the linked list for it

• If we search for a word that is not in ourdictionary, the expected number of words in thebin the word hashes to is /

• If we search for a word that is in our dictionary,the expected number of other words in thatword's bin is ( 1)/ , so the expected numberof words in the bin is 1 + ( 1)/

8-Feb-17MAT-72306 RandAl, Spring 2017 227

2/8/2017

21

• If we choose = bins for our hash table, thenthe expected number of words we must searchthrough in a bin is constant

• If the hashing takes constant time, then the totalexpected time for the search is constant

• The maximum time to search for a word,however, is proportional to the maximumnumber of words in a bin

• We have shown that when = this maximumload is ln ln ln with probability close to 1,and hence w.h.p. this is the maximum searchtime in such a hash table

8-Feb-17MAT-72306 RandAl, Spring 2017 228

• While this is still faster than the required time forstandard binary search, it is much slower thanthe average, which can be a drawback for manyapplications

• Another drawback of chain hashing can bewasted space

• If we use bins for items, several of the binswill be empty, potentially leading to wastedspace

• The space wasted can be traded off against thesearch time by making the average number ofwords per bin larger than 1

8-Feb-17MAT-72306 RandAl, Spring 2017 229

2/8/2017

22

5.5.2. Hashing: Bit Strings

• Now save space instead of time• Consider, again, the problem of keeping a

dictionary of unsuitable passwords• Assume that a password is restricted to be eight

ASCII characters, which requires 64 bits (8bytes) to represent

• Suppose we use a hash function to map eachword into a 32-bit string

• This string is a short fingerprint for the word

8-Feb-17MAT-72306 RandAl, Spring 2017 230

• We keep the fingerprints in a sorted list• To check if a proposed password is

unacceptable, we calculate its fingerprint andlook for it on the list, say by a binary search

• If the fingerprint is on the list, we declare thepassword unacceptable

• In this case, our password checker may not givethe correct answer!

• It is possible that an acceptable password isrejected because its fingerprint matches thefingerprint of an unacceptable password

8-Feb-17MAT-72306 RandAl, Spring 2017 231

2/8/2017

23

• Hence there is some chance that hashing willyield a false positive: it may falsely declare amatch when there is not an actual match

• The fingerprints do not uniquely identify theassociated word

• This is the only type of mistake this algorithmcan make

• Allowing false positives means our algorithm isoverly conservative, which is probablyacceptable

• Letting easily cracked passwords through,however, would probably not be acceptable

8-Feb-17MAT-72306 RandAl, Spring 2017 232

• Place in a more general context: describe as anapproximate set membership problem

• Suppose we have a set = , , … , ofelements from a large universe

• We want to be able to quickly answer queries ofthe form "Is an element of ?"

• We want also like the representation to take aslittle space as possible

• To save space, we are willing to allowoccasional mistakes in the form of false positives

• Here the unallowable passwords correspond toour set

8-Feb-17MAT-72306 RandAl, Spring 2017 233

2/8/2017

24

• How large should the range of the hash functionused to create the fingerprints be?

• How many bits should be in a fingerprint?• Obviously, we want to choose the number of bits

that gives an acceptable probability for a falsepositive match

• The probability that an acceptable password hasa fingerprint that is different from any specificunallowable password in is 1 2

• If the set has size , then the probability of afalse positive for an acceptable password is

1 1 1 2 18-Feb-17MAT-72306 RandAl, Spring 2017 234

• If we want this probability of a false positive tobe less than a constant , we need

which implies thatlog

ln 1 (1 )• I.e., we need lg bits• If we, however, use = 2 lg bits, then the

probability of a false positive falls to1

<1

• If we have 2 = 65,536 words, then using 32bits yields a FP Pr of just less than 1/65,536

8-Feb-17MAT-72306 RandAl, Spring 2017 235

2/8/2017

25

5.6. Random Graphs

• There are many NP-hard computationalproblems defined on graphs: Hamiltonian cycle,independent set, vertex cover, …

• Are these problems hard for most inputs or justfor a relatively small fraction of all graphs?

• Random graph models provide a probabilisticsetting for studying such questions

• Most of the work on random graphs has focusedon two closely related models, , and ,

8-Feb-17MAT-72306 RandAl, Spring 2017 236

5.6.1. Random Graph Models

• In , we consider all undirected graphs ondistinct vertices , , … ,

• A graph with a given set of edges hasprobability

• One way to generate a random graph in , isto consider each of the possible edges insome order and then independently add eachedge to the graph with probability

8-Feb-17MAT-72306 RandAl, Spring 2017 237

2/8/2017

26

• The expected number of edges in the graph istherefore , and each vertex has expecteddegree ( 1)

• In the , model, we consider all undirectedgraphs on vertices with exactly edges

• There are possible graphs, each selectedwith equal probability

• One way to generate a graph uniformly from thegraphs in , is to start with a graph with noedges

8-Feb-17MAT-72306 RandAl, Spring 2017 238

• Choose one of the possible edges uniformlyat random and add it to the edges in the graph

• Now choose one of the remaining 1possible edges I+U@R and add it to the graph

• Continue similarly until there are edges• The , and , models are related:

– When = , the number of edges in arandom graph in , is concentrated around

, and conditioned on a graph from ,having edges, that graph is uniform over allthe graphs from ,

8-Feb-17MAT-72306 RandAl, Spring 2017 239

2/8/2017

27

• There are many similarities between randomgraphs and the balls-and-bins models

• Throwing edges into the graph as in the ,model is like throwing balls into bins

• However, since each edge has two endpoints,each edge is like throwing two balls at once intotwo different bins

• The pairing adds a rich structure that does notexist in the balls-and-bins model

• Yet we can often utilize the relation between thetwo models to simplify analysis in random graphmodels

8-Feb-17MAT-72306 RandAl, Spring 2017 240