birthday paradox – a simulation of shared birthday experiments

13
BIRTHDAY PARADOX A SIMULATION OF SHARED BIRTHDAY EXPERIMENTS Stephen Hogan (50217631) Postgraduate Diploma in IT (Evening) Dublin City University School of Computing [email protected] Abstract Being one of the most well-known and famous problems in probability 1 , the question is asked: “How many randomly chosen people are needed to achieve at least a 50% probability that some pair will both have been born on the same day?” Since the chance of any two persons having the same birthday is remote, many of us would expect this number to be rather large. However, it turns out that this is not the case, and hence the paradox. A simulation of empirical testing and results was conducted to simulate multiple trials, where people of a given group size have their birthdays compared. The probability is consistently monitored and established as the number of successful experiments as a proportion of the total number of experiments performed. As the theoretical value is already known (23), it was also the goal-driving result for the algorithm being implemented in Java, and happily, was achieved. 1. INTRODUCTION Premise born in 1938 2 : In a room of just 23 people, there is a 50% probability of two of these people sharing the same birthday, (ignoring years of birth). In a room of 75 people, there is a 99.9% chance of two people with matching birthdays. This is one particular case of exponential (Saliusian) sets, where duplicates are allowed. Exponents are not intuitive, and thus why our linear-thinking leads us to an incorrect estimation! While theoretical mathematical models and proofs have been derived in previous works outside the scope of this paper, the object of this paper is not to put the formulae into action; moreover the objective of this paper is to highlight a proof by example; i.e. that a simulation of birthdates for a group of people is analysed for this probability result, based on the aforementioned premise. For the following two sections, Background and Method, to be presented here, both share the following assumptions: That there are only 365 days in a year, i.e. thus ignoring leap years. This also results in ignoring the suspension of leap day on years divisible by 100 that are also divisible by 400. Birth years are ignored. People’s birthdays are equally distributed throughout the year; (i.e. influencing elements such as seasonality are not factored in). Obviously in real-life, birthday distributions are not uniform, i.e. not all dates are equally likely. The date of a person’s birthday does not affect the date of another person’ birthday, i.e. twins, triplets, etc. 1 This is not a paradox in the literal sense it just highlights the fact that people expect the value to be much larger. 2 American Mathematical Monthly in 1938 in Zoe Emily Schnabel's The estimation of the total fish population of a lake, under the name of capture-recapture statistics. 2. BACKGROUND Mathematical Model One of the basic rules of probability: the sum of the probability that an event will happen and the probability that the even will not happen is always 1. In other words, the chance that anything might or might not happen is always 100%. If we can work out the probability that no two people will have the same birthday, we can use this rule to find the probability that two people will share a birthday: P(event happens) + P(event does not happen) = 1 →P(two people share birthdays) + P(no two people share birthdays) = 1 P(two people share birthdays) = 1 P(no two people share birthdays) The formula for the probability that n people have different birthdays (month and day) is 3 : n n 365 * ! 365 365 (1) Therefore, the probability that at least two of them share the same birthday is: n n 365 * ! 365 365 1 (2) Having (2) graphed in Figure 1 it is clearly seen where, at the probability of 50%, cross-referencing it with the number of people reveals a value of 23: Figure 1: A graph showing the approximate probability of at least two people sharing a birthday amongst a certain number of people 4 . 3 As Dr Math FAQ The Birthday Problem (http://mathforum.org/dr.math/faq/faq.birthdayprob.html ) 4 Wikipedia - Birthday Problem (http://en.wikipedia.org/wiki/Birthday_paradox ) Comment [JM1]: Keep it formal and to the point. Comment [JM2]: Use a reference rather than a footnote. Comment [JM3]: A reference would be welcome here. Comment [JM4]: This could be compressed by simply giving a single equation and pointing the reader to a reference whre more detail is given. Comment [JM5]: Mixed font size.

Upload: datalore

Post on 10-Apr-2015

1.387 views

Category:

Documents


0 download

DESCRIPTION

Overall Comment: "While the paper is well written, it tried to fit too much trivial detail in at the expense of material which was demoted to the appendices. The paper should be able to stand on its own without the appendices; the appendices simply add to/support the paper. code and extracted Javadoc very impressive."Abstract: Being one of the most well-known and famous problems inprobability1, the question is asked: “How many randomly chosen peopleare needed to achieve at least a 50% probability that some pair will bothhave been born on the same day?” Since the chance of any two personshaving the same birthday is remote, many of us would expect this numberto be rather large. However, it turns out that this is not the case, andhence the paradox.A simulation of empirical testing and results was conducted to simulatemultiple trials, where people of a given group size have their birthdayscompared. The probability is consistently monitored and established asthe number of successful experiments as a proportion of the total numberof experiments performed.As the theoretical value is already known (23), it was also the goal-drivingresult for the algorithm being implemented in Java, and happily, wasachieved.

TRANSCRIPT

Page 1: BIRTHDAY PARADOX – A SIMULATION OF SHARED BIRTHDAY EXPERIMENTS

BIRTHDAY PARADOX –

A SIMULATION OF SHARED BIRTHDAY EXPERIMENTS

Stephen Hogan (50217631)

Postgraduate Diploma in IT (Evening)

Dublin City University School of Computing [email protected]

Abstract Being one of the most well-known and famous problems in

probability1, the question is asked: “How many randomly chosen people

are needed to achieve at least a 50% probability that some pair will both

have been born on the same day?” Since the chance of any two persons

having the same birthday is remote, many of us would expect this number

to be rather large. However, it turns out that this is not the case, and

hence the paradox.

A simulation of empirical testing and results was conducted to simulate

multiple trials, where people of a given group size have their birthdays

compared. The probability is consistently monitored and established as

the number of successful experiments as a proportion of the total number

of experiments performed.

As the theoretical value is already known (23), it was also the goal-driving

result for the algorithm being implemented in Java, and happily, was

achieved.

1. INTRODUCTION

Premise born in 19382: In a room of just 23 people, there is a 50%

probability of two of these people sharing the same birthday, (ignoring

years of birth). In a room of 75 people, there is a 99.9% chance of two

people with matching birthdays. This is one particular case of exponential

(Saliusian) sets, where duplicates are allowed. Exponents are not intuitive,

and thus why our linear-thinking leads us to an incorrect estimation!

While theoretical mathematical models and proofs have been derived in

previous works outside the scope of this paper, the object of this paper is

not to put the formulae into action; moreover the objective of this paper is

to highlight a proof by example; i.e. that a simulation of birthdates for a

group of people is analysed for this probability result, based on the

aforementioned premise.

For the following two sections, Background and Method, to be

presented here, both share the following assumptions:

That there are only 365 days in a year, i.e. thus ignoring leap years.

This also results in ignoring the suspension of leap day on years

divisible by 100 that are also divisible by 400.

Birth years are ignored.

People’s birthdays are equally distributed throughout the year; (i.e.

influencing elements such as seasonality are not factored in).

Obviously in real-life, birthday distributions are not uniform, i.e. not

all dates are equally likely.

The date of a person’s birthday does not affect the date of another

person’ birthday, i.e. twins, triplets, etc.

1 This is not a paradox in the literal sense – it just highlights the fact that people

expect the value to be much larger. 2 American Mathematical Monthly in 1938 in Zoe Emily Schnabel's The estimation

of the total fish population of a lake, under the name of capture-recapture statistics.

2. BACKGROUND

Mathematical Model

One of the basic rules of probability: the sum of the probability that an

event will happen and the probability that the even will not happen is

always 1. In other words, the chance that anything might or might not

happen is always 100%.

If we can work out the probability that no two people will have the same

birthday, we can use this rule to find the probability that two people will

share a birthday:

P(event happens) + P(event does not happen) = 1

→P(two people share birthdays) + P(no two people share birthdays) = 1

P(two people share birthdays) = 1 – P(no two people share birthdays)

The formula for the probability that n people have different birthdays

(month and day) is3:

nn 365*!365

365

(1)

Therefore, the probability that at least two of them share the same birthday

is:

nn 365*!365

3651

(2)

Having (2) graphed in Figure 1 it is clearly seen where, at the probability

of 50%, cross-referencing it with the number of people reveals a value of

23:

Figure 1: A graph showing the approximate probability of at least two

people sharing a birthday amongst a certain number of people4.

3 As Dr Math FAQ – The Birthday Problem

(http://mathforum.org/dr.math/faq/faq.birthdayprob.html) 4 Wikipedia - Birthday Problem (http://en.wikipedia.org/wiki/Birthday_paradox)

Comment [JM1]: Keep it formal and to the point.

Comment [JM2]: Use a reference rather than a footnote.

Comment [JM3]: A reference would be welcome here.

Comment [JM4]: This could be compressed by simply giving a single

equation and pointing the reader to a

reference whre more detail is given.

Comment [JM5]: Mixed font size.

Page 2: BIRTHDAY PARADOX – A SIMULATION OF SHARED BIRTHDAY EXPERIMENTS

3. METHOD

Programming Methodology

The simulation was written in the Java language5. A text file that lists

multiple trials with the following parameters serves as input to the

program, (defaults outlined here are geared to solving the problem in

question):

Number of matches to be checked (default: K = 2)

Number of trials (various >> values)

Starting group size (default: N = 2)

Group size increment (default: 1)

Terminating probability (default: 0.5)

Assumptions:

The group size will never be greater than 1,000.

In the case that the starting value of N is less than K, you should

start the simulation with N = K. The reason is simple: How do

you possibly find three (i.e. K) matches in a group of two (i.e. N)

persons?

Beginning with group size of N = 2 people, we initialise an array with the

random birthdays of N = 2 people; (the random number generator is being

seeded with the current time). We compare every pair wise (number of

matches K = 2) combination of people in the group of N = 2 and check the

existence of any two persons having the same birthday.

This will be repeated number of trials times with different groups of two

people. If the average occurrence of two persons having the same birthday

in these one thousand trials exceeds 0.5 (i.e. the probability P), the

simulation terminates. Otherwise, N is incremented by group size

increment = 1, and the entire simulation of one thousand trials is repeated

with randomized groups of (starting group size = 2) + (group size

increment = 1) people.

The flexibility in reading in values from a file that control the execution of

the algorithm allows us to evaluate numbers of people for various

probabilities and/or enumerate possible combinations of more than two

persons, differing numbers of trials and matches, and check for shared

birthdays, for example. A review of the Java source code in Appendix 2

should reveal other variations.

For each trial, a number of random birthdays are generated and placed into

an array; (here, we use the Julian Date format, 1…365). These birthdays

are sorted and then iterated through to find the same values in consecutive

elements in the array, denoting a success. Once the trials have completed

running, the probability is evaluated as6:

currProbability = numSameBirthday / numTrials

4. RESULTS & DISCUSSION

With the availability of having an input file, multiple case scenarios can be

generated. Table 1 is an example of sample data was available on the

input file, (as per aforementioned format in Section 3):

2 10000 2 1 0.5

2 20000 2 1 0.5

2 30000 2 1 0.5

2 50000 2 1 0.5

2 700000 2 1 0.5

2 1000000 2 1 0.5

2 2000000 2 1 0.5

2 5000000 2 1 0.5

Table 1: Input File sample data.

5 Adapted from [2] http://www.comp.nus.edu.sg/~cs1101cl/labs_sem2_0405/lab3/oddweek/paradox.c 6 See Appendix 1 – Pseudocode.

The last record in this table is graphed in Figure 2. Preparing graphs for

all records processed in this table (Appendix 4) reveals, interestingly, that

time performance dips in proportion to the curve of probability when

tending to 23 people, irrespective of the number of trials:

Simulation 8

0.00000

0.10000

0.20000

0.30000

0.40000

0.50000

0.60000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

No. People

Pro

bab

ilit

y

0

5000

10000

15000

20000

25000

Tim

e (

ms)

Probability Time (ms)

Figure 2: Simulation 8

5. CONCLUSIONS & RECOMMENDATIONS

This research reports on a simulated empirical study of the Birthday

Paradox. The findings suggest that there is a strong similarity between the

theoretical and (simulated) actual probabilities. Specifically, that for any

group size or number of trials, to achieve at least a 50% probability in

solving the problem, we would need at least 23 people in a room for

comparison (with previously unknown birthdays).

The input file allows for variations on the number of people > pairs as well

as the group increment size to be > 1, as well as other terminating

probabilities.

A further recommendation would be to refine the sorting algorithm even

further, or adopt a faster mechanism of sorting, especially for large N and

K. Parallelisation of Quicksort, for example, would be ideal, as

synchronisation is not a requirement. Java Threading would be an

approach for this.

Another approach to solving this problem would be to base it on collisions

– by tracking as each person enters the room and checking to see if there is

a match with any other person. An array of 365 elements would only be

needed, and a random date only generated for each person until either the

entire group size is exhausted or a match has been found. The author of

this paper decided against this approach after initially selecting it, as it

would not be possible to measure time performance as fluidly.

6. REFERENCES

[1] Birthday Paradox Wikipedia.com

(http://en.wikipedia.org/wiki/Birthday_paradox)

[2] CS1101C Lab 3 – Birthday Paradox National University of Singapore,

School of Computing

(http://www.comp.nus.edu.sg/~cs1101cl/labs_sem2_0405/lab3/oddweek/)

[3] How to Generate Random numbers About.com

(http://java.about.com/od/javautil/a/randomnumbers.htm)

[4] Quick Sort Implementation with median-of-three partitioning and

cutoff for small arrays Java-Tips.org

(http://www.java-tips.org/java-se-tips/java.lang/quick-sort-

implementation-with-median-of-three-partitioning-and-cutoff-for-small-

a.html)

Comment [JM6]: Why not use algrbraic

notation as in (1) and (2). Also: This eqn is not numbered.

Comment [JM7]: Explain the entries in the table.

Comment [JM8]: Why capitalise?

Comment [JM9]: You need to explain what Simulation 8 is.

Comment [JM10]: We

Page 3: BIRTHDAY PARADOX – A SIMULATION OF SHARED BIRTHDAY EXPERIMENTS

APPENDIX 1 – PSEUDOCODE

SET variables numTrials = any large number

currGroupSize = 2

currProbability = 0.0

terminatingProbability = 0.5

DO

{

SET variable numSameBirthday = 0

DO WHILE currProbability <= terminatingProbability

{

DO FOR EACH trial FROM 1 TO numTrials

{

SET RANDOM number FOR EACH ELEMENT IN birthday[] UNTIL birthday[currGroupSize]

Sort birthday[] into ascending order

Check for match between consecutive elements: IF TRUE THEN numSameBirthday = numSameBirthday + 1

} END FOR

currProbability = numSameBirthday / numTrials

currGroupSize = currGroupSize + groupIncrement

} END WHILE

} END DO LOOP

Comment [JM11]: Complexity of this could have been discussed in the paper.

Page 4: BIRTHDAY PARADOX – A SIMULATION OF SHARED BIRTHDAY EXPERIMENTS

APPENDIX 2 – SOURCE CODE

BirthdaySimulation.java

Double-Click to open the entire embedded document.

Page 5: BIRTHDAY PARADOX – A SIMULATION OF SHARED BIRTHDAY EXPERIMENTS

Birthday.java

Double-Click to open the entire embedded document.

Page 6: BIRTHDAY PARADOX – A SIMULATION OF SHARED BIRTHDAY EXPERIMENTS

QuickSort.java

Double-Click to open the entire embedded document.

Page 7: BIRTHDAY PARADOX – A SIMULATION OF SHARED BIRTHDAY EXPERIMENTS

APPENDIX 3 – JAVADOC

BirthdaySimulation.java

Double-Click to open the entire embedded document.

Page 8: BIRTHDAY PARADOX – A SIMULATION OF SHARED BIRTHDAY EXPERIMENTS

Birthday.java

Double-Click to open the entire embedded document.

Page 9: BIRTHDAY PARADOX – A SIMULATION OF SHARED BIRTHDAY EXPERIMENTS

QuickSort.java

Double-Click to open the entire embedded document.

Page 10: BIRTHDAY PARADOX – A SIMULATION OF SHARED BIRTHDAY EXPERIMENTS

APPENDIX 4 – CHARTING RESULTS

Simulation 1

0.00000

0.10000

0.20000

0.30000

0.40000

0.50000

0.60000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

No. People

Pro

bab

ilit

y

0

5

10

15

20

25

30

35

Tim

e (

ms)

Probability Time (ms)

Simulation 2

0.00000

0.10000

0.20000

0.30000

0.40000

0.50000

0.60000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

No. People

Pro

bab

ilit

y

0

10

20

30

40

50

60

70

80

Tim

e (

ms)

Probability Time (ms)

Page 11: BIRTHDAY PARADOX – A SIMULATION OF SHARED BIRTHDAY EXPERIMENTS

Simulation 3

0.00000

0.10000

0.20000

0.30000

0.40000

0.50000

0.60000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

No. People

Pro

bab

ilit

y

0

20

40

60

80

100

120

Tim

e (

ms)

Probability Time (ms)

Simulation 4

0.00000

0.10000

0.20000

0.30000

0.40000

0.50000

0.60000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

No. People

Pro

bab

ilit

y

0

20

40

60

80

100

120

140

160

180

Tim

e (

ms)

Probability Time (ms)

Page 12: BIRTHDAY PARADOX – A SIMULATION OF SHARED BIRTHDAY EXPERIMENTS

Simulation 5

0.00000

0.10000

0.20000

0.30000

0.40000

0.50000

0.60000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

No. People

Pro

bab

ilit

y

0

500

1000

1500

2000

2500

3000

Tim

e (

ms)

Probability Time (ms)

Simulation 6

0.00000

0.10000

0.20000

0.30000

0.40000

0.50000

0.60000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

No. People

Pro

bab

ilit

y

0

1000

2000

3000

4000

5000

6000

Tim

e (

ms)

Probability Time (ms)

Page 13: BIRTHDAY PARADOX – A SIMULATION OF SHARED BIRTHDAY EXPERIMENTS

Simulation 7

0.00000

0.10000

0.20000

0.30000

0.40000

0.50000

0.60000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

No. People

Pro

bab

ilit

y

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

Tim

e (

ms)

Probability Time (ms)

Simulation 8

0.00000

0.10000

0.20000

0.30000

0.40000

0.50000

0.60000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

No. People

Pro

bab

ilit

y

0

5000

10000

15000

20000

25000

Tim

e (

ms)

Probability Time (ms)