copyright (c) bani k. mallick1 stat 651 lecture #15

Copyright (c) Bani K. Mallick 1

STAT 651

Lecture #15


Topics in Lecture #15 Some basic probability

The binomial distribution

Inference about a single population proportions


Book Sections Covered in Lecture #15

Chapters 4.7-4.8

Chapter 10.2


Lecture 14 Review: Nonparametric Methods

Replace each observation by its rank in the pooled data

Do the usual ANOVA F-test

Kruskal-Wallis


Lecture 14 Review: Nonparametric Methods

Once you have decided that the populations are different in their means, there is no version of a LSD

You simply have to do each comparison in turn

This is a bit of a pain in SPSS, because you physically must do each 2-population comparison, defining the groups as you go


Categorical Data

Not all experiments are based on numerical outcomes

We will deal with categorical outcomes, i.e., outcomes that for each individual is a category

The simplest categorical variable is binary:

Success or failure

Male of female


Categorical Data

For example, consider flipping a fair coin, and let

X = 0 means “tails”

X = 1 means “heads”


Categorical Data

The fraction of the population who are “successes” will be denoted by the Greek symbol

Note that because it is a Greek symbol, it represents something to do with a population

For coin flipping, if you flipped all the fair coins in the world (the population), the fraction of the times they turn up heads equals


Categorical Data

The fraction of the population who are “successes” will be denoted by the Greek symbol

The fraction of the sample of size n who are “successes” is going to be denoted by

We want to relate to

Let X = number of successes in the sample. The fraction = (# successes)/n = X / n


Categorical Data

Suppose you flip a coin 10 times, and get 6 heads.

The proportion of heads = 0.60

The percentage of heads = 60%


Categorical Data

The number of success X in n experiments each with probability of success is called a binomial random variable

There is a formula for this:

Pr(X = k) =

0! = 1, 1! = 1, 2! = 2 x 1 = 2, 3! = 3 x 2 x 1 = 6, 4! = 4 x 3 x 2 x 1 = 24, etc.

k n kn!Pr( k/ n) (1 )ˆ

k! (n-k)!


Categorical Data

0! = 1, 1! = 1, 2! = 2 x 1 = 2, 3! = 3 x 2 x 1 = 6, 4! = 4 x 3 x 2 x 1 = 24, etc.

The idea is to relate the sample fraction to the population fraction using this formula

Key Point: if we knew , then we could entirely characterize the fraction of experiments that have k successes

k n kn!Pr(X k) Pr( k/ n) (1 )ˆ

k! (n-k)!


Categorical Data

The probability that the coin lands on heads will be denoted by the Greek symbol

Suppose you flip a coin 2 times, and count the number of heads.

So here, X = number of heads that arise when you flip a coin 2 times

X takes on the values 0, 1 and 2

takes on the values 0/2, ½, 2/2


Categorical Data: What the binomial formula does

The experiment results in 4 equally likely outcomes: each occurs ¼ of the time

Tails on toss #1

Heads on toss #1

Tails of toss #2

¼ ¼

Heads on Toss #2

¼ ¼


Categorical Data

Heads = “success”:

Tails on toss #1

Heads on toss #1

Tails on toss #2

¼ ¼

Heads on Toss #2

¼ ¼

Pr(X 0) Pr( 0/ 2) 1/ 4ˆ Pr(X 1) Pr( 1/ 2) 1/ 2ˆ

Pr(X 2) Pr( 2/ 2) 1/ 4ˆ The binomial formula can be used to give these results without thinking


Categorical Data

0! = 1, 1! = 1, 2! = 2 x 1 = 2, 3! = 3 x 2 x 1 = 6, 4! = 4 x 3 x 2 x 1 = 24, etc.

n=2, k=1, k! = 1, n! = 2, (n-k)! = 1

The binomial formula gives the answer ½, which we know to be correct

k n kn!Pr(X k) Pr( k/ n) (1 )ˆ

k! (n-k)!

k n k.5, and(1 ) .5


Categorical Data

Roll a fair dice

1 2 3 4 5 6

First Dice

Every combination is equally likely, so what are the probabilities?


Categorical Data

Roll a fair dice

1 2 3 4 5 6

1/6 1/6 1/6 1/6 1/6 1/6

First Dice



Categorical Data

Roll a fair dice

1 2 3 4 5 6

1/6 1/6 1/6 1/6 1/6 1/6

First Dice


What is the chance of rolling a 1 or a 2?


Categorical Data

Roll a fair dice

1 2 3 4 5 6

1/6 1/6 1/6 1/6 1/6 1/6

First Dice


What is the chance of rolling a 1 or 2? 2/6 = 1/3


Categorical Data

Now roll two fair dice

1 2 3 4 5 6

1

2

3

4

5

6

Second Dice

First Dice



Categorical Data

Roll two fair dice

1 2 3 4 5 6

1 1/36 1/36 1/36 1/36 1/36 1/36

2 1/36 1/36 1/36 1/36 1/36 1/36

3 1/36 1/36 1/36 1/36 1/36 1/36

4 1/36 1/36 1/36 1/36 1/36 1/36

5 1/36 1/36 1/36 1/36 1/36 1/36

6 1/36 1/36 1/36 1/36 1/36 1/36

Second Dice

First Dice



Categorical Data

Roll two fair dice

1 2 3 4 5 6

1 1/36 1/36 1/36 1/36 1/36 1/36

2 1/36 1/36 1/36 1/36 1/36 1/36

3 1/36 1/36 1/36 1/36 1/36 1/36

4 1/36 1/36 1/36 1/36 1/36 1/36

5 1/36 1/36 1/36 1/36 1/36 1/36

6 1/36 1/36 1/36 1/36 1/36 1/36

Second Dice

First Dice

Define a success as rolling a 1 or a 2. What is the chance of two successes?


Categorical Data

Roll two fair dice

1 2 3 4 5 6

1 1/36 1/36 1/36 1/36 1/36 1/36

2 1/36 1/36 1/36 1/36 1/36 1/36

3 1/36 1/36 1/36 1/36 1/36 1/36

4 1/36 1/36 1/36 1/36 1/36 1/36

5 1/36 1/36 1/36 1/36 1/36 1/36

6 1/36 1/36 1/36 1/36 1/36 1/36

Second Dice

First Dice

Define a success as rolling a 1 or a 2. What is the chance of two successes? 4/36 = 1/9


Categorical Data

Roll two fair dice

1 2 3 4 5 6

1 1/36 1/36 1/36 1/36 1/36 1/36

2 1/36 1/36 1/36 1/36 1/36 1/36

3 1/36 1/36 1/36 1/36 1/36 1/36

4 1/36 1/36 1/36 1/36 1/36 1/36

5 1/36 1/36 1/36 1/36 1/36 1/36

6 1/36 1/36 1/36 1/36 1/36 1/36

Second Dice

First Dice

Define a success as rolling a 1 or a 2. What is the chance of two failures? 16/36 = 4/9


Categorical Data

So, a success occurs when you roll a 1 or a 2

Pr(success on a single die) = 2/6 = 1/3 =

Pr(2 successes) = 1/3 x 1/3 = 1/9

Use the binomial formula: pr(X=k) when k=2

k!=2, n!=2, (n-k)!=1,

k n k1/ 9,and(1 ) 1

k n kn!Pr(X k) Pr( k/ n) (1 ) 1/ 9ˆ

k! (n-k)!


Categorical Data

In other words, the binomial formula works in these simple cases, where we can draw nice tables

Now think of rolling 4 dice, and ask the chance the 3 of the 4 times you get a 1 or a 2

Too big a table: need a formula


Categorical Data

Does it matter what you call as “success” and hat you call a “failure”?

No, as long as you keep track

For example, in a class experiment many years ago, men were asked whether they preferred to wear boxers or briefs

This is binary, because there are only 2 outcomes

“success” = ?????


Categorical Data

Binary experiments have sampling variability, just like sample means, etc.

Experiment: “success” = being under 5’10” in height

First 6 men with SSN < 5

First 6 men with SSN > 5

Note how the number of “successes” was not the same! (I might have to do this a few times)


Categorical Data

The sample fraction is a random variable

This means that if I do the experiment over and over, I will get different values.

These different values have a standard deviation.


Categorical Data

The sample fraction has a standard error

Its standard error is

Note how if you have a bigger sample, the standard error decreases

The standard error is biggest when = 0.50.

ˆ

(1 )n


Categorical Data

The sample fraction has a standard error

Its standard error is

The estimated standard error based on the sample is

ˆ

(1 )n

ˆ

(1 )ˆ ˆˆ

n


Categorical Data

It is possible to make confidence intervals for the population fraction if the number of successes > 5, and the number of failures > 5

If this is not satisfied, consult a statistician

Under these conditions, the Central Limit Theorem says that the sample fraction is approximately normally distributed (in repeated experiments)


Categorical Data

(1100% CI for the population fraction

is by looking up 1 in Table 1

/ 2 ˆzˆ ˆ

ˆ

(1 )ˆ ˆˆ

n

/ 2z


Categorical Data

Often, you will only know the sample proportion/percentage and the sample size

Computing the confidence interval for the population proportion: two ways By hand

By SPSS (this is a pain if you do not have the data entered already)

Because you may need to do this by hand, I will make you do this.


Categorical Data


95% CI, = 1.96

n = 25, = 0.30

/ 2 ˆzˆ ˆ

ˆ

(1 ) .3(1 .3)ˆ ˆ 0.09165ˆn 25

/ 2z

/ 2 ˆz 0.30 1.96x0.09165ˆ ˆ


Categorical Data


Interpretation?

/ 2 ˆz 0.30 1.96x0.09165ˆ ˆ

0.30 0.18 [0.12,0.48]


Categorical Data


Interpretation? The proportion of successes in the population is from 0.12 to 0.48 (12% to 48%) with 95% confidence

/ 2 ˆz 0.30 1.96x0.09165ˆ ˆ

0.30 0.18 [0.12,0.48]


Categorical Data

You can use SPSS as long as the number of successes and the number of failures both exceed 5

To get the confidence intervals, you first have to define a numeric version of your variable that classifies whether an observation is a success or failure.

You then compute the 1-sample confidence interval from “descriptives” “Explore”: Demo


Categorical Data

If you set up your data in SPSS, the “mean” will be the proportion/fraction/percentage of 1’s

Data = 0 1 1 1 0 0 0 1 0 0

n = 10

Mean = 4/10 = .40

= .40


Boxers versus briefs for males

Case Processing Summary

188 100.0% 0 .0% 188 100.0%Boxers or BriefsPerference

N Percent N Percent N Percent

Valid Missing Total

Cases

In this output, boxers = 1 and briefs = 0


Boxers versus briefs for males: what % prefer boxers? In the

sample, 46.81%. In the population???

Descriptives

.4681 3.649E-02

.3961

.5401

.4645

.0000

.250

.5003

.00

1.00

1.00

1.0000

.129 .177

-2.005 .353

MeanLower Bound

Upper Bound

95% ConfidenceInterval for Mean

5% Trimmed Mean

Median

Variance

Std. Deviation

Minimum

Maximum

Range

Interquartile Range

Skewness

Kurtosis

Boxers or BriefsPerference

Statistic Std. Error

In this output, boxers = 1 and briefs = 0. The proportionof 1’s is the mean


Boxers versus briefs for males: what % prefer boxers? Between

39.61% and 54.01%

Descriptives

.4681 3.649E-02.3961

.5401

.4645

.0000

.250.5003

.00

1.001.00

1.0000.129 .177

-2.005 .353

Mean

Lower BoundUpper Bound

95% ConfidenceInterval for Mean

5% Trimmed Mean

MedianVariance

Std. DeviationMinimum

MaximumRange

Interquartile Range

SkewnessKurtosis

GenderMaleNumeric Boxers: 0

= Briefs, 1 = Boxers

Statistic Std. Error


Boxers versus briefs

In the sample, 46.81% of the men preferred boxers to briefs: 53.19% preferred briefs.

Between 39.61% and 54.01% men prefer boxers to briefs (95% CI)

Is there enough evidence to conclude that men generally prefer briefs?


Boxers versus briefs

In the sample, 46.81% of the men preferred boxers to briefs: 53.19% preferred briefs.

Between 39.61% and 54.01% men prefer boxers to briefs (95% CI)

Is there enough evidence to conclude that men generally prefer briefs?

No: since 50% is in the CI! This means that it is possible (95%CI) that 50% prefer boxers, 50% prefer briefs, = 0.50.


Sample Size Calculations

The standard error of the sample fraction is

If you want an (1100% CI interval to be

you should set

ˆ

(1 )n

E

/ 2

(1 )E z

n



This means that

/ 2

(1 )E z

n

2/ 2 2

(1 )n z

E



The small problem is that you do not know . You have two choices: Make a guess for

Set = 0.50 and calculate (most conservative, since it results in largest sample size)

Most polling operations make the latter choice, since it is most conservative

2/ 2 2

(1 )n z

E


Sample Size Calculations: Examples

Set E = 0.04, 95% CI, you guess that = 0.30

You have no good guess:

2/ 2 2

(1 )n z

E

22

.3(1 .3)n 1.96 504

.04

22

.5(1 .5)n 1.96 601

.04

copyright (c) bani k. mallick1 stat 651 lecture #15

Documents

copyright c bani

x n slide

times x

tails x

number of success x

population fraction

mallick11 categorical

mallick7 categorical