lecture on sampling distributions

8/8/2019 Lecture on Sampling Distributions

1/31

Sampling Distributions

Stat 515 Lecture


2/31

12/18/2010 2

Inching Towards Inference

Recall that one of our main goals is to make

inference about the unknown parameters of the

population or the distribution, such as the mean Q,

the standard deviation W, or some other summary

measures such as the median, etc.

We now have possible models for the population,

which are provided by the probability distributions

(Binomial, Poisson, Normal, Uniform, others).

We also know how to compute sample statistics suchas the sample mean, sample standard deviation, and

others, with these sample statistics to be used for

making inference about the parameters.


3/31

12/18/2010 3

Sampling as a Random Experiment

To understand the notion of a sampling distribution of

a sample statistic, it is important to realize that the

process of taking a sample from a population could

be viewed as a random experiment.

To illustrate this idea, consider a population taking 3

values: 2, 4, 5 according to the following probability

distribution.

Probability Function: p(2) = .4, p(4) = .5, p(5) = .1 You may imagine that 40% of all the values in the

population equals 2; 50% equals 4; and 10% equals

5.


4/31

12/18/2010 4

The Population

4s

2s

5s


5/31

12/18/2010 5

Characteristics of the Population

For this population, we have the parameters:

Q = (2)(.4) + (4)(.5) + (5)(.1) = .8 + 2 + .5 = 3.3

W2 = (2 - 3.3)2(.4) + (4 - 3.3)2(.5) + (5 - 3.3)2(.1) = 1.21

W = (1.21)

1/2

= 1.1

Its shape is given by the bar graph below:

0

0.2

0.4

0.6

2 3 4 5


6/31

12/18/2010 6

Possible Outcomes of Sampling Process

Possible

Samples

Proba bil ity of

S a mp le

Sa m ple M ean S a m p le

Variance

(2 , 2) (.4)(.4 ) = .16 2 0

(2 , 4) (.4)(.5 ) = .20 3 2

(2 , 5) (.4)(.1 ) = .04 3 .5 4 .5

(4 , 2) (.5)(.4 ) = .20 3 2

(4 , 4) (.5)(.5 ) = .25 4 0

(4 , 5) (.5)(.1 ) = .05 4 .5 .5(5 , 2) (.1)(.4 ) = .04 3 .5 4 .5

(5 , 4) (.1)(.5 ) = .05 4 .5 .5

(5 , 5) (.1)(.1 ) = .01 5 0

Now, consider the sampling process of taking n = 2

observations (with replacement) from this population

or distribution. Below is a table of possibilities.


7/31

12/18/2010 7

Some Points about the Preceding Table

Since we are sampling with replacement, to obtain

the probability of each possible sample, we simply

multiply the probabilities of each of the observations

(Think of a tree diagram!). The 9 possible samples represent the elementary

events of the experiment of taking a sample of size 2

from the population or distribution.

The sample mean ( ) is obtained the usual way. The sample variance is computed the usual way. For

example, for the second sample, we have

S2 = [(2-3)2 + (4-3)2]/(2-1) = [1 + 1]/1 = 2

X


8/31

12/18/2010 8

Sample Statistics as Random Variables

Since the sample mean and the sample variance are

numerical characteristics of each of the possible

samples, they can be viewed as random variables in

this sampling experiment. Therefore, we could obtain the probability

distributions of the sample mean and sample

variance.

These probability distributions are called samplingdistributions.

Thus we will have the sampling distribution of the

sample mean, as well as the sample variance.


9/31

12/18/2010 9

Sampling Distribution of the Sample Mean

From the earlier table, we could construct theprobability distribution of the sample mean, now

called the sampling distribution of the sample mean.

This is given by the following table.

X )(XP )(* XPX)()3.3(

2XPX

2 .16 0.32 .2704

3 .20 + .20 = .40 1.20 .0360

3.5 .04 + .04 = .08 0.28 .0032

4 .25 1.00 .12254.5 .05 + .05 = .10 0.45 .1440

5 .01 0.05 .0289

Sums 1.00 3.3 .6050


10/31

12/18/2010 10

Graph of the Sampling Distribution of

the Sample Mean

Note that it has become more concentrated near the

population mean of 3.3, compared to the original

distribution.

3

.

.

.

.3

.

X

a r

Xbar

am pl ing Dis t r ibu tion o f the

a m p l e Me a n

a s e d o n

a

a m p l e o f

i

e n


11/31

12/18/2010 11

Parameters of the Sampling Distribution

Because the sampling distribution is just like any

other probability distribution, we are also able to

obtain its mean, variance, and standard deviation.

Thus, for the sampling distribution of the sample

mean, we find the mean to be 3.3, which coincideswith the original population mean; while

the variance of the sampling distribution of the

sample mean turns out to be equal to .605, which is

equal to (1.21)/2, the population variance divided bythe sample size.

The standard deviation of the sample mean, now

called the standard error (SE), is (.605)1/2 = .7778.


12/31

12/18/2010 12

Recapitulation

Sampling from a probability distribution or population

could be viewed as a random experiment, and the

elementary outcomes are the possible samples.

Sample statistics, such as the sample mean, could

be viewed as random variables, and as such have

their associated probability distributions, which are

called sampling distributions.

The sampling distribution also has a mean.

And it also has a variance.

The standard deviation of the sampling distribution is

called the standard error (SE).


13/31

12/18/2010 13

Sampling Distribution of the Sample Mean

The mean of the sampling distribution of the sample

mean equals the population mean.

The variance of the sampling distribution of the

sample mean equals the population variance divided

by the sample size.

These two characteristics are always true for the

sampling distribution of the sample mean when

sampling with replacement.


14/31

12/18/2010 14

Obtaining Sampling Distributions

In the example considered, we obtained the sampling

distribution of the sample mean by enumerating all

the possible samples that could arise.

However, such a method is not feasible if the sample

size is large. For instance, if n = 10, then there will

be a total of (3)(3)(3)(3) = 310 = 59049 possible

samples, and complete enumeration is not anymore

possible.

How do we obtain sampling distributions?


15/31

12/18/2010 15

Some Methods for Obtaining Sampling

Distributions of Statistics Complete enumeration, if possible.

Computer simulation or via the onte Carlo method.

In this method the computer generates many, many

samples, and then constructs the probabilityhistogram of the values of the statistic of interest.

This will provide an empirical approximation.

Using theoretical results such as, for instance, when

sampling from a Bernoulli population the number ofsuccesses is binomially-distributed.

Using theoretical approximations such as the Central

Limit Theorem or the de oivre approximation.


16/31

12/18/2010 16

Illustrating the Monte Carlo Method

We illustrate the use of the simulation or onte Carlo

method by approximating the sampling distribution of

the sample mean based on n = 10 observations from

the population considered earlierwhich has:

p(2) = .4, p(4) = .5, p(5) = .1

We generate 500 samples of size n = 10 from this

population, and for each sample we compute the

sample mean.

This simulation was done using initab.


17/31

12/18/2010 17

First 10 of the 500Generated Samples

The table below shows the first 10 samples of size n

= 10 that were generated from the population.

Also included are their corresponding sample means.

y p(y) x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 Sample ean2 0.4 4 2 2 5 4 2 4 2 2 4 3.14 0.5 4 4 2 2 5 4 4 4 2 2 3.3

5 0.1 4 2 2 2 2 2 4 4 4 5 3.1

2 2 4 2 2 2 2 2 2 2 2.22 5 2 4 4 4 2 2 5 2 3.2

4 4 4 2 2 4 2 2 2 4 3.04 4 4 4 2 2 5 2 2 4 3.3

2 2 2 4 2 2 4 2 2 2 2.42 4 2 2 4 5 5 2 2 2 3.0

2 5 4 4 2 2 4 4 4 2 3.3


18/31

12/18/2010 18

Relative Frequency Histogram of the 500

Sample Means

2

0

10

S l

RelieFequency(in%)

Si ul ed Sampling Di i uti n of the Sample

Mean Based on 10 Observations when Sampling

from the Population p(2) = .4, p(4) = .5, and p(5) = .1


19/31

12/18/2010 19

Points to Ponder

This relative frequency histogram of the simulated

sample means serves as an approximation to the

sampling distribution of the sample mean when n =

10 and when sampling from the given population.

Notice that the values of the sample means are nowclustered around the population mean of 3.3, and

furthermore, the shape of the histogram is almost

bell-shaped.

Looking at this histogram, it also shows that thechances of getting a sample of size n = 10 whose

sample mean is less than 2.5 or greater than 4.5 is

rather small.


20/31

12/18/2010 20

When the mean of the 500 sample means is

computed, it turns out to be 3.3094. [Their median isexactly 3.30!]

Recall that the population mean is 3.30.

The standard deviation of the 500 sample means

turns out to be 0.3497. Recall that the population standard deviation is

(1.21)1/2 = 1.1, so

.3478.1622.3

1.1

10

1.1!!!

n

W


21/31

12/18/2010 21

We therefore note that the mean of the simulated

sample means is very close to the population mean,and

the standard deviation of the simulated sample

means is also very close to the population standard

deviation divided by the square root of the sample

size.

Indeed, we always have the theoretical results:

nX

X

X

X

WW

QQ

!!

!!

orrorStd.

oean


22/31

12/18/2010 22

An Important Result About the Sampling

Distribution of the Sample Mean

When the population being sampled is a

normal population with mean Q and standard

deviation W, then the sampling distribution ofthe sample mean is also normal with mean Q

and standard error ofW/n1/2, forany sample

size n.

When the population is not normal, however,

then the sampling distribution of the sample

mean need not be normal. But we have:


23/31

12/18/2010 23

Central Limit Theorem

If a random sample of size n is taken from a

population or distribution with mean Q and standard

deviation W, and if the sample size is large (n > 30),

then the sampling distribution of the sample mean isapproximately normal with mean Q and standard

deviation (or standard error) ofW/n1/2. That is,

.,approx.is2

nNX WQ


24/31

12/18/2010 24

Uses of the Central Limit Theorem

Because of this approximation, when computing

probabilities associated with the sample mean, we

can use the approximation given below which uses

the standard normal distribution. Note: Z b N(0,1), the standard normal variable.

_ a .

ee}ee

n

bZ

n

aPbXaPWQ

WQ


25/31

12/18/2010 25

Applications of the CLT

Situation 1: Suppose we take a sample of

size n = 30 from the population described by

the probability function p(2) = 0.4, p(3) = 0.5,

p(5) = 0.1. This is the population we wereusing earlier.

Question 1: We seek the approximate

probability that the sample mean is between

3.1 and 3.5. Question 2: Find the approximate probability

that the sample mean is less than 2.6.


26/31

12/18/201026

Applications continued

Situation 2: The systolic blood pressure

population data set has mean Q = 114.58 and

standard deviation ofW = 14.06. Its

distribution is not normal as it is right-skewed.Suppose we take a random sample of n = 50

people, and obtain the sample mean of their

systolic blood pressures.

Question 1: What is the approximateprobability that this sample mean will exceed

120?


27/31

12/18/201027

Continued ...

Question 2: What would be the value of A

such that the probability that the samplemean of the systolic blood pressures of a

sample of size 50 is greater than A is 0.95?


28/31

12/18/201028

Sampling a Bernoulli Population

A Bernoulli population is one where there are only

two possible values or outcomes, called a Success,

denoted by the value of = 1, and a Failure,

denoted by a value of = 0. The probability of a

Success is denoted by p.

For such a population we have:

ean = Q = p;

Variance = W2 = p(1-p).

Consider now taking a sample of size n from this

population and letting equal the proportion of

successes in the sample. That is,

p


29/31

12/18/201029

Sample Proportion

.

1

n

Successes""oNumber

1

X

Xn

p

n

i

i

!

!

!

!

Because the Bernoulli observations are either

0 or 1 (with 1 representing success), then

the sample proportion could be defined via:


30/31

12/18/201030

Sampling Distribution of the Sample

Proportion

.,approx.is 2

!!

n

pqpNp

ppWQ

Since the sample proportion is the sample mean of

the observations from a Bernoulli population, by the

Central Limit Theorem, it follows that the sampling

distribution of the sample proportion, when thesample size is large (that is n > 30), is approximately

normal with mean of p and SE of [p(1-p)/n]1/2.


31/31

12/18/201031

An Application

Situation: One of the ways most Americans relieve

stress is to reward themselves with sweets.

According to one study, 46% admit to overeating

sweet foods when stressed. Suppose that the 46%

figure is correct and we take a random sample of sizen = 100 Americans and ask them if they overeat

sweets when they are stressed out.

Question 1: What is the probability that theproportion who overeats sweets in this sample

exceeds 0.50?

lecture on sampling distributions

Documents