st3239: survey methodologystazw/ch1-2.pdf · a part of that population. the principles and methods...

22
ST3239: Survey Methodology by Wang ZHOU

Upload: others

Post on 07-Jun-2020

18 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ST3239: Survey Methodologystazw/ch1-2.pdf · a part of that population. The principles and methods of collecting and analysing data from flnite population is a branch of statistics

ST3239: Survey Methodology

by Wang ZHOU

Page 2: ST3239: Survey Methodologystazw/ch1-2.pdf · a part of that population. The principles and methods of collecting and analysing data from flnite population is a branch of statistics

Chapter 1

Elements of the sampling problem

1.1 Introduction

Often we are interested in some characteristics of a finite population, e.g. the average incomeof last year’s graduates from NUS. Since the population is usually very large, we would liketo say something (i.e. make inference) about the population by collecting and analysing onlya part of that population. The principles and methods of collecting and analysing data fromfinite population is a branch of statistics known as Sample Survey Method. The theoryinvolved is called Sampling Theory. Sample survey is widely used in many areas such asagriculture, education, industry, social affairs, medicine.

1.2 Some technical terms

1. An element is an object on which a measurement is taken.

2. A population is a collection of elments about which we require information.

3. Population charateristic: this is the aspect of the population we wish to measure, e.g.the average income of last year’s graduates from NUS, or the total wheat yield of allfarmers in a certain country.

4. Sampling units are nonoverlapping collections of elements from the population. Sam-pling units may be the individual members of the population, they may be a coarsersubdivision of the population, e.g. a household which may contain more than one indi-vidual member.

5. A frame is a list of sampling units, e.g., telephone directory.

6. A sample is a collection of sampling units drawn from a frame or frames.

1

Page 3: ST3239: Survey Methodologystazw/ch1-2.pdf · a part of that population. The principles and methods of collecting and analysing data from flnite population is a branch of statistics

1.3 Why sample?

If a sample is equal to the population, then we have a census, whcih contains all the informationone wants. However, census is rarely conducted for several reasons:

• cost, (money is limited)

• time, (time is limited)

• destructive (testing a product can be destructive, e.g. light bulbs),

• accessibility (non-response can be a serious issue).

In those cases, sampling is the only alternative.

1.4 How to select the sample: the design of the sample

survey

The procedure for selecting the sample is called the sample survey design. The general aim ofsample survey is to draw samples which are “representative” of the whole population. Broadlyspeaking, we can classify sampling schemes into two categories: probability sampling andsome other sampling schemes.

1. Probability sampling is a sampling scheme whereby particular samples are numerated andeach has a non-zero probability of being selected. With probability built in the design, we canmake statements such as “our estimate is unbiased and we are 95% confident that it is within 2percentage point of the true proportion”. In this course, we shall only concentrate on Probabilitysampling.

2. Some other sampling schemes

a) ‘volunteer sampling’: a TV telephone polls, medical volunteers for research.b) ‘subjective sampling’: We choose samples that we consider to be typical or “rep-resentative” of the population.c) ‘quota sampling’: One keeps sampling until certain quota is filled.

All these sampling procedures provide some information about the population, but it ishard to deduce the nature of the population from the studies as the samples are very subjectiveand often very biased. Furthermore, it is hard to measure the precision of these estimates.

1.5 How to design a questionnaire and plan a survey

This can be the most important and perhaps most difficult part of the survey sampling problem.We shall come back to this point in more details later.

2

Page 4: ST3239: Survey Methodologystazw/ch1-2.pdf · a part of that population. The principles and methods of collecting and analysing data from flnite population is a branch of statistics

Chapter 2

Simple random sampling

Definition: If a sample of size n is drawn from a population of size N in such a way that everypossible sample of size n has the same probability of being selected, the sampling procedureis called simple random sampling. The sample thus obtained is called a simple randomsample. Simple random sampling is often written as s.r.s. for short and is the simplestsampling procedure.

2.1 How to draw a simple random sample

Suppose that the population of size N has values

{u1, u2, · · · , uN}.

If we draw n (distinct) items without replacement from the population, there are altogether(Nn

)different ways of doing it. So if we assign probability 1/

(Nn

)to each of the different

samples, then each sample thus obtained is a simple random sample. We denote this sampleby

{y1, y2, · · · , yn}.

Remark: In our previous statistics course, we always use upper-case letters like X, Y etc.to denote random variables and lower-case letters like x, y etc. to represent fixed values.However, in sample survey course, by convention, we use lower-case letters like y1, y2 etc. todenote random variables.

Theorem 2.1.1 For simple random sampling, we have

P (y1 = ui1 , y2 = ui2 , · · · , yn = uin) =1

N

1

(N − 1)· · · 1

(N − n + 1)=

(N − n)!

N !.

where i1, i2, · · · , in are mutually different.

3

Page 5: ST3239: Survey Methodologystazw/ch1-2.pdf · a part of that population. The principles and methods of collecting and analysing data from flnite population is a branch of statistics

Proof. By the definition of s.r.s, the probability of obtaining the sample {ui1 , ui2 , · · · , uin}(where the order is not important) is 1/

(Nn

). There are n! number of ways of ordering

{ui1 , ui2 , · · · , uin}. Therefore,

P (y1 = ui1 , y2 = ui2 , · · · , yn = uin) =1(

Nn

)n!

=(N − n)!n!

N !n!=

(N − n)!

N !.

Remark: Recall that the total number of all possible samples is(

Nn

), which could be very large

if N and n are large. Therefore, getting a simple random sample by first listing all possiblesamples and then drawing one at random would not be practical. An easier way to get asimple random sample is simply to draw n values at random without replacement from the Npopulation values. That is, we first draw one value at random from the N population values,and then draw another value at random from the remaining N − 1 population values and soon, until we get a sample of n (different) values.

Theorem 2.1.2 A sample obtained by drawing n values successively without replacement fromthe N population values is a simple random sample.

Proof. Suppose that our sample obtained by drawing n values without replacement from theN population values is

{a1, a2, · · · , an},where the order is not important. Let {ai1 , ai2 , · · · , ain} be any permutation of {a1, a2, · · · , an}.Since the sample is drawn without replacement, we have

P (y1 = ai1 , · · · , yn = ain) =1

N

1

(N − 1)· · · 1

(N − n + 1)=

(N − n)!

N !.

Hence, the probability of obtaining the sample {a1, · · · , an} (where the order is not important)is ∑

all (i1,···,in)

P (y1 = ai1 , · · · , yn = ain) =∑

all (i1,···,in)

(N − n)!N !

= n!× (N − n)!N !

=1(Nn

) .

The theorem is thus proved by the definition of the simple random sampling.

4

Page 6: ST3239: Survey Methodologystazw/ch1-2.pdf · a part of that population. The principles and methods of collecting and analysing data from flnite population is a branch of statistics

Two special cases will be used later when n = 1, and n = 2.

Theorem 2.1.3 For any i, j = 1, ..., n and s, t = 1, ..., N ,

(i) P (yi = us) = 1/N.

(ii) P (yi = us, yj = ut) =1

N(N − 1), i 6= j, s 6= t.

Proof.

P (yk = uj) =∑

all (i1, · · · , in), but ik = j

P (y1 = ui1 , · · · , yk = uik , · · · , yn = uin)

=(N − n)!

N !×

(N − 1

n− 1

)(n− 1)! =

(N − n)!

N !× (N − 1)!

(N − n)!=

1

N.

P (yk = us, yj = ut) =∑

all (i1, · · · , in), but ik = s,ij = t

P (y1 = ui1 , · · · , yn = uin)

=(N − n)!

N !×

(N − 2

n− 2

)(n− 2)! =

(N − n)!

N !× (N − 2)!

(N − n)!=

1

N(N − 1).

Example 1. A population contains {a, b, c, d}. We wish to draw a s.r.s of size 2. List all possiblesamples and find out the prob. of drawing {b, d}.

Solution. Possible samples of size 2 are

{a, b}, {a, c}, {a, d}, {b, c}, {b, d}, {c, d},

The probability of drawing {b, d} is 1/6.

5

Page 7: ST3239: Survey Methodologystazw/ch1-2.pdf · a part of that population. The principles and methods of collecting and analysing data from flnite population is a branch of statistics

2.2 Estimation of population mean and total

2.2.1 Estimation of population mean

Suppose that the population of size N has values {u1, u2, · · · , uN}, we can define1) the population mean

µ =u1 + u2 + · · ·+ uN

N=

1

N

N∑

i=1

ui,

2) the population variance

σ2 =1

N

N∑

i=1

(ui − µ)2.

We wish to estimate the quantities µ and σ2 and to study the accuracy of their estimators.Suppose that a simple random sample of size n is drawn, resulting in {y1, y2, · · · , yn}. Then anobvious estimator for µ is the sample mean:

µ = y =n∑

i=1

yi/n.

Theorem 2.2.1

(i) E(yi) = µ, V ar(yi) = σ2.

(ii) Cov(yi, yj) = − σ2

N − 1, for i 6= j.

Proof. (i). By an ealier theorem,

E(yi) =N∑

k=1

ukP (yi = uk) =N∑

k=1

uk1

N= µ.

V ar(yi) =N∑

k=1

(uk − µ)2P (yi = uk) =N∑

k=1

(uk − µ)2 1

N= σ2.

(ii). By defintion, Cov(yi, yj) = E(yiyj)− E(yi)E(yj) = E(yiyj)− µ2. Now,

E(yiyj) =∑

all s 6= t

usutP (yi = us, yj = ut) =∑

all s 6= t

usut1

N(N − 1)

=1

N(N − 1)

all s, t

usut −∑

s=t

usut

=

1

N(N − 1)

[(N∑

s=1

us

) (N∑

t=1

ut

)−

N∑

s=1

u2s

]

=1

N(N − 1)

[(Nµ)2 −

(N∑

s=1

(us − µ)2 + Nµ2

)]

=1

N(N − 1)

[(Nµ)2 −Nσ2 −Nµ2

]= − σ2

N − 1+ µ2.

Thus, Cov(yi, yj) = E(yiyj)− µ2 = − σ2

N−1.

6

Page 8: ST3239: Survey Methodologystazw/ch1-2.pdf · a part of that population. The principles and methods of collecting and analysing data from flnite population is a branch of statistics

Theorem 2.2.2

E(y) = µ, V ar(y) =σ2

n

(N − n

N − 1

).

Proof. Note y = 1n(y1 + ... + yn). So

E(y) =1

n(Ey1 + ... + Eyn) =

1

n(nµ) = µ.

Now

V ar(y) =1

n2Cov(

n∑

i=1

yi,n∑

j=1

yj) =1

n2

n∑

i=1

n∑

j=1

Cov(yi, yj)

=1

n2

i6=j

Cov(yi, yj) +∑

i=j

Cov(yi, yj)

=1

n2

i6=j

(− σ2

N − 1) +

n∑

i=1

V ar(yi)

=1

n2

(n(n− 1)(− σ2

N − 1) + nσ2

)

=σ2

n

((n− 1)(− 1

N − 1) + 1

)

=σ2

n

(N − n

N − 1

)

Remark: From Theorem 2.2.2, we see that y is an unbiased estimator for µ. Also as n gets large(but n ≤ N), V ar(y) tends to 0. This implies that y will be a more accurate estimator for µ as n getslarger (but less than N). In particular, when n = N , we have a census and V ar(y) = 0.

Remark: In our previous statistics course, we usually sample {y1, y2, · · · , yn} from the populationwith replacement. Therefore, {y1, y2, · · · , yn} are independent and identically distributed (i.i.d.).And recall we have results like

Eiid(y) = µ, V ariid(y) =σ2

n.

Notice that V ariid(y) is different from V ar(y) in Theorem 2.2.2. In fact, for n > 1,

V ar(y) =σ2

n

(N − n

N − 1

)<

σ2

n= V ariid(y).

Thus, for the same sample size n, sampling without replacement produces a less variable estimator ofµ. Why?

7

Page 9: ST3239: Survey Methodologystazw/ch1-2.pdf · a part of that population. The principles and methods of collecting and analysing data from flnite population is a branch of statistics

Summary1. How to draw a simple random sample? (purpose, method) Simple random sampling is

the basic survey methodology.

2. After getting a s.r.s, how to describe the population, or how to analyze the data?Estimation of the population mean. (Sample mean.)

Estimation of σ2 and V ar(y)

The population variance σ2 is usually unknown. Now define

s2 =1

n− 1

n∑

i=1

(yi − y)2 =1

n− 1

(n∑

i=1

y2i − n(y)2

).

Example.When a few data points are repeated in a data set, the results are often arrayed in a

frequency table. For example, a quiz given to 25 students was graded on a 4-point scale 0, 1,2, 3 with 3 being a perfect score. Here are the results:

Score(X) Frequency(F ) Proportion(P )3 16 0.642 4 0.161 2 0.080 3 0.12(a). Calculate the average score by using frequencies.(b). Calculate the average score by using proportions.(c). Calculate the standard deviation.

Solution

If the above 25 students constitute a random sample, then s2 = nn−1

1.0976 = 1.1433.

Let us look at some properties of s2. Is it unbiased?

Theorem 2.2.3

E(s2) =N

N − 1σ2.

8

Page 10: ST3239: Survey Methodologystazw/ch1-2.pdf · a part of that population. The principles and methods of collecting and analysing data from flnite population is a branch of statistics

Proof.

Es2 =1

n− 1

(n∑

i=1

Ey2i − nE(y)2

)

=1

n− 1

(n∑

i=1

[V ar(yi) + (Eyi)

2]− n

[V ar(y) + (Ey)2

])

=1

n− 1

(n

[σ2 + µ2

]− n

[σ2

n

N − n

N − 1+ µ2

])

=nσ2

n− 1

[1− 1

n

(N − n

N − 1

)]=

nσ2

n− 1

(nN − n− (N − n)

n(N − 1)

)

=Nσ2

N − 1

The next theorem is an easy consequence of the last theorem.

Theorem 2.2.4 σ2 := N−1N

s2 is an unbiased estimator of σ2, e.g.

E(

N − 1

Ns2

)= σ2.

We shall define

f =n

Nto be the sample proportion,

1− f = 1− n

Nto be the finite population correction (ab. fpc)

Then we have the following theorem.

Theorem 2.2.5 An unbiased estimator for V ar(y) is

V ar(y) =s2

n(1− f) .

Proof.

EV ar(y) =Es2

n(1− f) =

Nσ2

n(N − 1)(1− n

N)

Confidence intervals for µ

It can be shown that the sample average y under the simple random sampling is approximatelynormally distributed provided n is large (≥ 30, say) and f = n/N is not too close to 0 or 1.

9

Page 11: ST3239: Survey Methodologystazw/ch1-2.pdf · a part of that population. The principles and methods of collecting and analysing data from flnite population is a branch of statistics

Central limit theorem: If n →∞ such that n/N → λ ∈ (0, 1), then

y − µ√V ar(y)

∼ N(0, 1) approximately.

If V ar(y) is replaced by its estimator V ar(y), we still have

y − µ√V ar(y)

∼approx. N(0, 1), as n/N → λ > 0.

Thus,

1− α ≈ P

∣∣∣∣∣∣y − µ√V ar(y)

∣∣∣∣∣∣≤ zα/2

= P

(y − zα/2

√V ar(y) ≤ µ ≤ y + zα/2

√V ar(y)

)

Therefore, an approximate (1− α) confidence interval for µ is

y ∓ zα/2

√V ar(y) = y ∓ zα/2

s√n

√1− f.

B := zα/2

√V ar(y) , is called bound on the error of estimation.

Example. Suppose that a s.r.s. of size n = 200 is taken from a population of size N = 1000.resulting in y = 94 and s2 = 400. Find a 95% C.I. for µ.

Solution

94∓ 1.9620√200

√1− 1/5 = 94∓ 2.479

Example. A simple random sample of n = 100 water meters within a community ismonitored to estimate the average daily water consumption per household over a specified dryspell. The sample mean and variance are found to be y = 12.5 and s2 = 1252. If we assumethat there are N = 10, 000 households within the community, estimate µ, the true average dailyconsumption, and find a 95% confidence interval for µ.

Solution

10

Page 12: ST3239: Survey Methodologystazw/ch1-2.pdf · a part of that population. The principles and methods of collecting and analysing data from flnite population is a branch of statistics

2.3 Selecting the sample size for estimating population

means

population mean

We have seen that V ar(y) = σ2

n

(N−nN−1

). So the bigger the sample size n is (but ≤ N), the more

accurate our estimate y is. It is of interest to find out the minimum n such that our estimateis within an error bound with certain probability 1− α, say,

P (|y − µ| < B) ≈ 1− α,

i.e.,

P

|y − µ|√

V ar(y)<

B√V ar(y)

≈ 1− α.

By the central limit theorem,

B√V ar(y)

=B√

σ2

n

(N−nN−1

) ≈ zα/2 ⇐⇒ σ2

n

(N − n

N − 1

)=

B2

z2α/2

= D,

⇐⇒ N

n− 1 =

(N − 1)D

σ2⇐⇒ N

n= 1 +

(N − 1)D

σ2=

(N − 1)D + σ2

σ2

Thus,

n ≈ Nσ2

(N − 1)D + σ2, where D =

B2

z2α/2

Remark 1: if α = 5%, then zα/2 = 1.96 ≈ 2, so D ≈ B2

4. This coincides with the formula in the

textbook (page 93).

Remark 2: the above formula requires the knowledge of the population variance σ2, which istypically unknown in practice. However, we can approximate σ2 by the following methods:

1) from pilot studies2) from previous surveys3) other studies.

11

Page 13: ST3239: Survey Methodologystazw/ch1-2.pdf · a part of that population. The principles and methods of collecting and analysing data from flnite population is a branch of statistics

e.g. Suppose that a total of 1500 students are to graduate next year. Determine the samplesize n needed to ensure that the sample average in starting salary is within $40 of the popula-tion average with probability at least 0.9. From previous studies, we know that the standarddeviation of the starting salary is approximately $400.

Solution. n = 1500×4002

1499×402/1.6452+4002 = 229.37 ≈ 230.

e.g. Example 4.5 (p.94, 5th edition). The average amount of money µ for a hospital’s accountsreceivable must be estimated. Although no prior data are available to estimate the populationvariance σ2, that most accounts lie within a $100 range is known. There are 1000 open accounts.Find the sample size needed to estimate µ with a bound on the error of estimation $3 withprobability 0.95.

Remark. The solution depends on how one inteprets “most accounts”, whether it means 70%,90%, 95% or 99% of all accounts.

Solution. We need an estimate of σ2. For the normal distribution, N(0, σ2), we haveP (|N(0, σ2)| ≤ 1.96σ) = P (|N(0, 1)| ≤ 1.96) = 95%, P (|N(0, σ2)| ≤ 3σ) = P (|N(0, 1)| ≤ 3) =99.87% So 95% accounts lie within a 4σ range and 99.87% accounts lie within a 6σ range.

B = 3, N = 1000.If most means 95%, we take 2× (2σ) = 100, so σ = 25. Then

n = 210.76 ≈ 211.

If most means 99.87%, we take 2× (3σ) = 100, so σ = 50/3. Then

n ≈ 107.

12

Page 14: ST3239: Survey Methodologystazw/ch1-2.pdf · a part of that population. The principles and methods of collecting and analysing data from flnite population is a branch of statistics

2.3.1 A quick summary on estimation of population mean

The population mean is defined to be

µ =1

N(u1 + u2 + · · ·+ uN).

Suppose a simple random sample is {y1, ..., yn}.

1) An estimator of the population mean µ and variance σ2 are

µ = y =1

n

n∑

i=1

yi, s2 =1

n− 1

n∑

i=1

(yi − y)2.

2) The mean and variance of y are

Ey = µ, V ar(y) =σ2

n

(N − n

N − 1

).

3) An estimator of the variance of y is

V ar(y) =s2

n(1− f) , where f = n/N .

4) An approximate (1− α) confidence interval for µ is

y ∓ zα/2

√V ar(y) = y ∓ zα/2

s√n

√1− f.

5) Minimum sample size n needed to have an error bound B with probability 1− α

n ≈ Nσ2

(N − 1)D + σ2, where D =

B2

z2α/2

13

Page 15: ST3239: Survey Methodologystazw/ch1-2.pdf · a part of that population. The principles and methods of collecting and analysing data from flnite population is a branch of statistics

2.3.2 Estimation of population total

The population total is defined to be

τ = (u1 + u2 + · · ·+ uN) = Nµ

Suppose a simple random sample is {y1, ..., yn}.

1) An estimator of the population total τ is

τ = Ny

3) The mean and variance of τ are

Eτ = τ, V ar(τ) = N2σ2

n

(N − n

N − 1

).

2) An estimator of the variance of τ is

V ar(τ) = V ar(Ny) = N2 s2

n(1− f)

Central limit theorem: If n →∞ such that n/N → λ ∈ (0, 1), then

τ − τ√V ar(τ)

∼ N(0, 1) approximately.

If V ar(τ) is replaced by its estimator V ar(τ), we still have

τ − τ√V ar(τ)

∼approx. N(0, 1), as n/N → λ > 0.

Thus,

1− α ≈ P

∣∣∣∣∣∣τ − τ√V ar(τ)

∣∣∣∣∣∣≤ zα/2

= P

(τ − zα/2

√V ar(τ) ≤ τ ≤ τ + zα/2

√V ar(τ)

)

Therefore, an approximate (1− α) confidence interval for τ is

τ ∓ zα/2

√V ar(τ) = τ ∓ zα/2N

s√n

√1− f.

B := zα/2

√V ar(τ) = Nzα/2

√V ar(y) , is called bound on the error of estimation.

14

Page 16: ST3239: Survey Methodologystazw/ch1-2.pdf · a part of that population. The principles and methods of collecting and analysing data from flnite population is a branch of statistics

4) An approximate (1− α) confidence interval for τ is

τ ∓ zα/2

√V ar(τ) = τ ∓ zα/2N

s√n

√1− f = N

(y ∓ zα/2

s√n

√1− f

).

5) Minimum sample size n needed to have an error bound B with probability 1− α

n ≈ Nσ2

(N − 1)D + σ2, where D =

B2

N2z2α/2

Example 4.6. (Page 95 of the textbook). An investigator is interested in estimating thetotal weight gain in 0 to 4 weeks for N = 1000 chicks fed on a new ration. Obviously, to weigheach bird would be time-consuming and tedious. Therefore, determine the number of chicksto be sampled in this study in order to estimate τ within a bound on the error of estimationequal to 1000 grams with probability 95%. Many similar studies on chick nutrition have beenrun in the past. Using data from these studies, the investigator found that σ2, the populationvariance, was approximately 36.00 (grams)2. Determine the required sample size.

Solution

15

Page 17: ST3239: Survey Methodologystazw/ch1-2.pdf · a part of that population. The principles and methods of collecting and analysing data from flnite population is a branch of statistics

2.4 Estimation of population proportion

If we are interested in the proportion p of the population with a specified characteristic. Let

yi = { 1 if the ith element has the characteristic0 if not

It is easy to see that E(yi) = E(y2i ) = p (Why?). Therefore, we have

µ = E(yi) = p,

σ2 = var(yi) = p− p2 = pq, where q = 1− p

So the total number of elements in the sample of size n possessing the specified characteristicis

∑ni=1 yi. Therefore,

1. An estimator of the population proportion p is

y =

∑ni=1 yi

n= p, say.

And an estimator of the population variance σ2 = pq is

s2 =1

n− 1

n∑

i=1

(yi − y)2 =1

n− 1

(n∑

i=1

y2i − n(y)2

)

=1

n− 1

(n∑

i=1

yi − np2

)=

1

n− 1

(np− np2

)

=n

n− 1pq where q = 1− p

From Theorems 2.2.2 and 2.2.3, we have

E(p) = p,

E(s2) =N

N − 1σ2 =

N

N − 1pq. (4.1)

2. Again, from Theorem 2.2.2, the variance of p is

V ar(p) =σ2

n

(N − n

N − 1

)=

pq

n

(N − n

N − 1

).

3. From equation (4.1) and Theorem 2.2.5, an estimator of the variance of p is

V ar(p) =s2

n(1− f) =

pq

n− 1(1− f) .

4. An approximate (1− α) confidence interval for p is

p∓ zα/2

√V ar(p) = p∓ zα/2

√pq√

n− 1

√1− f.

16

Page 18: ST3239: Survey Methodologystazw/ch1-2.pdf · a part of that population. The principles and methods of collecting and analysing data from flnite population is a branch of statistics

5. The minimum sample size n required to estimate p such that our estimate p is within anerror bound B with probability 1− α is,

n ≈ Npq

(N − 1)D + pq, where D =

B2

z2α/2

Note that the right hand side is an increasing function of σ2 = pq.

a) p is often unknown, so we can replace it by some estimate (from previousstudy, pilot study, etc.).

b) If we don’t have an estimate p, we can replace it by p = 1/2, thus pq = 1/4.

e.g. Suppose that a small town has population of N = 800 people. Let p = the proportion ofpeople with blood type A.

(1). What sample size n must be drawn in order to estimate p to be within 0.04 ofp with probability 0.95?

(2). Suppose that we know no more than 10% of the population have blood typeA. Find n again in (1). Comment on the difference between (1) and (2).

(3). A simple random sample of size n = 200 is taken and it is found that 7% ofthe sample has blood type A. Find a 90% confidence interval for p.

Solution. N = 800, α = 0.05, B = 0.04

(1). Take p = 1/2 in the formula, we get n = 344.

(2). p ≤ 0.10 so σ2 = pq ≤ 0.09. Simple calculation yields n = 171.

(3). (0.040, 0.096).

Example A simple random sample of n = 40 college students was interviewed to determinethe proportion of students in favor of converting from the semester to the quarter system. 25students answered affirmatively. Estimate p, the proportion of students on campus in favor ofthe change. (Assume N = 2000.) Find a 95% confidence interval for p.

17

Page 19: ST3239: Survey Methodologystazw/ch1-2.pdf · a part of that population. The principles and methods of collecting and analysing data from flnite population is a branch of statistics

Solution

2.5 Comparing estimates

Suppose x1, · · · , xm is a random sample from a population with mean µx and y1, · · · , yn is arandom sample from a population with mean µy. We are interested in the difference of meansµy − µx, which can be estimated unbiased by y − x, as

E(y − x) = µy − µx.

Further,V ar(y − x) = V ar(y) + V ar(x)− 2Cov(y, x).

Remark: If the two samples x1, · · · , xm and y1, · · · , yn are independent, then Cov(y, x) = 0.However, a more interesting case is when the two samples are dependent, which will beillustrated in the following example.

An dependent example

Suppose an opinion poll asks n people the question “Do you favor the abortion?”The opinions given are

YES, NO, NO OPINION.

Let the proportions of people who answer ‘YES’, ‘NO’, ‘No opinion’ be p1, p2 and p3,respectively. In particular, we are interested in comparing p1 and p2 by looking at p1 − p2.Clearly, p1 and p2 are dependent proportions, since if one is high, the other is likely to be low.

Let p1, p2 and p3 be the three respective sample proportions amongst the sample of sizen. Then X = np1, Y = np2 and Z = np3 follows a multinomial distribution with parameter(n, p1, p2, p3). That is

P (X = x, Y = y, Z = z) =

(n

x, y, z

)px

1py2p

z3 =

n!

x! y! z!px

1py2p

z3

Please note that

x≥0,y≥0,x+y+z=n

n!

x! y! z!px

1py2p

z3 = 1.

18

Page 20: ST3239: Survey Methodologystazw/ch1-2.pdf · a part of that population. The principles and methods of collecting and analysing data from flnite population is a branch of statistics

Question: What is the distribution of X? (Hint: Classify the people into “Yes” and “NotYes”)

Theorem 2.5.1E(X) = np1, E(Y ) = np2, E(Z) = np3,

V ar(X) = np1q1, V ar(Y ) = np2q2,

Cov(X,Y ) = −np1p2.

Proof. X = number of people saying “YES” ∼ Bin(n, p1). So EX = np1, V ar(X) = np1q1.

Now Cov(X, Y ) = E(XY )− (EX)(EY ) = E(XY )− n2p1p2. But

E(XY ) =∑

x,y≥0,x+y≤n

xyP (X = x, Y = y)

=∑

x,y≥1,x+y≤n

xyP (X = x, Y = y, Z = n− x− y)

=∑

x,y≥1,x+y≤n

xyn!

x! y! (n− x− y)!px

1py2p

n−x−y3

=∑

x,y≥1,x+y≤n

n!

(x− 1)! (y − 1)! (n− x− y)!px

1py2p

n−x−y3

= n(n− 1)p1p2

x−1,y−1≥0,(x−1)+(y−1)≤(n−2)

(n− 2)!

(x− 1)! (y − 1)! ((n− 2)− (x− 1)− (y − 1))!px−1

1 py−12 p

(n−2)−(x−1)−(y−1)3

= n(n− 1)p1p2

x1,y1≥0,x1+y1≤(n−2)

(n− 2)!

(x1)! (y1)! ((n− 2)− x1 − y1)!px1

1 py12 p

(n−2)−x1−y1

3

= n(n− 1)p1p2 = n2p1p2 − np1p2.

Therefore, Cov(X, Y ) = E(XY )− n2p1p2 = −np1p2.

Theorem 2.5.2E(p1) = p1, E(p2) = p2,

V ar(p1) = p1q1/n, V ar(p2) = p2q2/n,

Cov(p1, p2) = −p1p2/n.

19

Page 21: ST3239: Survey Methodologystazw/ch1-2.pdf · a part of that population. The principles and methods of collecting and analysing data from flnite population is a branch of statistics

Proof. Note that p1 = X/n and p2 = Y/n. Apply the last theorem.

From the last theorem, we have

V ar(p1 − p2) = V ar(p1) + V ar(p2)− 2Cov(p1, p2) =p1q1

n+

p2q2

n+

2p1p2

n.

One estimator of V ar(p1 − p2) is

V ar(p1 − p2) =p1q1

n+

p2q2

n+

2p1p2

n.

Is is unbiased? No! An unbiased estimator of the variance of p1 is V ar(p1) = p1q1

n−1(1− f) .

Also Ep1p2 = EXY/n2 = p1p2(1−1/n) implies an unbiased estimator of p1p2 is p1p2(1−1/n)−1.So

V ar(p1) + V ar(p2) + 2n−1p1p2(1− 1/n)−1

is an unbiased estimator of V ar(p1 − p2). But it is easy to use

V ar(p1 − p2) =p1q1

n+

p2q2

n+

2p1p2

n.

Therefore, an approximate (1− α) confidence interval for p1 − p2 is

(p1 − p2)∓ zα/2

√V ar(p1 − p2) = (p1 − p2)∓ zα/2

√√√√ p1q1

n+

p2q2

n+

2p1p2

n.

e.g. (From the textbook.) Should smoking be banned from the workplace? A Time/Yankelovichpoll of 800 adult Americans carried out on April 6-7, 1994 gave the following results.

Nonsmokers Smokers

Banned 44% 8%

Special areas 52% 80%

No restrictions 3% 11%

Based on a sample of 600 nonsmokers and 200 smokers, estimate and construct a 95% C.I.for

(1) the true difference between the proportions choosing “Banned” between non-smokers and smokers;

(2) the true difference between the proportions among nonsmokers choosing between“Banned” and “Special Areas”.

20

Page 22: ST3239: Survey Methodologystazw/ch1-2.pdf · a part of that population. The principles and methods of collecting and analysing data from flnite population is a branch of statistics

SolutionA. The proportions choosing “banned” are independent of each other; a high value does not

force a low value of the other. Thus, an appropriate estimate of this difference is

0.44− 0.08± 2

√0.44× 0.56

600+

0.08× 0.92

200= 0.36± 0.06

B. The proportion of nonsmokers choosing “special areas” is dependent on the proportionschoosing “banned”; if the latter is large, the former must be small. These are multinomialproportions. Thus, an appropriate estimate of this difference is

0.52− 0.44± 2

√0.44× 0.56

600+

0.52× 0.48

600+ 2× 0.44× 0.52

600= 0.08± 0.08

Example The major league baseball season in US came to an abrupt end in the middle of1994. In a poll of 600 adult Americans, 29% blamed the players for the strike, 34% blamedthe owners, and the rest held various other opinions. Does evidence suggest that the trueproportions who blame players and owner, respectively, are really different?

p1: proportions of Americans who blamed the players.p2: proportions of Americans who blamed the owners.

V ar(p1 − p2) =p1q1

n+

p2q2

n+

2p1p2

n

=0.29× 0.71

600+

0.34× 0.66

600+

2× 0.29× 0.34

600= 1.0458× 10−3

So an approximate 95% C.I. for p1 − p2 is

0.29− 0.34± z0.025

√V ar(p1 − p2)

= −0.05± 1.96× 0.03234

= (−0.11339, 0.01339)

21