10 hypothesis testing - math.arizona.edu

10 Hypothesis testing

10.1 Introduction

In this chapter we will study hypothesis testing for a population parameter.There are other type of hypothesis testing in statistics.

We will always have two hypotheses: the null hypothesis, H0, and thealternative hypothesis, Ha. Depending on the data, we will either reject thenull hypothesis and accept the alternative hypothesis or not reject the nullhypothesis. We start with an example.

Example: In an election between two candidates it takes 50% or more ofthe votes to win. McSally is one of the candidates. We think she is goingto lose and want to test this hypothesis. Let p be the fraction of the votersthat will vote for McSally. Our hypotheses are

H0 : p = 0.5 (1)

Ha : p < 0.5 (2)

Note that we take the null hypothesis to be that p = 0.5. We take a pollwith 15 people and sees how many say they will vote for her. How do wedecide between the two hypotheses?

Example: A drug company has a drug, call it A, for lowering blood pressure.They have just developed a new drug, B, that they think is better. Theywant to test it to decide if they should quit selling drug A and starting sellingdrug B. Let µA be the average amount a patient’s blood pressure is loweredby drug A, and let µB be the average amount a patient’s blood pressure islowered by drug B. Our hypotheses are

H0 : µA = µB (3)

Ha : µA > µB (4)

Again, note that the null hypothesis involves an equality.

Hypothesis testing is like a court trial (Wikipedia). The null hypothesisis that the defendant is not guilty. The alternative is that he or she is guilty.The evidence (data) must reach a certain level (beyond a reasonable doubt)for us to reject the null hypothesis in favor of the alternative.

1

10.2 Elements of a statistical test

Our hypothesis test involves the following elements:

1. Null hypothesis

2. Alternative hypothesis

3. A test statistic

4. Rejection region

We will always take the null hypothesis to be of the form

H0 : θ = θ0 (5)

where θ0 is a known number. The alternative hypothesis can be of threeforms. The two sided alternative is

Ha : θ 6= θ0 (6)

There are two possible one-sided alternatives:

Ha : θ > θ0 (7)

or

Ha : θ < θ0 (8)

The test statistic is (like all statistics) a function of the random sample. Therejection region is the set of values of the test statistic for which we rejectthe null hypothesis and so conlude the alternative hypothesis holds. If thetest statistic does not fall in the rejection region, we do not reject the nullhypothesis. However, we do not accept the null hypothesis. We just concludethat our data does not support the conclusion that the null hypothesis is false.

Example: Return to the election example. We have already stated thehypotheses. We poll n people and let Yn be the number of them that saythey will vote for McSally. The test statistic is Yn. If Yn is small enoughwe should reject H0 and accept Ha. So the rejection region should be of theform Yn ≤ k. What should k be?

Example: Return to the drug example. We take a bunch of patients, ran-domly divide them into two groups and give one group drug A and the other

2

group drug B. We let Y A be the average reduction in the blood pressure ingroup A, Y A the average reduction in the blood pressure in group B. Our teststatistic is Y A − Y B. If it is significantly bigger than 0 we should reject H0

and accept Ha. So the rejectiion region should be of the form Y A− Y B ≥ k.If H0 is true, then there is still some probability that Y A − Y B will be pos-itive. So we should not take the rejection region to be just Y A − Y B > 0.Obviously k should be positive, but how large should it be?

There is always some chance that the random sample we get is “atypical”and so the conclusion we draw based on it is wrong. There are two possibletypes of errors.

Definition 1. If H0 is true and we reject it, this is called a type I error. Welet α be the probability of a type I error. α is called the level of the test. IfHa is true and we accept H0, this is called a type II error. We let β be theprobability of a type II error.

Note that if H0 is true, then we know the value of θ. So we can computethe probability the test statistic falls in the rejection region, i.e., we cancompute α. It will just be a number. Of course it depends on the rejectionregion. But if we know that Ha is true, then we know something about θ, butwe don’t know the actual value of θ. So when we compute β the probabilitywill depend on θ. So β is a function of θ.

Example: Return to the election example. Suppose we sample 15 peopleand we take the rejection region to be Y ≤ 2. What is α?

α = P (Y ≤ 2|p = 0.5) =2∑y=0

(15

y

)(0.5)y(1− 0.5)15−y = 0.00369 (9)

The value of β depends on p. Suppose p = 0.3. Then we have

β = P (Y > 2|p = 0.3) =15∑y=3

(15

y

)(0.3)y(1− 0.3)15−y = 0.873 (10)

This is not good. We are almost certain to make a type II error even if pis 0.3. To make β smaller we need to make the rejection region bigger. Ifwe do this, then α will increase. So there is a trade-off between α and β.A small rejection region makes α smaller but β larger. A larger rejection

3

region makes α larger but β smaller. To do better overall we need to makethe sample size larger. More on this later.

Which is worse - a type I or type II error? That depends very muchon the particular problem. In the election example, concluding she will winwhen she will not is comparable to concluding she will lose when she will infact win. But consider this example. When a drug company starts tesing anew drug they may start with tests just to see if the drug has harmful sideeffects. To be extreme, suppose they want to test if the drug is actually fatal.Let p be the probability that the drug kills a patient. Take

H0 : p = 0 (11)

Ha : p > 0 (12)

Suppose Ha is true and we mistakenly accept H0 (a type II error) and con-clude the drug safe. This is really bad. The company will kill people. On theother hand, if H0 is true and we mistakenly reject H0 (a type I error), thenwe will mistakenly conclude the drug is dangerous when it is in fact safe. Sowe will probably abandon the drug and the company may lose all the moneyit might have made from the new drug.

10.3 Common large sample tests

Review: one population mean, one population proportion, difference of twopopulation means, difference of two population proportions.

Suppose our hypothesis involves a population mean µ. (So θ is µ.) Wehave a point estimator for µ, namely the sample mean Y . We could use Y asthe test statistic. The mean of Y is µ and its variance is σ2/n. If the samplesize is large, then the CLT says that

Y − µ√σ2/n

(13)

is approximately a standard normal. Note that this involves the unknownparameter µ, so this is not a valid statistic.

Now suppose our hypotheses are

H0 : µ = µ0 (14)

Ha : µ > µ0 (15)

4

where µ0 is known. We define

Z =Y − µ0√σ2/n

(16)

Note that we used µ0, not µ. So Z is a valid statistic. It does not dependon the unknown µ. If the null hypothesis is true, then the distribution of Zis approximately standard normal. We should reject H0 if Y is significantlylarger than µ0, i.e., if Z is significantly larger than 0. We can finally quantifywhat “significantly larger” should mean since we know the distribution of Z.If H0 is true the values of Z are usually between −2 and 2 and so a reasonblechoice for the rejection region would be to reject H0 if Z > 2. Note that if thenull hypothesis is false, then Z does not have a standard normal distributionsince the mean µ is not µ0.

The rejection region is of the form Z > zc. The probability of a type Ierror is the probability that Z > zc when the null hypothesis is true. Thisis just P (Z > zc). So if we have a desired value of α, this determines thecutoff zc. It should just be zα where P (Z > zα) = α. Note that the rejectionregion Z > zα is the same as

Y > µ0 +σ√nzα (17)

Example: An assembly line makes widgets. They claim that the number ofdefective widgets per day is on average 15. We suspect the number is higherthan this. We randomly pick 36 days and see how many defective widgetswere made each of those 36 days. The sample mean is 17.0 and the samplevariance is 9.0. Test the companies claim with significance level α = 0.05.

Let µ be the average number of defective widgets per day. We take ourhypotheses to be

H0 : µ = 15 (18)

Ha : µ > 15 (19)

Our test statistic is

Z =Y − µ0

σ/√n

(20)

5

With a significance level of 0.05, z0.05 = 1.645. So our rejection region isZ > 1.645. We have µ0 = 15 and n = 36. We don’t know σ2 so weapproximate it by the sample variance. So

Z =Y − 15√9/√

36(21)

In our test we got Y = 17. So Z works out to 4. The is in the rejectionregion, so we reject H0 and conclude the company is understating the numberof defectives.

End of lecture on Thurs, 3/22

Example: Comparing visual reaction times of men vs. women. (Reference:Int J Appl Basic Med Res. 2015 May-Aug; 5(2): 124127. ) Suppose wewant to test if the reaction times of males and females are different. We willuse a significance level of α = 0.05. Subjects watch a screen and when a reddot appears they have to hit the space bar. The study had 60 men and 60women. The units are milliseconds.

Males : mean 239.70, stan dev 13.04Females: mean 255.50, stand dev 19.92.We let µm and µf be the average visual reaction time for males and for

females

H0 : µm = µf (22)

Ha : µm 6= µf (23)

The estimator for µm − µf is Y m − Y f . If the null hypotheses is true, thenthe mean of this estimator is 0. Its variance is σ2

m/nm + σ2f/nf . So we take

our test statistic to be

Z =Y m − Y f√

σ2m/nm + σ2

f/nf(24)

Note that we now have a two-sided alternative. So we should reject H0 ifwe get an unusually large value of Z or an unusually small (negative) value.So the rejection region should be of the form |Z| > zc. So given a desiredlevel α, we want to choose zc so that P (|Z| ≥ zc) = α. So zc is zα/2, which

6

is 1.96. So we reject H0 and conclude that the reaction times are different ifZ > 1.96 or Z < −1.96. For our data

Z =239.70− 255.50√

(13.04)2/60 + (19.92)2/60= −5.14 (25)

which is well inside the rejection region. So we conclude there is a differencein reaction times. Note that since we do not know the exact values of σ2

m andσ2f , we had to approximate them with the corresponding sample variances in

the above.

Suppose we are doing hypotheses testing for two population proportions.So our test statistic is

Z =pA − pB√

pA(1− pA)/nA + pB(1− pB)/nB(26)

Note that we are estimating pA by pA and pB by pB in the denominator. Ifthe null hypothesis is of the form H0 : pA = pB, then when H0 is true thetwo populations have the same p. So it is better to use a pooled estimatorfor this common parameter p.

Z =pA − pB√

p(1− p)/nA + p(1− p)/nB(27)

=pA − pB√

p(1− p)(1/nA + 1/nB)(28)

with

p =pAnA + pBnbnA + nB

(29)

Example: Drug Gemfibrozil lower bad cholestorol and so hopefully reducesheart attacks. 5 year experiment. Some patients get the drug, some get aplacebo. The control group has 2030 subjects and 84 had a heart attackduring the 5 year period. The group taking the drug has 2051 subjects and56 had a heart attack during the 5 year period. Test at the 5% significancelevel if the drug reduces heart attacks. Let 1 be control, 2 the drug group.

H0 : p1 = p2 (30)

Ha : p1 > p2 (31)

7

We are doing a one-sided alternative and zα = 1.645, so we will reject H0 ifZ > 1.645. Pooling

p =56 + 84

2051 + 2030= 0.0343 (32)

So

Z =pA − pB√

p(1− p)(1/nA + 1/nB)= 2.47 (33)

So we reject H0. The data supports the conclusion that the drug works.

Summary:For these four scenarios - one population mean, difference between two

population means, one population proportion, difference between two popu-lation proportions- we do the following for a level α test. The null hypothesisis H0 : θ = θ0. The test statistic is

Z =θ − θ0

σθ(34)

Upper one-sided alternative (Ha : θ > θ0). Reject if Z > zα.Lower one-sided alternative (Ha : θ < θ0). Reject if Z < zα.Two-sided alternative (Ha : θ 6= θ0). Reject if |Z| > zα/2, i.e., Z > zα/2

or Z < −zα/2.

Which alternative hypothesis?There are three possible forms of the hypothesis. Which one should be

used? The answer depends on the problem/experiment. However, it shouldnot depend on the data. You should decide what Ha is before you see thedata.

If taking the null hypothesis to be H0 : µ ≤ µ0 would lead to the sameconclusions as H0 : µ = µ0, then the alternative should be µ > µ0. If takingthe null hypothesis to be H0 : µ ≥ µ0 would lead to the same conclusions asH0 : µ = µ0, then the alternative should be µ < µ0.

Consider the example of an assembly line making widgets, some of whichare defective. The factory says the number of defective widgets per day is onaverage 15. We think they are doing a worse job than this and the number isactually higher. The null hypothesis is H0 : µ = 15. If we reject H0 then we

8

should conclude that µ > 15. If we had taken H0 to be µ ≤ 15, then whenwe reject H0 we would still conclude that µ > 15. So the alternative shouldbe Ha : µ > 15.

Now consider the same assembly line making widgets, but now supposewe think they are doing a better job than what they claim, i.e., the avergenumber of defectives is less than 15. Null hypothesis is still µ = 15. Nowrejecting the null hypothesis should mean we conclude the average numberof defectives is less than 15. If we have taken H0 : µ ≥ 15, then rejecting thenull hypothesis would still lead to the conclusion that µ < 15. So we wouldtest against Ha : µ < 15.

Finally, we could think that they just made this number up or that theydon’t know how to do a proper experiment to estimate this number. So wemight want to test against the alternative µ 6= 15.

Consider the blood pressure drug example. In that example we onlycared if the new drug was better, i.e., lowered blood pressure more thanthe old drug. If it lowered it less, our decision would be the same as if itlowered it the same amount, i.e., stop development of the new drug. But nowsuppose we are not trying to find a better drug, we are just doing research tounderstand how the exisiting drug works. Drug B is a modification of the olddrug A which may or may not change it efficacy. So we would test againstthe alternative Ha : µA 6= µB.

To illustrate why you should decide what Ha is before you see the dataconsider the following example for the blood pressure medications. Supposewe use the test statistic

Z =Y A − Y B

σY A−Y B

(35)

We want to take α = 5%. We are testing if the drugs have different efficacies,so we take H0 : µA = µB and Ha : µA 6= µB. This is a two tailed test, so wereject H0 if |Z| > zα/2 = 1.96. Now suppose our data give Z = 1.83. Thenwe do not reject H0. Suppose instead that we looked at our data beforedeciding on Ha. We might be tempted to say that it looks like if there is adifference then it is drug A that is better. So we take Ha : µA > µB. Then wewould reject if Z > zα = 1.645. Since Z = 1.83, we reject H0 and conclude(possibly incorrectly) that drug A is better.

9

10.4 Calculating probability of type II error and find-ing the sample size for Z tests

What is the probability of a type II error? Z does not have the standardnormal distribution now, but Y is still normal if the sample size is large. Sowe can still compute β. It depends on µ, so we write it as β(µ). We startwith an example.

Example: Return to the defective widget example. We have a sample ofsize 36. Our test statistic is

Z =Y − µ0

σ/√n

=Y − 15

1/2(36)

With a significance level of 0.05, z0.05 = 1.645. So our rejection region isZ > 1.645. We are interested in type II errors now, so we want to considerwhat happens when Ha is true. When this happens Z is not standard normal.So we express the rejection region in terms of Y Z > 1.645 is equivalent toY > 15 + 1.645/2 = 15.8225. So we accept H0 when Y ≤ 15.8225. Sinceµ = 16,

β(16) = P (Y ≤ 15.8225) = P (Y − 16

1/2≤ 15.8225− 16

1/2) (37)

= P (Z ≤ −0.355) = 0.361 (38)

If µ = 17,

= β(17) = P (Y ≤ 15.8225) = P (Y − 17

1/2≤ 15.8225− 17

1/2) (39)

= P (Z ≤ −2.355) = 0.0093 (40)

If µ = 18,

= β(18) = P (Y ≤ 15.8225) = P (Y − 18

1/2≤ 15.8225− 18

1/2) (41)

= P (Z ≤ −4.355) = 0.000007 (42)

End of lecture on Tues, 3/27

10

Consider how the distribution of Z changes Picture of sliding normal.Recall that we can decrease β by changing the rejection region. If we

enlarge the rejection region then β will decrease. But α will then increase.Suppose we want to keep α at 0.05. Then we can decrease β by increasingthe sample size.Example: We continue with the widget example. We use our existing datato estimate σ. So σ ≈ S = 3. Suppose we want to find the sample sizethat will make β(16) = 0.05. Now consider a sample of size n. The rejectionregion is Z > 1.645 and

Z =Y − 15

σ/√n

(43)

So the rejection region is Y ≥ 15 + 1.645 σ√n. Now

β(16) = P (Y ≤ 15 + 1.645σ√n

) = P (Y − 16σ/

√n≤???) (44)

We can find a general formula for the sample size when we are give adesired α and β(µ). We consider the case where the alternative hypothesisis µ > µ0. So

β(µ) = Pµ(Y ≤ µ0 +σ√nzα) (45)

= Pµ(Y − µσ/√n≤µ0 − µ+ σ√

nzα

σ/√n

) (46)

= Pµ(Y − µσ/√n≤ µ0 − µ

σ/√n

+ zα) (47)

We have put a subscript µ on P to remind ourselves that it depends on µsince the alternative hypothesis is true when we consider a type II error. IfHa is true, then we do not know µ but it is greater than µ0. So µ0 − µ/σ/

√n

is negative. With the sample size fixed, as µ increases, this quantity getmore negative and β(µ) decreases. The larger sample size is, the faster itdecreases. The formula for the sample size is

n =(zα + zβ)2σ2

(µa − µ0)2(48)

11

10.5 Hypothesis testing vs. confidence intervals

Not many notes here.The punchline here is that in a two-sided test with significance level α,

we reject the null hypothesis if and only if the true value of θ is outside theconfidence interval with significance 1− α.

10.6 p-values

Suppose we are doing a large sample test which uses a test statistic Z thatis approximately standard normal. We are doing a two sided alternative andα = 0.05. We reject if |Z| ≥ 1.96. Now compare two different outcomesof the experiment. In one outcome our data gives Z = 4.2. In the other itgives Z = 2.1. In both cases we reject the null hypotheses. But this doesnot fully reflect what out data tell us. In the first scenario the value of Z iswell inside the rejection region which the second it is close to the boundary.We could convey more information by actually reporting the value of Z thatwe got. However, we will eventually look at test with statistics that haveother distributions. So we would like a way to report the result that doesnot involved the distribution of the test statistic. That is what p-values do.

Definition 2. For a given set of date, the p-value is the smallest value ofthe level α which would lead to us rejecting the null hypthosesis.

Another way to say this is that if we get a value z0 for the test statistic,then p is the probability of a value of the test statistic that would give evenstronger evidence to reject H0 than Z = z0. We spell this out for the threepossible Ha. Suppose that our data gives Z = z0. If we have an upper-tailed

test (Ha : θ > θ0) and the rejection region is Z ≥ k, then

p = P (Z ≥ z0|H0 is true) (49)

If we have a lower-tailed test (Ha : θ < θ0) and the rejection region is Z ≤ k,then

p = P (Z ≤ z0|H0 is true) (50)

If we have a two-tailed test (Ha : θ 6= θ0) and the rejection region is |Z| ≥ k,then

p = P (|Z| ≥ |z0||H0 is true) (51)

12

Example: Suppose we are testing

H0 : µ = 22 (52)

Ha : µ < 22 (53)

and we get a Z of −1.53. Then p = P (Z ≤ −1.53) = 0.06309. So we wouldreject H0 if α = 10%, but we would accept H0 if α = 5%. What if we gotZ = +0.53. Then p = P (Z ≤ 0.53) = 0.702. This is a huge p value. Wewould not reject H0 for any reasonable α. Note that a positive value of Zmeans the sample mean was actually larger than 22, so this certainly doessupport accepting the alternative that µ < 22.

Example: The assembly line makes widgets. We were doing a one-sidedtest:

H0 : µ = 15 (54)

Ha : µ > 15 (55)

For our sample of 36 days we found 17.0 and the sample variance is 9.0. Sothe test statistic was Z = 4. So p = P (Z ≥ 4) = 3.17× 10−5.

Example: Comparing visual reaction times of men vs. women. We weretesting if their average reactions times were different.

H0 : µm = µf (56)

Ha : µm 6= µf (57)

The study had 60 men and 60 women.Males : mean 239.70, stan dev 13.04Females: mean 255.50, stand dev 19.92.Our test statistic was

Z =Y m − Y f√

σ2m/nm + σ2

f/nf= −5.14 (58)

We are doing a two-sided Ha, so p = P (|Z| ≥ 5.14) = 0..Example: Drug Gemfibrozil to reduce heart attack risk. Let 1 be control,2 the drug group.

H0 : p1 = p2 (59)

Ha : p1 > p2 (60)

13

Z =p1 − p2

p(1− p)(1/nA + 1/nB)(61)

We reject H0 if Z is large. For our data we got Z = 2.475. So p = P (Z ≥2.475) = 0.00666


10.7 comments

10.8 “Small” sample testing

Suppose we want to test a hypothesis concerning the mean of a population.As before Y is a natural statistic to look at. If the sample size is not large,then it need not be normal. Furthermore, the approximation of replacing σby S is not justified. In this sectino we assume that the population is normal,so Y is normal. But the replacement of σ by S is still not justified. Recallthat

Y − µs/√n

(62)

has a t-distribution with n−1 degrees of freedom. As before we consider testswhere the null hypothesis is H0 : µ = µ0 with µ0 known and the alternativeis one of Ha : µ < µ0, Ha : µ > µ0, Ha : µ 6= µ0. We take the test statistic tobe

T =Y − µ0

s/√n

(63)

Note that we have µ0 here, not the unknown µ. So if the null hypothesis istrue than T has a t-distribution, but if it is not true it does not.

Example: (from the book) a new gunpowder manufacturer claims the muz-zle velocity for it is 3000 ft/sec. We want to test the claim that is it is thishigh with α = 2.5%. We test 8 shells and find an average velocity of 2959ft/sec with a standard deviation of 39.1 ft/sec.

H0 : µ = 3000 (64)

Ha : µ < 3000 (65)

14

R says that qt(0.025, 7) = −2.364. So we should reject the null if T < −2.364.For the test statistic we find

T =Y − 3000

s/√n

= T =2959− 3000

39.1/√

8= −2.966 (66)

So we reject the null and conclude that the manufacturer is wrong. Theaverage muzzle velocity is less than 3000. The p-value is P (T ≤ −2.966) =pt(−2.966, 7) = 0.0105

Now suppose we have two populations with means µ1 and µ2. We wantto test a hypothesis involving µ1 − µ2. For large samples, we could assumeY 1 − Y2 was normal and we could replace σ1 by S1 and σ2 by S2. We nowconsider small samples, but add the assumption that the populations arenormal and they have the same variance σ2. In this case we estimate thiscommon variance by the pooled estimator

S2p =

(n1 − 1)S21 + (n2 − 1)S2

2

n1 + n2 − 2(67)

In this case

Y 1 − Y 2 − (µ1 − µ2)

Sp√

1n1

+ 1n2

(68)

has a t distribution with n1 + n2 − 2 d.f. We assume the null hypthesis isH0 : µ1 = µ2. Then we take our test statistic to be

T =Y 1 − Y 2 − 0

Sp√

1n1

+ 1n2

(69)

Example: Does adding a calcium supplement lower your blood pressure?Take 21 subjects. 10 take the supplement (group 1) and 11 take a placebo(group 2) for 12 weeks. We measure their BP before and after the 12 weeksand find the decrease in BP. We test at the α = 10% level. There are10 + 11− 2 = 19 d.f. And R says qt(0.1, 19) = −1.328. So we should rejectH0 if T > 1.328.

For group 1 the average decrease was 5.000 with S = 8.743. For group 2the average decrease was −0.273 with S = 5.901. We find SE = 3.287 and

15

T = 1.604. So we reject H0 and conclude the supplement does help lower BP.The p-value is P (T > 1.604) = 0.0620. So if we had tested at the α = 5%level we would not have rejected H0.


10.9 Tests involving the variance

We now consider tests involving the population variance. We start with asingle population with variance σ2 and consider testing H0 : σ = σ0 againstone of the alternatives

Ha : σ2 > σ20, Ha : σ2 < σ2

0, Ha : σ2 6= σ20 (70)

The natural statistic to look at is S2. We assume that the population isnormal. Recall that in this case,

(n− 1)S2

σ2(71)

has a χ2 distribution with n− 1 df. We define our test statistic to be

χ2 =(n− 1)S2

σ20

(72)

Note that we use the null hypothesis value in this definition. So if the nullhypothesis is true, then χ2 will have a χ2 distribution.

Note that the χ2 distribution is not symmetric. So in a two tailed testour rejection region is not symmetric about σ0. Let χ2

α be the number suchthat P (χ2 ≥ χ2

α) = α. The rejection region should be

Ha : σ2 > σ20 (73)

RR : χ2 > χ2α (74)

Ha : σ2 < σ20 (75)

RR : χ2 < χ21−α (76)

Ha : σ2 6= σ20 (77)

RR : χ2 < χ21−α/2 or χ2 > χ2

α/2 (78)

16

Example: A company produces pipes. It is important that the lengths bevery nearly the same, i.e., the variance in the lengths is small. They claimthat the standard deviation of the length is at most 1.2 cm. In a sample of25 pipes we find a sample standard deviation of 1.5 cm. Test the company’sclaim at the 5% significance level.

H0 : σ = 1.2 (79)

Ha : σ > 1.2 (80)

R says that qchisq(0.95, 24) = 36.42. So we will reject H0 if χ2 > 36.42. Forour data the value of the test statistic is

χ2 =(n− 1)S2

σ20

=24(1.5)2

(1/2)2= 37.5 (81)

So we reject H0. The data provides evidence that the company’s claim is notcorrect. The p-value is p = 1− pchisq(37.5, 24) = 0.0390.

Example: A manufacturer of hard hats test them by applying a large forceto the top of the helmet and seeing how much force is transmitted to thehead. They claim that at most 800 lbs of force is transmitted on average andthe standard deviation is 40 lbs. We want to test if the value of 40 for thestandard deviation is correct. We will use α = 5%.

H0 : σ = 40 (82)

Ha : σ 6= 40 (83)

The test statistic is

χ2 =(n− 1)S2

σ20

(84)

The rejection region is χ2 < 23.65 or χ2 > 58.12. For our data χ2 = 57.336,so we do not reject H0. Since this is a two-tailed test, the p-value is

p = 2P (χ2 ≥ 57.336) = 2× 0.029 = 0.058 (85)

Now suppose we have two normal populations and we want to test if theyhave the same variance. So the null hypothesis is H0 : σ2

1 = σ22. The three

possible alternative hypotheses are

Ha : σ21 > σ2

2, Ha : σ21 < σ2

2, Ha : σ21 6= σ2

2 (86)

We review the def of the F-distribution.

17

Definition 3. F-distribution Let W1 and W2 be independent RV’s with χ2

distributions with ν1 and ν2 degrees of freedom. Define

F =W1/ν1

W2/ν2

(87)

Then the distribution of F is called the F-distribution with ν1 numeratordegrees of freedom and ν2 denominator degrees of freedom.

If we have two normal populations with variances σ21 and σ2

2, and we takerandom samples from each one with sizes n1 and n2 and sample variancesS2

1 and S22 , then we know that (ni − 1)S2

i /σ2i have χ2 distributions. So the

following has an F-distribution:

S21/σ

21

S22/σ

22

(88)

If the null hypothesis is true, then this simplifies to S21/S

22 . So we define our

test statistic to be

F =S2

1

S22

(89)

Under the null hypothesis the distiribution of F is the F distribution withn1 − 1 numerator degrees of freedom and n2 − 1 denominator degrees offreedom.

Let Fα be the number such that P (F ≥ Fα) = α. Note that it dependson n1 and n2. For the three possible alternative hypotheses our rejectionregion (RR) is

Ha : σ21 > σ2

2 (90)

RR : F > Fα (91)

Ha : σ21 < σ2

2 (92)

RR : F < F1−α (93)

Ha : σ21 6= σ2

2 (94)

RR : F < F1−α/2 or F > Fα/2 (95)

We can compute values of Fα using R. The order of arguments for numeratordf, then denominator df. For example, qf(0.95, n,m) will give F0.05 for nnumerator df and m numerator df.

18

Example: A psychologist was interested in exploring whether or not maleand female college students have different driving behaviors. The particularstatistical question she framed was as follows:

Is the mean fastest speed driven by male college students different thanthe mean fastest speed driven by female college students?

The psychologist conducted a survey of a random n = 34 male collegestudents and a random m = 29 female college students. We take population1 to be the the female population and population 2 to be the male population.The data is

Y 1 = 90.9, S11 = 12.2 (96)

Y 2 = 105.5, S12 = 20.1 (97)

(98)

We want to test at the α = 5% level if the variances of the two populations arethe same. We are doing a two tailed test R tells us that qf(0.025, 28, 33) =0.47869, qf(0.975, 28, 33) = 2.04407. So we will reject H0 if F > 2.04407 orF < 0.47869. For our data F = (12.2)2/(20.1)2 = 0.368. So we reject H0

and conclude the variances are not equal.To find the p-value, remember this is a two tailed test. So

p = 2P (F < 0.368) = 2× 0.004250 = 0.008500 (99)

If F has an F distribution, then 1/F will also have an F distributionbut with the number of degrees of freedom switched. So in our exam-ple, instead of computing qf(0.975, 28, 33) = 2.04407, we could have used1/qf(0.025, 33, 28) = 2.044073. This was a big deal when we had to usetables, not a bit deal now.

10.10 Power of a test and the Neyman-Pearson Lemma

Consider a test involving a paramter θ and suppose the null hypothesis isH0 : θ = θ0 and the alternative is the two sided Ha : θ 6= θ0. The power ofa test is closely related to the probabiity of a type two error. Recall that atype two error is the probability of accepting H0 when it is not true. Thisprobability depends on the actual value of the parameter θ. So we have beendenoting it by

β(θ) = P (accept H0|θ) (100)

19

where θ is not θ0. The power is just 1− β(θ):

Definition 4.

power(θ) = P (reject H0|θ) (101)

The power when θ = θ0 is the probability we reject H0 when it is in facttrue. This is just α, the probability of a type I error. So the power at θ = θ0

is α. Typically the power will be a continuous function of θ. So it will stillbe close to α when θ is close to θ0. Typically it will approach 1 as θ movesaway from θ0

Picture of typical power function, Ha : θa 6= θ0

Now suppose we are testing with a one-sided hypothesis. Consider firstthe alternative Ha : θa > θ0. When θ = θ0 the power will again be α. Asθ increases from θ0, the probability we reject H0 increases and so the powerincreases, approaching 1 as θ gets farther away from θ0. On the other side,as θ decreases from θ0 the probability we reject H0 will be even smaller thanα. So the graph of the power function looks like :

Picture of typical power function, Ha : θa > θ0

If the alternative is Ha : θa < θ0, the graph looks like

Picture of typical power function, Ha : θa < θ0

Next we define simple and composite hypotheses. Suppose the populationpdf has just one unknown parameter θ. Under the null hypothesis H0 : θ =θ0, the population distribution is completely determined. However, underthe alternative hypothesis it is not. The null hypothesis is an example of asimple hypothesis, the alternative is an example of a composite hypothesis.

Definition 5. A hypothesis is a simple hypothesis if it completely specifies thedistribution of the population. Otherwise it is called a composite hypothesis.

Until now our alternative hypothesis has always been of the form θ 6= θ0,θ < θ0, or θ > θ0. Now we will also consider alternative hypotheses of theform Ha : θ = θa where θa is not equal to θ0 and is known. In this case thealternative hypothesis is simple.

20

Lemma 1. (the Neyman-Pearson lemma) Suppose we want to test the nullhypothesis H0 : θ = θ0 versus the alternative Ha : θ = θa using a randomsample Y1, Y2, · · · , Yn from a population which depends on a parameter θ.Given a value of α the test that maximizes the power at θa is the test withrejection region

L(y1, · · · , yn|θ0)

L(y1, · · · , yn|θa)< k (102)

where the constant k is chosen so that the probability of a type I error is α.Such a test is called the most powerful α-level test for H0 vs. Ha.

The theorem does not say anything when the alternative hypothesis iscomposite. But in some situtations it can. Suppose the alternative is Ha :θ > θ0 and suppose that the when we find the rejection region in the theorem,it does not depend on the value of θa. It only depends on α. Then the test isthe most powerful test for the composite alternative hypothesis Ha : θ > θ0.We say that the test is the uniformly most powerful test for H0 : θ = θ0 vsHa : θ > θa.

Example: Suppose the population is normal with unknown mean µ butknown variance σ2. So the likelihood function is

L(y1, · · · , yn|µ) = (2πσ2)−n/2 exp(−∑i

(yi − µ)2/(2σ2)) (103)

So

L(y1, · · · , yn|µ0)

L(y1, · · · , yn|µa)= exp[−

∑i

(yi − µ0)2/(2σ2) +∑i

(yi − µa)2/(2σ2)] (104)

So the rejection region is

−∑i

(yi − µ0)2/(2σ2) +∑i

(yi − µa)2/(2σ2) < ln k (105)

which can be rewritten as∑i

yi(µ0 − µa) <n

2(µ2

0 − µ2a) + σ2 ln k (106)

If µa < µ0, this is equivalent to Y < c where the constant c depends onk, µ0, µa, n and σ2. If µa > µ0, this is equivalent to Y > c. The constant k

21

is determined by the requirement that that the probability of a type I errorshould be α. So we might as well forget about k and just solve for c. It isdetermined by

P (Y > c|θ = θ0) = α (107)

As we have seen before this gives c = µ0 + zασ/√n. It does not depend on

the value of µa. So the rejection region does not depend on the value of µa.So in this case the test is the uniformly most powerful test.

Example: Suppose we want to test a population proportion. So the popu-lation pdf is just

f(y|p) = py(1− p)1−y (108)

where y can only be 0 or 1. So the likelihood function is a binomial distri-bution:

L(y1, · · · , yn|p) = p∑

i yi(1− p)1−∑

i yi = (1− p)n[

p

1− p

]∑i yi

(109)

where all the sums on i are from 1 to n. So

L(y1, · · · , yn|p0)

L(y1, · · · , yn|pa)=

[1− p0

1− pa

]n [p0(1− pa)pa(1− p0)

]∑i yi

(110)

So the rejection region is[p0(1− pa)pa(1− p0)

]∑i yi

< k′ (111)

where k′ depends on k and p0, pa. Taking the logarithm, this is equivalent to

(∑i

yi) ln

[p0(1− pa)pa(1− p0)

]< ln k′ (112)

A little algebra shows p0 > pa if and only if

p0(1− pa)pa(1− p0)

> 1 (113)

So we see that if p0 > pa then the rejection region is of the form Y < c. Thevalue of k is determined by α, but as in the previous example we might as

22

well forget k and just find c by the requirement that P (Y < c) = α. Asbefore this leads to c = p0 − zα

√p0(1− p0)/n. If p0 < pa then the rejection

region is of the form Y > c, and c = p0 + zα√p0(1− p0)/n. In both cases

we find that the rejection region does not depend on the value of pa. So thetest is the uniformly most powerful test.

Sufficient statistic: Suppose there is a sufficient statistic U for θ. So bythe factorization theorem

L(y1, · · · , yn|θ) = g(u, θ)h(y1, · · · , yn) (114)

Since h does not depend on θ, this gives

L(y1, · · · , yn|θ0)

L(y1, · · · , yn|θa)=g(u, θ0)

g(u, θa)(115)

So when there is a sufficient statistic, the rejection region for the test fromthe Neyman-Pearson lemma depends on the random sample only throughthe sufficient statistic.


Proof of Neyman-Pearson lemmaWe need a little notation. Let d(y1, · · · , yn) be the function which is 1 if thetest in the Neyman-Pearson lemma says we should reject H0 and is 0 if itdoes not. So

d(y1, · · · , yn) =

1 if L(y1, · · · , yn|θ0) < kL(y1, · · · , yn|θa)0 if L(y1, · · · , yn|θ0) ≥ kL(y1, · · · , yn|θa)

(116)

Suppose we have another test with the same α. Let d′(y1, · · · , yn) be thefunction which is 1 if this test says we should reject H0 and is 0 if it doesnot. The power of the Neyman-Pearson test at θ = θa is

P (d(y1, · · · , yn) = 1|θ = θa) =

∫d(y1, · · · , yn)L(y1, · · · , yn|θa)dy (117)

where the intergral is over Rn and dy is shorthand for dy1 · · · dyn. The powerof the other test is

P (d′(y1, · · · , yn) = 1|θ = θa) =

∫d′(y1, · · · , yn)L(y1, · · · , yn|θa)dy (118)

23

We claim that

[d(y1, · · · , yn)− d′(y1, · · · , yn)][kL(y1, · · · , yn|θa)− L(y1, · · · , yn|θ0)] ≥ 0

Note that both d and d′ only take on the values 0 and 1. So if d > d′ we musthave d = 1. So in this case kL(y1, · · · , yn|θ0) − L(y1, · · · , yn|θ0) > 0. Thisverifies the claim in the case that d > d′. If d < d′ we must have d = 0. Soin this case kL(y1, · · · , yn|θ0) − L(y1, · · · , yn|θ0) < 0. This verifies the claimin the other case. So the claim is proved. Now integrate the claim over Rn.Note that∫

[d(y1, · · · , yn)− d′(y1, · · · , yn)]L(y1, · · · , yn|θ0)]dy = α− α = 0 (119)

So we get

k

∫d(y1, · · · , yn)L(y1, · · · , yn|θa) dy ≥ k

∫d′(y1, · · · , yn)L(y1, · · · , yn|θa)dy

By equations 117 and 118 this says that the power of the test from theNeyman-Person lemma is at least as large as the power from the other test.This completes the proof.

Example: Population has Poisson distribution with parameter λ. So

f(y|λ) =e−λλy

y!, y = 0, 1, 2, · · · (120)

We want to test H0 : λ = λ0 vs. Ha : λ = λa. We will do the case of λa > λ0.

L(y1, · · · , yn|λ) =e−nλλ

∑i yi∏

i yi!(121)

Since the parameter is λ we will denote the test statistic by Λ in this example.

Λ = en(λ0−λa)

(λ0

λa

)∑i yi

(122)

The rejection region is then given by Λ < k, which we rewrite as(λ0

λa

)∑i yi

< k′ (123)

24

where k′ is .... Next we take the log and note that since λa > λ0, ln(λ0/λa) <0. So our rejection region can be written simply as y > c. As always, c ischosen to make the probability of a type I error be α. If the sample size islarge, than Y is approximately normal. When H0 is true, Y has mean λ andvariance λ/n. So standardizing,

P (Y > c|λ = λ0) = P (Z ≥ c− λ0√λ/n

) (124)

So (c− λ0)/√λ/n = zα. So our RR becomes

y ≥ λ0 + zα

√λ0

n(125)

Note that this rejection region does not depend on λa except for the assump-tion that λa > λ0. So if the alternative hypothesis is Ha : λa > λ0, then thisrejection region give a uniformly most powerful test.

If λa < λ0 we find that the RR is of the form

y ≤ λ0 − zα

√λ0

n(126)

This does not depend on λa, other than the fact that λa < λ0, so if thealternative hypothesis is Ha : λa < λ0, then we get a uniformly most powerfultest.

If we want to test with a two sided alternative Ha : λa 6= λ0, then therewill not be a uniformly most powerful test. For two sided tests there usuallydo not exit uniformly most powerful tests.

10.11 Likelihood ratio tests

In this section we use the likelihood ratio to develop a very general test forhypotheses. We allow any number of parameters θ1, · · · , θn. We denote themby Θ. So Θ takes values in Rn. Let Ω0 and Ωa be subsets of R2. Thehypothesis are

H0 : Θ ∈ Ω0, (127)

Ha : Θ ∈ Ωa (128)

The only constraint on Ω0 and Ωa is that they be disjoint. These will becomposite hypotheses unless Ω0 or Ωa just consists of a single point. We

25

let Ω = Ω0 ∪ Ωa. To keep the notation simple, we will denote the likehoodfunction L(y1, · · · , yn|Θ) by just L(Θ).

Definition 6. The likelihood ratio test for level α is defined as follows. Thetest statistic is

λ =maxΘ∈Ω0 L(Θ)

maxΘ∈Ω L(Θ)(129)

The rejection region is of the form λ < k where the constant k is chosen sothat

maxΘ∈Ω0

P (acceptH0|Θ) = α (130)

Example: Consider a normal population with variance 1 and unknown meanµ. So

f(y|µ) =1√2π

exp(−(y − µ)2/2) (131)

We want to test

H0 : µ = µ0, (132)

Ha : µ > µ0 (133)

So Ω0 just consists of the single point µ0 and Ωa is (µ0,∞). And we haveΩ = [µ0,∞). The likelihood function is

L(µ) = (2π)−n/2 exp(−1

2

∑i

(yi − µ)2) (134)

Finding the maxmimum of L(µ) over Ω0 is trivial. It is just L(µ0). Findingthe maxmimum of L(µ) over Ω takes a little calculus. As we often do, thealgebra is a bit simpler if we look at lnL(µ):

lnL(µ) = −n2

ln(2π)− 1

2

∑i

(yi − µ)2 (135)

So

d

dµlnL(µ) =

∑i

(yi − µ) = n(y − µ) (136)

26

So there is one critical point at µ = y. Note however that this value of µ canbe outside Ω. So the max occurs at µ = y if y ≥ µ0, and at µ = µ0 if y < µ0.Note that since the alternative is µ > µ0, if we get a sample with y < µ0,any reasonble test would not reject the null hypothesis. If y ≥ µ0, then

λ =L(µ0)

L(y)= exp(−1

2

∑i

(yi − µ0)2 +1

2

∑i

(yi − y)2) (137)

= exp(nµ0y −1

2ny2 −−1

2nµ2

0 = exp(−1

2n(y − µ0)2) (138)

So the rejection region λ < k is equivalent to |y − µ0| > c for some constantc. Remember that we are doing the case of y ≥ µ0. So this is equivalent toy ≥ µ0 + c. The constant c is determine by requiring the probability of atype I error to be α.

Example (continued): We continue the example above but now considera composite null hypothesis:

H0 : µ ≤ µ0, (139)

Ha : µ > µ0 (140)

We need to find the max of L(µ) over µ ≤ µ0. There is one critical point atµ = y. So if y < µ0 the max is L(y) and if y ≥ µ0 the max is L(µ0). Firstconsider what happens if y < µ0. Then the max in the numerator is at µ = yand the max in the denominator is at the same value. So the likelihood ratiowill be 1 which will be in the rejection region. So from now on we just lookat the case that y ≥ µ0. So the max in the numerator is L(µ0). Now thecomputation goes just as in the previous example.


Example (book): Normal with σ2 and µ both unknown.

H0 : µ = µ0, (141)

Ha : µ > µ0 (142)

So

f(y|µ, σ) =1

σ√

2πexp(−(y − µ)2

2σ2) (143)

27

L(y1, · · · , yn) = (2πσ2)−n/2 exp(−∑i

(yi − µ)2

2σ2) (144)

First we find the max over Ω0. This means µ is fixed to µ0 but σ2 can beany positive number. So we need to maximize L as function of σ2. Somecalculus shows the max occurs at

σ20 =

1

n

∑i

(yi − µ0)2 (145)

Thus

maxΩ0

L(µ, σ2) (146)

= (2πσ20)−n/2 exp(−

∑i

(yi − µ0)2

2σ20

) = [2π]−n/2(σ20)−n/2e−n/2 (147)

Next we need to maximize L over Ω. So µ ≥ µ0 and σ2 can be anypositive number. The max over σ2 goes as before. It occurs at

σ2 =1

n

∑i

(yi − µ)2 (148)

Now maximize over µ ≥ µ0. Taking derivative wrt µ of lnL, we find onecritical point at µ = y. But as before this may be outside of [µ0,∞). Whenit is outside, the max occurs at µ = µ0. So we find that max is at µ whereµ = y if y ≥ µ0 and µ = µ0 if y < µ0. Thus we find

maxΩ

L(µ, σ) = (2π)−n/2(σ2)−n/2e−n/2 (149)

So the likelihood ratio is

λ =maxΘ∈Ω0 L(Θ)

maxΘ∈Ω L(Θ)(150)

=(σ2

0)−n/2

(σ2)−n/2(151)

=

[ ∑i(yi−y)2∑i(yi−µ0)2

]n/2if y ≥ µ0

1 if y < µ0

(152)

28

The rejection region is λ < k. We will always have k < 1, so the second casein the above does not matter. So we can rewrite λ < k as∑

i(yi − y)2∑i(yi − µ0)2

< k′ (153)

where k′ = k2/n. Recall that

s2 =1

n− 1

∑i

(yi − y)2 (154)

And∑i

(yi − µ0)2 =∑i

(yi − y + y − µ0)2 =∑i

(yi − y)2 + n(y − µ0)2 (155)

After some algebra we find that we can write the rejection region as

y − µ0

s/√n≥ c (156)

Then we find c to make the probability of a type I error be α.

In order to carry out a likelihood ratio test we need to be able to find k.In our examples so far, the test statistic λ was relatively simple and we coulddo this explicitly. This need not be the case as the following example shows.

Example: Two plants manufacture widgets. We look at the number ofdefects they make each day. We assume the distribution of the number ofdefects follows a Poisson distribution with parameter θ1 for plant 1 and θ2

for plant 2. We want to test whether the two defect rates are equal with asignificance level of α = 1%. We randomly choose 100 days for each of theplants and observe how many defects occur each of those days. For plant 1we find a total of 2072 defects from the 100 days. For plant 2 we find a totalof 2265 defects from the 100 days.

Note that we have two populations here. We use x1, · · · , x100 to denotethe random sample from population 1 and y1, · · · , y100 the random samplefrom population 2. To keep the notation under control we will denote these100-tuples by just x and y. The likelihood function is

L(x, y|θ1, θ2) =1

kθ∑

i xi1 e−nθ1 θ

∑i yi

2 e−nθ2 (157)

29

where

k =∏i

xi!∏i

yi! (158)

The hypotheses are

H0 : θ1 = θ2, (159)

Ha : θ1 6= θ2 (160)

So

Ω0 = (θ, θ) : θ > 0 (161)

Ωa = (θ1, θ2) : θ1, θ2 > 0, θ1 6= θ2 (162)

(163)

To maximize the likelihood over Ω0, we need to compute

maxθ

1

kθ∑

i xi+∑

i yie−2nθ (164)

The max occurs at

θ =

∑i xi +

∑i yi

2n(165)

To keep the notation under control, let

x =1

n

∑i

xi, y =1

n

∑i

yi (166)

So

θ =x+ y

2(167)

and

maxΩ0

L =1

kθnx+ny e−2nθ (168)

To maximize the likelihood over Ω, we need to compute

maxθ1,θ2

1

k(θ1)nx e−nθ1(θ2)ny e−nθ2 (169)

30

The max occurs at

θ1 = x, θ2 = y, (170)

and

maxΩ

L =1

k(θ1)nx e−nθ1(θ)ny e−nθ2 (171)

Note that nθ1 + nθ2 = 2nθ. So

λ =maxΩ0 L

maxΩ L=

(θ)nx+ny

(θ1)nx (θ2)ny(172)

The rejection region is λ < k where k is chosen to make

maxΩ0

P (λ < k) = α (173)

However, λ is complicated and we have no hope of computing its distributionexplicitly. So we cannot find k explicitly.

The following theorem says that for large samples the distribution of λ isapproximately related to the χ2 distribution.

Theorem 1. Let r0 be the number of free parameters in Ω0, r the numberof free parameters in Ω. Suppose that r > r0. Under certain regularityconditions, the distribution of −2 lnλ is approximately a χ2 distribution withr − r0 degrees of freedom if the sample size is large.

In the likelihood ratio test we reject the null hypothesis if λ < k. Thisis equivalent to −2 lnλ > −2 ln k. So the rejection region with significancelevel α will be −2 lnλ > χ2

α.

Example continued In our example r = 2 and r0 = 1. For plant 1 we hada total of 2072 defects, for plant 2 a total of 2265 defects. So

x =2072

100, y =

2265

100, θ =

2072 + 2265

200(174)

which yields −2 ln(λ) = 9.527. We have χ20.01 = qchisq(0.99, 1) = 6.635. So

we reject the null hypothesis and conclude the defect rates are different forthe two factories.

31


Example: This is one of the problems on the last homework set. We juststart it. You will finish it for the homework. This is problem 10.105 in thebook. There are four political wards in a city and we want to compare thefraction of voters favoring candidate A in each of the wards. We randomlypoll 200 voters in each ward. In ward 1 we find 76 favor A, in ward 2 53favor A, in ward 3 we find 59 favor A, and in ward 4 we find 48 favor A. Wewant to test if the percentages favoring A in the four wards are all the samewith a significance level of 5%.

Let x1, · · · , x200 be the sample from ward 1. Each xi is 0 if the ith voterdoes not favor A, 1 if the voter does favor A. We denote x1, · · · , x200 by justx. We let y, z, w be the random samples from wards 2,3,4. The likelihood is

L(x, y, z, w|p1, p2, p3, p4) = pnx1 (1− p1)n(1−x) (175)

pny2 (1− p2)n(1−y) pnz3 (1− p3)n(1−z) pnw4 (1− p4)n(1−w) (176)

where

x =1

n

∑i

xi, y =1

n

∑i

yi, z =1

n

∑i

zi, w =1

n

∑i

wi (177)

32

10 hypothesis testing - math.arizona.edu

Documents