21 testing problems - purdue university

21
21 Testing Problems Testing is certainly another fundamental problem of inference. It is interestingly different from estimation. The accuracy measures are fundamentally different and the appropriate asymptotic theory is also different. Much of the theory of testing has revolved somehow or other on the Neyman-Pearson lemma, which has led to a lot of ancillary developments in other problems in mathematical statistics. Testing has ledto the useful idea of local alternatives, and like maximum likelihood in estimation, likelihood ratio, Wald, and Rao score tests have earned the status of default methods, with a neat and quite unified asymptotic theory. Because of all these reasons, a treatment of testing is essential. We discuss the asymptotic theory of likelihood ratio, Wald, and Rao score tests in this chapter. Principal references for this chapter are Bickel and Doksum (2001), Ferguson (1996), and Sen and Singer (1993). Many other specific references are given in the sections. 21.1 Likelihood Ratio Tests The likelihood ratio test is a general omnibus test applicable, in principle, in most finite dimensional parametric problems. Thus, let X (n) =(X 1 ,...,X n ) be the observed data with joint distribution P n θ ¿ μ n , θ Θ, and density f θ (x (n) )= dP n θ /dμ n . Here, μ n is some appropriate σ-finite measure on X n , which we assume to be a subset of an Euclidean space. For testing H 0 : θ Θ 0 vs. H 1 : θ Θ - Θ 0 , the likelihood ratio test (LRT) rejects H 0 for small values of Λ n = sup θΘ 0 f θ (x (n) ) sup θΘ f θ (x (n) ) The motivation for Λ n comes from two sources: (a) The case where H 0 , H 1 are each simple, a most powerful (MP) test is found from Λ n by the Neyman-Pearson lemma. (b) The intuitive explanation that for small values of Λ n we can better match the observed data with some value of θ outside of Θ 0 . LRTs are useful because they are omnibus tests and because otherwise optimal tests for a given sample size n are generally hard to find outside of the exponential family. However the LRT is not a universal test. There are important examples where the LRT simply cannot be used, because the null distribution of the LRT test statistic depends on nuisance parameters. Also, the exact distribution of the LRT 319

Upload: others

Post on 18-Dec-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

21 Testing Problems

Testing is certainly another fundamental problem of inference. It is interestingly

different from estimation. The accuracy measures are fundamentally different and

the appropriate asymptotic theory is also different. Much of the theory of testing has

revolved somehow or other on the Neyman-Pearson lemma, which has led to a lot

of ancillary developments in other problems in mathematical statistics. Testing has

ledto the useful idea of local alternatives, and like maximum likelihood in estimation,

likelihood ratio, Wald, and Rao score tests have earned the status of default methods,

with a neat and quite unified asymptotic theory. Because of all these reasons,

a treatment of testing is essential. We discuss the asymptotic theory of likelihood

ratio, Wald, and Rao score tests in this chapter. Principal references for this chapter

are Bickel and Doksum (2001), Ferguson (1996), and Sen and Singer (1993). Many

other specific references are given in the sections.

21.1 Likelihood Ratio Tests

The likelihood ratio test is a general omnibus test applicable, in principle, in most

finite dimensional parametric problems. Thus, let X(n) = (X1, . . . , Xn) be the

observed data with joint distribution P nθ ¿ µn, θ ∈ Θ, and density fθ(x

(n)) =

dP nθ /dµn. Here, µn is some appropriate σ-finite measure on Xn, which we assume

to be a subset of an Euclidean space. For testing H0 : θ ∈ Θ0 vs. H1 : θ ∈ Θ − Θ0,

the likelihood ratio test (LRT) rejects H0 for small values of

Λn =supθ∈Θ0

fθ(x(n))

supθ∈Θ fθ(x(n))

The motivation for Λn comes from two sources:

(a) The case where H0, H1 are each simple, a most powerful (MP) test is found

from Λn by the Neyman-Pearson lemma.

(b) The intuitive explanation that for small values of Λn we can better match

the observed data with some value of θ outside of Θ0.

LRTs are useful because they are omnibus tests and because otherwise optimal

tests for a given sample size n are generally hard to find outside of the exponential

family. However the LRT is not a universal test. There are important examples

where the LRT simply cannot be used, because the null distribution of the LRT test

statistic depends on nuisance parameters. Also, the exact distribution of the LRT

319

statistic is very difficult or impossible to find in many problems. Thus, asymptotics

become really important. But the asymptotics of the LRT may be nonstandard

under nonstandard conditions.

Although we only discuss the case of a parametric model with a fixed finite di-

mensional parameter space, LRTs and their useful modifications have been studied

when the number of parameters grows with the sample size n, and in nonparamet-

ric problems. See, e.g., Portnoy(1988),Fan,Hung and Wong(2000), and Fan and

Zhang(2001).

We start with a series of examples, which illustrate various important aspects of the

likelihood ratio method.

21.2 Examples

Example 21.1. Let X1, X2, . . . , Xniid∼ N(µ, σ2) and consider testing

H0 : µ = 0 vs.H1 : µ 6= 0.

Let θ = (µ, σ2). Then,

Λn =supθ∈Θ0

(1/σn) exp(− 1

2σ2

∑i(Xi − µ)2

)supθ∈Θ(1/σn) exp

(− 12σ2

∑i(Xi − µ)2

) =

(∑i(Xi − Xn)2∑

i X2i

)n/2

by an elementary calculation of mles of θ under H0 and in the general parameter

space. By another elementary calculation, Λn < c is seen to be equivalent to t2n > k

where

tn =

√nXn√

1n−1

∑i(Xi − Xn)

2

is the t-statistic. In other words, the t-test is the LRT. Also, observe that,

t2n =nX

2

n

1n−1

∑i(Xi − Xn)

2

=

∑i X

2i − ∑

i(Xi − Xn)2

1n−1

∑i(Xi − Xn)

2

=(n − 1)

∑i X

2i∑

i(Xi − Xn)2 − (n − 1)

= (n − 1)Λ−2/nn − (n − 1)

320

This implies,

Λn =

(n − 1

t2n + n − 1

)n/2

⇒ log Λn =n

2log

n − 1

t2n + n − 1

⇒ −2 log Λn = n log

(1 +

t2nn − 1

)

= n

(t2n

n − 1+ op

(t2n

n − 1

))L−→ χ2

1

under H0 since tnL−→ N(0, 1) under H0.

Example 21.2. Consider a multinomial distribution MN(n, p1, · · · , pk). Consider

testing

H0 : p1 = p2 = · · · = pk vs. H1 : H0 is not true.

Let n1, · · · , nk denote the observed cell frequencies. Then by an elementary calcu-

lation,

Λn = Πki=1

(n

kni

)ni

⇒ − log Λn = n(logk

n) +

k∑i=1

ni log ni

The exact distribution of this is a messy discrete object and so asymptotics will be

useful. We illustrate the asymptotic distribution for k = 2. In this case,

− log Λn = n log 2 − n log n + n1 log n1 + (n − n1) log(n − n1)

Let Zn = (n1 − n/2)/√

n/4. Then,

− log Λn = n log 2 − n log n

+n +

√nZn

2log

n +√

nZn

2+

n −√nZn

2log

n −√nZn

2

= −n log n +n +

√nZn

2log(n +

√nZn) +

n −√nZn

2log(n −√

nZn)

=n +

√nZn

2log(1 +

Zn√n

) +n −√

nZn

2log(1 − Zn√

n)

=n +

√nZn

2

(Zn√

n− Z2

n

2n+ op

(Z2

n

n

))

321

+n −√

nZn

2

(− Zn√

n− Z2

n

2n+ op

(Z2

n

n

))

=Z2

n

2+ op(1).

Hence −2 log ΛnL−→ χ2

1 under H0 as ZnL−→ N(0, 1) under H0.

Remark: The popular test in this problem is the Pearson chi-square test which

rejects H0 for large values of ∑ki=1(ni − n/k)2

n/k

Interestingly this test statistic too has an asymptotic χ21 distribution as does the

LRT statistic.

Example 21.3. This example shows that the chi-square asymptotics of the LRT

statistic fail when parameters have constraints or somehow there is a boundary phe-

nomenon.

Let X1, X2, . . . , Xniid∼ N2(θ, I) where the parameter space is restricted to Θ = {θ =

(θ1, θ2) : θ1 ≥ 0, θ2 ≥ 0}. Consider testing

H0 : θ = 0 vs. H1 : H0 is not true.

We would write the n realizations as xi = (x1i, x2i)T and the sample average would

be denoted by x = (x1, x2). The mle of θ is given by,

θ =

(x1 ∨ 0

x2 ∨ 0

)

Therefore by an elementary calculation,

Λn =exp(−∑

i x21i/2 − ∑

i x22i/2)

exp(−∑i(x1i − x1 ∨ 0)2/2 − ∑

i(x2i − x2 ∨ 0)2/2)

Case 1: x1 ≤ 0, x2 ≤ 0. In this case Λn = 1 and −2 log Λn = 0.

Case 2: x1 > 0, x2 ≤ 0. In this case −2 log Λn = nx21.

Case 3: x1 ≤ 0, x2 > 0. In this case −2 log Λn = nx22.

322

Case 4: x1 > 0, x2 > 0. In this case −2 log Λn = nx21 + nx2

2.

Now, under H0 each of the above four cases has a probability 1/4 of occurrence.

Therefore, under H0,

−2 log ΛnL−→ 1

4δ{0} +

1

2χ2

1 +1

4χ2

2,

in the sense of the mixture distribution being its weak limit.

Example 21.4. In bio-equivalence trials, a brand name drug and a generic are

compared with regard to some important clinical variable, such as average drug

concentration in blood over a 24 hour time period. By testing for a difference, the

problem is reduced to a single variable, often assumed to be normal. Formally, one

has X1, X2, . . . , Xniid∼ N(µ, σ2) and the bio-equivalence hypothesis is:

H0 : |µ| ≤ ε for some specified ε > 0.

Inference is always significantly harder if known constraints on the parameters are

enforced. Casella and Strawderman(1980) is a standard introduction to the normal

mean problem with restrictions on the mean; Robertson,Wright and Dykstra (1988)

is an almost encyclopedic exposition.

Coming back to our example, to derive the LRT, we need the restricted mle and

the non-restricted mle. We will assume here that σ2 is known. Then the MLE under

H0 is

µH0 = X if |X| ≤ ε

= εsign(X) if |X| > ε.

Consequently, the LRT statistic Λn satisfies,

−2 log Λn = 0 if |X| ≤ ε

=n

σ2(X − ε)2 if X > ε

=n

σ2(X + ε)2 if X < −ε.

Take a fixed µ. Then,

Pµ(|X| ≤ ε) = Φ

(√n(ε − µ)

σ

)+ Φ

(√n(ε + µ)

σ

)− 1

Pµ(X < −ε) = Φ

(−√

n(ε + µ)

σ

)

323

and

Pµ(X > ε) = 1 − Φ

(√n(ε − µ)

σ

)From these expressions we get:

Case 1: If −ε < µ < ε then Pµ(|X| ≤ ε) → 1 and so −2 log ΛnL−→ δ{0}.

Case 2.a: If µ = −ε then each of Pµ(|X| ≤ ε) and Pµ(X < −ε) converges to 1/2. In

this case, −2 log ΛnL−→ 1

2δ{0}+ 1

2χ2

1, i.e., the mixture of a point mass and a chisquare

distribution.

Case 2.b: Similarly if µ = ε then again, −2 log ΛnL−→ 1

2δ{0} + 1

2χ2

1.

Case 3: If µ > ε then Pµ(|X| ≤ ε) → 0 and Pµ(X > ε) → 1. In this case,

−2 log ΛnL−→ NCχ2(1, (µ − ε)2/σ2)), a noncentral chisquare distribution.

Case 4: Likewise, if µ < −ε then −2 log ΛnL−→ NCχ2(1, (µ + ε)2/σ2)).

So, unfortunately, even the null asymptotic distribution of − log Λn depends on the

exact value of µ.

Example 21.5. Consider the Behrens-Fisher problem with

X1, X2, . . . , Xmiid∼ N(µ1, σ

21)

Y1, Y2, . . . , Yniid∼ N(µ2, σ

22)

and all m + n observations are independent. We want to test

H0 : µ1 = µ2 vs. H1 : µ1 6= µ2.

Let µ, σ21, σ

22 denote the restricted mle of µ, σ2

1 and σ22 respectively, under H0. They

satisfy the equations,

σ21 = (X − µ)2 +

1

m

∑i

(Xi − X)2

σ22 = (Y − µ)2 +

1

n

∑j

(Yj − Y )2

0 =m(X − µ)

σ21

+n(Y − µ)

σ22

324

This gives,

−2 log Λn =

( 1m

∑(Xi − X)2

1m

∑(Xi − X)2 + (X − µ)2

)m2

( 1n

∑(Yj − Y )2

1n

∑(Yj − Y )2 + (Y − µ)2

)n2

However, the LRT fails as a matter of practical use because its distribution depends

on the nuisance parameter σ21/σ

22, which is unknown.

Example 21.6. In point estimation, the mles in nonregular problems behave fun-

damentally differently with respect to their asymptotic distributions. For exam-

ple, limit distributions of mles are not normal (see Chapter 16). It is interesting,

therefore, that limit distributions of −2 log Λn in nonregular cases still follow the

chi-square recipe, but with different degrees of freedom.

As an example, suppose X1, X2, . . . , Xniid∼ U [µ − σ, µ + σ] and suppose we want to

test H0 : µ = 0. Let W = X(n) − X(1) and U = max(X(n),−X(1)). Under H0, U

is complete and sufficient and W/U is an ancillary statistic. Therefore, by Basu’s

theorem (Basu (1955)), under H0, U and W/U are independent. This will be useful

shortly to us. Now, by a straightforward calculation,

Λn =

(W

2U

)n

⇒ −2 log Λn = 2n(log 2U − log W )

Therefore, under H0,

Eeit(−2 log Λn) = Ee2nit(log 2U−log W ) =Ee2nit(− log W )

Ee2nit(− log 2U)

by the independence of U and W/U . This works out, on doing the calculation, to

Eeit(−2 log Λn) =n − 1

n(1 − 2it) − 1→ 1

1 − 2it

Since (1− 2it)−1 is the characteristic function of the χ22 distribution, it follows that

−2 log ΛnL−→ χ2

2.

Thus inspite of the nonregularity, −2 log Λn is asymptotically chi-square, but we

gain a degree of freedom! This phenomenon holds more generally in nonregular

cases.

325

Example 21.7. Consider the problem of testing for equality of two Poisson means.

Thus let X1, X2, . . . , Xmiid∼ Poi(µ1) and Y1, Y2, . . . , Yn

iid∼ Poi(µ2), where Xi’s and

Yj’s are independent. Suppose we wish to test H0 : µ1 = µ2. We assume m,n → ∞in such a way that

m

m + n→ λ with 0 < λ < 1

The restricted mle for µ = µ1 = µ2 is∑i Xi +

∑j Yj

m + n=

mX + nY

m + n

Therefore, on an easy calculation,

Λm,n =

((mX + nY )/(m + n)

)mX+nY

XmX Y nY

⇒ − log Λm,n = mX log X + nY log Y

−(mX + nY ) log

(m

m + nX +

n

m + nY

)The asymptotic distribution of − log Λm,n can be found by a two-term Taylor ex-

pansion, by using the multivariate delta theorem. This would be a more direct

derivation, but we will instead present a derivation based on an asymptotic tech-

nique that is useful in many problems. Define,

Z1 = Z1,m =

√m(X − µ)√

µ, Z2 = Z2,m =

√n(Y − µ)√

µ

where µ = µ1 = µ2 is the common value of µ1 and µ2 under H0. Substituting in the

expression for log Λm,n we have,

− log Λm,n = (mµ +√

mµZ1)

[log

(1 +

Z1√mµ

)+ log µ

]

+(nµ +√

nµZ2)

[log

(1 +

Z2√nµ

)+ log µ

]

−((m + n)µ +√

mµZ1 +√

nµZ2)

[log

(1 +

m

m + n

Z1√mµ

+n

m + n

Z2√nµ

)+ log µ

]

= (mµ +√

mµZ1)

(Z1√mµ

− Z21

2mµ+ Op(m

−3/2) + log µ

)

+(nµ +√

nµZ2)

(Z2√nµ

− Z22

2nµ+ Op(n

−3/2) + log µ

)

326

−((m + n)µ +√

mµZ1 +√

nµZ2)(m

m + n

Z1√mµ

+n

m + n

Z2√nµ

− m

(m + n)2

Z21

2µ− n

(m + n)2

Z22

−√

mn

(m + n)2

Z1Z2

µ+ Op(min(m,n)−3/2) + log µ)

On further algebra, from above, by breaking down the terms, and on cancellation,

− log Λm,n = Z21

(1

2− m

2(m + n)

)+ Z2

2

(1

2− n

2(m + n)

)

−√

mn

m + nZ1Z2 + op(1)

Assuming that m/(m + n) → λ, it follows that,

− log Λm,n =1

2

(√n

m + nZ1 −

√m

m + nZ2

)2

+ op(1)

L−→ 1

2(√

1 − λN1 −√

λN2)2

where N1, N2 are independent N(0, 1). Therefore, −2 log Λm,nL−→ χ2

1, since√

1 − λN1−√λN2 ∼ N(0, 1).

Example 21.8. This example gives a general formula for the LRT statistic for test-

ing about the natural parameter vector in q-dimensional multiparameter exponential

family. What makes the example special is a link of the LRT to the Kullback-Leibler

divergence measure in the exponential family.

Suppose X1, X2, . . . , Xniid∼ p(x|θ) = eθ′T (x)c(θ)h(x)(dµ). Suppose we want to test

θ ∈ Θ0 vs. H1 : θ ∈ Θ − Θ0 where Θ is the full natural parameter space. Then, by

definition,

−2 log Λn = −2 logsupθ∈Θ0

c(θ)n exp(θ′∑

i T (xi))

supθ∈Θ c(θ)n exp(θ′∑

i T (xi))

= −2 supθ0∈Θ0

[n log c(θ0) + nθ′0T ] + 2 supθ∈Θ

[n log c(θ) + nθ′T ]

= 2n supθ∈Θ

infθ0∈Θ0

[(θ − θ0)′T + log c(θ) − log c(θ0)]

on a simple rearrangement of the terms. Recall now (see Chapter 2) that the

Kullback-Leibler divergence, K(f, g) is defined to be,

327

K(f, g) = Ef logf

g=

∫log

f(x)

g(x)f(x)µ(dx)

It follows that,

K(pθ, pθ0) = (θ − θ0)′EθT (X) + log c(θ) − log c(θ0)

for any θ, θ0 ∈ Θ. We will write K(θ, θ0) for K(pθ, pθ0). Recall also that in the

general exponential family, the unrestricted mle θ of θ exists for all large n and

furthermore, θ is a moment estimate given by Eθ=θ(T ) = T . Consequently,

−2 log Λn = 2n supθ∈Θ

infθ0∈Θ0

[(θ − θ0)′Eθ(T ) + log c(θ) − log c(θ0)+

(θ − θ)′Eθ(T ) + log c(θ) − log c(θ)]

= 2n supθ∈Θ

infθ0∈Θ0

[K(θ, θ0) + (θ − θ)′Eθ(T ) + log c(θ) − log c(θ)]

= 2n infθ0∈Θ0

K(θ, θ0) + 2n supθ∈Θ

[(θ − θ)′T n + log c(θ) − log c(θ)]

The second term in the above vanishes as the the supremum is attained at θ = θ.

Thus we ultimately have the identity,

−2 log Λn = 2n infθ0∈Θ0

K(θ, θ0)

The connection is beautiful. Kullback-Leibler divergence being a measure of dis-

tance, infθ0∈Θ0 K(θ, θ0) quantifies the disagreement between the null and the esti-

mated value of the true parameter θ, when estimated by the mle. The formula says

that when the disagreement is small one should accept H0.

The link to the K-L divergence is special for the exponential family; for example, if

pθ(x) =1√2π

e−12(x−θ)2 ,

then, for any θ, θ0, K(θ, θ0) = (θ − θ0)2/2 by a direct calculation. Therefore,

−2 log Λn = 2n infθ0∈Θ0

1

2(θ − θ)2 = n inf

θ0∈Θ0

(X − θ0)2

If H0 is simple, i.e., H0 : θ = θ0, then the above expression simplifies to −2 log Λn =

n(X − θ0)2, just the familiar chisquare statistic.

328

21.3 Asymptotic Theory of Likelihood Ratio Test Statistics

Suppose X1, X2, . . . , Xniid∼ f(x|θ), densities with respect to some dominating mea-

sure µ. We present the asymptotic theory of the LRT statistic for two types of null

hypothesis.

Suppose θ ∈ Θ ⊆ Rq for some 0 < q < ∞, and suppose as an affine subspace of R

q,

dim Θ = q. One type of null hypothesis we consider is H0 : θ = θ0 (specified). A sec-

ond type of null hypothesis is that for some 0 < r < q, and functions h1(θ), . . . hr(θ),

linearly independent,

h1(θ) = h2(θ) = · · · = hr(θ) = 0.

For each case, under regularity conditions to be stated below, the LRT statistic

is asymptotically a central chi-square under H0 with a fixed degree of freedom,

including the case where H0 is composite.

Example 21.9. Suppose X11, . . . , X1n1

iid∼ Poisson (θ1), X21, . . . , X2n2

iid∼ Poisson

(θ2) and X31, . . . , X3n3

iid∼ Poisson (θ3) where 0 < θ1, θ2, θ3 < ∞, and all observations

are independent. An example of H0 of the first type is H0 : θ1 = 1, θ2 = 2, θ3 = 1.

An example of H0 of the second type is H0 : θ1 = θ2 = θ3. The functions h1, h2 may

be chosen as h1(θ) = θ1 − θ2, h2(θ) = θ2 − θ3.

In the first case, −2 log Λn is asymptotically χ23 under H0 and in the second case

it is asymptotically χ22 under each θ ∈ H0. This is a special example of a theorem

originally stated by Wilks (1938); also see Lawley (1956). The degree of freedom

of the asymptotic chi-square distribution under H0 is the number of independent

constraints specified by H0; it is useful to remember this as a general rule.

Theorem 21.1. Suppose X1, X2, . . . , Xniid∼ fθ(x) = f(x|θ) = dPθ

dµfor some domi-

nating measure µ. Suppose θ ∈ Θ, an affine subspace of Rq of dimension q. Suppose

that the conditions required for asymptotic normality of any strongly consistent se-

quence of roots θn of the likelihood equation hold (see Chapter 16). Consider the

problem H0 : θ = θ0 against H1 : θ 6= θ0. Let

Λn =

n∏i=1

f(xi|θ0)

l(θn),

where l(·) denotes the likelihood function. Then −2 log ΛnL−→ χ2

q under H0.

The proof can be seen in Ferguson (1996), Bickel and Doksum (2001), or Sen and

329

Singer (1993).

Remark: As we have seen in our illustrative examples earlier in this chapter, the

theorem can fail if the regularity conditions fail to hold. For instance, if the null

value is a boundary point in a constrained parameter space, then the result will fail.

Remark: This theorem is used in the following way. To use the LRT with an exact

level α, we need to find cn,α such that PH0(Λn < cn,α) = α. Generally, Λn is so

complicated as a statistic that one cannot find cn,α exactly. Instead we use the test

that rejects H0, when

−2 log Λn > χ2α,q.

The theorem above implies that P (−2 log Λn > χ2α,q) → α and so P (−2 log Λn >

χ2α,q) − P (Λn < cn,α) → 0 as n → ∞.

Under the second type of null hypothesis the same result holds with the degree of

freedom being r. Precisely, the following holds:

Theorem 21.2. Assume all the regularity conditions in the previous theorem. De-

fine Hq×r = H(θ) =(

∂∂θi

hj(θ))

1≤i≤q1≤j≤r

; it is assumed that the required partial deriva-

tives exist. Suppose for each θ ∈ Θ0, H has full column rank. Define

Λn =

supθ:hj(θ)=0,1≤j≤r

l(θ)

l(θn).

Then, −2 log ΛnL−→ χ2

r under each θ in H0.

A proof can be seen in Sen and Singer(1993).

21.4 Distribution under Alternatives

To find the power of the test that rejects H0 when −2 log Λn > χ2α,d for some d,

one would need to know the distribution of −2 log Λn at the particular θ = θ1 value

where we want to know the power. But the distribution under θ1 of −2 log Λn for a

fixed n is also generally impossible to find. So we may appeal to asymptotics.

However, there cannot be a nondegenerate limit distribution on [0,∞) for −2 log Λn

under a fixed θ1 in the alternative. The following simple example illustrates this

difficulty.

330

Example 21.10. Suppose X1, X2, . . . , Xniid∼ N(µ, σ2), µ, σ2 both unknown, and

we wish to test H0 : µ = 0 vs. H1 : µ 6= 0. We saw earlier that

Λn =

(∑(Xi − X)2∑

X2i

)n/2

=

( ∑(Xi − X)2∑

(Xi − X)2 + nX2

)n/2

.

⇒ −2 log Λn = n log

(1 +

nX2∑(Xi − X)2

)= n log

(1 +

X2

1n

∑(Xi − x)2

)

Consider now a value µ 6= 0. Then X2 a.s−→ µ2(> 0) and 1n

∑(Xi − X)2 a.s−→ σ2.

Therefore, clearly −2 log Λna.s−→ ∞ under each fixed µ 6= 0. There cannot be a bona

fide limit distribution for −2 log Λn under a fixed alternative µ.

However, if we let µ depend on n and take µn = ∆√n

for some fixed but arbitrary

∆, 0 < ∆ < ∞, then −2 log Λn still has a bona fide limit distribution under the

sequence of alternatives µn.

Sometimes alternatives of the form µn = ∆√n

are motivated by arguing that one wants

the test to be powerful at values of µ close to the null value. Such a property would

correspond to a sensitive test. These alternatives are called Pitman alternatives.

The following result holds. We present the case hi(θ) = θi for notational simplicity.

Theorem 21.3. Assume the same regularity conditions as in the previous theo-

rems. Let θq×1 = (θ1, η)′ where θ1 is r × 1 and η is (q − r) × 1, a vector of nuisance

parameters. Let H0 : θ1 = θ10 (specified) and let θn = (θ10 + ∆√n, η). Let I(θ)

be the Fisher Information matrix, with Iij(θ) = −E(

∂2

∂θi∂θjlog f(X|θ)

). Denote

V (θ) = I−1(θ) and Vr(θ) the upper (r × r) principal submatrix in V (θ). Then

−2 log ΛnL−→

Pθn

NCχ2(r, ∆′V −1r (θ0, η)∆) where NCχ2(r, δ) denotes the noncentral χ2

distribution with r degrees of freedom and noncentrality parameter δ.

Remark: The theorem is essentially proved in Sen and Singer (1993). The the-

orem can be restated in an obvious way for the testing problem H0 : g(θ) =

(gi(θ), . . . , gr(θ)) = 0.

Example 21.11. Let X1, . . . , Xn be iid N(µ, σ2). Let θ = (µ, σ2) = (θ1, η), where

θ1 = µ and η = σ2. Thus q = 2 and r = 1. Suppose we want to test H0 : µ = µ0 vs.

H1 : µ 6= µ0. Consider alternatives θn = (µ0 + ∆√n, η). By a familiar calculation,

I(θ) =

(1η

0

0 12η2

)⇒ V (θ) =

(η 0

0 2η2

)and Vr(θ) = η.

331

Therefore, by the above theorem −2 log ΛnL−→

Pθn

NCχ2(1, ∆2

η) = NCχ2(1, ∆2

σ2 )

Remark: One practical use of this theorem is in approximating the power of a test

that rejects H0 if −2 log Λn > c.

Suppose we are interested in approximating the power of the test when θ1 = θ10 + ε.

Formally, set ε = ∆√n⇒ ∆ = ε

√n. Use NCχ2(1, nε2

Vr(θ10,η)) as an approximation to

the distribution of −2 log Λn at θ = (θ10+ε, η). Thus the power will be approximated

as P (NCχ2(1, nε2

Vr(θ10,η)) > c).

21.5 Bartlett Correction

We saw that under regularity conditions −2 log Λn has asymptotically a χ2(r) distri-

bution under H0 for some r. Certainly, −2 log Λn is not exactly distributed as χ2(r).

In fact, even the expectation is not r. It turns out that E(−2 log Λn) under H0 admits

an expansion of the form r(1 + an

+ o(n−1)). Consequently, E(

−2 log Λn

1+ an

+Rn

)= r where

Rn is the remainder term in E(−2 log Λn). Moreover, E(

−2 log Λn

1+ an

)= r + O(n−2).

Thus we gain higher order accuracy in the mean by rescaling −2 log Λn to −2 log Λn

1+ an

.

For the absolutely continuous case, the higher order accuracy even carries over to

the χ2 approximation itself, i.e.

P

(−2 log Λn

1 + an

≤ c

)= P (χ2(r) ≤ c) + O(n−2);

Due to the rescaling, the 1n

term that would otherwise be present has vanished. This

type of rescaling is known as the Bartlett correction. See Bartlett(1937),Wilks(1938),

McCullagh and Cox(1986), and Barndorff-Nielsen and Hall(1988) for detailed treat-

ment of Bartlett corrections in general circumstances.

21.6 The Wald and Rao Score Tests

Competitors to the LRT are available in the literature. They are general and can

be applied to a wide selection of problems. Typically, the three procedures are

asymptotically first order equivalent. See Wald(1943),and Rao(1948) for the first

introduction of these procedures.

We define the Wald and the score statistics first. We have not stated these definitions

under the weakest possible conditions. Also note that establishing the asymptotics

332

of these statistics will require more conditions than are needed for just defining

them.

Definition 21.1. Let X1, X2, . . . , Xn be iid f(x|θ) = dPθ

dµ, θ ∈ Θ ⊆ Rq, µ a σ-finite

measure. Suppose the Fisher information matrix I(θ) exists at all θ. Consider

H0 : θ = θ0. Let θn be the MLE of θ. Define QW = n(θn − θ0)′I(θn)(θn − θ0). This

is the Wald test statistic for H0 : θ = θ0.

Definition 21.2. Let X1, X2, . . . , Xn be iid f(x|θ) = dPθ

dµ, θ ∈ Θ ⊆ Rq, µ a σ-finite

measure. Consider testing H0 : θ = θ0. Assume that f(x|θ) is partially differen-

tiable with respect to each coordinate of θ for every x, θ, and the Fisher information

matrix exists and is invertible at θ = θ0. Let Un(θ) =n∑

i=1

∂∂θ

log f(xi|θ). Define

QS = 1nU ′

n(θ0)I−1(θ0)Un(θ0); QS is the Rao score test statistic for H0 : θ = θ0.

Remark: As we have seen in Chapter 16, the MLE θn is asymptotically multivariate

normal under the Cramer-Rao regularity conditions. Therefore, n(θn−θ0)′I(θn)(θn−

θ0) is asymptotically a central chisquare, provided I(θ) is smooth at θ0. The Wald

test rejects H0 when QW = n(θn − θ0)′I(θn)(θn − θ0) is larger than a chisquare per-

centile.

On the other hand, under simple moment conditions on the score function ∂∂θ

log f(xi|θ),by the CLT Un(θ)√

nis asymptotically multivariate normal, and therefore the Rao score

statistic QS = 1nU ′

n(θ0)I−1(θ0)Un(θ0) is asymptotically a chisquare. The score test

rejects H0 when QS is larger than a chisquare percentile. Notice that in this case of

a simple H0, the χ2 approximation of QS holds under less assumptions than would

be required for QW or −2 log Λn. Note also that to evaluate QS, computation of θn

is not necessary, which can be a major advantage in some applications. Here are the

asymptotic chisquare results for these two statistics. A version of the two parts of

the theorem below is proved in Serfling (1980) and in Sen and Singer (1993). Also

see van der Vaart (1998).

Theorem 21.4.

(a) If f(x|θ) can be differentiated twice under the integral sign, I(θ0) exists and

is invertible, and {x : f(x|θ) > 0} is independent of θ, then under H0 : θ =

θ0, QSL−→ χ2

q, where q = dimension of Θ.

(b) Assume the Cramer-Rao regularity conditions for the asymptotic multivariate

normality of the MLE, and assume that the Fisher information matrix is con-

333

tinuous in a neighborhood of θ0. Then, under H0 : θ = θ0, QWL−→ χ2

q, with

q as above.

Remark: QW and QS can be defined for composite nulls of the form H0 : g(θ) =

(g1(θ), . . . , gr(θ)) = 0 and once again, under appropriate conditions, QW and QS

are asymptotically χ2r for any θ ∈ H0. However, now the definition of QS requires

calculation of the restricted MLE θn,H0 under H0.

Again, consult Sen and Singer(1993) for a proof.

Comparisons between the LRT, the Wald test and Rao’s score test have been

made using higher order asymptotics. See Mukerjee and Reid(2001), in particular.

21.7 Likelihood Ratio Confidence Intervals

The usual duality between testing and confidence intervals says that the acceptance

region of a test with size α can be inverted to give a confidence set of coverage

probability at least (1− α). In other words, suppose A(θ0) is the acceptance region

of a size α test for H0 : θ = θ0 and define S(x) = {θ0 : x ∈ A(θ0)}. Then

Pθ0(S(x) 3 θ0) ≥ 1 − α and hence S(x) is a 100(1 − α)% confidence set for θ.

This method is called the inversion of a test. In particular, each of the LRT, the

Wald test, and the Rao score test can be inverted to construct confidence sets that

have asymptotically a 100(1 − α)% coverage probability.

The confidence sets constructed from the LRT, the Wald test,and the score test are

respectively called the likelihood ratio, Wald and score confidence sets. Of these, the

Wald and the score confidence sets are ellipsoids because of how the corresponding

test statistics are defined. The likelihood ratio confidence set is typically more

complicated. Here is an example.

Example 21.12. Suppose Xiiid∼ Bin(1, p), 1 ≤ i ≤ n. For testing H0 : p = p0 vs.

H1 : p 6= p0, the LRT statistic is

Λn =px

0(1 − p0)n−x

supp

px(1 − p)n−xwhere x =

n∑i=1

Xi

=px

0(1 − p0)n−x

(xn)x(1 − (x

n))n−x

=px

0(1 − p0)n−xnn

xx(n − x)n−x

Thus the LR confidence set is of the form SLR(x) := {p0 : Λn ≥ k} = {p0 :

px0(1 − p0)

n−x ≥ k∗} = {p0 : x log p0 + (n − x) log(1 − p0) ≥ log k∗}.

334

The function x log p0 + (n − x) log(1 − p0) is concave in p0 and therefore SLR(x)

is an interval. The interval is of the form [0, u] or [l, 1] or [l, u] for 0 < l, u < 1.

However, l, u cannot be written in closed form; see Brown, Cai, DasGupta (2001)

for asymptotic expansions for them.

Next, QW = n(p − p0)2I(p), where p = x

nis the MLE of p and I(p) = 1

p(1−p)is

the Fisher information function. Therefore

QW = n(x

n− p0)

2

p(1 − p).

The Wald confidence interval is SW = {p0 : n(p−p0)2

p(1−p)≤ χ2

α} = {p0 : (p − p0)2 ≤

p(1−p)n

χ2α} = {p0 : |p − p0| ≤ χα√

n

√p(1 − p)} = [p − χα√

n

√p(1 − p), p + χα√

n

√p(1 − p)]

This is the textbook confidence interval for p.

For the score test statistic, we need Un(p) =n∑

i=1

∂∂p

log f(xi|p) =n∑

i=1

∂∂p

[xi log p +

(1 − xi) log(1 − p)] = x−npp(1−p)

. Therefore, QS = 1n

(x−np0)2

p0(1−p0)and the score confidence

interval is SS = {p0 : (x−np0)2

np0(1−p0)≤ χ2

α} = {p0 : (x − np0)2 ≤ nχ2

αp0(1 − p0)} = {p0 :

p20(n

2 + nχ2α)− p0(2nx + nχ2

α) + x2 ≤ 0} = [lS, uS], where lS, uS are the roots of the

quadratic equation p20(n

2 + nχ2α) − p0(2nx + nχ2

α) + x2 = 0.

These intervals all have the property

limn→∞

Pp0(S 3 p0) = 1 − α.

See Brown, Cai, DasGupta (2001) for a comparison of their exact coverage proper-

ties; they show that the performance of the Wald confidence interval is extremely

poor, and the score and the likelihood ratio intervals perform much better.

335

21.8 Exercises

For each of the following problems 1 to 18, derive the LRT statistic and find its

limiting distribution under the null, if there is a meaningful one :

Exercise 21.1. . X1, X2, . . . , Xniid∼ Ber(p), H0 : p = 1

2vs.H1 : p 6= 1

2. Is the LRT

statistic a monotone function of |X − n2| ?

Exercise 21.2. . X1, X2, . . . , Xniid∼ N(µ, 1), H0 : a ≤ µ ≤ b,H1 : not H0.

Exercise 21.3. . *X1, X2, . . . , Xniid∼ N(µ, 1), H0 : µ is an integer, H1 : µ is not an

integer.

Exercise 21.4. . * X1, X2, . . . , Xniid∼ N(µ, 1), H0 : µ is rational, H1 : µ is irra-

tional.

Exercise 21.5. . X1, X2, . . . , Xniid∼ Ber(p1), Y1, Y2, . . . , Ym

iid∼ Ber(p2), H0 : p1 =

p2, H1 : p1 6= p2; assume all m + n observations are independent.

Exercise 21.6. . X1, X2, . . . , Xniid∼ Ber(p1), Y1, Y2, . . . , Ym

iid∼ Ber(p2); H0 : p1 =

p2 = 12, H1 : not H0; all observations are independent.

Exercise 21.7. . X1, X2, . . . , Xniid∼ Poi(µ), Y1, Y2, . . . , Ym

iid∼ Poi(λ), H0 : µ = λ =

1, H1 : not H0; all observations are independent.

Exercise 21.8. . * X1, X2, . . . , Xniid∼ N(µ1, σ

21), Y1, Y2, . . . , Ym

iid∼ N(µ2, σ22), H0 :

µ1 = µ2, σ1 = σ2, H1 : not H0 (i.e., test that two normal distributions are identical);

again, all observations are independent.

Exercise 21.9. . X1, X2, . . . , Xniid∼ N(µ, σ2), H0 : µ = 0, σ = 1, H1 : not H0.

Exercise 21.10. . X1, X2, . . . , Xniid∼ BV N(µ1, µ2, σ1, σ2, ρ), H0 : ρ = 0, H1 : ρ 6=

0.

Exercise 21.11. . X1, X2, . . . , Xniid∼ MV N(µ, I), H0 : µ = 0, H1 : µ 6= 0.

Exercise 21.12. . * X1, X2, . . . , Xniid∼ MV N(µ, Σ), H0 : µ = 0, H1 : µ 6= 0.

Exercise 21.13. . X1, X2, . . . , Xniid∼ MV N(µ, Σ), H0 : Σ = I,H1 : Σ 6= I.

Exercise 21.14. . * Xijindep.∼ U [0, θi], j = 1, 2, . . . , n, i = 1, 2, . . . , k,H0 : θi are

equal, H1 ; not H0.

Remark: : Notice the EXACT chisquare distribution.

336

Exercise 21.15. . Xijindep.∼ N(µi, σ

2), H0 : µi are equal, H1 : not H0.

Remark: : Notice that you get the usual F test for ANOVA.

Exercise 21.16. . * X1, X2, . . . , Xniid∼ c(α)Exp[−|x|α], where c(α) is the normal-

izing constant, H0 : α = 2, H1 : α 6= 2.

Exercise 21.17. . (Xi, Yi), i = 1, 2, . . . , niid∼ BV N(µ1, µ2, σ1, σ2, ρ), H0 : µ1 =

µ2, H1 : µ1 6= µ2.

Remark: : This is the paired t test.

Exercise 21.18. . X1, X2, . . . , Xniid∼ N(µ1, σ

21), Y1, Y2, . . . , Ym

iid∼ N(µ2, σ22), H0 :

σ1 = σ2, H1 : σ1 6= σ2; assume all observations are independent.

Exercise 21.19. * Suppose X1, X2, . . . , Xn are iid from a two component mixture

pN(0, 1) + (1 − p)N(µ, 1), where p, µ are unknown.

Consider testing H0 : µ = 0 vs.H1 : µ 6= 0. What is the LRT statistic ? Simulate

the value of the LRT by gradually increasing n and observe its slight decreasing

trend as n increases.

Remark: : It has been shown by Jon Hartigan that in this case −2 log Λn → ∞.

Exercise 21.20. * Suppose X1, X2, . . . , Xn are iid from a two component mixture

pN(µ1, 1) + (1 − p)N(µ2, 1), and suppose we know that |µi| ≤ 1, i = 1, 2. Consider

testing H0 : µ1 = µ2 vs.H1 : µ1 6= µ2, when p, 0 < p < 1 is unknown.

What is the LRT statistic ? Simulate the distribution of the LRT under the null

by gradually increasing n.

Remark: : The asymptotic null distribution is nonstandard, but known.

Exercise 21.21. *For each of Exercises 1, 6, 7, 9, 10, 11, simulate the exact dis-

tribution of −2 log Λn for n = 10, 20, 40, 100 under the null, and compare its visual

match to the limiting null distribution.

Exercise 21.22. *For each of Exercises 1, 5, 6, 7, 9, simulate the exact power of

the LRT at selected alternatives, and compare it to the approximation to the power

obtained from the limiting nonnull distribution, as indicated in text.

Exercise 21.23. * For each of Exercises 1, 6, 7, 9, compute the constant a of the

Bartlett correction of the likelihood ratio, as indicated in text.

337

Exercise 21.24. * Let X1, X2, . . . , Xniid∼ N(µ, σ2). Plot the Wald and the Rao

score confidence sets for θ = (µ, σ) for a simulated data set. Observe the visual

similarity of the two sets by gradually increasing n.

Exercise 21.25. Let X1, X2, . . . , Xniid∼ Poi(µ). Compute the likelihood ratio,

Wald, and Rao score confidence intervals for µ for a simulated data set. Observe

the similarity of the limits of the intervals for large n.

Exercise 21.26. * Let X1, X2, . . . , Xniid∼ Poi(µ). Derive a two-term expansion for

the expected lengths of the Wald and the Rao score intervals for µ. Plot them for

selected n and check if either one is usually shorter than the other interval.

Exercise 21.27. Consider the linear regression model E(Yi) = β0 + β1xi, i =

1, 2, . . . , n, with Yi being mutually independent, distributed as N(β0 + β1xi, σ2), σ2

unknown. Find the likelihood ratio confidence set for (β0, β1).

21.9 References

Barndorff-Nielsen,O. and Hall,P.(1988).On the level error after Bartlett adjustment

of the likelihood ratio statistic,Biometrika,75,374-378.

Bartlett,M.(1937).Properties of sufficiency and statistical tests,Proc.Royal Soc.London,

Ser A,160,268-282.

Bickel,P.J. and Doksum,K.(2001).Mathematical Statistics,Basic Ideas and Se-

lected Topics,Prentice Hall,Upper Saddle River,NJ.

Brown,L.,Cai,T., and DasGupta,A.(2001).Interval estimation for a binomial pro-

portion, Statist.Sc.,16,2,101-133.

Casella,G. and Strawderman,W.E.(1980). Estimating a bounded normal mean,

Ann.Stat., 9,4,870-878.

Fan,J.,Hung,H., and Wong,W.(2000).Geometric understanding of likelihood ra-

tio statistics,JASA,95,451,836-841.

Fan,J. and Zhang,C.(2001).Generalized likelihood ratio statistics and Wilks’ phe-

nomenon, Ann.Stat.,29,1,153-193.

Ferguson,T.S.(1996).A Course in Large Sample theory,Chapman and Hall,London.

338

Lawley,D.N.(1956).A general method for approximating the distribution of like-

lihood ratio criteria,Biometrika,43,295-303.

McCullagh,P. and Cox,D.(1986).Invariants and likelihood ratio statistics,Ann.Stat.,

14,4,1419-1430.

Mukerjee,R. and Reid,N.(2001).Comparison of test statistics via expected lengths

of associated confidence intervals,Jour.Stat.Planning and Inf.,97,1,141-151.

Portnoy,S.(1988).Asymptotic behavior of likelihood methods for Exponential

families when the number of parameters tends to infinity,Ann.Stat.,16,356-366.

Rao,C.R.(1948).Large sample tests of statistical hypotheses concerning several

parameters with applications to problems of estimation,Proc.Cambridge Philos.

Soc.,44,50-57.

Robertson,T.,Wright,F.T. and Dykstra,R.L.(1988). Order Restricted Statistical

Inference,John Wiley,New York.

Sen,P.K. and Singer,J.(1993).Large Sample Methods in Statistics,An Introduc-

tion with Applications,Chapman and Hall,New York.

Serfling, R. (1980). Approximation Theorems of Mathematical Statistics, Wiley,

New York.

van der Vaart, A. (1998). Asymptotic Statistics, Cambridge University Press,

Cambridge.

Wald,A.(1943).Tests of statistical hypotheses concerning several parameters when

the number of observations is large,Trans.Amer.Math.Soc.,5,426-482.

Wilks,S.(1938).The large sample distribution of the likelihood ratio for testing

composite hypotheses,Ann.Math.Statist.,9,60-62.

339