chapter 5 bayes methods and elementary decision theory · 8 chapter 5. bayes methods and elementary...

1

Chapter 5Bayes Methods

and Elementary Decision Theory

1. Elementary Decision Theory

2. Structure of the risk body: the finite case

3. The finite case: relations between Bayes minimax, admissibility

4. Posterior distributions

5. Finding Bayes rules

6. Finding Minimax rules

7. Admissibility and Inadmissibility

8. Asymptotic theory of Bayes estimators

Chapter 5

Bayes Methods and ElementaryDecision Theory

1 Elementary Decision Theory

Notation 1.1. Let

• ! ! the states of nature.

• A ! the action space.

• X ! the sample space of a random variable X with distribution P!.

• P : X"! #$ [0, 1]; P!(X = x) = probability of observing X = x when ! is true.

• L : !"A #$ R+; a loss function.

• d : A " X $ [0, 1], d(a, x) = d(a|x) = probability of action a when X = x (a decisionfunction).

• D ! {all decision functions}.

The risk for the rule d % D when ! % ! is true is

R(!, d) = E!L(!, d(·|X))

=!

X

!

AL(!, a)d(da|X = x)P!(dx)

=k"

i=1

L(!, ai)

#$

%

m"

j=1

d(ai, xj)P!(X = xj)

&'

( in the discrete case.

We will call R : !"D$ R+ the risk function.A decision rule d is inadmissible if there is a rule d! such that R(!, d!) & R(!, d) for all ! and

R(!, d!) < R(!, d) for some !. A decision rule d is admissible if it is not inadmissible.

Example 1.1 Hypothesis testing. A = {0, 1}, ! = !0 '!1,

L(!, 0) = l01!1(!), L(!, 1) = l11!0(!).

3

4 CHAPTER 5. BAYES METHODS AND ELEMENTARY DECISION THEORY

Since A contains just two points, d(1|x) = 1( d(0|x) ! "(x); and then

P!(accept H0) = E!d(0|X) = E!(1( "(X)),P!(reject H0) = E!d(1|X) = E!"(X),

and

R(!, d) = l1E!"(X)1!0(!) + l0E!(1( "(X))1!1(!).

The classical Neyman - Pearson hypothesis testing philosophy bounds l1# ! sup!"!0R(!, d) and

tries to minimize

l0(1( $d(!)) ! R(!, d)

for each ! % !1.

Example 1.2 Estimation. A = !; typically ! = Rs for some s (but sometimes it is usefulto take ! to be a more general metric space (!, d)). A typical loss function (often used formathematical simplicity more than anything else) is

L(!, a) = K|! ( a|2 for some K.

Then the risk of a rule d is:

R(!, d) = KE!|! ( d(·|X)|2

= KE!|! ( d(X)|2 for a non -randomized rule d

= K{V ar!(d(X)) + [bias!(d(X))]2} when s = 1.

Definition 1.1 A decision rule dM is minimax if

infd"D

sup!"!

R(!, d) = sup!"!

R(!, dM ).

Definition 1.2 A probability distribution " over ! is called a prior distribution.

Definition 1.3 For any given prior " and d % D, the Bayes risk of d with respect to " is

R(", d) =!

!R(!, d)d"(!)

=l"

i=1

R(!i, d)%i if ! = {!1, . . . , !l}.

Definition 1.4 A Bayes decision rule with respect to ", d", is any rule satisfying

R(", d") = infd"D

R(", d) ! Bayes risk.

Example 1.3 ! = {1, 2}.

Urn 1: 10 red balls, 20 blue balls, 70 green balls.

Urn 2: 40 red balls, 40 blue balls, 20 green balls.

1. ELEMENTARY DECISION THEORY 5

One ball is drawn from one of the two urns. Problem: decide which urn the ball came from if thelosses L(!, a) are given by:

!/a 1 21 0 102 6 0

Let d = (dR, dB , dG) with dx = probability of choosing urn 1 if color X = x is observed. Then

R(1, d) = 10P1(action 2) = 10{.1(1 ( dR) + .2(1 ( dB) + .7(1 ( dG)}

and, similarly,

R(2, d) = 6{.4dR + .4dB + .2dG}.

if the prior distribution on the urns is given by %1 = % and %2 = 1( %, then the Bayes risk is

R(", d) = 10% + (2.4 ( 3.4%)dR + (2.4 ( 4.4%)dB + (1.2( 8.2%)dG.

This is minimized by choosing dx = 1 if its coe#cient is negative, 0 if its coe#cient is positive. Forexample, if % = 1/2, then the Bayes risk with respect to % equals

5 + .7dR + .2dB ( 2.9dG,

which is minimized by dR = dB = 0, dG = 1; i.e. the Bayes rule d" with respect to % = 1/2 isd" = (0, 0, 1). Note that the Bayes rule is in fact a non-randomized rule. This gives us the Bayesrisk for % = 1/2 as R(1/2, dB) = 2.1.

I claim that the minimax rule is dM = (0, 9/22, 1), which is a randomization of the two non-random rules d2 = (0, 0, 1), and d7 = (0, 1, 1). This is easily confirmed by computing R(1, dM ) =240/110 = R(2, dM ). Here is a table giving all of the nonrandomized rules and their risks:

X d1 d2 d3 d4 d5 d6 d7 d8

R 0 0 0 1 1 1 0 1B 0 0 1 0 1 0 1 1G 0 1 0 0 0 1 1 1

R(1, d) 10 3 8 9 7 2 1 0R(2, d) 0 1.2 2.4 2.4 4.8 3.6 3.6 6


2 Structure of the Risk Body: the finite case

Suppose that

! = {!1, . . . , !l}, X = {x1, . . . , xm}, A = {a1, . . . , ak}.

Definition 2.1 A set A ) Rl is convex if, for all % % [0, 1], x, y % A, %x + (1( %)y % A.

Lemma 2.1 Let % = (%1, . . . ,%t) be a probability distribution (so %i * 0 for i = 1, . . . , t, and)ti=1 %i = 1). Then for any d1, . . . , dt % D,

)ti=1 %idi % D.

Proof. Set d(al, xj) =)t

i=1 %idi(al, xj). Then d(al, xj) * 0 and

k"

l=1

d(al, xj) =t"

i=1

%i

k"

l=1

di(al, xj) =t"

i=1

%i = 1.

!

We call

R = {(R(!1, d), . . . , R(!l, d)) % R+l : d % D}

the risk body.

Theorem 2.1 The risk body is convex.

Proof. Let d1, d2 % D denote the rules corresponding to Ri = (R(!1, di), . . . , R(!l, di)), i = 1, 2,and set d = %d1 + (1( %)d2 % D by the lemma. Then

%R(!l, d1) + (1( %)R(!l, d2) =k"

i=1

m"

j=1

L(!l, ai){%d1(ai, xj) + (1( %)d2(ai, xj)}p!l(xj)

= R(!l, d),

so d % D has risk point %R1 + (1( %)R2 % R. !

Theorem 2.2 Every d % D may be expressed as a convex combination of the the nonrandomizedrules.

Proof. Any d % D may be represented as a k "m matrix of * 0 real numbers whose columnsadd to 1. The nonrandomized rules are those whose entries are 0’s and 1’s. Set d(ai, xj) = dij ,d = (dij), 1 & i & k, 1 & j & m.

There are km nonrandom decision rules: call them &(r) = (&(r)ij ), for 1 & r & km. Given d % D,

we want to find %r * 0 with)km

r=1 %r = 1 so that

km"

r=1

%r&(r) = d.

2. STRUCTURE OF THE RISK BODY: THE FINITE CASE 7

Claim:

%r !m*

j=1

k*

i=1

d"(r)ij

ij =m*

j=1

{k"

i=1

dij&(r)ij }

work. (This is easily proved by induction on m.) !

Remark 2.1 The space of decision rules defined above, D, with d % D being a probability dis-tribution over actions given that X = x, corresponds to the behavioral decision rules as discussedby Ferguson (1967), page 25. This di$ers from Ferguson’s collection D#, the randomized decisionfunctions, which are probability distributions over the non-randomized rules.


3 The finite case: relations between Bayes, minimax, and admis-sibility

This section continues our examination of the special, but illuminating, case of a finite set !. Inthis case we can prove a number of results about Bayes and minimax rules and connections betweenthem which carry over to more general problems with some appropriate reformulation.

Theorem 3.1 A minimax rule always exists.

Proof. If 0 /% R, then for some c > 0 the cube with all sides of length c in the positive quadrantdoes not intersect R. Let c increase until the square intersects R; call this number c0. Any decisionrule with a risk point which intersects the c0 square is minimax. (Note that the equation for theboundary of a cube is maxi R(!i, d) = constant.) !

Theorem 3.2 If % = (%1, . . . ,%l) is a prior, then Bayes decision rules have risk points on thehyperplane {x % Rl :

)i %ixi = c#} where

c# = inf{c * 0 : the plane determined by"

i

%ixi = c intersects R}.

Proof. The Bayes risk is R(", d) =)l

i=1 %iR(!i, d) is the Bayes risk and we want to minimizeit. !

Lemma 3.1 An admissible rule is a Bayes rule for some prior %.

Proof. See the picture! !

Note that not every Bayes rule is admissible.

Theorem 3.3 Suppose that d0 is Bayes with respect to % = (%1, . . . ,%l) and %i > 0, i = 1, . . . , l.Then d0 is admissible.

Proof. Suppose that d0 is not admissible; then there is a rule d better than d0: i.e.

R(!i, d) & R(!i, d0) for all i

with < for some i. Since %i > 0 for i = 1, . . . , l,

R(", d) =l"

i=1

%iR(!i, d) <l"

i=1

%iR(!i, d0) = R(", d0),

contradicting the fact that d0 is Bayes with respect to %. !

Theorem 3.4 If dB % D is Bayes for % and it has constant risk, then dB is minimax.

3. THE FINITE CASE: RELATIONS BETWEEN BAYES, MINIMAX, AND ADMISSIBILITY9

Proof. Let r0 be the constant risk. Assume dB is not minimax and let dM be the minimaxrule (which exists). Then

R(!i, dM ) & max1$j$l

R(!j, dM ) < max1$j$l

R(!j, dB) = r0(a)

and

mind"D

R(%, d) = R(%, dB) ="

%iR(!i, dB) = r0.(b)

But (a) yields

R(%, dM ) ="

i

%iR(!i, dM ) <"

i

%ir0 = r0(c)

which contradicts (b). !

Example 3.1 We return to Example 1.3, and consider it from the perspective of Theorem 3.4.When the prior % = (%, 1( %) with % = 6/11, we find that the Bayes risk is given by

R(", d) = 10% + (2.4( 3.4%)dR + (2.4 ( 4.4%)dB + (1.2 ( 8.2%)dG

= 10 · 611

+611

dR + 0 · dB (3611

dG,

so all the rules d" = (0, dB , 1) are Bayes with respect to % = 6/11. Can we find one with constantrisk? The risks of these Bayes rules d" are given by

R(1, d") = 1 + 2(1( dB),R(2, d") = 0 + 2.4dB + 1.2,

and these are equal if dB satisfies 3( 2dB = 1.2 + 2.4dB , and hence dB = 9/22. For this particularBayes rule, d" = (0, 9/22, 1), and the risks are given by R(1, d") = 24/11 = R(2, d"). Thus thehypotheses of Theorem 3.4 are satisfied, and we conclude that d" = (0, 9/22, 1) is a minimax rule.


4 Posterior Distributions

Now we will examine Bayes rules more generally.

Definition 4.1 If ! + ", a prior distribution over !, then the conditional distribution of ! givenX, P (! % A|X), A % B(!), is called the posterior distribution of !. Write "(!|x) = P (! & !|X = x)for the conditional distribution function if ! is Euclidean.

If " has density % with respect to ' and P! has density p! with respect to µ, then

%(!|x) =p(x|!)%(!)+

! p(x|!)%(!)d'(!)=

p(x|!)%(!)p(x)

is the posterior density of ! given X = x.

Definition 4.2 Suppose X has density p(x|!), and % = %(·) is a prior density, % % P". If theposterior density %(·|x) has the same form (i.e. %(·|x) % P" for a.e. x), then % is said to be aconjugate prior for p(·|!).

Example 4.1A. (Poisson - Gamma). Suppose that (X|! = !) + Poisson(!), ! + %(#,$). Then

%(!|x) =p(x|!)%(!)

p(x), p(x|!)%(!)

= e%! !x

x!$#

%(#)!#%1e%$!

=$#

x!%(#)!#+x%1e%($+1)!,

so (!|X = x) + %(# + x,$ + 1), and gamma is a conjugate prior for Poisson.B. (Normal mean - normal). (Example 1.3, Lehmann TPE, page 243). Suppose that (X|! = !) +N(!,(2), ! + N(µ, )2). Then

(!|X = x) + N

,µ/)2 + x/(2

1/)2 + 1/(2,

11/)2 + 1/(2

-.

Note that if X is replaced by X1, . . . ,Xn i.i.d. N(!,(2), then by su#ciency,

(!|X = x) + N

,n/(2

1/)2 + n/(2x +

1/)2

1/)2 + n/(2µ, ,

11/)2 + n/(2

-.

Example 4.2 If (X|! = !) + Binomial(n, !), ! + B(#,$), then

%(!|x) =p(x|!)%(!)

p(x), p(x|!)%(!)

=,

n

x

-!x(1( !)n%x!#%1(1( !)$%1 %(# + $)

%(#)%($)

= !#+x%1(1( !)$+n%x%1 %(# + $)%(#)%($)

,n

x

-,

so (!|X = x) + Beta(# + x,$ + n( x).

4. POSTERIOR DISTRIBUTIONS 11

Example 4.3 (Multinomial - Dirichlet). Suppose that (X|! = !) + Multinomialk(n, !) and ! +Dirichlet(#); i.e.

%(!) =%(#1 + · · · + #k).k

i=1 %(#i)!#1%11 · · · !#k%1

k

for !i * 0,)k

i=1 !i = 1, #i > 0. Then (!|X = x) + Dirichlet(# + x),

E(!) =#

)ki=1 #i

,

and

d"(X) =# + X)#i + n

=)

#i)#i + n

#)#i

+n)

#i + n

X

n.


5 Finding Bayes Rules

For convex loss functions we can restrict attention to non-randomized rules; see corollary 1.7.9,TPE, page 48. The following theorem gives a useful recipe for finding Bayes rules.

Theorem 5.1 Suppose that ! + ", (X|! = !) + P!, and L(!, a) * 0 for all ! % !, a % A. If(i) There exists a rule d0 with finite risk, and(ii) For a.e. x there exists d"(x) minimizing

E{L(!, d(·|x))|X = x} =!

!L(!, d(·|x))%(!|x)d'(!),

then d"(·|x) is a Bayes rule.

Proof. For any rule d with finite risk

E{L(!, d(X))|X} * E{L(!, d"(X))|X} a.s.(a)

by (ii). Hence

R(", d) = EE{L(!, d(X))|X} = EL(!, d(X))(b)* EE{L(!, d"(X))|X} = EL(!, d"(X))= R(", d").

!

Corollary 1 (Estimation with weighted squared - error loss). If ! = A = R and L(!, a) =K(!)|! ( a|2, then

d"(X) =E{K(!)!|X}E{K(!)|X} =

+! !K(!)d"(!|X)+! K(!)d"(!|X)

.

When K(!) = 1, then

d"(X) =!

!d"(!|X) = E{!|X} ! the posterior mean.

Proof. For an arbitrary (nonrandomized) rule d % D,!

K(!)|! ( d(x)|2d"(!|x) =!

K(!)|! ( d"(x) + d"(x)( d(x)|2d"(!|x)

=!

K(!)|! ( d"(x)|2d"(!|x)

+ 2(d"(x)( d(x))!

K(!){! ( d"(x)}d"(!|x)

+ (d"(x)( d(x))2!

K(!)d"(!|x)

*!

K(!)|! ( d"(x)|2d"(!|x)

with equality if d(x) = d"(x). !

5. FINDING BAYES RULES 13

Corollary 2 (Estimation with L1(loss). If ! = A = R and L(!, a) = |! ( a|, then

d"(x) = any median of "(!|x).

Corollary 3 (Testing with 0( 1 loss). If A = {0, 1}, ! = !0 + !1 (in the sense of disjoint unionof sets), and L(!, ai) = li1!c

i(!), i = 0, 1, then any rule of the form

d"(x) =

#$

%

1 if P (! % !1|X = x) > (l1/l0)P (! % !0|X = x)*(x) if P (! % !1|X = x) = (l1/l0)P (! % !0|X = x)0 if P (! % !1|X = x) < (l1/l0)P (! % !0|X = x)

is Bayes with respect to ". Note that this reduces to a test of the Neyman - Pearson form when!i = {!i}, i = 0, 1.

Proof. Let "(x) = d(1|x). Then

E{L(!,"(x))|X = x} =!

!L(!,"(x))d"(!|x)

=!

!{l1"(x)1!0(!) + l0(1( "(x))1!1(!)}d"(!|x)

= l1"(x)P (! % !0|X = x) + l0(1( "(x))P (! % !1|X = x)= l0P (! % !1|X = x) + "(x){l1P (! % !0|X = x)( l0P (! % !1|X = x)}

which is minimized by any rule of the form d". !

Corollary 4 (Testing with linear loss). If ! = R, A = {0, 1}, !0 = ((-, !0], !1 = (!0,-), and

L(!, 0) = (! ( !0)1!1(!); L(!, 1) = (!0 ( !)1!0(!);

then

d"(x) =

#$

%

1 if E(!|X = x) > !0,*(x) if E(!|X = x) = !0,0 if E(!|X = x) < !0

is Bayes with respect to ".

Proof. Again, by theorem 5.1 it su#ces to minimize

E{L(!,"(x))|X = x}

=!

!{"(x)(!0 ( !)1(%&,!0](!) + (1( "(x))(! ( !0)1(!0,&)(!)}d"(!|x)

=!

!(!0 ( !)1(%&,!0](!)d"(!|x) + (1( "(x)){E(!|X = x)( !0}

which is minimized for each fixed x by any rule of the form d". !


Example 5.1 Suppose X + Binomial(n, !) with ! + Beta(#,$). Thus (!|X = x) + Beta(# +x,$ + n( x), so that

E(!) =#

# + $

and hence for L(!, a) = (! ( a)2 the Bayes rule is

d"(X) = E{!|X} =# + X

# + $ + n

=# + $

# + $ + n· #

# + $+

n

# + $ + n

X

n.

If the loss is L(!, a) = (! ( a)2/{!(1 ( !)} instead of squared - error loss, then, with B(#,$) !%(# + $)/(%(#)%($)),

E{!K(!)|X} =B(# + X,$ + n(X ( 1)

B(# + X,$ + n(X),

E{K(!)|X} =B(# + X ( 1,$ + n(X ( 1)

B(# + X,$ + n(X),

and hence the Bayes rule with respect to " for this loss function is

d"(X) =B(# + X,$ + n(X ( 1)

B(# + X ( 1,$ + n(X ( 1)

= %n(#,$)#( 1

# + $ ( 2+ (1( %n(#,$))

X

n

where %n(#,$) = (# + $ ( 2)/(# + $ + n( 2). Note that when # = $ = 1, the Bayes estimator forthis loss function becomes the familiar maximum likelihood estimator p̂ = X/n.

Example 5.2 Suppose that (X |!) + Multinomialk(n,!), and ! + Dirichlet(#). Then

E(!) =#)#i

,

and for squared error loss the Bayes rule is

d"(X) = E(!|X) =# + X)#i + n

.

Example 5.3 Normal with normal prior. If (X|! = !) + N(!,(2), ! + N(µ,(2), then

(!|X) + N

,1/)2

1/)2 + 1/(2µ +

1/(2

1/)2 + 1/(2X,

11/)2 + 1/(2

-.

Consequently

E(!) = µ,

and, for squared error loss the Bayes rule is

d"(X) = E(!|X) =1/)2

1/)2 + 1/(2µ +

1/(2

1/)2 + 1/(2X.

This remains true if L(!, a) = +(! ( a) where + is convex and even (see e.g. Lehmann, TPE, page244).


The following example of the general set-up is sometimes referred to as a “classification prob-lem”. Suppose that ! = {!1, . . . , !k} and A = {a1, . . . , ak} = !. Suppose that X + Pi ! P!i ,i = 1, . . . , k when !i is the true state of nature where Pi has density pi with respect to a ((finitedominating measure µ A simple loss function is given by

L(!i, aj) = 1{i .= j}, i = 1, . . . , k, j = 1, . . . , k.

Given a (“multiple”) decision rule d = (d(1|X), . . . , d(k|X)), the risk function is

R(!i, d) =k"

j=1

L(!i, aj)E!i{d(j|X)} ="

j '=i

E!i{d(j|X)}

= 1( E!i{d(i|X)}.

Suppose that % = (%1, . . . ,%k) is a prior distribution on ! = {!1, . . . , !k}. then the Bayes risk is

R(%, d) = 1(k"

i=1

%iE!i{d(i|X)}

= probability of missclassification using d

when the distribution from which X is drawnis chosen according to %.

Here is a theorem characterizing the class of Bayes rules in this setting.

Theorem 5.2 Any rule d for which

d(i|X) = 0 whenever %ipi(X) < maxj

%jpj(X)

for i = 1, . . . , k is Bayes with respect to %. Equivalently, since)k

i=1 d(i|X) = 1 for all X,

d(i|X) = 1 if %ipi(X) > %jpj(X), for j .= i.

Proof. Let d! be any other rule. We want to show that

R(%, d!)(R(%, d) * 0

where d is any rule of the form given. But

R(%, d!)(R(%, d) =k"

i=1

%i

!d(i|x)pi(x)dµ(x)(

k"

j=1

%j

!d!(j|x)pj(x)dµ(x)

=k"

i=1

k"

j=1

!d(i|x)d!(j|x){%ipi(x)( %jpj(x)}dµ(x)

* 0

since whenever %ipi(x) < %jpj(x) it follows that d(i|x) = 0. !

Finally, here is a cautionary example.


Example 5.4 (Ritov-Wasserman). Suppose that (X1, R1, Y1), . . . , (Xn, Rn, Yn) are i.i.d. with adistribution described as follows. Let ! = (!1, . . . , !B) % [0, 1]B ! ! where B is large, e.g. 1010.Let , = (,1, . . . , ,B) be a vector of known numbers with 0 < & & ,j & 1 ( & < 1 for j = 1, . . . , B.Furthermore, suppose that:

(i) Xi + Uniform{1, . . . , B}.

(ii) Ri + Bernoulli(,Xi).

(iii) If Ri = 1, Yi + Bernoulli(!Xi); if Ri = 0, Yi is missing (i.e. not observed).

Our goal is to estimate

- = -(!) = P!(Y1 = 1) =B"

j=1

P (Y1 = 1|X1 = j)P (X1 = j) =1B

B"

j=1

!j.

Now the likelihood contribution of (Xi, Ri, Yi) is

f(Xi)f(Ri|Xi)f(Yi|Xi, Ri) =1B

,RiXi

(1( ,Xi)1%Ri!YiRi

Xi(1 ( !Xi)

(1%Yi)Ri ,

and hence the likelihood for ! is

Ln(!) =n*

i=1

1B

,RiXi

(1( ,Xi)1%Ri!YiRi

Xi(1( !Xi)

(1%Yi)Ri

,n*

i=1

!YiRiXi

(1( !Xi)(1%Yi)Ri .

Thus

ln(!) =n"

i=1

{YiRi log !Xi + (1( Yi)Ri log(1( !Xi)}

=B"

j=1

nj log !j +B"

j=1

mj log(1( !j)

where

nj = #{i : Yi = 1, Ri = 1,Xi = j}, mj = #{i : Yi = 0, Ri = 1,Xi = j}.

Note that nj = mj = 0 for most j since B >> n. Thus the MLE for most !j is not defined.Furthermore, for most !j the posterior distribution is the prior distribution (especially if the prioris a product distribution on ! = [0, 1]B !). Thus both MLE and Bayes estimation fail.

Here is a purely frequentist solution: the Horovitz- Thompson estimator of - is

-̂n =1n

n"

i=1

Ri

,Xi

Yi.


Note that

E(-̂n) = E{ R1

,X1

Y1} = EE{ R1

,X1

Y1|X1, R1} = E

/R1

,X1

E{Y1|X1, R1}0

= E

/R1

,X1

!X1

0= EE

/R1

,X1

!X1|X1

0= E

/!X1

,X1

E{R1|X1}0

= E

/!X1

,X1

,X1

0= E{!X1}

= B%1B"

j=1

!j = -(!) .

Thus -̂n is an unbiased estimator of -(!). Moreover,

V ar(-̂n) =1n

#$

%1B

B"

j=1

!j

,j( -(!)2

&'

( .(1)

Exercise 5.1 Show that (1) holds and hence that

V ar(-̂n) & 1n&

under the assumption that ,j * & > 0 for all 1 & j & B. [Hint: use the formula V ar(Y ) =EV ar(Y |X) + V ar[E(Y |X)] twice.]


6 Finding Minimax Rules

Definition 6.1 A prior "0 for which R(", d") is maximized is called a least favorable prior:

R("0, d"0) = sup"

R(", d").

Theorem 6.1 Suppose that " is a prior distribution on ! such that

R(", d") =!

!R(!, d")d"(!) = sup

!R(!, d").(1)

Then:(i) d" is minimax.(ii) If d" is unique Bayes with respect to ", d" is unique minimax.(iii) " is least favorable.

Proof. (i) Let d be another rule. Then

sup!

R(!, d) *!

!R(!, d)d"(!)

*!

!R(!, d")d"(!) since d" is Bayes wrt "(a)

= sup!

R(!, d") by (1).

Hence d" is minimax.(ii). If d" is unique Bayes, then > holds in (a), so d" is unique minimax.(iii). Let "# be some other prior distribution. Then

r"! !!

!R(!, d"!)d"#(!) &

!

!R(!, d")d"#(!)

since d"! is Bayes wrt "#

& sup!

R(!, d")

= R(", d") ! r" by (1).

!

Corollary 1 If d" is Bayes with respect to " and has constant risk, R(!, d") = constant, then d"

is minimax.

Proof. If d" has constant risk, then (1) holds. !

Corollary 2 Let

!" ! {! % ! : R(!, d") = sup!"

R(!!, d")}

be the set of !’s where the risk of d" assumes its maximum. Then d" is minimax if "(!") = 1.Equivalently, d" is minimax if there is a set !" with "(!") = 1 and R(!, d") = sup!" R(!!, d") forall ! % !".

6. FINDING MINIMAX RULES 19

Example 6.1 Suppose X + Binomial(n, !) with ! + Beta(#,$). Thus (!|X = x) + Beta(# +x,$ + n( x), and as we computed in section 5,

d"(X) =# + X

# + $ + n=

# + $

# + $ + n

#

# + $+

n

# + $ + n

X

n.

Consequently,

R(!, d") =,

n

# + $ + n

-2 !(1( !)n

+,

# + n!

# + $ + n( !

-2

=1

(# + $ + n)21#2 + (n( 2#(# + $))! + ((# + $)2 ( n)!2

2

=1

(# + $ + n)2#2

if 2#(# + $) = n and (# + $)2 = n. But solving these two equations yields # = $ =/

n/2. Thusfor these choices of # and $, the risk of the Bayes rule is constant in !, and hence

dM (X) =1

1 +/

n

12

+/

n

1 +/

n

X

n

is Bayes with respect to " = Beta(/

n/2,/

n/2) and has risk

R(!, dM ) =1

4(1 +/

n)2& !(1( !)

n= R(!,X/n)

if

|! ( 1/2| & 12

31 + 2

/n

1 +/

n+ 1/

2n1/4.

Hence Beta(/

n/2,/

n/2) is least favorable and dM is minimax. Note that dM is a consistentestimator of !:

dM (X)$p !

as n $ -, but its bias is of the order n%1/2 (the bias equals (1/2 ( !)/(1 +/

n)). Furthermore,dM is a (locally) regular estimator of !, but the limit distribution is not centered at zero if ! .= 1/2:under P!n with !n = ! + tn%1/2

/n(dM (X) ( !n)$d N(1/2 ( !, !(1( !)).

If a least-favorable prior does not exist, then we can still consider improper priors or limits ofproper priors:

Let {"k} be a sequence of prior distributions, let dk denote the Bayes estimator correspondingto "k, and set

rk !!

!R(!, dk)d"k(!).

Suppose that

rk $ r <- as k $-.(2)


Definition 6.2 The sequence of prior distributions {"k} with Bayes risks {rk} is said to be leastfavorable if r" ! R(", d") & r for any prior distribution ".

Theorem 6.2 Suppose that {"k} is a sequence of prior distributions with Bayes risks satisfying

rk $ r,(3)

and d is an estimator for which

sup!

R(!, d) = r.(4)

Then:A. d is mininax, andB. {"k} is least - favorable.

Proof. A. Suppose d# is any other estimator. Then

sup!

R(!, d#) *!

R(!, d#)d"k(!) * rk

for all k * 1. Hence

sup!

R(!, d#) * r = sup!

R(!, d) by (4),

so d is minimax.B. If " is any prior distribution, then

r" =!

R(!, d")d"(!) &!

R(!, d)d"(!) & sup!

R(!, d) = r

by (4). !

Example 6.2 Let X1, . . . ,Xn be i.i.d. N(!,(2) given ! = !, and suppose that ! + N(µ, )2) andthat (2 is known. Then it follows from (5.3) that

d"(X) = E{!|X} =1/)2

1/)2 + n/(2µ +

n/(2

1/)2 + n/(2Xn

and

R(", d") ! r" = E{! ( d"(X)}2

= EE{[! ( d"(X)]2|X}= EV ar{!|X}

=1

1/)2 + n/(2$ (2

nas )2 $-

= R(!,X)

for all ! % ! = R. Hence by theorem 6.2, X is a minimax estimator of !.

The remainder of this section is aimed at extending minimaxity of estimators from smallermodels to larger ones.

6. FINDING MINIMAX RULES 21

Lemma 6.1 Suppose that X + P % M ! {all P ’s on X}, and that ' : P ) M #$ R is afunctional (e.g. '(P ) = EP (X) =

+xdP (x), '(P ) = V arP (X), '(P ) = Pf for some fixed function

f). Suppose that d is a minimax estimator of '(P ) for P % P0 ) P1. If

supP"P0

R(P, d#) = supP"P1

R(P, d),

then d is also minimax for estimating '(P ), P % P1.

Proof. Suppose that d is not minimax. Then there exists a rule d# with smaller maximum risk:

supP"P0

R(P, d#) & supP"P1

R(P, d#) < supP"P1

R(P, d) = supP"P0

R(P, d)

which contradicts the hypothesis that d is minimax for P0. !

Example 6.3 This continues the normal mean example 6.2. Suppose that (2 is unknown. Thus,with ! = {(!,(2) : ! % R, 0 < (2 <-},

sup(!,%2)"!

R((!,(2),Xn) = sup(!,%2)"!

(2

n=-.

So, to get something reasonable, we need to restrict (2. Let

P0 = {N(!,M) : ! % R},P1 = {N(!,(2) : ! % R, 0 & (2 &M}.

Then Xn is minimax for P0 by our earlier calculation, P0 ) P1, and

supP1

R(P,Xn) = supP0

R(P,Xn) =M

n.

Thus Xn is a minimax estimator of ! for P1.

Example 6.4 Let X1, . . . ,Xn be i.i.d. P % Pµ where

Pµ = {all probability measures P : EP |X| <-}

and consider estimation of '(P ) = EP (X) with squared error loss for the families

Pb%2 ! {P % Pµ : V arP (X) &M <-}Pbr ! {P % Pµ : P (a & X & B) = 1} for some fixed a, b % R.

Then:A. X is minimax for Pb%2 = P1 by Example 6.3 since it is minimax for P1 ! {N(!,(2) : ! %R, 0 < (2 &M}, and

supP"P0

R(P,X) = supP"P1

R(P,X).

B. Without loss of generality suppose a = 0 and b = 1. Let

P1 = {P : P ([0, 1]) = 1},P0 = {P % P1 : P (X = 1) = p, P (X = 0) = 1( p for some 0 < p < 1}.


For P0 we know that the minimax estimator is

dM (X) =1

1 +/

n

12

+/

n

1 +/

nX.

Now, with EP X ! !,

R(P, dM ) =1

(1 +/

n)2

4VarP (X) +

,12( !

-25

& 1(1 +

/n)2

{! ( !2 + (1/4) ( ! + !2}

since 0 & X & 1 implies EP X2 & EP X ! !

=1/4

(1 +/

n)2.

Thus

supP"P1

R(P, dM ) = supP"P0

R(P, dM ),

and by Lemma 6.1 dM is minimax.

Remark 6.1 The restriction (2 & M in (5) is crucial; note that '(P ) = EP (X) is discontinuousat every P in the sense that we can easily have Pm $d P , but '(Pm) fails to converge to '(P ).

Remark 6.2 If ! = [(M,M ] ) R, then X is no longer minimax in the normal case; see Bickel(1981). The approximately least favorable densities for M $- are

%M (!) = M%1 cos2((./2)(!/M)1[%M,M ](!).

7. ADMISSIBILITY AND INADMISSIBILITY 23

7 Admissibility and Inadmissibility

Our goal in this section is to establish some basic and simple results about admissibility / inad-missibility of some classical estimators. In particular, we will show that X is admissible for theGaussian location model for k = 1, but inadmissible for the Gaussian location model for k * 3.The fundamental work in this area is that of Stein (1956) and James and Stein (1961). For aninteresting expository paper, see Efron and Morris (1977); for more work on admissibility issuessee e.g. Eaton (1992), (1997), and Brown (1971).

Theorem 7.1 Any unique Bayes estimator is admissible.

Proof. Suppose that d" is unique Bayes with respect to " and is inadmissible. Then thereexists an estimator d such that R(!, d) & R(!, d") with strict inequality for some !, and hence

!

!R(!, d)"(!) &

!

!R(!, d")d"(!)

which contradicts uniqueness of d". Thus d" is admissible. !

Lemma 7.1 If the loss function L(!, a) is squared error (or is convex in a) and, with Q definedby Q(A) =

+P!(X % A)d"(!), a.e. Q implies a.e. P, then a Bayes rule with finite Bayes risk is

unique a.e. P. [a.e. P means P (N) = 0 for all P % P.]

Proof. See Lehmann and Casella TPE Corollary 4.1.4 page 229. !

Example 7.1 Consider the Bayes estimator of a normal mean, d" = pnµ + (1 ( pn)X , pn =(1/)2)/(1/)2 +n/(2). The Bayes risk is finite and a.e. Q implies a.e. P. Hence d" is unique Bayesand admissible.

Theorem 7.2 If X is a random variable with mean ! and variance (2, then aX + b is inadmissibleas an estimator of ! for squared error loss if

(i) a > 1(ii) a < 0(iii) a = 1, b .= 0.

Proof. For any a, b the risk of the rule aX + b is

R(!, aX + b) = a2(2 + {(a( 1)! + b}2 ! +(a, b).

(i) If a > 1

+(a, b) * a2(2 > (2 = +(1, 0),

so aX + b is dominated by X.(ii) If a < 0, then (a( 1)2 > 1 and

+(a, b) * {(a( 1)! + b}2 = (a( 1)2{! +b

a( 1}2

> {! +b

a( 1}2 = +(0,(b/(a ( 1)).


(iii) If a = 1, b .= 0,

+(1, b) = (2 + b2 > (2 = +(1, 0),

so X + b is dominated by X. !

Example 7.2 Thus aX + b is inadmissible for a < 0 or a > 1. When a = 0, d = b is admissiblesince it is the only estimator with zero risk at ! = b. When a = 1, b .= 0, d is inadmissible. What isleft is X: is X admissible as the estimator of a normal mean in R? The following theorem answersthis a#rmatively.

Theorem 7.3 If X1, . . . ,Xn are i.i.d. N(!,(2), ! % ! = R with (2 known, then X is an admissibleestimator of !.

Proof. Limiting Bayes method. Suppose that X is inadmissible (and ( = 1). Then there isan estimator d# such that R(!, d#) & 1/n = R(!,X) for all ! with risk < 1/n for some !. Now

R(!, d) = E!(! ( d(X))2

is continuous in ! for every d, and hence there exists / > 0 and !0 < !1 so that

R(!, d#) <1n( / for all !0 < ! < !1.

Let

r#& !!

!R(!, d#)d"(!)

where " = N(0, )2). Thus

r& !!

!R(!, d& )d"(!) =

11/)2 + n

=)2

1 + n)2

so r& & r#& . Thus

1/n ( r#&1/n ( r&

=1(2'&

+ &%& {1/n (R(!, d#)} exp((!2/2)2)d!

1/n ( )2/(1 + n)2)

*1(2'&

/+ !1!0

exp((!2/2)2)d!

1/n(1 + n)2)

=n(1 + n)2)/

2.)/

! !1

!0

exp((!2/2)2)d!

$ - · / · (!1 ( !0) =- as ) $-.

Hence1n( r#& >

1n( r&

for ) > some )0, or, r#& < r& for some ) > )0, which contradicts d& Bayes with respect to "& withBayes risk r& . Hence X is admissible. !


Proof. (Information inequality method). The risk of any rule d is

R(!, d) = E!(d( !)2 = V ar![d(X)] + b2(!)

* [1 + b!(!)]2

n+ b2(!)

since I(!) = 1. Suppose that d is any estimator satisfying

R(!, d) & 1/n for all !.

then

1n* b2(!) +

[1 + b!(!)]2

n.(a)

But (a) implies that:(i) |b(!)| & 1/

/n for all !; i.e. b is bounded.

(ii) b!(!) & 0 since 1 + 2b!(!) + [b!(!)]2 & 1 .(iii) There exists !i $- such that b!(!i)$ 0.

[If b!(!) & (/ for all ! > !0, then b(!) is not bounded.]Similarly, there exists !i $ (- such that b!(!i)$ 0. (iv)

b2(!) & 1n( [1 + b!(!)]2

n= (2b!(!) + [b!(!)]2

n

Hence b(!) = 0, and R(!, d) ! 1/n. !

Theorem 7.4 (Stein’s theorem) If k * 3, then X is inadmissible.

Remark 7.1 The sample mean is admissible when k = 2; see Stein (1956), Ferguson (1967), page170.

Proof. Let g : Rk $ Rk have

E666

0

0xigi(X)

666 <-,

and consider estimators of the form

7!n = X + n%1g(X).

Now

E!|X ( !|2 ( E!|7!n ( !|2 = E!|X ( !|2 ( E!|X ( ! + n%1g(X)|2

= (2n%1E!0X ( !, g(X)1 ( n%2E!|g(X)|2.(a)

!

To proceed further we need an identity due to Stein.


Lemma 7.2 If X + N(!,(2), and g is a function with E|g!(X)| <-, then

(2Eg!(X) = E(X ( !)g(X).(1)

If X + Nk(!,(2I), and g : Rk $ Rk has

E| 0

0xigi(X)| <-,

then:

(2E0

0xigi(X) = E(Xi ( !i)gi(X), i = 1, . . . , k.(2)

Proof. Without loss of generality, suppose that ! = 0 and (2 = 1. Integration by parts gives

Eg!(X) =1/2.

! &

%&g!(x) exp((x2/2)dx

= ( 1/2.

! &

%&g(x) exp((x2/2){(2x/2}dx

= EXg(X).

In applying integration by parts here, we have ignored the term uv|&%& = g(x)"(x)|&%&. To provethat this does in fact vanish, it is easiest to apply Fubini’s theorem (twice!) in a slightly devious wayas follows: let "(t) = (2.)%1/2 exp((t2/2) be the standard normal density. Since "!(t) = (t"(t),we have both

"(x) = (! &

x"!(t)dt =

! &

xt"(t)dt

and

"(x) =! x

%&"!(t)dt = (

! x

%&t"(t)dt.

Therefore we can write

Eg!(X) =! &

%&g!(x)"(x)dx

=! &

0g!(x)

! &

xt"(t)dtdx(

! 0

%&g!(x)

! x

%&t"(t)dtdx

=! &

0t"(t)

/! t

0g!(x)dx

0dt(

! 0

%&t"(t)

/! 0

tg!(x)dx

0dt

=,! &

0+

! 0

%&

-{t"(t)}[g(t) ( g(0)]}dt

=! &

%&tg(t)"(t)dt = EXg(X).

Here the third equality is justified by the hypothesis E|g!(X)| <- and Fubini’s theorem.]To prove (2), write X(i) ! (X1, . . . ,Xi%1,Xi+1, . . . ,Xk). Then

(2E0

0xigi(X) = (2EE{ 0

0xigi(X)|X(i)}

= EE{(Xi ( !i)gi(X)|X(i)} by (1)


!

Proof. Now we return to the proof of the theorem. Using (2) in (a) yields

(2n%2E!

k"

i=1

0gi

0xi(X)( n%2E!|g(X)|2.(a)

Now let - : Rk $ R be twice di$erentiable, and set

g(x) = 2{log -(x)} =1

-(x)

,0-(x)0x1

, . . . ,0-(x)0xk

-.

Thus

0gi

0xi(x) =

1-(x)

02

0x2i

-(x)( 1-2(x)

,0-(x)0xi

-2

=1

-(x)02

0x2i

-(x)( gi(x)2

andk"

i=1

0gi

0xi(x) =

1-(x)

22-(x) ( |g(x)|2.

Hence the right side of (a) is

= n%2E!|g(X)|2 ( 2n%2E!

/1

-(X)22-(X)

0> 0(b)

if -(x) * 0 and 22- & 0, g .= 0 (i.e. - is super-harmonic). Here is one example of such a function:A. Suppose that -(x) = |x|%(k%2) = {x2

1 + · · · + x2k}%(k%2)/2. Then

g(x) = 2 log-(x) = (k ( 2|x|2 x

and 22-(x) = 0, so - is harmonic. Thus

!̂n =,

1( k ( 2n|X|2

-X

and

E!|X ( !|2 ( E|7! ( !|2 = n%2E!|g(X)|2 =,

k ( 2n

-2

E!|X |%2

=,

k ( 2/n

-2

E!|/

n(X ( !) +/

n!|%2

=,

k ( 2/n

-2

E0|X +/

n!|%2

=

4 8k%2n

92E0|n%1/2X + !|%2 = O(n%2), ! .= 0,

(k%2)2

n1

k%2 = k%2n , ! = 0


since |X|2 + 12k with E(1/12

k) = 1/(k ( 2). Hence

E0|7! ( !|2

E0|X ( !|2=

2k

< 1.

For general !,

R(!, 7!) =k

n( (k ( 2)2

nE0

,1

|X +/

n!|2

-

=k

n

,1( k ( 2

nE0

,k ( 212

k(&)

--

since |X +/

n!|2 + 12k(&) with & = n|!|2/2. Thus

R(!, 7!)R(!,X)

=,

1( k ( 2n

E0

,k ( 212

k(&)

--

= 2/k when ! = 0,$ 1 as n$- for fixed ! .= 0,$ 1 as |!|$- for fixed n.

!

Remark 7.2 Note that the James-Stein estimator

7!n =,

1( k ( 2n|X|2

-X

derived above is not regular at ! = 0: if !n = tn%1/2, then/

n(X ( !n) d= Z + Nk(0,(2I) underP!n , so that

/n(7!n ( !n) =

/n(X ( !n)( k ( 2

|/

n(X ( !n) + t|21/

n(X ( !n) + t2

= Z ( k ( 2|Z + t|2 (Z + t)

which has a distribution dependent on t.

Remark 7.3 It turns out that the James - Stein estimator 7!n is itself inadmissible; see Lehmannand Casella, TPE pages 356-357, and Section 5.7, pages 376-389.

Remark 7.4 Another interesting function - is

-(x) =/

|x|%(k%2), |x| */

k ( 2,(k ( 2)%(k%2)/2 exp([(k ( 2)( |x|2]/2), |x| <

/k ( 2.

For this - we have

g(x) = 2 log-(x) =

4(k%2

|x|2 x, |x| */

k ( 2(x, |x| &

/k ( 2.


Another approach to deriving the James-Stein estimator is via an empirical Bayes approach. Weknow that the Bayes estimator for squared error loss when X + Nk(!,(2I) and ! + Nk(0, )2I) ! "&

isd"(X) =

)2

(2 + )2X,

with Bayes risk

R(", d") =k

1/(2 + 1/)2=

k(2)2

(2 + )2.

Regarding ) as unknown and estimating it via the marginal distribution Q of X is a (parametric)empirical Bayes approach. Since the marginal distribution Q of X is Nk(0, ((2 +)2)I), with density

q(x; )2) =1

(3

2.((2 + )2))kexp

,( 3x32

2((2 + )2)

-

the MLE of (2 + )2 is 3X32/k, and the resulting MLE )̂2 of )2 is given by

)̂2 =,3X32

k( (2

-+

.

Thus)̂2

(2 + )̂2=

,1( k(2

3X32

-+

,

and this leads to the following version of the (positive part) James-Stein estimator:

dEB,MLE(X) =,

1( k(2

3X32

-+

X.

If, instead of the MLE, we used an unbiased estimator of )2/((2 + )2), then we get the positivepart James-Stein estimator

dJS(X) =,

1( (k ( 2)(2

3X32

-+

X.


8 Asymptotic theory of Bayes estimators

We begin with a basic result for Bayes estimation of a Bernoulli probability !.Suppose that (Y1, Y2, . . .)|! = ! are i.i.d. Bernoulli(!). Suppose that the prior distribution is

an arbitrary distribution on (0, 1). The Bayes estimator of ! with respect to squared error loss isthe posterior mean d(Y ) = d(Y1, . . . , Yn) = E(!|Y ). Now this sequence is a (uniformly integrable)martingale; this type of martingale is sometimes known as a Doob martingale. Hence by themartingale convergence theorem,

d(Y n) = E(!|Y n)$a.s. E(!|Y1, Y2, . . .).

To prove consistency, it remains only to show that E(!|Y1, Y2, . . .) = !.To see this, we compute conditionally on !: now

E

:n"

i=1

Yi|!;

= n!, V ar

:n"

i=1

Yi|!;

= n!(1( !).

Hence, with 7!n ! Y n,

E(7!n ( !)2 & 1/4n

,

and by Chebychev’s inequality, 7!n converges in probability to !. Therefore 7!nk $ ! almost surelyfor some subsequence nk. Hence ! = !̃ a.s. where !̃ ! limk

7!nk is measurable with respect toY1, Y2, . . .. Thus we have

E(!|Y1, Y2, . . .) = E(!̃|Y1, Y2, . . .) = !̃ = ! a.s. P"

where

P"(Y % A, ! % B) =!

BP!(Y % A)d"(!).

Therefore

P"(d(Y n)$ !) =!

[0,1]P!(d(Y n)$ !)d"(!) = 1,

and this implies that

P!(d(Y n)$ !) = 1 a.e. ".

Thus the Bayes estimator d(Y n) ! E(!|Y n) is consistent for " a.e. !.A general result of this type for smooth, finite - dimensional families was established by Doob

(1948), and has been generalized by Breiman, Le Cam, and Schwartz (1964), Freedman (1963),and Schwartz (1965). See van der Vaart (1998), pages 149-151 for the general result. The upshotis that for smooth, finite - dimensional families (!,") is consistent if an only if ! is in the supportof ". However the assumption of finite-dimensionality is important: Freedman (1963) gives anexample of an inconsistent Bayes rule when the sample space is X = Z+ and ! is the collection ofall probabilities on X . The paper by Diaconis and Freedman (1986) on the consistency of Bayesestimates is an outgrowth of Freedman (1963).

Asymptotic E"ciency of Bayes Estimators

8. ASYMPTOTIC THEORY OF BAYES ESTIMATORS 31

Now we consider a more general situation, but still with a real-valued parameter !. LetX1, . . . ,Xn be i.i.d. p!(x) with respect to µ where ! % ! ) R, and !0 % ! (an open interval)is true; write P for P!0 .

Assumptions:B1: (a) - (f) of Theorem 2.6 (TPE, pages 440 - 441) hold. Thus it follows that

ln(!)( ln(!0) = (! ( !0)l̇!(X ; !0)(12(! ( !0)2{nI(!0) + Rn(!)}(1)

where n%1Rn(!)$p 0; see problems 6.3.22 and 6.8.3, TPE, pages 503 and 514. We strengthen thisto:B2: for any / > 0 there exists a & > 0 such that

P

4

sup|!%!0|$"

|n%1Rn(!)| * /

5

$ 0 as n$-.

B3: For any & > 0 there exists an / > 0 such that

P

4sup

|!%!0|)"(ln(!)( ln(!0)) & (/

5$ 1.

B4: The prior density % of ! is continuous and > 0 for all ! % !.B5: E(|!| <-.

Theorem 8.1 Suppose that %#(t|X) is the posterior density of/

n(! ( Tn) where

Tn ! !0 +1

I(!0){n%1l̇!(X ; !0)}.

(i) If B1 - B4 hold, then

dTV ("#(·|X), N(0, 1/I(!0))) =!

|%#(t|X)(3

I(!0)"(t3

I(!0))|dt$p 0.

(ii) If B1 - B5 hold, then!

(1 + |t|)|%#(t|X)(3

I(!0)"(t3

I(!0))|dt$p 0.

Theorem 8.2 If B1 - B5 hold and !̃n is the Bayes estimator with respect to % for squared errorloss, then

/n(!̃n ( !0)$d N(0, 1/I(!0)),

so that !̃n is/

n(consistent and asymptotically e#cient.

Example 8.1 Suppose that X1, . . . ,Xn are i.i.d. N(!,(2) and that the prior on ! is N(µ, )2).Then the posterior distribution is given by (!|X) + N(pnX + (1( pn)µ,(2

n/n) where

pn !n/(2

n/(2 + 1/)2, (2

n =n

n/(2 + 1/)2


so that pn $ 1,/

n(1( pn)$ 0, and (2n $ (2. Note that

Tn = !0 +1

1/(2

(X ( !0)(2

= X,

and the Bayes estimator with squared error loss is

E(!|X) = pnX + (1( pn)µ.

Thus

(/

n(! (X)|X) + N(/

n(1( pn)(µ(X),(2n)

$d N(0,(2) = N(0, 1/I(!0)) a.s.

in agreement with theorem 8.1, while/

n(E(!|X)( !0) =/

n(pnX + (1( pn)µ( !0)= pn

/n(X ( !0) +

/n(1( pn)(µ( !0)

d= pnN(0,(2) +/

n(1( pn)(µ( !0)$d N(0,(2)

in agreement with theorem 8.2.

Proof. We first suppose that theorem 8.1 is proved, and show that it implies theorem 8.2. Notethat

/n(!̃n ( !0) =

/n(!̃n ( Tn) +

/n(Tn ( !0).(a)

Since/

n(Tn ( !0)$d N(0, 1/I(!0)),(b)

it su#ces to show that/

n(!̃n ( Tn)$p 0.(c)

To prove (c), note that by definition of !̃n

!̃n =!

!%(!|X)d! =! /

t/n

+ Tn

0%#(t|X)dt,

and hence/

n(!̃n ( Tn) =!

t%#(t|X)dt(!

t3

I(!0)"(t3

I(!0))dt.

It folows that/

n|!̃n ( Tn| &!

|t||%#(t|X)(3

I(!0)"(t3

I(!0))|dt$p 0

as n$ - by theorem 8.1(ii). Thus (c) holds, and the proof is complete; it remains only to provetheorem 8.1.


Proof of theorem 8.1: (i) By definition of Tn

%#(t|X) =%(Tn + n%1/2t) exp(l(Tn + n%1/2t))+%(Tn + n%1/2u) exp(l(Tn + n%1/2u))du

! e)(t)%(Tn + n%1/2t)/Cn

where

2(t) ! l(Tn + n%1/2t)( l(!0)(1

2nI(!0){l̇!(X ; !0)}2(d)

and

Cn !!

e)(u)%(Tn + n%1/2u)du.(e)

Now if we can show that

J1n !!

|e)(t)%(Tn + n%1/2t)( exp((t2I(!0)/2)%(!0)|dt$p 0,(f)

then

Cn $p

!exp((t2I(!0)/2)%(!0)dt = %(!0)

32./I(!0),(g)

and the left side of (2) is Jn/Cn where

Jn !!

|e)(t)%(Tn + n%1/2t)( Cn

3I(!0)"(t

3I(!0))|dt.(h)

Furthermore Jn & J1n + J2n where J1n is defined in (f) and

J2n !!

|Cn

3I(!0)"(t

3I(!0))( exp((t2I(!0)/2)%(!0)|dt(i)

=666Cn

3I(!0)/2.

( %(!0)666!

exp((t2I(!0)/2)dt $p 0

by (g). Hence it remains only to prove that (f) holds.(ii) The same proof works with J !

1n replacing J1n where J !1n has an additional factor of (1+ |t|).

!

Lemma 8.1 The following identity holds:

2(t) ! l(Tn + n%1/2t)( l(!0)(1

2nI(!0){l̇!(X ; !0)}2(2)

= (I(!0)t2

2( 1

2nRn(Tn + n%1/2t)

4t +

1I(!0)

l̇!(X; !0)/n

52

.


Proof. This follows immediately from the definition of Tn, (1), and algebra. !

Proof. that J1n $p 0: Recall the definition of J1n given in (f) of the proof of theorem 8.1.Divide the region of integration into:(i) |t| &M ;(ii) M & |t| & &

/n; and

(iii) &/

n < t <-.For the region (i), the integral is bounded by 2M times the supremum of the integrand over theset |t| & M , and by application of lemma 8.1 it is seen that this supremum converges to zero inprobability if

sup|t|$M

6661n

Rn(Tn + n%1/2t)

:

t +1

I(!0)l̇!(X ; !0)/

n

;2 666$p 0(a)

and

sup|t|$M

|%(Tn + n%1/2t)( %(!0)|$p 0.(b)

Now % is continuous by B4 and Tn $p !0 by B1, so % is uniformly continuous in a neighborhoodof !0 and

!0 ( & & Tn ( n%1/2M & Tn + n%1/2t & Tn + n%1/2M & !0 + &(c)

for |t| & M , so (b) holds. Now n%1/2l̇!(X; !0) is bounded in probability (by B1 and the centrallimit theorem), so it follows from (c) and B2 that (a) holds.

(ii) For the region M & |t| & &/

n it su#ces to show that the integrand is bounded by anintegrable function with probability close to one, since then the integral can be made small bychoosing M su#ciently large. Since the second term of the integrand in J1n is integrable, it su#cesto find such an integrable bound for the first term. In particular, it can be shown that for every/ > 0 there exists a C = C(/, &) and an integer N = N(/, &) so that, for n * N ,

P (exp(2(t)%(Tn + n%1/2t) & C exp((t2I(!0)/4) for all |t| & &/

n) * 1( /;

this follows from the assumption B2; see TPE, pages 494-496.(iii) The integral over the region |t| * &

/n converges to zero by an argument similar to that

for the region (ii), but now the assumption B3 comes into play; see TPE pages 495-496.The proof for J !

1n requires only trivial changes using assumption B5. !

Remark 8.1 This proof is from Lehmann and Casella, TPE, pages 489 - 496. This type of theoremapparently dates back to Laplace (1820), and was later rederived by Bernstein (1917), and von Mises(1931). More general versions have been established by Le Cam (1958), Bickel and Yahav (1969),and Ibragimov and Has’minskii (1972). See the discussion on pages 488 and 493 of TPE. For arecent treatment along the lines of Le Cam (1958) that covers the case of ! ) Rd, see chapter 10of van der Vaart (1998). For a version with the true measure Q generating the data not in P, seeHartigan (1983).

Remark 8.2 See Ibragimov and Has’minskii (1981) and Strasser (1981) for more on the consis-tency and e#ciency of Bayes estimators.


Remark 8.3 Analogues of these theorems for nonparametric and semiparametric problems arecurrently an active research area. See Ghosal, Ghosh, and van der Vaart (2000), Ghosal and vander Vaart (2001), and Shen (2002).

Now we will continue the development for convex likelihoods started in section 4.7. The notationwill be as in section 4.7, and we will make use of lemmas 4.7.1 and 4.7.2. Here, as in section 4.7, wewill not assume that the distribution P governing the data is a member of the parametric modelP = {P! : ! % ! ) Rd}.

Here are the hypotheses we will impose.

C1. %(!) & C1 exp(C2|!|) for all ! for some positive constants C1, C2.

C2. % is continuous at !0.

C3. log p(x, !0 + t)( log p(x, !0) = -(x)T t + R(x; t) is concave in t.

C4. EP-(X1)-(X1)T ! K, EP-(X1) = 0.

C5. EP R(X1; t) = (tTJt/2 + o(|t|2) and VarP (R(X1; t)) = o(|t|2) where J is symmetric, positivedefinite.

C6. X1, . . . ,Xn are i.i.d. P .

Theorem 8.3 Suppose that C1 - C6 hold. Then the maximum likelihood estimator !̂n is/

n(consistentfor !0 = !0(P ) and satisfies

/n(!̂n ( !0)$d Nd(0, J%1K(J%1)T ).

Moreover the Bayes estimator !̃n = E{!|Xn} satisfies/

n(!̃n ( !̂n)$p 0,

and hence also/

n(!̃n ( !0)$d Nd(0, J%1K(J%1)T ).

Proof. Let Ln(!) !.n

i=1 p(Xi, !) denote the likelihood function. Define random convexfunctions An by

exp((An(t)) = Ln(7!n + t//

n)/Ln(7!n).

By definition of the MLE, An achieves its minimum value of zero at t = 0. Then

!̃n =+!Ln(!)%(!)d!+Ln(!)%(!)d!

= 7!n +1/n

t exp((An(t))%(7!n + t//

n) exp((C2|7!n|)dt

exp((An(t))%(7!n + t//

n) exp((C2|7!n|)dt

by the change of variable ! = 7!n + t//

n.

Claim: The random functions An converge in probability uniformly on compact sets to tT Jt/2.


Proof. Let

A0n(t) = nPn{log p(x; !0 + t/

/n)( log p(x; !0)}.

From the proof of theorem 4.7.6 (with h(x; !) ! ( log p(x; !) and multiplication by (1),

A0n(t) = UT

n t +12tTJt + op(1),(a)

and this holds uniformly in t in compact sets by Lemma 4.x.y. Note that

An(t) = A0n(t +

/n(7!n ( !0))(A0

n(/

n(7!n ( !0)).

Since/

n(7!n ( !0) = Op(1) so that for every / > 0 there exists a compact set K ! K* such thatP (/

n(7!n ( !0) % K) * 1( /. Hence by the uniformity in t in the convergence in (a),

An(t) = UTn (t +

/n(7!n ( !0))( UT

n (/

n(7!n ( !0))

+12{(t +

/n(7!n ( !0))T J(t +

/n(7!n ( !0))

(/

n(7!n ( !0)T J/

n(7!n ( !0)} + op(1)

= UTn t +

12tT Jt +

/n(7!n ( !0)T Jt + op(1)

=12tT Jt + op(1)

where the last line follows because/

n(7!n ( !0) = (J%1Un + op(1). Since An is convex andconverges pointwise in probability to tT Jt/2, it converges uniformly in probability on compactsubsets by lemma 4.7.1. !

Now we return to the main proof. Define *n ! inf |u|=1 An(u). It converges in probability to*0 = inf |u|=1 uT Ju/2 > 0. By the same argument used in lemma 4.7.2 it follows that

An(t) * *n|t| for |t| > 1.

Now the integrand in the numerator above converges in probability for each fixed t to

t exp((tT Jt/2)%(!0) exp((C2|!0|),

and similarly the integrand of the denominator converges for each fixed t to

exp((tT Jt/2)%(!0) exp((C2|!0|),

Furthermore the domination hypotheses of the following convergence lemma hold for both thenumerator and denominator with dominating function

D(t) ! 2C11{|t| & 1} + C1|t| exp((*0|t|/2)1{|t| > 1}.

Hence the ratio of integrals converges in probability to+

t exp((tT Jt/2)%(!0) exp((C2|!0|)dt+exp((tT Jt/2)%(!0) exp((C2|!0|)

= 0.

!


Lemma 8.2 Suppose that Xn(t), Yn(t) are jointly measurable random functions, Xn(t,2), Yn(t,2)for (t,2) % K " & where K ) Rd is compact, that % is a measure on Rd, and :(i) Yn(t)$p Y (t) and Xn(t)$p X(t) for % almost all t % K.(ii)

+K Yn(t)d%(t)$p

+Y (t)d%(t) with |

+K Y (t)d%(t)| <- almost surely.

Then+K Xn(t)d%(t) $p

+K X(t)d%(t).

Proof. It su#ces to show almost sure convergence for some further subsequence of any givensubsequence. By convergence in probability for fixed t, and then dominated convergence

Hn(/) ! (P 4 ")({(2, t) : |Xn(t,2)(X(t,2)| > /)

=!

KP ({2 : |Xn(t,2)(X(t,2)| > /})dt$ 0

by the dominated convergence theorem, and similarly for {Yn}. By replacing / by /n 5 0 andextraction of subsequences we can find a subsequence {n!} so that the integrals

+K Xnd% converge

and for some set N ) K " & with P 4 %(N) = 0 we get convergence for all (2) % N c of both Xn"

and Yn" . Then we have %({t : (2, t) % N}) = 0 for almost all 2. Hence by Fatou’s lemma appliedto Yn" ± Xn" we deduce that

!

KXn"(t)d%(t) $a.s.

!

KX(t)d%(t).

!

Corollary 1 Suppose that:(i) Gn(t)$p G(t) for each fixed t % Rd.(ii) P (|Gn(t)| & D(t) for all t % Rd)$ 1.(iii)

+D(t)dt <-.

Then+

Gn(t)dt$p+

G(t)dt.

Proof. Let / > 0; choose a compact set K = K* so that+Kc D(t)dt < /. This is possible since+

D(t)dt <-. Then, using |G(t) & D(t),666!

Gn(t)dt (!

G(t)dt666 &

666!

KGn(t)(

!

KG(t)dt

666 +!

Kc|Gn(t)|dt +

!

Kc|G(t)|dt

&666!

KGn(t)1[|Gn(t)|$D(t)]dt(

!

KG(t)dt

666

+!

K*[|Gn(t)|>D(t)]|Gn(t)|dt +

!

Kc*[|Gn(t)|>D(t)]|Gn(t)|dt

+ 2!

KcD(t)dt.

Now apply lemma 8.2 with Yn(t) ! D(t) and

Xn(t) ! Gn(t)1[|Gn(t)|$D(t)].

Then Xn(t) $p G(t) and the second hypothesis of the lemma holds easily. Hence+

Xn(t)dt $p+G(t)dt so that the first term above $p 0. The second and third terms converge in probability to

zero because the set over which the integral is take is empty with arbitrarily high probability; andthe third term is bounded by 2/ by choice of K. !

chapter 5 bayes methods and elementary decision theory · 8 chapter 5. bayes methods and elementary...

Documents