contentspdhoff/courses/581/lecturenotes/shrinkage.pdfpeter ho shrinkage estimators october 31, 2013...

25
Peter Hoff Shrinkage estimators October 31, 2013 Contents 1 Shrinkage estimators 1 2 Admissible linear shrinkage estimators 3 3 Admissibility of unbiased normal mean estimators 6 4 Motivating the James-Stein estimator 11 4.1 What is wrong with X? ......................... 11 4.2 An oracle estimator: ........................... 12 4.3 Adaptive shrinkage estimation ...................... 13 5 Risk of δ JS 16 5.1 Risk bound for δ JS ............................ 16 5.2 Stein’s identity .............................. 17 6 Some oracle inequalities 21 6.1 A simple oracle inequality ........................ 21 7 Unknown variance or covariance 23 Much of this content comes from Lehmann and Casella [1998], sections 5.2, 5.4, 5.5, 4.6 and 4.7. 1 Shrinkage estimators Consider a model {p(x|θ): θ Θ} for a random variable X such that E[X |θ]= μ(θ), 0 < Var[X |θ]= σ 2 (θ) < ∞∀θ Θ. 1

Upload: others

Post on 18-Mar-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Contentspdhoff/courses/581/LectureNotes/shrinkage.pdfPeter Ho Shrinkage estimators October 31, 2013 Letting w= 1 aand 0 = b=(1 a), the result suggests that if we want to use an admissible

Peter Hoff Shrinkage estimators October 31, 2013

Contents

1 Shrinkage estimators 1

2 Admissible linear shrinkage estimators 3

3 Admissibility of unbiased normal mean estimators 6

4 Motivating the James-Stein estimator 11

4.1 What is wrong with X? . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.2 An oracle estimator: . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.3 Adaptive shrinkage estimation . . . . . . . . . . . . . . . . . . . . . . 13

5 Risk of δJS 16

5.1 Risk bound for δJS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5.2 Stein’s identity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

6 Some oracle inequalities 21

6.1 A simple oracle inequality . . . . . . . . . . . . . . . . . . . . . . . . 21

7 Unknown variance or covariance 23

Much of this content comes from Lehmann and Casella [1998], sections 5.2, 5.4, 5.5,

4.6 and 4.7.

1 Shrinkage estimators

Consider a model {p(x|θ) : θ ∈ Θ} for a random variable X such that

E[X|θ] = µ(θ), 0 < Var[X|θ] = σ2(θ) <∞ ∀θ ∈ Θ.

1

Page 2: Contentspdhoff/courses/581/LectureNotes/shrinkage.pdfPeter Ho Shrinkage estimators October 31, 2013 Letting w= 1 aand 0 = b=(1 a), the result suggests that if we want to use an admissible

Peter Hoff Shrinkage estimators October 31, 2013

A linear estimator δ(x) for µ(θ) is an estimator of the form

δab(X) = aX + b.

Is δab admissible?

Theorem 1 (LC thm 5.2.6). δab(X) = aX + b is inadmissible for E[X|θ] under

squared error loss whenever

1. a > 1,

2. a = 1 and b 6= 0, or

3. a < 0.

Proof. The risk of δab is

R(θ, δab) = E[(aX + b− µ)2|θ]

= E[(aX − aµ− µ(1− a) + b)2|θ]

= E[a2(X − µ)2 + (b− µ(1− a))2 + 2a(X − µ)(b− µ(1− a))|θ]

= a2σ2 + (b− µ(1− a))2

1. If a > 1, then R(θ, δab) > a2σ2 > σ2 = R(θ,X), so δab is dominated by X.

2. If a < 0, then

R(θ, δab) > (b− µ(1− a))2

= (1− a)2(b/(1− a)− µ)2

= (b/(1− a)− µ)2 = R(θ, b/(1− a))

and so δab is dominated by the constant estimator b/(1− a).

3. If a = 1 and b 6= 0, then R(θ, δab) = σ2 +b2 > σ2 = R(θ,X), so δab is dominated

by X.

2

Page 3: Contentspdhoff/courses/581/LectureNotes/shrinkage.pdfPeter Ho Shrinkage estimators October 31, 2013 Letting w= 1 aand 0 = b=(1 a), the result suggests that if we want to use an admissible

Peter Hoff Shrinkage estimators October 31, 2013

Letting w = 1− a and µ0 = b/(1− a), the result suggests that if we want to use an

admissible linear estimator, it should be of the form

δ(X) = wµ0 + (1− w)X , w ∈ [0, 1]

We call such estimators linear shrinkage estimators as they “shrink” the estimate

from X towards µ0. Intuitively, you can think of µ0 as your “guess” as to the value

of µ, and w as the confidence you have in your guess. Of course, the closer your

guess is to the truth, the better your estimator.

If µ0 represents your guess as to µ(θ), it seems natural to require that µ0 ∈ µ(Θ) =

{µ : µ = µ(θ), θ ∈ Θ}, i.e. µ0 is a possible value of µ.

Lemma 1. If µ(Θ) is convex and µ0 6∈ µ(Θ), then δ(X) = wµ0 + (1 − w)X is not

admissible.

Proof.

For the one-dimensional case, suppose µ0 > µ(θ) ∀θ ∈ Θ.

Let µ0 = supΘ µ(θ), and δ(X) = wµ0 + (1− w)X.

Then δ(X) dominates δ(X)

(the variances are the same, and the latter has higher bias for all θ).

The proof is similar for the case µ0 < µ(θ) ∀θ ∈ Θ.

Exercise: Generalize this result to higher dimensions.

2 Admissible linear shrinkage estimators

We have shown that δ(X) = wµ0 + (1− w)X is inadmissible for µ(θ) = E[X|θ] if

• w 6∈ [0, 1] or

3

Page 4: Contentspdhoff/courses/581/LectureNotes/shrinkage.pdfPeter Ho Shrinkage estimators October 31, 2013 Letting w= 1 aand 0 = b=(1 a), the result suggests that if we want to use an admissible

Peter Hoff Shrinkage estimators October 31, 2013

• µ0 6∈ µ(Θ).

Restricting attention to w ∈ [0, 1] and µ0 ∈ µ(Θ), it may seem that such estimators

should always be admissible, but “always” is almost always too inclusive.

Exercise: Given an example where wµ0 + (1 − w)X is not admissible, even with

w ∈ (0, 1) and µ0 ∈ µ(Θ).

Linear shrinkage via conjugate priors

What about using a Bayesian argument? Recall,

Theorem. Any unique Bayes estimator is admissible.

If we can show that wµ0 + (1−w)X is unique Bayes under some prior, then we will

have shown admissibility.

Let X1, . . . , Xn ∼ i.i.d. p(x|θ), where

p(x|θ) ∈ P = {p(x|θ) = h(x) exp(x · θ − A(θ)) : θ ∈ H}Consider estimation of µ = E[X|θ] under squared error loss.

Let π(θ) ∝ exp(n0µ0 · θ − n0A(θ)) where n0 > 0 and µ0 ∈ Conv{E[X|θ] : θ ∈ H}Recall that under this prior,

E[µ] ≡ E[E[X|θ]] = µ0.

Then π(θ|x) ∝ exp(n1µ1 · θ − n1A(θ)), where n1 = n0 + n and

n1µ1 = n0µ0 + nx

µ1 = n0µ0/n1 + nx/n1

=n0

n0 + n1

µ0 +n

n0 + nx.

Under this posterior distribution,

E[µ|x] ≡ E[E[X|θ]|x] = µ1.

4

Page 5: Contentspdhoff/courses/581/LectureNotes/shrinkage.pdfPeter Ho Shrinkage estimators October 31, 2013 Letting w= 1 aand 0 = b=(1 a), the result suggests that if we want to use an admissible

Peter Hoff Shrinkage estimators October 31, 2013

Therefore, the unique Bayes estimator of µ = E[X|θ] under squared error loss is

µ1 = wµ0 + (1− w)x,

and so this linear shrinkage estimator is admissible.

Example (multiple normal means):

Let X ∼ Np(θ, σ2I). First consider the case that σ2 is known, so that

p(x|θ) = (2πσ2)−p/2 exp(−(x− θ) · (x− θ)/[2σ2])

∝θ exp(x · θ/σ2 − θ · θ/[2σ2]).

Consider the normal prior

π(θ) = (2πτ 2)−p/2 exp(−(θ − θ0) · (θ − θ0)/[2τ 20 ])

∝θ exp(θ0 · θ/τ 20 − θ · θ/[2τ 2

0 ]).

where τ 20 is analogous to 1/n0 in the general formulation for exponential families.

The posterior density is

π(θ|x) ∝θ exp{[θ0/τ20 + x/σ2] · θ − θ · θ[1/σ2 + 1/τ 2

0 ]/2}

= exp{θ1 · θ/τ 21 − θ · θ/[2τ 2

1 ]}

where

• 1/τ 21 = 1/τ 2

0 + 1/σ2

• θ1 =1/τ20

1/τ20 +1/σ2θ0 + 1/σ2

1/τ20 +1/σ2x ≡ wθ0 + (1− w)x.

So {θ|x} ∼ Np(θ1, τ21 I), which means that

E[θ|x] = θ1 = wθ0 + (1− w)x

uniquely minimizes the posterior risk under squared error loss. The posterior mean

is therefore a unique Bayes estimator and also an admissible estimator of θ. Since

this result holds for all τ 20 > 0, we have the following:

5

Page 6: Contentspdhoff/courses/581/LectureNotes/shrinkage.pdfPeter Ho Shrinkage estimators October 31, 2013 Letting w= 1 aand 0 = b=(1 a), the result suggests that if we want to use an admissible

Peter Hoff Shrinkage estimators October 31, 2013

Lemma 2. For each w ∈ (0, 1) and θ0 ∈ Rp, the estimator δwθ0(x) = wθ0 +(1−w)x

is admissible for estimating θ in the model X ∼ Np(θ, σ2I), θ ∈ Rp, where σ2 is

known.

Of course, what we would like is the following lemma:

Lemma 3. For each w ∈ (0, 1) and θ0 ∈ Rp, the estimator δwθ0(X) = wθ0+(1−w)X

is admissible for estimating θ in the model X ∼ Np(θ, σ2I), θ ∈ Rp, σ2 ∈ R+.

How can this result be obtained?

Theorem 2. Let P = {p(x|θ, ψ) : (θ, ψ) ∈ Θ × Ψ}, and for ψ0 ∈ Ψ, let Pψ0 =

{p(x|θ, ψ0) : θ ∈ Θ} be a submodel. If δ is admissible for estimating θ under Pψ0 for

each ψ0 ∈ Ψ, then δ is admissible for estimating θ under P.

Proof. Suppose δ satisfies the conditions of the theorem but is not admissible.

Then there exists a δ′ ∈ D such that

∀(θ, ψ) , R((θ, ψ), δ′) ≤ R((θ, ψ), δ)

∃(θ0, ψ0) , R((θ0, ψ0), δ′) < R((θ0, ψ0), δ).

But this contradicts the assumption that δ is admissible for estimating θ under Pψ0 .

Therefore, no such δ′ can exist and so δ is admissible for P .

A corollary to this theorem is the admissibility of wθ0 + (1 − w)X in the normal

model with unknown variance.

3 Admissibility of unbiased normal mean estima-

tors

Let X ∼ Np(θ, σ2I), θ ∈ Rp, σ2 > 0.

For estimation of θ under squared error loss, we have shown that the linear shrinkage

estimator δ(x) = wθ0 + (1− w)x is

6

Page 7: Contentspdhoff/courses/581/LectureNotes/shrinkage.pdfPeter Ho Shrinkage estimators October 31, 2013 Letting w= 1 aand 0 = b=(1 a), the result suggests that if we want to use an admissible

Peter Hoff Shrinkage estimators October 31, 2013

• inadmissible if w 6∈ [0, 1],

• admissible if w ∈ (0, 1).

What remains to evaluate is the admissibility for w ∈ {0, 1}. Admissibility for w = 1

is easy to show - the estimator δ1θ0(x) = θ0 beats everything at θ0 and so can’t be

dominated. The last and most interesting case is that of w = 0, i.e. δ0(X) = X, the

unbiased MLE and UMVUE.

Blyth’s method

Recall Blyth’s method for showing admissibility using a limiting Bayes argument:

Theorem 3 (LC 5.7.13). Suppose Θ ⊂ Rp is open, and that R(θ, δ) is continuous in

θ for all δ ∈ D. Let δ be an estimator and {πn} be a sequence of measures such that

for any open ball B ⊂ Θ,

R(πn, δ)−R(πn, δπn)

πn(B)→ 0 as n→∞.

Then δ is admissible.

Let’s try to use this to show admissibility of δ0(X) = X in the normal means

problem. We begin with the case that σ2 = 1 is known.

X ∼ Np(θ, I)

θ ∼ Np(0, τ2I)

{θ|X} ∼ Np(τ2

1+τ2X, τ2

1+τ2I)

The unique Bayes estimator is δτ2 = E[θ|X] = τ2

1+τ2X ≡ (1− w)X.

To apply the theorem, we need to compute the Bayes risk of δτ2 and X under the

Np(0, τ2I) prior πτ2 . The loss we will use is “scaled” squared error loss, L(θ,d) =∑

(θj − dj)2/p. Because the risk is the average of the individual MSEs, the Bayes

7

Page 8: Contentspdhoff/courses/581/LectureNotes/shrinkage.pdfPeter Ho Shrinkage estimators October 31, 2013 Letting w= 1 aand 0 = b=(1 a), the result suggests that if we want to use an admissible

Peter Hoff Shrinkage estimators October 31, 2013

risk is just the average of the Bayes risks from the p components,

R(θ, δ) = E[

p∑j=1

(θj − δj)2]/p

=

p∑j=1

E[(θj − δj)2]/p,

and so calculating the Bayes risk is similar to calculating the risk in the p = 1

problem. For δτ2(x) = ax, where a = 1− w = τ 2/(1 + τ 2), we have

E[(aX − θ)2] = E[(aX − aθ + (1− a)θ)2]

= a2E[(X − θ)2] + 2a(1− a)E[(X − θ)θ] + (1− a)2E[θ]2

= a2 + (1− a)2τ 2

=

(τ 2

1 + τ 2

)2

+τ 2

(1 + τ 2)2=

τ 2

1 + τ 2

A more intuitive way to calculate this makes use of the fact that δτ2(X) = E[θ|X],

so

E[(θ − δτ2)2] = E[(θ − E[θ|X])2]

= Ex[Eθ|x[(θ − E[θ|X])2]]

= Ex[Var[θ|X]]

= Ex[τ2/(1 + τ 2)] =

τ 2

1 + τ 2.

Similarly,

E[(X − θ)2] = Eθ[Ex|θ[(X − θ)2]] = Eθ[1] = 1.

So R(πτ2 , δτ2) = τ2

1+τ2and R(πτ2 , X) = 1. Returning to the p-variate case, since the

Bayes risk is the arithmetic average of the risks for each of the p components of θ,

we have

R(πτ2 , δτ2) =τ 2

1 + τ 2

R(πτ2 ,X) = 1.

8

Page 9: Contentspdhoff/courses/581/LectureNotes/shrinkage.pdfPeter Ho Shrinkage estimators October 31, 2013 Letting w= 1 aand 0 = b=(1 a), the result suggests that if we want to use an admissible

Peter Hoff Shrinkage estimators October 31, 2013

Note that

δτ2(X) = τ2

1+τ2X ↑X as τ 2 ↑ ∞,

R(πτ2 , δτ2) ↑ R(πτ2 ,X) as τ 2 ↑ ∞,

so X is a “limiting Bayes” estimator, for which the risk difference from the Bayes

estimator converges to zero. This is promising - let’s now apply the theorem. Letting

B be any open finite ball in Rp, we need to see if the following limit is zero:

limτ2→∞

R(πτ2 ,X)−R(πτ2 , δτ2)

πτ2(B)= lim

τ2→∞

(1− τ2

1+τ2)

πτ2(B)

= limτ2→∞

[(1 + τ 2)πτ2(B)]−1

Now πτ2(B)→ 0 as τ 2 →∞ for any bounded set B Therefore, the limit is zero only

if

limτ2→∞

τ 2πτ2(B) =∞.

We have

τ 2πτ2(B) = τ 2

∫B

(2πτ 2)−p/2 exp(−||θ||2/[2τ 2]) dθ

= (2π)−p/2 × (τ 2)1−p/2 ×∫B

exp(−||θ||2/[2τ 2]) dθ

(2π)−p/2 × (a)× (b).

Now take the limit as τ 2 →∞:

(b)→ Vol(B) as τ 2 →∞

(a)→

∞ if p = 1

1 if p = 2

0 if p > 2

.

Therefore, the desired limit is achieved for p = 1 but not p > 1.

• By the theorem, X is admissible for Θ.

9

Page 10: Contentspdhoff/courses/581/LectureNotes/shrinkage.pdfPeter Ho Shrinkage estimators October 31, 2013 Letting w= 1 aand 0 = b=(1 a), the result suggests that if we want to use an admissible

Peter Hoff Shrinkage estimators October 31, 2013

• For p > 1, this method of showing admissibility does not work.

– For p = 2, X can be shown to be admissible using Blyth’s method with

non-normal priors (see LC exercise 5.4.5).

– For p > 2, X can’t be shown to be admissible because it isn’t.

Interpreting the failure of Blyth’s method:

The admissibility conditions for Blyth’s method derived from consideration of the

existence of an estimator δ that dominates X. If such an estimator exits, then by

continuity of risks

∃ ε > 0 and an open ball B ⊂ Θ : R(θ,X)−R(θ, δ) > ε ∀θ ∈ B,

which implies for each prior πk that

R(πk,X)−R(πk, δ) =

∫[R(θ,X)−R(θ, δ)]πk(dθ) ≥

∫B

[R(θ,X)−R(θ, δ)]πk(dθ)

≥ επk(B).

Integrating with respect to a prior πk, and comparing to the Bayes risk of the Bayes

estimator δk under πk gives

R(πk,X)−R(πk, δk) ≥ R(πk,X)−R(πk, δ) > επk(B) ∀k,

as δk has Bayes risk less than or equal to that of δ. Could such a δ exist?

Could exist: Suppose B is a ball such that πk(B) goes to zero very fast. Then

an estimator (like X) can have a good limiting Bayes risk and still do poorly on B.

This allows for the possibility of domination by another estimator that does better

on B.

10

Page 11: Contentspdhoff/courses/581/LectureNotes/shrinkage.pdfPeter Ho Shrinkage estimators October 31, 2013 Letting w= 1 aand 0 = b=(1 a), the result suggests that if we want to use an admissible

Peter Hoff Shrinkage estimators October 31, 2013

Couldn’t exist: On the other hand, if R(πk,X)−R(πk, δk) goes to zero very fast

(e.g. faster than the probability of any ball B), then in a sense X would have to

be doing well everywhere, and would not be able to be dominated - this is Blyth’s

method for showing admissibility.

What fails in the admissibility proof for the normal means problem is that for p > 2,

the probability πk(B) of an open ball B is going to zero much faster than the Bayes

risk difference, leaving a large enough “gap” for some other estimator to do better.

4 Motivating the James-Stein estimator

Stein [1956] showed that X is inadmissible for θ in the normal means problem when

p > 2. This was surprising, as X is the MLE and UMVUE for θ. In this section,

4.1 What is wrong with X?

For large p,

• X may be close to θ, but

• X ·X = ||X||2 may be far from θ · θ = ||θ||2.

If X ∼ Np(θ, I),

E[||X||2] = E[

p∑1

X2j ]

=

p∑1

(θ2j + 1) = ||θ||2 + p,

so for large p, the magnitude of the estimator vector X is expected to be much larger

than the magnitude of the estimand vector θ. More insight can be gained as follows:

Note that every vector x can be expressed as

x = sθ + r, for some s ∈ R and r : θ · r = 0.

11

Page 12: Contentspdhoff/courses/581/LectureNotes/shrinkage.pdfPeter Ho Shrinkage estimators October 31, 2013 Letting w= 1 aand 0 = b=(1 a), the result suggests that if we want to use an admissible

Peter Hoff Shrinkage estimators October 31, 2013

Here, the random variable s is the magnitude of the projection of x in the direction of

θ, and r is the residual vector. Using this decomposition, we can write the squared-

error loss of ax for estimating θ as

||ax− θ||2 = (ax− θ) · (ax− θ)

= ((as− 1)θ + ar) · ((as− 1)θ + ar)

= (as− 1)2||θ||2 + a2||r||2

Now consider replacing x with X ∼ Np(θ, I). The random-variable version of the

above equation is then

||aX − θ||2 = (aS − 1)2||θ||2 + a2||R||2

Exercise: Show that

• S ∼ N(1, ||θ||−2),

• ||R||2 ∼ χ2p−1,

• S and R are independent.

Now imagine a situation where p is growing but ||θ||2 remains fixed. The distribution

of (aS − 1)2||θ||2 remains fixed whereas the distribution of a2||R||2 blows up. This

suggests that if we think ||θ||2/p is small we should use an estimator like aX with

a < 1 to control the error that comes from R2. But what should the value of a be?

4.2 An oracle estimator:

Question: Among estimators aX : a ∈ [0, 1], which has the smallest risk?

Solution:

E[||aX − θ||2] = E[||(aX − aθ)− (1− a)θ||2]

= a2p+ (1− a)2||θ||2.

12

Page 13: Contentspdhoff/courses/581/LectureNotes/shrinkage.pdfPeter Ho Shrinkage estimators October 31, 2013 Letting w= 1 aand 0 = b=(1 a), the result suggests that if we want to use an admissible

Peter Hoff Shrinkage estimators October 31, 2013

Taking derivatives, the minimizing value a of a satisfies

2ap− 2(1− a)||θ||2 = 0

a

1− a=||θ||2

p

a =||θ||2

||θ||2 + p.

Thus the optimal shrinkage “estimator” is given by δa(X) = aX. This is not really

an estimator in the usual sense, because the ideal degree of shrinkage a depends on

θ. For this reason, aX is sometimes called an “oracle estimator:” You would need

an oracle to tell you the value of ||θ||2 before you could use it. Note that the risk of

this estimator is

E[||aX − θ||2] =||θ||4p+ p2||θ||2

(||θ||2 + p)2

=p||θ||2(||θ||2 + p)

(||θ||2 + p)2

=p||θ||2

||θ||2 + p

and so

E[||aX − θ||2] = p||θ||2

||θ||2 + p< p = E[||X − θ||2].

The risk differential is large if ||θ||2 is small compared to p.

4.3 Adaptive shrinkage estimation

As shown above, the optimal amount of shrinkage a is

a =||θ||2

||θ||2 + p=||θ||2/p||θ||2/p+ 1

.

13

Page 14: Contentspdhoff/courses/581/LectureNotes/shrinkage.pdfPeter Ho Shrinkage estimators October 31, 2013 Letting w= 1 aand 0 = b=(1 a), the result suggests that if we want to use an admissible

Peter Hoff Shrinkage estimators October 31, 2013

Note that ||θ||2/p is the variability of of the θj values around zero. Can this variability

be estimated? Consider the following hierarchical model:

Xj = θj + εj ε1, . . . , εpiid∼ N(0, 1)

θ1, . . . , θpiid∼ N(0, τ 2)

If you’d like to connect this with some actual inference problem, imagine that each

Xj is the sample mean or t-statistic calculated from observations from experiment j,

with population mean θj.

Suppose you believed this model and knew the value of τ 2. If you were interested

finding an estimator δ(X) that minimized the the expected squared error ||θ−δ(X)||2

under repeated sampling of

• θ1, . . . , θp, followed by sampling of

• X1, . . . , Xp,

you would want to come up with an estimator δ(X) that minimized

E[||θ − δ(X)||2] =

∫ ∫||θ − δ(x)||2p(dx|θ)p(dθ).

Exercise: Show that δτ2(X) = τ2

1+τ2X minimizes the expected loss.

If we knew τ 2, then the estimator to use would be τ2

1+τ2X. We generally don’t know

τ 2, but maybe it can be estimated from the data. Under the above model,

X = θ + ε

θ ∼ Np(0, τ2I)

ε ∼ Np(0, I)

Cov(θ, ε) = 0.

This means that the distribution of X marginalized over θ is

X ∼ Np(0, (τ2 + 1)I).

14

Page 15: Contentspdhoff/courses/581/LectureNotes/shrinkage.pdfPeter Ho Shrinkage estimators October 31, 2013 Letting w= 1 aand 0 = b=(1 a), the result suggests that if we want to use an admissible

Peter Hoff Shrinkage estimators October 31, 2013

An unbiased estimator of τ 2 + 1 is clearly ||X||2/p, so an unbiased estimator of τ 2 is

τ 2 =||X||2 − p

p.

However, we were interested in estimating τ 2/(τ 2 + 1), not τ 2. If p > 2, you can use

the fact that ||X||2 ∼ gamma(p/2, 1/[2(τ 2 + 1)]) to show that

E[||X||−2||] = [(p− 2)(τ 2 + 1)]−1,

and so

E[ p−2||X||2 ] =

1

τ 2 + 1

E[1− p−2||X||2 ] =

τ 2

τ 2 + 1.

Again, τ2

τ2+1X would be the optimal estimator in this hierarchical model if we knew

τ 2. If we don’t know τ 2, we might instead consider using

δJS(X) =

(τ 2

τ 2 + 1)X

= (1− p−2||X||2 )X.

This estimator is called the James-Stein estimator. As we will see, it has many

interesting properties:

• For large p in the hierarchical normal model, it is almost as good as the oracle

estimator τ2

τ2+1X:

EXθ[||θ − δJS||2] ≈ EXθ[||θ − τ2

τ2+1X||2]

• Even if the hierarchical normal model isn’t correct, it still is almost as good as

the oracle estimator aX in the normal means model:

EX|θ[||θ − δJS||2] ≈ EX|θ||θ − aX||2]

• In the normal means problem, this estimator dominates the unbiased estimator

X if p > 2:

EX|θ[||θ − δJS||2] < EX|θ[||θ −X||2] ∀θ

We will show this last inequality first.

15

Page 16: Contentspdhoff/courses/581/LectureNotes/shrinkage.pdfPeter Ho Shrinkage estimators October 31, 2013 Letting w= 1 aand 0 = b=(1 a), the result suggests that if we want to use an admissible

Peter Hoff Shrinkage estimators October 31, 2013

5 Risk of δJS

We will show that δJS dominates X by showing that the risk R(θ, δJS) = E[||δJS −θ||2]/p is uniformly less than 1. This will not be done by computing its risk function

directly, but instead by showing that 1 is an upper bound on the risk. This bound will

be obtained via an identity that has applications beyond the calculation of R(θ, δJS).

5.1 Risk bound for δJS

We can write the James-Stein estimator as

δJS =( τ 2

1 + τ 2

)x = (1− p−2

x·x )x

= x− p−2x·xx

≡ x− g(x).

Under X ∼ Np(θ, I),

E[||δJS − θ||2] = E[(X − g(X)− θ)2]

= E[(X − θ)− g(X))2]

= E[||(X − θ)||2] + E[||g(X)||2]− 2E[(X − θ) · g(X)].

where all expectations are with respect to the distribution of X given θ. The first

expectation is p and the second is

E[(p− 2)2 X·X(X·X)2

] = (p− 2)2E[ 1X·X ].

The third expectation is more complicated, but in the next subsection we’ll derive

an identity (Stein’s identity) for computing E[(X − θ) · g(X)] that is applicable for

arbitrary functions g. Stein’s identity as applied to g(x) = p−2x·xx gives

E[(X − θ)g(X)] = E[ (p−2)2

X·X ].

16

Page 17: Contentspdhoff/courses/581/LectureNotes/shrinkage.pdfPeter Ho Shrinkage estimators October 31, 2013 Letting w= 1 aand 0 = b=(1 a), the result suggests that if we want to use an admissible

Peter Hoff Shrinkage estimators October 31, 2013

Using this for the above risk calculation gives

E[||δJS − θ||2] = p+ (p− 2)2E[ 1X·X ]− 2(p− 2)2E[ 1

X·X ]

= p− E[ (p−2)2

X·X ].

Note that we haven’t actually calculated the risk of δJS is closed form - our formula

depends on the expectation of 1/X ·X, which is an inverse-moment of a noncentral χ2

distribution where the noncentrality parameter depends on θ. However, computing

this moment is not necessary to show that δJS dominates X: Since 1x·x > 0 ∀x ∈ Rp,

we have

E[||δJS − θ||2] = p− E[ (p−2)2

X·X ] < p = E[||X − θ||2].

Since the expectation of (X ·X)−1 is complicated, further study of the risk of δJS is

often achieved via a study of its unbiased risk estimate. From the above calculation,

we see that

E[||δJS − θ||2] = E[p− (p−2)2

X·X ],

and so p− (p−2)2

X·X can be said to be an unbiased estimate of the risk of δJS.

5.2 Stein’s identity

We start with a univariate version of the identity:

Lemma 4 (Stein’s identity). Let X ∼ N(µ, σ2) and let g(x) be such that E[|g′|] <∞.

Then

E[g(X)(X − µ)] = σ2E[g′(X)].

Proof.

The proof follows from Fubini’s theorem and a bit of calculus. Letting p(x) =

φ([x − µ]/σ)/σ, note that p′(x) = −(x−µσ2 )p(x). By the fundamental theorem of

calculus,

p(x) =

∫ x

−∞−(y−µ

σ2 )p(y) dy =

∫ ∞x

(y−µσ2 )p(y) dy.

17

Page 18: Contentspdhoff/courses/581/LectureNotes/shrinkage.pdfPeter Ho Shrinkage estimators October 31, 2013 Letting w= 1 aand 0 = b=(1 a), the result suggests that if we want to use an admissible

Peter Hoff Shrinkage estimators October 31, 2013

The expectation we wish to calculate is

E[g′(x)] =

∫ ∞−∞

g′(x)p(x) dx

=

∫ ∞0

g′(x)p(x) dx+

∫ 0

−∞g′(x)p(x) dx.

Doing the first part, we have∫ ∞0

g′(x)p(x) dx =

∫ ∞0

g′(x)

∫ ∞x

(y−µσ2 )p(y) dy dx

=

∫0<x<y<∞

g′(x)(y−µσ2 )p(y) dy dx

=

∫ ∞0

(y−µσ2 )p(y)

∫ y

0

g′(x) dx dy

=

∫ ∞0

(y−µσ2 )p(y)[g(y)− g(0)] dy

= E[(g(X)− g(0))(X−µσ2 )1(X > 0)].

Similarly, the second part is∫ 0

−∞g′(x)p(x) dx = E[(g(X)− g(0))(X−µ

σ2 )1(X < 0)].

Adding the two parts gives

E[g′(x)] = E[(g(X)− g(0))(X−µσ2 )]

= E[g(X)(X−µσ2 )]− g(0)E[(X−µ

σ2 )] = E[g(X)(X−µσ2 )],

and so

E[g(X)(X − µ)] = σ2E[g′(X)].

Stein’s lemma is often alternatively proven with integration by parts. These proofs

go roughly as follows: As before,

p(x) = (2πσ2)−1/2 exp{−(x− µ)2/[2σ2]}dp(x)

dx= −( 1

σ2 )(x− µ)p(x).

18

Page 19: Contentspdhoff/courses/581/LectureNotes/shrinkage.pdfPeter Ho Shrinkage estimators October 31, 2013 Letting w= 1 aand 0 = b=(1 a), the result suggests that if we want to use an admissible

Peter Hoff Shrinkage estimators October 31, 2013

E[g′(X)] =

∫ ∞−∞

g′(x)× p(x) dx

= limc→∞

∫ c

−cg′(x)× p(x) dx

≡ limc→∞

∫ c

−cv′(x)u(x) dx

= limc→∞

(u(x)v(x)|c−c −

∫ c

−cv(x)u′(x) dx

)= lim

c→∞

(g(x)p(x)|c−c −

∫ c

−cg(x)[−(x− µ)/σ2]p(x) dx

)= E[g(X)(X − µ)/σ2] + lim

c→∞g(x)p(x)|c−c.

To complete the proof we have to show that the last limit is zero. This is straight-

forward to show if p(x) decreases monotonically in |x|:

Lemma 5. Let p(x) be decreasing to zero in |x| and let E[|g′(X)|] < ∞. Then

g(x)p(x)→ 0 as x→ ±∞.

Proof. Given ε > 0 ∃K such that∫ ∞K

|g′(x)|p(x) dx < ε/3.

Then for any t sufficiently large,

1. p(t) < p(K)(∫ K

0|g′(x)|p(x))−1 × ε/3

2. p(t)|g(0)| < ε/3.

19

Page 20: Contentspdhoff/courses/581/LectureNotes/shrinkage.pdfPeter Ho Shrinkage estimators October 31, 2013 Letting w= 1 aand 0 = b=(1 a), the result suggests that if we want to use an admissible

Peter Hoff Shrinkage estimators October 31, 2013

From this, we have

|g(t)p(t)| = p(t)|g(0) +

∫ t

0

g′(x) dx|

= p(t)|g(0) +

∫ K

0

g′(x) dx+

∫ t

K

g′(x) dx|

≤ p(t)|g(0)|+ p(t)

∫ K

0

|g′(x)| dx+ p(t) + p(t)

∫ t

K

|g′(x)| dx

≤ p(t)|g(0)|+ p(t)p(K)

∫ K

0

|g′(x)|p(x) dx+

∫ t

K

|g′(x)|p(x) dx

≤ ε/3 + ε/3 + ε/3 = ε,

where the second to last line holds as p(x)/p(K) < 1 and p(x)/p(t) < 1 on x ∈ (0, K)

and (K, t) respectively, due to p(x) being monotonically decreasing.

This identity generalizes to other exponential families. See LC Theorem 1.5.15.

For computing the risk of a vector-valued function g : Rp → Rp, we will need a

multivariate version of the above identity.

Lemma 6 (Stein’s identity, multivariate version). LetX ∼ Np(µ, σ2I) and let g(x) :

Rp → Rp such that E[|dgi/dxi|] <∞. Then

E[(X − µ) · g(X)] = σ2E[∇ · g(X)]

where ∇ · g =∑p

j=1 dgj(x)/dxj.

Proof. This is just a corollary of the univariate version:

E[dgp(x)/dxp] =

∫x−p

(∫xp

dgp(x)

dxpp(xp) dxp

)p−1∏

1

p(xj) dxj

=

∫x−p

Exp [gp(X)(Xp − µp)]/σ2

p−1∏1

p(xj) dxj

= 1σ2 E[gp(X)(Xp − µp)]],

20

Page 21: Contentspdhoff/courses/581/LectureNotes/shrinkage.pdfPeter Ho Shrinkage estimators October 31, 2013 Letting w= 1 aand 0 = b=(1 a), the result suggests that if we want to use an admissible

Peter Hoff Shrinkage estimators October 31, 2013

and similarly for each other j ∈ {1, . . . , p}. Therefore

E[∇ · g(X)] =∑

E[dgj(x)/dxj]

=∑

E[gj(X)(Xj − µj)]/σ2

= E[g(X) · (X − µ)]/σ2.

Now we are in a position to apply the lemma to obtain the unbiased risk estimator of

δJS. Recall we needed to calculate E[(X−θ) ·g(X)] where g(x) = (p−2)x·x x. Applying

the lemma, we have

g(x) = (p− 2)( x1

x · x, . . . ,

xpx · x

)∇ · g(x) = (p− 2)

∑(x · x− 2x2

j

(x · x)2

)=

p− 2

(x · x)2[px · x− 2x · x]

=(p− 2)2

x · x,

and so

E[(X − θ) · g(X)] = E[ (p−2)2

X·X ],

as we used above.

6 Some oracle inequalities

6.1 A simple oracle inequality

Recall that if we knew ||θ||2, the optimal estimator in the class {δa(x) = ax : a ∈[0, 1]} would be δa where

a =||θ||2/p||θ||2/p+ 1

.

21

Page 22: Contentspdhoff/courses/581/LectureNotes/shrinkage.pdfPeter Ho Shrinkage estimators October 31, 2013 Letting w= 1 aand 0 = b=(1 a), the result suggests that if we want to use an admissible

Peter Hoff Shrinkage estimators October 31, 2013

We also showed that the risk of this estimator is

R(θ, aX) = E[||aX − θ||2]/p =||θ||2

||θ||2 + p< 1.

Not surprisingly, it turns out that

R(θ, aX) ≤ R(θ, δJS)

(use the fact that X ·X has a noncentral χ2 distribution, or condition on X ·X).

But how much worse is δJS than the oracle estimator δa? Recall the risk of δJS is

R(θ, δJS) = 1− (p− 2)2

pE[(X ·X)−1]

Since 1/x is convex, Jensen’s inequality gives

E[1

X ·X] ≥ 1

E[X ·X)]=

1

||θ||2 + p,

and so

R(θ, δJS) ≤ 1− (p− 2)2

p

1

||θ||2 + p

p(R(θ, δJS)−R(θ, δa)) ≤ p− (p− 2)2

||θ||2 + p− ||θ||2p||θ||2 + p

=p||θ||2 + p2 − p2 + 4p− 4− ||θ||2p

||θ||2 + p

=4(p− 1)

||θ||2 + p

≤ 4p−1p≤ 4,

and so

R(θ, δa)) ≤ R(θ, δJS) ≤ R(θ, δa)) + 4/p.

Additional work can get you to

R(θ, δa)) ≤ R(θ, δJS) ≤ R(θ, δa)) + 2/p.

For more on this, see Johnstone [2002] or Candes [2006].

22

Page 23: Contentspdhoff/courses/581/LectureNotes/shrinkage.pdfPeter Ho Shrinkage estimators October 31, 2013 Letting w= 1 aand 0 = b=(1 a), the result suggests that if we want to use an admissible

Peter Hoff Shrinkage estimators October 31, 2013

7 Unknown variance or covariance

Suppose X ∼ Np(θ, σ2I), where σ2 is known. Letting

• X = X/σ and

• θ = θ/σ,

we have X ∼ Np(θ, I). The James-Stein estimator δJS of θ = θ/σ is then

δ(X) = (1− p− 2

X ·X)X

= (1− σ2(p− 2)

X ·X)X/σ.

It seems natural then that the JSE of θ should be σ times the JSE of θ = θ/σ,

giving

δJS = (1− σ2(p− 2)

X ·X)X.

Of course, often σ2 is not known. Is there a version of the JSE in this case? Yes, if

you have information about σ2. Consider the following hierarchical sampling scheme:

Xi,j = θj + εi,j , i = 1, . . . , n, j = 1, . . . , p

{εi,j} ∼ i.i.d. N(0, σ2).

Letting Xj = X·j =∑n

i=1 Xi,j/n, we now basically have the situation described

above.

Also note that the data contain information about σ2 via the pooled sample sum of

squares:

S =

p∑j=1

(Xi,j − X·j)2 ∼ σ2χ2p(n−1).

Note further that X and S are statistically independent. For this and similar situa-

tions, James and Stein [1961] considered estimators of the following form:

Let X ∼ Np(θ, σ2I) be independent of S ∼ σ2χ2

k. Define the estimator

δc(X, S) = (1− cSX·X )X.

23

Page 24: Contentspdhoff/courses/581/LectureNotes/shrinkage.pdfPeter Ho Shrinkage estimators October 31, 2013 Letting w= 1 aand 0 = b=(1 a), the result suggests that if we want to use an admissible

Peter Hoff Shrinkage estimators October 31, 2013

The value of c that minimizes the risk of δc for all θ is c = (p− 2)/(k + 2), resulting

in the following estimator:

δJS(X, S) = (1− Sk+2

(p−2)X·X )X.

In particular, note that this estimator dominates X.

This result also generalizes to the correlated data case: Let X ∼ Np(θ,Σ) and

S ∼Wishart(Σ, k) be independent. Consider estimators of the form

δc(X,S) = (1− cXTS−1X

)X.

James and Stein [1961] show that the estimator obtained by setting

c =p− 2

n− p+ 3

minimizes the risk for all values of θ.

See Brown and Han for recent work on these and related problems, including exten-

sions to regression problems.

References

L.D. Brown and X. Han. Optimal estimation of multidimensional normal means with

an unknown variance.

Emmanuel J. Candes. Modern statistical estimation via oracle inequali-

ties. Acta Numer., 15:257–325, 2006. ISSN 0962-4929. doi: 10.1017/

S0962492906230010. URL http://dx.doi.org.offcampus.lib.washington.

edu/10.1017/S0962492906230010.

W. James and Charles Stein. Estimation with quadratic loss. In Proc. 4th Berkeley

Sympos. Math. Statist. and Prob., Vol. I, pages 361–379. Univ. California Press,

Berkeley, Calif., 1961.

24

Page 25: Contentspdhoff/courses/581/LectureNotes/shrinkage.pdfPeter Ho Shrinkage estimators October 31, 2013 Letting w= 1 aand 0 = b=(1 a), the result suggests that if we want to use an admissible

Peter Hoff Shrinkage estimators October 31, 2013

IM Johnstone. Function estimation and gaussian sequence models. Unpublished

manuscript, 2002.

E. L. Lehmann and George Casella. Theory of point estimation. Springer Texts in

Statistics. Springer-Verlag, New York, second edition, 1998. ISBN 0-387-98502-6.

Charles Stein. Inadmissibility of the usual estimator for the mean of a multivariate

normal distribution. In Proceedings of the Third Berkeley Symposium on Mathe-

matical Statistics and Probability, 1954–1955, vol. I, pages 197–206, Berkeley and

Los Angeles, 1956. University of California Press.

25