divergence loss for shrinkage estimation, prediction...

80
DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION AND PRIOR SELECTION By VICTOR MERGEL A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2006

Upload: others

Post on 03-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION ANDPRIOR SELECTION

By

VICTOR MERGEL

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2006

Page 2: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

Copyright 2006

by

Victor Mergel

Page 3: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

ACKNOWLEDGMENTS

I would like to express my sincere gratitude to my advisor Dr. Malay

Ghosh for his support and professional guidance. Working with him was not

only enjoyable but also very valuable personal experience.

I would also like to thank Michael Daniels, Panos M. Pardalos, Brett Presnell,

and Ronald Randles, for their careful reading and extensive comments of this

dissertation.

iii

Page 4: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

CHAPTER

1 INTRODUCTION AND LITERATURE REVIEW . . . . . . . . . . . . 1

1.1 Statistical Decision Theory . . . . . . . . . . . . . . . . . . . . . . . 11.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Point Estimation of the Multivariate Normal Mean . . . . . 21.2.2 Shrinkage towards Regression Surfaces . . . . . . . . . . . . 101.2.3 Baranchik Class of Estimators Dominating the Sample Mean 12

1.3 Shrinkage Predictive Distribution for the Multivariate Normal Density 131.3.1 Shrinkage of Predictive Distribution . . . . . . . . . . . . . 131.3.2 Minimax Shrinkage towards Points or Subspaces . . . . . . . 17

1.4 Prior Selection Methods and Shrinkage Argument . . . . . . . . . . 181.4.1 Prior Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 181.4.2 Shrinkage Argument . . . . . . . . . . . . . . . . . . . . . . 20

2 ESTIMATION, PREDICTION AND THE STEIN PHENOMENON UNDERDIVERGENCE LOSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.1 Some Preliminary Results . . . . . . . . . . . . . . . . . . . . . . . 222.2 Minimaxity Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.3 Admissibility for p = 1 . . . . . . . . . . . . . . . . . . . . . . . . . 302.4 Inadmissibility Results for p ≥ 3 . . . . . . . . . . . . . . . . . . . . 312.5 Lindley’s Estimator and Shrinkage to Regeression Surface . . . . . . 40

3 POINT ESTIMATION UNDER DIVERGENCE LOSS WHEN VARIANCECOVARIANCE MATRIX IS UNKNOWN . . . . . . . . . . . . . . . . . 43

3.1 Preliminary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.2 Inadmissibility Results when Variance-Covariance Matrix is Proportional

to Identity Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.3 Unknown Positive Definite Variance-Covariance Matrix . . . . . . . 49

4 REFERENCE PRIORS UNDER DIVERGENCE LOSS . . . . . . . . . 53

4.1 First Order Reference Prior under Divergence Loss . . . . . . . . . 53

iv

Page 5: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

4.2 Reference Prior Selection under Divergence Loss for One ParameterExponential Family . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5 SUMMARY AND FUTURE RESEARCH . . . . . . . . . . . . . . . . . 66

5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

v

Page 6: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy

DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION ANDPRIOR SELECTION

By

Victor Mergel

August 2006

Chair: Malay GhoshMajor Department: Statistics

In this dissertation, we consider the following problems: (1) estimate a normal

mean under a general divergence loss and (2) find a predictive density of a new

observation drawn independently of the sampled observations from a normal

distribution with the same mean but possibly with a different variance under

the same loss. The general divergence loss includes as special cases both the

Kullback-Leibler and Bhattacharyya- Hellinger losses. The sample mean, which is a

Bayes estimator of the population mean under this loss and the improper uniform

prior, is shown to be minimax in any arbitrary dimension. A counterpart of this

result for the predictive density is also proved in any arbitrary dimension. The

admissibility of these rules holds in one dimension, and we conjecture that the

result is true in two dimensions as well. However, the general Baranchik class of

estimators, which includes the James-Stein estimator and the Strawderman class

of estimators, dominates the sample mean in three or higher dimensions for the

estimation problem. An analogous class of predictive densities is defined and any

member of this class is shown to dominate the predictive density corresponding

to a uniform prior in three or higher dimensions. For the prediction problem,

vi

Page 7: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

in the special case of Kullback-Leibler loss, our results complement to a certain

extent some of the recent important work of Komaki and George et al. While our

proposed approach produces a general class of empirical Bayes predictive densities

dominating the predictive density under a uniform prior, George et al. produce a

general class of Bayes predictors achieving a similar dominance. We show also that

various modifications of the James-Stein estimator continue to dominate the sample

mean, and by the duality of the estimation and predictive density results which we

will show, similar results continue to hold for the prediction problem as well. In

the last chapter we consider the problem of objective prior selection by maximizing

the distance between the prior and the posterior. We show that the reference prior

under divergence loss coincides with Jeffreys’ prior except in one special case.

vii

Page 8: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

CHAPTER 1INTRODUCTION AND LITERATURE REVIEW

1.1 Statistical Decision Theory

Statistical Decision Theory primarily consists of three basic elements:

• the sample space X ,

• the parameter space Θ,

and

• the action space, A.

We assume that an unknown element θ ∈ Θ labels the otherwise known

distribution. We are concerned with inferential procedures for θ using the sampled

observations x (real or vector valued).

A decision rule δ is a function with domain space X , and a range space A.

Thus, for each x ∈ X we have an action a = δ(x) ∈ A. For every θ ∈ Θ and

δ(x) ∈ A, we incur a loss L(θ, δ(x)). The long-term average loss associated with δ is

the expectation Eθ[L(θ, δ(X))] and this expectation is called the risk function of δ

and will be denoted as R(θ, δ).

Since the risk function depends on the unknown parameter θ, it is very often

impossible to find a decision rule that is optimal for every θ. Thus the statistician

needs to restrict decision rules such as Bayes, Minimax and admissible rules.

The method required to solve the statistical problem at hand strongly depends

on the parametric model considered (the class P = Pθ, θ ∈ Ω in which the

distribution of X belongs), structure of decision space and choice of loss function.

The choice of the decision space often depends on the statistical problem at

hand. For example, two-decision problems are used in testing of hypotheses; for

point estimation problems decision space often coincides with parameter space.

1

Page 9: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

2

The choice of the loss function is up to the decision maker, and it is supposed

to evaluate the penalty (or error) associated with the decision δ when the

parameter takes the value θ. When the setting of an experiment is such that loss

function cannot be determined, the most common option is to resort to classical

losses such as quadratic loss or absolute error loss. Sometimes, the experiment

settings are very uninformative and decision maker may need to use an intrinsic

loss such as general divergence loss as considered in this dissertation. This is

discussed for example in [56].

In this dissertation, we mostly look at the point estimation problem of the

multivariate normal mean under general divergence loss and we consider also the

prediction problem, where we are interested in estimating the density function

f(x| θ) itself.

In multidimensional settings, for dimensions high enough, the best invariant

estimator is not always admissible. There often exists a class of estimators that

dominates the intuitive choice. For quadratic loss this effect was first discovered by

Stein [58]. In this dissertation, we consider the estimation and prediction problems

simultaneously under a broader class of losses to examine whether the Stein effect

continues to hold.

Since many results for estimation and prediction problems for the multivariate

normal distribution, as considered in this dissertation, seem to have some inherent

similarities with the parallel theory of estimating a multivariate normal mean under

the quadratic loss, we will begin with a literature review of known results.

1.2 Literature Review

1.2.1 Point Estimation of the Multivariate Normal Mean

Suppose X ∼ N(θ, Ip). For estimating the unknown normal mean θ under

quadratic loss, the best equivariant estimator is X, the MLE which is also the

posterior mean under improper uniform prior see [56] pp.429-431. Blyth, see [14],

Page 10: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

3

showed that this estimator is minimax and admissible when p = 1. Unfortunately,

this estimator may fail to be admissible in multidimensional problems. For

simultaneous estimation of p (≥ 2) normal means, this natural estimator is

admissible for p = 2, but it is inadmissible for p ≥ 3 for a wide class of losses.

This fact was first discovered by [58] for the sum of squared error losses, i.e. when

L(θ,a) =‖ θ − a ‖2. The inadmissibility result was later extended by Brown [17]

to a wider class of losses. For the sum of squared error losses, an explicit estimator

dominating the sample mean was proposed by James and Stein [42].

For estimating the multivariate normal mean, Stein [58] recommended using

”spherically symmetric estimators” of θ since, under the loss L(θ, a) =‖ θ − a ‖2,

X is an admissible estimator of θ if and only if it is admissible in the class of

all spherically symmetric estimators. The definition of spherically symmetric

estimators is as follows.

Definition 1.2.1.1. An estimator δ(X) of θ is said to be spherically symmetric if

and only if δ(X) has the form δ(X) = h(‖X‖2)X

Stein used this result and the Cramer-Rao inequality to prove admissibility

of X for p = 2. Later Brown and Hwang [19] provided a Blyth type argument for

proving the same result.

As mentioned earlier, X is a generalized Bayes estimator of θ ∈ Rp under the

loss L(θ,a) =‖ θ − a ‖2 and the uniform prior. Stein [58] showed the existence of

a, b such that

(1− b

a+‖X‖2

)X dominates X for p ≥ 3. Later James and Stein

[42] have produced the explicit estimator

δ(X) =

(1− p− 2

‖X‖2

)X

which dominates X for p ≥ 3.

Efron and Morris [28] show how James-Stein estimators arise in an empirical

Bayes context. A good review of empirical Bayes (EB) and hierarchical Bayes (HB)

Page 11: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

4

approaches could be found in [35]. As described in [8], an EB scenario is one in

which known relationships among the coordinates of a parameter vector allow use

of the data to estimate some features of the prior distribution. Both EB and HB

procedures recognize the uncertainty in the prior information. However, while the

EB method estimates the unknown prior parameters in some classical way like the

MLE or the method of moments from the marginal distributions (after integrating

θ out) of observations, the HB procedure models the prior distribution in stages.

To ilustrate this, we begin with the following setup.

A Conditional on θ1, . . . , θp, let X1, . . . , Xp be independent with

Xi ∼ N(θi, σ2), i = 1, . . . , p, σ2(> 0) being known. Without loss of generality,

assume σ2 = 1.

B The θi’s have independent N(µi, A), i = 1, . . . , p priors.

The posterior distribution of θ given X = x is then

N((1−B)x + Bµ, (1−B)Ip),

where B = (1 + A)−1. The posterior mean (the usual Bayes estimate) of θ is given

by

E(θ|X = x) = (1−B)x + Bµ. (1–1)

Now consider the following three scenarios.

Case I. Let µ1 = . . . = µp = µ, where µ (real) is unknown, but A (> 0) is known.

Based on the marginal distribution of X, X is the UMVUE, MLE and the best

equivariant estimator of µ. Thus using EB approach, an EB estimator of θ is:

θ(1)

EB = (1−B)X + BX1p. (1–2)

This estimator was proposed in Lindley and Smith (1972) but they have used

HB approach. Their model was:

Page 12: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

5

(i) conditional on θ and µ, X ∼ N(θ, Ip);

(ii) conditional on µ, θ ∼ N(µ1p, AIp);

(iii) µ is uniform on (−∞,∞).

Then the joint pdf of X, θ and µ is given by

f(x,θ, µ) ∝ exp

[−1

2‖ x− θ ‖2

]A− 1

2p exp

[− 1

2A‖ θ − µ1p ‖2

]. (1–3)

Thus joint pdf of X and θ is

f(x,θ) ∝ exp

[−1

2(θT D − 2θT x + xT x)

], (1–4)

and posterior distribution of θ given X = x is N(D−1x,D−1), where

D = A−1[(A + 1)Ip − p−1Jp ].

Thus one gets

E(θ|X = x) = (1−B)x + Bx1p , (1–5)

and

V (θ|X = x) = (1−B)Ip + Bp−1Jp , (1–6)

which gives the same estimator as under the EB approach. But the EB approach

ignores the uncertainty involved in estimating the prior parameters, and thus

underestimates the posterior variance.

Lindley and Smith [51] have shown that the risk of θ(1)

EB is not uniformly

smaller than that of X under squared error loss. However, there is a Bayes risk

superiority of θ(1)

EB over X as it is shown in the following theorem of Ghosh [35]:

Theorem 1.2.1.2. Consider the model X|θ ∼ N(θ, Ip) and the prior

θ ∼ N(µ1p, AIp). Let E denote expectation over the joint distribution of X and θ.

Then, assuming the loss L1(θ, a) = (a − θ)(a − θ)T , and writing θB as the Bayes

estimator of θ under L1,

EL1(θ, X) = Ip ; EL1(θ, θB) = (1−B)Ip ; (1–7)

Page 13: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

6

EL1(θ, θ(1)

EB) = (1−B)Ip + Bp−1Jp . (1–8)

Now assuming the quadratic loss L2(θ,a) = (a − θ)T Q(a − θ), where Q is a

known non-negative definite weight matrix,

EL2(θ, X) = tr(Q); EL2(θ, θB) = (1−B)tr(Q); (1–9)

EL2(θ, θ(1)

EB) = (1−B)tr(Q) + Btr(Qp−1Jp). (1–10)

Case II. Assume that µ is known but its components need not to be equal. Also

assume A to be unknown. ‖X − µ‖2 ∼ B−1χ2p is complete sufficient statistics.

Accordingly, for p ≥ 3, the UMVUE of B is given by (p− 2)/‖X − µ‖2.

Substituting this estimator of B, an EB estimator of θ is given by

θ(2)

EB = X − p− 2

‖X − µ‖2(X − µ). (1–11)

This estimator is known as a James-Stein estimator (see [42]). The EB

interpretation of this estimator was given in a series of articles by Efron and Morris

([27], [28], [29]).

James and Stein have shown that for p ≥ 3, the risk of θ(2)

EB is smaller than

that of X under the squared error loss. However, if the loss is changed to the

arbitrary quadratic loss L2 of the previous theorem, then the risk dominance of

θ(2)

EB over X does not necessarily hold. The θ(2)

EB dominates X under the loss L2

([8]; [15]) if

(i) tr(Q) > 2ch1(Q) and

(ii) 0 < p− 2 < 2[tr(Q)/ch1(Q)− 2] where ch1(Q) denotes the largest eigenvalue of

Q.

The Bayes risk dominance however still holds which follows from the following

(see [35]).

Page 14: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

7

Theorem 1.2.1.3. Let X|θ ∼ N(θ, Ip) and θ ∼ N(µ1p, AIp). It follows that for

p ≥ 3,

E[L1(θ, θ(2)

EB)] = Ip −B(p− 2)p−1Ip , (1–12)

and

E[L2(θ, θ(2)

EB)] = tr(Q)−B(p− 2)p−1tr(Q). (1–13)

Consider the HB approach in this case with X ∼ N(θ, Ip), θ ∼N(µ, AIp) and A has Type II beta density ∝ Am−1(1+A)−(m+n), with m > 0, n > 0.

Using the iterated formula for conditional expectations

θ(2)

HB ≡ E(θ|x) = E(E(θ|B, x)) = (1− ˆB)x +

ˆBµ, (1–14)

where B = (A + 1)−1, and

ˆB =

1∫

0

B12p+n(1−B)m−1 exp

[−1

2B‖x− µ‖2

]dB

÷1∫

0

B12p+n−1(1−B)m−1 exp

[−1

2B‖x− µ‖2

]dB. (1–15)

Strawderman [60] considered the case m = 1, and found sufficient conditions

on n under which the risk of θ(2)

HB is smaller than that of X. His results were

generalized by Faith [31].

When m = 1, the posterior mode of B is

BMO = min((p + 2n− 2)/‖x− µ‖2, 1) (1–16)

This leads to

θ(3)HB = (1− BMO)X + BMOµ, (1–17)

of θ. When n = 0 this estimator will become the positive part James-Stein

estimator, which dominates the usual James-Stein estimator.

Page 15: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

8

Case III. We consider the same model as in Case I, except that now µ and A > 0

are both unknown. In this case (X,p∑

i=1

(Xi − X)2) is complete sufficient, so that the

UMVUE’s of µ and B are given by X and (p− 3)/p∑

i=1

(Xi − X)2. The EB estimator

of θ in this case is

θ(3)

EB = X − p− 3p∑

i=1

(Xi − X)2

(X − X1p). (1–18)

This modification of the James-Stein estimator was proposed by Lindley [50].

Whereas, the James-Stein estimator shrinks X toward a specified point, the above

estimator shrinks X towards a hyperplane spanned by 1p.

The estimator θ(3)

EB is known to dominate X for p > 3. Ghosh [35] has found

the Bayes risk of this estimator under the L1 and L2 losses.

Theorem 1.2.1.4. Assume the model and the prior given in Theorem 1.2.1.2 Then

for p ≥ 4,

E[L1(θ, θ(3)

EB)] = Ip −B(p− 3)(p− 1)−1(Ip − p−1Jp) , (1–19)

and

E[L2(θ, θ(3)

EB)] = tr(Q)−B(p− 3)(p− 1)−1tr[Q(Ip − p−1Jp)]. (1–20)

To find the HB estimator of θ in this case consider the model where

(i) conditional on θ, µ and A, X ∼ N(θ, Ip);

(ii) conditional on µ and A, θ ∼ N(µ1p, AIp);

(iii) marginally µ and A are independently distributed with µ uniform on (−∞,∞),

and A has uniform improper pdf on (0,∞).

Under this model, as shown in [52],

E(θ|x) = x− E(B|x)(x− x1p) , (1–21)

Page 16: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

9

and

V (θ|x) = V (B|x)(x− x1p)(x− x1p)T + Ip − E(B|x)(Ip − p−1Jp), (1–22)

where

E(B|x) =

1∫

0

B12(p−3) exp

[−1

2B

p∑i=1

(xi − x)2

]dB

÷1∫

0

B12(p−5) exp

[−1

2B

p∑i=1

(xi − x)2

]dB , (1–23)

and

E(B2|x) =

1∫

0

B12(p−1) exp

[−1

2B

p∑i=1

(xi − x)2

]dB

÷1∫

0

B12(p−5) exp

[−1

2B

p∑i=1

(xi − x)2

]dB . (1–24)

Also, one can obtain a positive-part version of Lindley’s estimator by

substituting the posterior mode of B namely min

((p− 5)

/p∑

i=1

(Xi − X)2, 1

)

in 1–1. Morris (1981) suggests approximations to E(B|x) and E(B2|x) involving

replacement of1∫0

by∞∫0

both in the numerator as well as in denominator of 1–23

and 1–24 leading to the following approximations:

E(B|x).= (p− 3)

/p∑

i=1

(xi − x)2

and

E(B2|x).= (p− 1)(p− 3)

/p∑

i=1

(xi − x)2

2

,

so that

V (B|x).= 2(p− 3)

/p∑

i=1

(xi − x)2

2

.

Page 17: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

10

Morris [52] points out that the above approximations amount to putting a

uniform prior to A on (−1,∞) which gives the approximation

E(θ|x).= X − p− 3

p∑i=1

(Xi − X)2

(X − X1p) , (1–25)

which is Lindley’s modification of the James-Stein estimator with

V (θ|x).=

2(p− 3)(p∑

i=1

(Xi − X)2

)2 (X − X1p)(X − X1p)T

+ Ip − p− 3p∑

i=1

(Xi − X)2

(Ip − p−1Jp). (1–26)

1.2.2 Shrinkage towards Regression Surfaces

In the previous section, the sample mean was shrunk towards a point or a

subspace spanned by the vector 1p. Ghosh [35] synthesized EB and HB methods to

shrink the sample mean towards an arbitrary regression surface. The HB approach

was discussed in details in [51] with known variance components, while the EB

procedure was disscussed in [53].

The set up considered in [35] is as follow

I Conditional on θ, b and a let X1, . . . , Xp be independent with

Xi ∼ N(θi, Vi), i = 1, . . . , p, where the Vi’s are known positive constants;

II Conditional on b and a, θi’s are independently distributed with

θi ∼ N(zTi b, a), i = 1, . . . , p, where z1, . . . , zp are known regression vectors of

dimension r and b is r × 1.

III B and A are marginally independent with B ∼ uniform(Rp) and

A ∼ uniform(0,∞). We assume that p ≥ r + 3. Also, we write

ZT = (z1, . . . , zp); G = Diag(V1, . . . , Vp) and assume rank(Z) = r.

Page 18: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

11

As shown in [35], under this model

E(θi|x) = E[(1− Ui)xi + UizTi b |x] ; (1–27)

V (θi|x) =

V [Ui(xi − zTi b) |x] + E[Vi(1− Ui) + ViUi(1− Ui)z

Ti (ZT DZ)−1zi |x] ; (1–28)

Cov(θi, θj|x) =

Cov[Ui(xi − zTi b), Uj(xj − zT

j b) |x] + E[AUiUjzTi (ZT DZ)−1zj |x] , (1–29)

where Ui = Vi/(A + Vi), b = (ZT DZ)−1(ZT Dx),D = diag(1− u1, . . . , 1− up) with

(i = 1, . . . , p).

Morris [53] has approximated E[θi|x] by xi − ui(xi − zTiˆb), and V [θi|x] by

vi(xi − zTiˆb)2 + Vi(1 − ui)[1 + uiz

Ti (ZT DZ)−1zi], i = 1, . . . , p. In the above

vi = [2/(p− r − 2)]U2i (V + a)÷ (Vi + a), i = 1, . . . , p,

V =∑p

i=1 Vi(Vi + a)−1 ÷ ∑pi=1(Vi + a)−1, D = Diag(1 − u1, . . . , 1 − up), and

ˆb

is obtained from b by substituting the estimator of A. The vi’s are purported to

estimate V (Ui|x)’s.

When V1 = . . . = Vp = V , with u1 = . . . = up = V/(a + V ) = u,

D = (1 − u)Ip,ZT DZ = (1 − u)ZT Z,

ˆb = (ZT Z)−1ZT x = b, then the following

result holds

E(θi|x) = xi − E(U |x)(xi − zTi b) , (1–30)

and

V (θi|x) = V (U |x)(xi − zTi b)2 + V − V E(U |x)(1 − zT

i (ZT Z)−1zi) . (1–31)

Page 19: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

12

If one adopts Morris’s approximations, then one estimates E(U |x) by

ˆUV (p − r − 2)/SSE and V (U |x) by [2/(p − r − 2)]

ˆU2 (SSE =

∑pi=1 x2

i −(∑p

i=1 xizi)T

(ZT Z)−1 (∑p

i=1 xizi)).

1.2.3 Baranchik Class of Estimators Dominating the Sample Mean

In some situations when prior information is available and one believes

that unknown θ is close to some known vector µ it makes more sense to shrink

estimator X to µ instead of 0. In such cases Baranchick [5] proposed using a more

general class of shrinkage minimax estimators dominating X. Let S =p∑

i=1

(Xi − µi)2

and φi(X) = − τ(S)S

(Xi − µi) then estimator X + φ(X) will dominate X under

quadratic loss function if the following conditions hold:

(i) 0 < τ(S) < 2(p− 2),

(ii) τ(S) is nondecreasing in S and differentiable in S.

Efron and Morris [30] have slightly widened Baranchick’s class of minimax

estimators. They have proved that the following conditions will guarantee that the

estimator X + φ(X) dominates X under the quadratic loss function:

(i) 0 < τ(S) < 2(p− 2), p > 2,

(ii) τ(S) is differentiable in S,

and

(iii) u(S) = Sp−22 τ(S)

2(p−2)−τ(S)is increasing in S.

Thus the Baranchick class of estimators dominates the best equivariant

estimator. The natural question was if that class had a subclass of admissible

estimators.

Strawderman [60] shows that there exists a subclass in the Baranchick class of

estimators which is proper Bayes with respect to the following class of two stage

priors. The prior distribution for θ is constructed as follows.

Page 20: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

13

Conditional on A = a,θ ∼ N(0, aIp), while A itself has pdf

g(a) = δ(1 + a)−1−δ, a ≥ 0, δ > 0.

Under this two stage prior, the Bayes estimator of θ has the Baranchick form

(1− τ(S)

S

)X

with

τ(S) = p + 2δ − 2 exp−S2

1∫0

λp2+δ−1 exp−λ

2S dλ

and conditions (i)-(iii) holds for p > 4. When p = 5 and 0 < δ < 12

we will get

class of proper Bayes and thus admissible estimators dominating X. When p > 5

choosing 0 < δ < 1 will lead to proper Bayes class of estimators dominating X.

For p = 3 and 4 Strawderman [61] showed that there do not exist any proper

Bayes estimators dominating X.

1.3 Shrinkage Predictive Distribution for the Multivariate NormalDensity

1.3.1 Shrinkage of Predictive Distribution

When fitting a parametric model or estimating parametric density function,

two most common methods are: estimate the unknown parameter first and

then use this estimator to estimate the unknown density, or try to estimate

the predictive density without directly estimating the unknown parameter. In

situations like this, the first method often fails to provide a good overall fit to the

unknown density even when the estimator of unknown quantity may have optimal

properties itself. The second method seems to produce estimators with smaller risk

then those constructed with plug-in method.

To evaluate the goodness of fit of predictive distribution p(y|x) (where x is

the observed random vector) to the unknown p(y|θ), the most often used measure

Page 21: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

14

of divergence is Kullback-Leibler [46] directed measure of divergence

L(θ, p(y|x)) =

∫p(y|θ) log

p(y|θ)

p(y|x)

dy (1–32)

which is nonnegative, and is zero if and only if p(y|x) coincides with p(y|θ).

Then the average loss or risk function of predictive distribution of p(y|x) could be

defined as follows:

RKL(θ, p) =

∫p(x|θ)L(θ, p(y|x)) dx, (1–33)

and under (possibly improper) prior distribution π on θ, the Bayes risk is

r(π, p) =

∫RKL(θ, p)π(θ) dθ. (1–34)

As shown in Atchinson [1] the Bayes predictive density under the prior π is

given by

pπ(y|x) =

∫p(x|θ)p(y|θ)π(θ) dθ∫

p(x|θ)π(θ) dθ, (1–35)

and this density is superior to any p(y|x) as a fit to the class of models.

Let X|θ ∼ N(θ, vxIp) and Y |θ ∼ N(θ, vyIp) be independent p-dimensional

multivariate normal vectors with common unknown mean θ and known variances

vx, vy. As shown by Murray [54] and Ng [55], the best invariant predictive density

in this situation is the constant risk Bayes rule under the uniform prior πU(θ) = 1,

which can be written as

pU(y|x) =1

2π(vx + vy) p2

exp

− ‖y − x‖2

2(vx + vy)

. (1–36)

Although invariance is a frequently used restriction, for point estimation, the

best invariant estimator θ = X of θ is not admissible if dimension of the problem

is p ≥ 3. It is known that the James-Stein estimator dominates the best invariant

estimator θ. Komaki [45] showed that the same effect holds for the prediction

problem. The best predictive density pU(y|x) which is invariant under translation

group is not admissible when p ≥ 3, and is dominated by predictive density under

Page 22: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

15

the Stein harmonic prior (see [59])

πH(θ) ∝ ‖θ‖−(p−2). (1–37)

Under this prior, the Bayesian predictive density is given by Komaki [45]

pH(y|x) =

(vx

vy

+ 1

) p−22 φp‖(v−1

y y + v−1x x)(v−1

y + v−1x )−

12‖

φp(‖v−12

x x‖)× 1

2π(vx + vy) p2

exp

− 1

2(vx + vy)‖y − x‖2

, (1–38)

where

φp(u) = u−p+2

12u2∫

0

v12p−2 exp(−v) dv. (1–39)

The harmonic prior is a special case of the Strawderman class of priors given

by

πS(θ) ∝∞∫

0

a−p2 exp

−‖θ‖

2

2a

a−δ−1 da ∝ ‖θ‖−p−2δ

with δ = −1. More recently, Liang [49] showed that pU(y|x) is dominated by the

proper Bayes rule pa(y|x) under the Strawderman prior πa(θ)

θ|s ∼ N(0, sv0Ip), s ∼ (1 + s)a−2, (1–40)

when vx ≤ v0, p = 5 and a ∈ [.5, 1) or p ≥ 6 and a ∈ [0, 1). When a = 2, it is

well known that πH(θ) is a special case of πa(θ). As shown in George, Liang and

Xu [34], this result closely parallels some key developments concerning minimax

estimation of a multivariate normal mean under quadratic loss. As shown in Brown

[17], any Bayes rule θp = E(θ|X) under quadratic loss,has the form

θ = X +∇ log mπ(X). (1–41)

As shown in George, Liang and Xu [34], this result closely parallels some

key developments concerning minimax estimation of a multivariate normal mean

Page 23: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

16

under quadratic loss. As shown in Brown [17], any Bayes rule θp = E(θ|X) under

quadratic loss,has the form

θ = X +∇ log mπ(X). (1–42)

Similar representation was proved by George, Liang and Xu [34], for the

predictive density under the KL loss:

pπ(y|x) = pU(y|x)mπ(w; vw)

mπ(x; vx), (1–43)

where

W =vyX + vxY

vx + vy

=v−1

x X + v−1y Y

v−1x + v−1

y

, (1–44)

vw =vxvy

vx + vy

, (1–45)

and mπ(w; vw) is a marginal distribution of W .

Now we present the domination result of George et al. under the KL loss.

Theorem 1.3.1.1. For Z|θ ∼ N(θ, vIp) and a given prior π on θ, let mπ(z; v)

be the marginal distribution of Z. If mπ(z; v) is finite for all z, then pπ(y|x) will

dominate pU(y|x), if any one of the following conditions holds for all v ≤ vx :

(i) ∂∂v

Eθ,vlog mπ(Z; v) < 0,

(ii) ∂∂v

mπ(√

vz + θ; v) ≤ 0 for all θ, with strict inequality on some interval A,

(iii)√

mπ(z; v) is superharmonic with strict inequality on some interval A,

(iv) mπ(z; v) is superharmonic with strict inequality on some interval A,

or

(v) π(θ) is superharmonic.

From the previous theorem and minimaxity of pU(y|x) under the KL loss, the

minimaxity of pπ(y|x) follows.

George et al. [34] also proved the following theorem which is similar to

Theorem 1 of Fourdrinier et al. [33].

Page 24: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

17

Theorem 1.3.1.2. If h is a positive function such that

(i) −(s + 1)h′(s)/h(s) can be decomposed as l1(s) + l2(s) where l1 ≤ A is

nondecreasing while 0 < l2 ≤ B with 12A + B ≤ (p− 2)/4,

(ii) lims→∞ h(s)/(s + 1)p/2 = 0.

Then

(i)√

mπ(z; v) is superharmonic for all v ≤ v0.

(ii) the Bayes rule ph(y|x) under prior πh(θ) dominates pU(y|x) and is

minimax when v ≤ v0.

1.3.2 Minimax Shrinkage towards Points or Subspaces

When a prior distribution is centered around 0, minimax Bayes rules pπ(y|x)

yield most risk reduction when θ is close to 0 (see [34]). Recentering the prior π(θ)

around any b ∈ Rp results in πb(θ) = π(θ − b). The marginal mπ

b corresponding to

πb can be directly obtained by recentering the marginal mπ :

mbπ (z; v) = mπ(z − b; v). (1–46)

Such recentered marginals yield predictive distributions

pbπ (y|x) = pU(y|x)mb

π (w; vw)

mbπ (x; vx)

. (1–47)

More generally, in order to recenter a prior π(θ) around (a possibly affine)

subspace B ⊂ Rp, George et al. [34] considered only spherically symmetric in θ

priors recentered as

πB(θ) = π(θ − PBθ), (1–48)

where PBθ = argminb∈B‖θ − b‖ is the projection of θ onto B. Note that the

dimension of θ − PBθ must be taken into account when concidering π. Thus, for

example, recentering the harmonic prior πH(θ) = ‖θ‖−(p−2) around the subspace

Page 25: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

18

spanned by 1p yields

πBH(θ) = ‖θ − θ1p‖−(p−3). (1–49)

Recentered priors yields predictive distributions

πBπ (y|x) = pU(y|x)

mBπ (w; vw)

mBπ (x; vx)

. (1–50)

George et al. [34] also considered multiple shrinkage prediction. Using the

mixture prior

π∗(θ) =N∑

i=1

wiπBi(θ), (1–51)

leads to the predictive distribution

p∗(y|x) = pU(y|x)

∑Ni=1 wim

Biπ (w; vw)∑N

i=1 wimBiπ (x; vx)

. (1–52)

1.4 Prior Selection Methods and Shrinkage Argument

1.4.1 Prior Selection

Since Bayes [6] and later since Fisher [32], the idea of Bayesian inference

was debated. The cornerstone of Bayesian analysis, namely prior selection was

criticized for arbitrariness and overwhelming difficulty in the choice of prior. From

the very beginning, when Laplace proposed a uniform prior as a noninformative

prior, its inconsistencies have been found, generating further criticism. This has

given a way to new ideas such as the one of Jeffreys [44] who proposed a prior

which remains invariant under any one-to-one reparametrization. Jeffrey’ prior

was derived as the positive square root of Fisher information matrix. However,

this prior was not an ideal one in the presence of nuisance parameters. Bernardo

[12] has noticed that Jeffreys prior can lead to marginalization paradox (see Dawid

et al. [26]) for inferences about µ/σ when the model is normal with mean µ and

variance σ2. These inconsistencies have led Bernardo [12] and later Berger and

Bernardo ( [9], [10], [11]) to propose uninformative priors known as ”reference”

Page 26: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

19

priors. Two basic ideas were used by Bernardo to construct his prior: the idea of

missing information and the stepwise procedure to deal with nuisance parameters.

Without any nuisance parameters, Bernardo’s prior is identical to Jeffreys prior.

The missing information idea makes one to choose the prior which is furthest in

terms of Kullback-Leibler distance from the posterior under this prior and thus

allows the observed sample to change this prior the most.

Another class of reference priors is obtained using the invariance principle

which is attributed to Laplace’s idea of insufficient reasons. Indeed, the simplest

example of invariance involves permutations on a finite set. The only invariant

distribution in this case is the uniform distribution over this set. Laplace’s idea was

generalized as follows. Consider a random variable X from a family of distributions

parameterized by θ. Then if there exists a group of transformations, say ha(θ) on

the parameter space, such that the distribution of Y = ha(X) belongs to the same

family with corresponding parameter ha(θ), then we want the prior distribution for

parameter θ to be invariant under this group of transformations. Good description

of this approach is given in Jaynes [43], Hartigan [37] and Dawid [25].

A somewhat different criteria is based on matching posterior coverage

probability of a Bayesian credible set with the corresponding frequentist coverage

probability. Most often matching is accomplished by matching a) posteriors

quantiles b) highest posterior densities (HPD) regions or c) inversion of test

statistics.

In this dissertation we will find uninformative priors by maximizing divergence

between prior and corresponding posterior distribution asymptotically. To develop

asymptotic expansions we will use so called ”shrinkage argument” suggested

by J.K. Ghosh [36]. This method is particularly suitable for carrying out the

asymptotics, and avoids calculation of multivariate cumulants inherent in any

multidimentional Edgeworth expansion.

Page 27: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

20

1.4.2 Shrinkage Argument

We follow the description of Datta and Mukerjee [24] to explain the shrinkage

argument.

Consider a possibly vector-valued random variable X with a probability

density function p(x; θ) with θ ∈ Rp or some open subset thereof. We need to find

an asymptotic expansion for Eθ[h(X; θ)] where h is a joint measurable function.

The following steps describe the Bayesian approach towards the evaluation of

Eθ[h(X; θ)].

Step 1: Consider a proper prior density π(θ) for θ, such that support of π(θ) is a

compact rectangle in the parameter space, and π(θ) vanishes on the boundary of

support while being positive in the interior. Under this prior one obtains posterior

expectation of Eπ[h(X; θ)|X].

Step 2: In this step one finds the following expectation for θ being in the interior

of support of π

λ(θ) = Eθ [Eπ[h(X; θ)|X]] .

Step 3: Integrate λ(θ) with respect to π(θ) and then allow π(θ) converge to the

degenerate prior at the true value of θ supposing that the true value of θ is an

interior point of the support π(θ). This yields Eθ[h(X; θ)].

The rationale behind this process, if integrability of h(X; θ) with respect to

joint probability measure assumed, is as follows.

Note that posterior density of θ under prior π(θ) is given by

p(X; θ)π(θ)

m(X),

where

m(X) =

∫p(X; θ)π(θ) dθ.

Page 28: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

21

Hence, in step 1, we will get

Eπ[h(X; θ)|X] = K(X)/m(X),

where

K(X) =

∫h(X; θ)p(X; θ)π(θ) dθ.

Step 2 yields

λ(θ) =

∫K(x)/m(x)p(x; θ) dx.

In step 3 one would get

∫λ(θ)π(θ) dθ =

∫∫K(x)/m(x)p(x; θ)π(θ) dθ dx

=

∫K(x)/m(x)

∫p(x; θ)π(θ) dθ

dx

=

∫K(x) dx =

∫∫h(x; θ)p(x; θ)π(θ) dθ dx

=

∫[Eθ[h(X; θ)]π(θ) dθ.

The last integral gives us desired expectation when π(θ) converge to the degenerate

prior at the true value of θ.

Page 29: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

CHAPTER 2ESTIMATION, PREDICTION AND THE STEIN PHENOMENON UNDER

DIVERGENCE LOSS

2.1 Some Preliminary Results

We will start this section with definition of divergence loss. Among others, we

refer to Amari [2] and Cressie and Read [23]. This loss is given by

Lβ(θ,a) =1− ∫

p1−β(x|θ)pβ(x|a)dx

β(1− β). (2–1)

This above loss is to be interpreted as its limit when β → 0 or β → 1. The KL loss

obtains in these two limiting situations. For β = 1/2, the divergence loss is 4 times

the BH loss. Throughout this dissertation, we will perform the calculations with

β ∈ (0, 1), and pass on to the endpoints only in the limit when needed.

Let X and Y be conditionally independent given θ with corresponding

pdf’s p(x|θ) and p(y|θ). We begin with a general expression for the predictive

density of Y based on X under the divergence loss and a prior pdf π(θ), possibly

improper. Under the KL loss and the prior pdf π(θ), the predictive density of Y is

given by

πKL(y|x) =

∫p(y|θ)π(θ|x) dθ,

where π(θ|x) is the posterior of θ based on X = x see [1]. The predictive density

is proper if and only if the posterior pdf is proper. We now provide a similar result

based on the general divergence loss which includes the previous result of Aitchison

as a special case when β → 0.

22

Page 30: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

23

Lemma 2.1.0.1. Under the divergence loss and the prior π, the Bayes predictive

density of Y is given by

πD(y|x) = k1

1−β (y,x)

/∫k

11−β (y,x) dy , (2–2)

where k(y,x) =∫

p1−β(y|θ)π(θ|x) dθ.

Proof of Lemma 2.1.0.1. Under the divergence loss, the posterior risk of

predicting p(y|θ), by a pdf p(y|x), is β−1(1− β)−1 times

1−∫ [∫

p1−β(y|θ)pβ(y|x)dy

]π(θ|x)dθ

= 1−∫

pβ(y|x)

∫p1−β(y|θ)π(θ|x) dθ

dy

= 1−∫

k(y,x)pβ(y|x)dy. (2–3)

An application of Holder’s inequality now shows that the integral in (2–3) is

maximized at p(y|x) ∝ k1

1−β (y,x). Again by the same inequality, the denominator

of (2–2) is finite provided the posterior pdf is proper. This leads to the result

noting that πD(y|x) has to be a pdf. ¥

The next lemma, to be used repeatedly in the sequel, provides an expression

for the integral of the product of two normal densities each raised to a certain

power.

Lemma 2.1.0.2. Let Np(x|µ,Σ) denote the pdf of a p-variate normal random

variable with mean vector µ and positive definite variance-covariance matrix Σ.

Then for α1 > 0, α2 > 0,

∫[Np(x|µ1,Σ1)]

α1 [Np(x|µ2,Σ2)]α2 dx

= (2π)p2(1−α1−α2) |Σ1|

12(1−α1) |Σ2|

12(1−α2) |α1Σ2 + α2Σ1|−

12

× exp[−α1α2

2(µ1 − µ2)

T (α1Σ2 + α2Σ1)−1 (µ1 − µ2)

]. (2–4)

Page 31: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

24

Proof of Lemma 2.1.0.2. Writing H = α1Σ−11 +α2Σ

−12 and g = H−1(α1Σ

−11 µ1+

α2Σ−12 µ2), it follows after some simplification that

∫[Np(x|µ1,Σ1)]

α1 [Np(x|µ2,Σ2)]α2 dx

= (2π)−p2(α1+α2) |Σ1|−

12α1 |Σ2|−

12α2

∫exp

[−1

2(x− g)T H(x− g)

−1

2

α1(µ

T1 Σ−1

1 µ1) + α2(µT2 Σ−1

2 µ2)− gT Hg]

dx

= (2π)p2(1−α1−α2) |Σ1|−

12α1 |Σ2|−

12α2 |H|− 1

2

× exp

[−1

2

α1(µ

T1 Σ−1

1 µ1) + α2(µT2 Σ−1

2 µ2)− gT Hg]

. (2–5)

It can be checked that

α1(µT1 Σ−1

1 µ1) + α2(µT2 Σ−1

2 µ2)− gT Hg

= α1α2(µ1 − µ2)T (α1Σ2 + α2Σ1)

−1(µ1 − µ2), (2–6)

and

|H|−1/2 = |Σ1|1/2|Σ2|1/2|α1Σ2 + α2Σ1|−1/2. (2–7)

Then by (2–6) and (2–7),

right hand side of (2–5)

= (2π)p(1−α1−α2)/2|Σ1|(1−α1)/2|Σ2|(1−α2)/2|α1Σ2 + α2Σ1|−1/2

× exp[−α1α2

2(µ1 − µ2)

T (α1Σ2 + α2Σ1)−1(µ1 − µ2)].

This proves the lemma. ¥

The above results are now used to obtain the Bayes estimator of θ and

the Bayesian predictive density of a future Y ∼ N(θ, σ2yIp) under the general

divergence loss and the N(µ, AIp) prior for θ. We continue to assume that

conditional on θ, X ∼ N(θ, σ2xIp), where σ2

x > 0 is known. The Bayes estimator of

Page 32: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

25

θ is obtained by minimizing

1−∫

exp[−β(1− β)

2σ2x

||θ − a||2]N(θ|(1−B)X + Bµ, σ2x(1−B)Ip)dθ

with respect to a, where B = σ2x(σ

2x + A)−1. By Lemma 2.1.0.2,

∫exp[−β(1− β)

2σ2x

||θ − a||2]N(θ|(1−B)X + Bµ, σ2x(1−B)Ip)dθ =

(2πσ2

x

β(1− β)

)p/2 ∫N(θ|a, σ2

xβ−1(1− β)−1Ip)N(θ|(1−B)X + Bµ, σ2

x(1−B)Ip)dθ

∝ exp

[− ||a− (1−B)X −Bµ||2

2σ2x(β

−1(1− β)−1 + 1−B)

](2–8)

which is maximized with respect to a at (1 − B)X + Bµ. Hence, the Bayes

estimator of θ under the N(µ, AIp) prior and the general divergence loss is

(1− B)X + Bµ, the posterior mean. Also, by Lemma 2.1.0.2, the Bayes predictive

density under the divergence loss is given by

πD(y|X) ∝[∫

[N1−β(θ|y, σ2yIp)N(θ| (1−B)X + Bµ, σ2

x(1−B)Ip)dθ

] 11−β

∝ N(y|(1−B)X + Bµ,(σ2

x(1−B)(1− β) + σ2y

)Ip).

In the limiting (B → 0) case, i.e. under the uniform prior π(θ) = 1, the Bayes

estimator of θ is X, and the Bayes predictive density of Y is

N(y|X,(σ2

x(1− β) + σ2y

)Ip).

It may be noted that by Lemma 2.1.0.2 with α1 = 1 − β and α2 = β, the

divergence loss for the plug-in predictive density N(θ, σ2xIp), which we denote by

Page 33: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

26

δ0, is

L(θ, δ0) =1

β(1− β)

[1−

∫N1−β(y|θ, σ2

yIp)Nβ(y|X, σ2

xIp) dy

]

=1

β(1− β)

[1− (σ2

y)pβ2 (σ2

x)p(1−β)

2 (1− β)σ2x + βσ2

y−p/2

× exp

− β(1− β)‖X − θ‖2

2((1− β)σ2x + βσ2

y)

]. (2–9)

Noting that ‖X − θ‖2 ∼ σ2xχ

2p, the corresponding risk is given by

R(θ, δ0) =1

β(1− β)

[1− (σ2

y)pβ2 (σ2

x)p(1−β)

2 (1− β)σ2x + βσ2

y−p/2

×

1 +β(1− β)σ2

x

(1− β)σ2x + βσ2

y

− p2

]

=1

β(1− β)

[1− (σ2

y)pβ2 (σ2

x)p(1−β)

2 (1− β2)σ2x + βσ2

y−p/2]. (2–10)

On the other hand, by Lemma 2.1.0.2 again, the divergence loss for the Bayes

predictive density (under uniform prior) of N(y|θ, σ2yIp) which we denote by δ∗ is

L(θ, δ∗) =1

β(1− β)

[1−

∫N1−β(y|θ, σ2

yIp)Nβ(y|X, ((1− β)σ2

x + σ2y)Ip) dy

]

=1

β(1− β)

[1− (σ2

y)pβ2 ((1− β)σ2

x + σ2y)

p(1−β)2 (1− β)2σ2

x + σ2y−p/2

× exp

−β(1− β)‖X − θ‖2

2((1− β)2σ2x + σ2

y)

]. (2–11)

The corresponding risk

R(θ, δ∗) =1

β(1− β)

[1− (σ2

y)pβ2 ((1− β)σ2

x + σ2y)

p(1−β)2 ((1− β)2σ2

x + σ2y)− p

2

×

1 +β(1− β)σ2

x

(1− β)2σ2x + σ2

y

− p2

]=

1

β(1− β)

[1− (σ2

y)pβ2 ((1− β)σ2

x + σ2y)

−pβ2

].

(2–12)

Page 34: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

27

To show that R(θ, δ0) > R(θ, δ∗) for all θ, σ2x > 0 and σ2

y > 0, it suffices to

show that

(σ2y)

pβ2 (σ2

x)p(1−β)

2 (1− β2)σ2x + βσ2

y−p/2 < (σ2y)

pβ2 ((1− β)σ2

x + σ2y)

−pβ2 , (2–13)

or equivalently that

1 + β(σ2y/σ

2x − β) > (1 + σ2

y/σ2x − β)β, (2–14)

for all 0 < β < 1, σ2x > 0 and σ2

y > 0. But the last inequality is a consequence of the

elementary inequality (1 + z)u < 1 + uz for all real z and 0 < u < 1.

In the next section, we prove the minimaxity of X as an estimator of θ and

the minimaxity of N(y|X, ((1−β)σ2x +σ2

y)Ip) as the predictive density of Y in any

arbitrary dimension.

2.2 Minimaxity Results

Suppose X ∼ N(θ, σ2xIp), where θ ∈ Rp. By Lemma 2.1.0.2 under the general

divergence loss given in 2–1, the risk of X is given by

R(θ,X) =1

β(1− β)[1− 1 + β(1− β)−p/2] (2–15)

for all θ. We now prove the minimaxity of X as an estimator of θ.

Theorem 2.2.0.3. X is a minimax estimator of the θ in any arbitrary dimension

under the divergence loss given in 2–1.

Proof of Theorem 2.2.0.3. Consider the sequence of proper priors N(0, σ2nIp)

for θ, where σ2n −→ ∞ as n −→ ∞. We denote this sequence of priors by πn. The

Bayes estimator of θ, namely the posterior mean, under the prior πn is

δπn(X) = (1−Bn)X, (2–16)

with Bn = σ2x(σ

2x + σ2

n)−1.

Page 35: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

28

The Bayes risk of δπn under the prior πn is given by:

r(πn, δπn) =

1

β(1− β)

[1− E

[exp

−β(1− β)

2σ2x

‖δπn(X)− θ‖2

]], (2–17)

where expectation is taken over the joint distribution of X and θ, with θ having

the prior πn.

Since under the prior πn,

θ|X = x ∼ N(δπn(x), σ2

x(1−Bn)Ip

),

it follows that

‖θ − δπn(X)‖2 |X = x ∼ σ2x(1−Bn)χ2

p,

which does not depend on x. Accordingly, from (3.3),

r(πn, δπn) =1

β(1− β)[1− 1 + β(1− β)(1−Bn)−p/2]. (2–18)

Since Bn → 0 as n →∞, it follows from 2–18 that

r(πn, δπn) → 1

β(1− β)[1− 1 + β(1− β)−p/2]

as n →∞.

Noting (2–15), an appeal to a result of Hodges and Lehmann [40] now shows

that X is a minimax estimator of θ for all p.

Next we prove the minimaxity of the predictive density

δ∗(X) = N(y|X, ((1− β)σ2x + σ2

y)Ip)

of Y having pdf N(y|θ, σ2yIp).

Theorem 2.2.0.4. δ∗(X) is a minimax predictive density of N(y|θ, σ2yIp) in any

arbitrary dimension under the general divergence loss given in (2–1).

Page 36: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

29

Proof of Theorem 2.2.0.4. We have shown already that the predictive density

δ∗(X) of N(y|θ, σ2yIp) has constant risk

1

β(1− β)

[1− (σ2

y)pβ2 ((1− β)σ2

x + σ2y)

−pβ2

]

under the divergence loss given in (2–1). Under the same sequence πn of priors

considered earlier in this section, by Lemma 2.1.0.2, the Bayes predictive density of

N(y|θ, σ2yIp) is given by N(y| (1−Bn)X, (1− β)(1−Bn)σ2

x + σ2yIp). By Lemma

2.1.0.2 once again, one gets the identity

∫N1−β(y|θ, σ2

yIp)Nβ(y| (1−Bn)X, (1− β)(1−Bn)σ2

x + σ2yIp)dy

= (σ2y)

pβ2 ((1− β)(1−Bn)σ2

x + σ2y)

p(1−β)2 (1− β)2(1−Bn)σ2

x + σ2y−p/2

× exp[−β(1− β)||θ − (1−Bn)X||22((1− β)2(1−Bn)σ2

x + σ2y)

]. (2–19)

Noting once again that ||θ − (1 − Bn)X||2|X = x ∼ σ2x(1 − Bn)χ2

p, the posterior

risk of δ∗(X) simplifies to

1

β(1− β)

[1− (σ2

y)pβ2 ((1− β)(1−Bn)σ2

x + σ2y)

p(1−β)2

×((1− β)2(1−Bn)σ2x + σ2

y)− p

2

1 +

β(1− β)σ2x(1−Bn)

(1− β)2(1−Bn)σ2x + σ2

y

− p2

]

=1

β(1− β)

[1− (σ2

y)pβ2 ((1− β)(1−Bn)σ2

x + σ2y)− pβ

2

]. (2–20)

Since the expression does not depend on x, this is also the same as the Bayes

risk of δ∗(X). The Bayes risk converges to

1

β(1− β)

[1− (σ2

y)pβ2 ((1− β)σ2

x + σ2y)

−pβ2

].

An appeal to Hodges and Lehmann [40] once again proves the theorem.

Page 37: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

30

2.3 Admissibility for p = 1

We use Blyth’s [14] original technique for proving admissibility. First consider

the estimation problem. Suppose that X is not an admissible estimator of θ. Then

there exists an estimator δ0(X) of θ such that R(θ, δ0) ≤ R(θ, X) for all θ with

strict inequality for some θ = θ0. Let η = R(θ0, X) − R(θ0, δ0(X)) > 0. Due to

continuity of the risk function, there exists an interval [θ0 − ε, θ0 + ε] with ε > 0

such that R(θ, X)− R(θ, δ0(X)) ≥ 12η for all θ ∈ [θ0 − ε, θ0 + ε]. Now with the same

prior πn(θ) = N(θ| 0, σ2n),

r(πn, X)− r(πn, δ0(X))

=

R

[R(θ,X)−R(θ, δ0(X))] πn(dθ) ≥θ0+ε∫

θ0−ε

[R(θ,X)−R(θ, δ0(X))] πn(dθ)

≥ 1

θ0+ε∫

θ0−ε

(2πσ2n)−

12 exp

− 1

2σ2n

θ2

dθ ≥ 1

2η(2πσ2

n)−12 (2ε) . (2–21)

Again,

r(πn, X)− r(πn, δπn(X)) =[1 + β(1− β)(1−Bn)−1/2 − 1 + β(1− β)−1/2]

β(1− β)

= Oe(Bn) (2–22)

for large n, where Oe denotes the exact order. Since Bn = σ2x(σ

2x + σ2

n)−1 and

σ2n → ∞ as n → ∞, denoting C(> 0) as a generic constant, it follows from (2–21)

and (2–22) that for large n, say n ≥ n0,

r(πn, X)− r(πn, δ0(X))

r(πn, X)− r(πn, δπn(X))≥ 1

4η(2π)−1/2σ−1

n CB−1n →∞ (2–23)

as n →∞.

Hence, for large n, r(πn, δπn(X)) > r(πn, δ0(X)) which contradicts the

Bayesness of δπn(X) with respect to πn. This proves the admissibility of X for

p = 1. ¥

Page 38: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

31

For the prediction problem, suppose there exists a density p(y| ν(X)) which

dominates N(y|X, ((1− β)σ2x + σ2

y). Since

[((1− β)(1−Bn)σ2x + σ2

y))−β

2 − ((1− β)σ2x + σ2

y))−β

2 ]

β(1− β)= Oe(Bn)

for large n under the same prior πn, using a similar argument,

r(πn, N(y|X, ((1− β)σ2x + σ2

y)))− r(πn, p(y|ν(X)))

r(πn, N(y|X, ((1− β)σ2x + σ2

y)))− r(πn, N(y|X, ((1− β)(1−Bn)σ2x + σ2

y)))

= O(σ−1n B−1

n ) →∞, (2–24)

as n →∞. An argument similar to the previous result now completes the proof. ¥

Remark 1. The above technique of proving admissibility does not work for

p = 2. This is because for p = 2, the ratios in the left hand side of (2–23)

and (2–24) are greater than or equal to some constant times σ−2n B−1

n for large n

which tends to a constant as n → ∞. We conjecture the admissibility of X or

N(y

∣∣ X, ((1− β)σ2x + σ2

y)Ip

)for p = 2 under the general divergence loss for the

respective problems of estimation and prediction.

2.4 Inadmissibility Results for p ≥ 3

Let S = ‖X‖2/σ2x. The Baranchik class of estimators for θ is given by

δτ (X) =

(1− τ(S)

S

)X,

where one needs some restrictions on τ. The special choice τ(S) = p − 2 (with

p ≥ 3) leads to the James-Stein estimator.

It is important to note that the class of estimators δτ (X) can be motivated

from an empirical Bayes (EB) point of view. To see this, we first note that with the

N(0, AIp) (A > 0) prior for θ, the Bayes estimator of θ under the divergence loss is

(1−B)X, where B = σ2x(A + σ2

x)−1.

Page 39: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

32

An EB estimator of θ estimates B from the marginal distribution of X.

Marginally, X ∼ N(0, σ2xB

−1Ip) so that S is minimal sufficient for B. Thus, a

general EB estimator of θ can be written in the form δτ (X). In particular, the

UMVUE of B is (p− 2)/S which leads to the James-Stein estimator [28].

Note that for the estimation problem,

L(θ, δτ (X)) =1− exp

[−β(1−β)

2σ2x‖δτ (X)− θ‖2

]

β(1− β), (2–25)

while for the prediction problem,

L(N(y|θ, Ip), N

(y| δτ (X), ((1− β)σ2

x + σ2y)Ip

))

=1− (σ2

y)pβ2 ((1− β)σ2

x + σ2y)

p(1−β)2 (1− β)2σ2

x + σ2y−p/2

β(1− β)

× exp

[−β(1− β)‖δτ (X)− θ‖2

2((1− β)2σ2x + σ2

y)

]. (2–26)

The first result of this section finds an expression for

[exp

− b

σ2x

‖δτ (X)− θ‖2

], b > 0.

We will need b = β(1− β)/2 and b = β(1−β)σ2x

2((1−β)2σ2x+σ2

y)to evaluate (2–25) and (2–26).

Theorem 2.4.0.5.

[exp

− b

σ2x

‖δτ (X)− θ‖2

]= (2b + 1)−p/2

∞∑r=0

exp(−φ)φr/r! Ib(r), (2–27)

where φ = (b + 12)‖θ‖

2

σ2x

and

Ib(r) =

∞∫

0

1− b

(2t

2b + 1

)2rtr+

p2−1

Γ(r + p

2

)

× exp

[−t− b(b + 1

2)

tτ 2

(2t

2b + 1

)+ 2bτ

(2t

2b + 1

)]dt. (2–28)

Page 40: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

33

Proof.

Recall that S = ||X||2/σ2x. For ||θ|| = 0, proof is straightforward. So

we consider ||θ|| > 0. Let Z = X/σx and η = θ/σx. First we reexpress

1σ2

x

∥∥∥(1− τ(S)

S

)X − θ

∥∥∥2

as

∥∥∥∥(

1− τ(‖Z‖2)

‖Z‖2

)Z − η

∥∥∥∥2

= S+τ 2(S)

S−2τ(S)+‖η‖2−2ηT Z

(1− τ(S)

S

). (2–29)

We begin with the orthogonal transformation Y = CZ where C is an orthogonal

matrix with its first row given by (θ1/‖θ‖, . . . , θp/‖θ‖). Writing Y = (Y1, . . . , Yp)T ,

the right hand side of (2–29) can be written as

S +τ 2(S)

S− 2τ(S) + ‖η‖2 − 2‖η‖Y1

(1− τ(S)

S

), (2–30)

where S = ‖Y ‖2. Also we note that Y1, . . . , Yp are mutually independent with

Y1 ∼ N(‖η‖, 1), and Y2, . . . , Yp are iid N(0, 1). Now writing Z =p∑

i=2

Y 2i ∼ χ2

p−1 we

have

E

[exp

− b

σ2x

‖δτ (X)− θ‖2

]

=

+∞∫

0

+∞∫

−∞

exp

[−b

y2

1 + z +τ 2(y2

1 + z)

(y21 + z)

−2τ(y21 + z) + ‖η‖2 − 2‖η‖y1

(1− τ(y2

1 + z)

y21 + z

)]×

× (2π)−1/2 exp

−1

2(y1 − ‖η‖)2

exp

−1

2z

z

12(p−1)−1

2p−12 Γ

(p−12

) dy1 dz

=

+∞∫

0

+∞∫

−∞

(2π)−1/2 exp

[−

(b +

1

2

)(y1 − ‖η‖)2 − bτ 2(y2

1 + z)

(y21 + z)

+ 2bτ(y21 + z)

−2b‖η‖y1τ(y2

1 + z)

y21 + z

]exp

(b +

1

2

)z

z

p−32

2p−12 Γ

(p−12

) dy1 dz (2–31)

Page 41: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

34

We first simplify

+∞∫

−∞

(2π)−1/2 exp

[−

(b +

1

2

)(y1 − ‖η‖)2

−bτ 2(y21 + z)

(y21 + z)

+ 2bτ(y21 + z)− 2b‖η‖y1

τ(y21 + z)

y21 + z

]dy1

=

+∞∫

0

(2π)−1/2 exp

[−

(b +

1

2

)(y2

1 + ‖η‖2)− bτ 2(y21 + z)

(y21 + z)

+ 2bτ(y21 + z)

]

×[exp

2

(b +

1

2

)‖η‖y1 − 2b‖η‖y1

τ(y21 + z)

y21 + z

+ exp

−2

(b +

1

2

)‖η‖y1 + 2b‖η‖y1

τ(y21 + z)

y21 + z

]dy1 (2–32)

= 2

+∞∫

0

(2π)−1/2 exp

[−

(b +

1

2

)(y2

1 + ‖η‖2)− bτ 2(y21 + z)

(y21 + z)

+ 2bτ(y21 + z)

]

×[ ∞∑

r=0

(2‖η‖y1)2r

(2r)!

(b +

1

2

)− bτ(y2

1 + z)

y21 + z

2r]

dy1

= 2

+∞∫

0

(2π)−1/2 exp

[−

(b +

1

2

)(w + ‖η‖2)− bτ 2(w + z)

(w + z)+ 2bτ(w + z)

]

×[ ∞∑

r=0

‖2η‖2rwr− 12

(2r)!

(b +

1

2

)− bτ(w + z)

w + z

2r]

dw,

where w = y21.

With the substitution v = w + z and u = w/(w + z), it follows from (2–31) and

(2–32) that

E

[exp

− b

σ2x

‖δτ (X)− θ‖2

]

=

+∞∫

0

1∫

0

(2π)−1/2

∞∑r=0

exp

(−

(b +

1

2

)‖η‖2

)(2b‖η‖)2r

(2r)!

((2b)−1 + 1

)− τ(v)

v

2r

× exp

[−

(b +

1

2

)v − bτ 2(v)

v+ 2bτ(v)

]vr+ p

2−1ur+ 1

2−1(1− u)

p−12−1

2p−12 Γ

(p−12

) du dv (2–33)

Page 42: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

35

By the Legendre duplication formula, namely,

(2r)! = Γ(2r + 1) = Γ

(r +

1

2

)Γ(r + 1)22rπ−1/2,

(2–33) simplifies into

E

[exp

− b

σ2x

‖δτ (X)− θ‖2

]

=

+∞∫

0

1∫

0

(2π)−1/2

∞∑r=0

exp

(−

(b +

1

2

)‖η‖2

)(2b‖η‖)2r

√π

r!Γ(r + 1

2

)22r

((2b)−1 + 1

)− τ(v)

v

2r

× vr+ p2−1 exp

[−

(b +

1

2

)v − bτ 2(v)

v+ 2bτ(v)

]ur+ 1

2−1(1− u)

p−12−1

2p−12 Γ

(p−12

) du dv

=

+∞∫

0

1∫

0

∞∑r=0

exp

(−

(b +

1

2

)‖η‖2

)((b +

1

2

)‖η‖2

)r(b + 1

2

)r

r!

1− τ(v)

((2b)−1 + 1) v

2r

×vr+ p2−1 exp

[−

(b +

1

2

)v − bτ 2(v)

v+ 2bτ(v)

]ur+ 1

2−1(1− u)

p−12−1Γ

(r + p

2

)

2p2 Γ

(r + 1

2

(p−12

(r + p

2

) du dv

(2–34)

Integrating with respect to u, (2–34) leads to

E

[exp

− b

σ2x

‖δτ (X)− θ‖2

]

=∞∑

r=0

exp−φφr

r!

+∞∫

0

(b +

1

2

)r (1− τ(v)

((2b)−1 + 1) v

)2rvr+ p

2−1

2p2 Γ

(r + p

2

)

× exp

[−

(b +

1

2

)v − bτ 2(v)

v+ 2bτ(v)

]dv (2–35)

where φ =(b + 1

2

) ‖η‖2. Now putting t =(b + 1

2

)v, we get from (2–35)

E

[exp

− b

σ2x

‖δτ (X)− θ‖2

]= (2b + 1)−

p2

∞∑r=0

exp−φφr

r!

×+∞∫

0

exp

−t− b

(b + 1

2

)τ 2( 2t

2b+1)

t+ 2bτ

(2t

2b + 1

) (1− b

(2t

2b + 1

))2rtr+

p2−1

Γ(r + p2)dt,

The theorem follows. ¥

Page 43: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

36

As a consequence of this theorem, putting b = β(1 − β)/2, it follows from

(2–25) and (2–27) that

R(θ, δτ (X)) =

1− (1 + β(1− β))−p/2∞∑

r=0

exp(−φ)φr/r! Iβ(1−β)/2(r)

β(1− β),

while putting b = β(1−β)σ2x

2((1−β)2σ2x+σ2

y), it follows from (2–26) and (2–27) that

R(N(y|θ, σ2

yIp), N(y

∣∣ δτ (X), ((1− β)σ2x + σ2

y)Ip

))

=

1− (σ2y)

pβ/2((1− β)σ2x + σ2

y)−pβ/2

∞∑r=0

exp(−φ)φr/r! I β(1−β)σ2x

2((1−β)2σ2x+σ2

y)

(r)

β(1− β).

Hence, proving Ib(r) > 1 for all b > 0 under certain conditions on τ leads to

R(θ, δτ (X)) < R(θ,X)

and

R(N(y|θ, σ2

yIp), N(y

∣∣δτ (X), ((1− β)σ2x + σ2

y)Ip

))

< R(N

(y|θ, σ2

yIp), N(y|X, ((1− β)σ2

x + σ2y)Ip

))

for all θ. In the limiting case when β → 0, i.e. for the KL loss, one gets

RKL (θ, δτ (X)) < p/2 = RKL(θ, X) for all θ, since as shown in Section 1,

for estimation, the KL loss is half of the squared error loss. Similarly, for the

prediction problem, as β → 0,

RKL(N(y|θ, σ2yIp), N(y| δτ (X), ((1− β)σ2

x + σ2y)Ip)

<p

2log

(σ2

x + σ2y

σ2y

)= RKL(N(y|θ, σ2

yIp), N(y|X, ((1− β)σ2x + σ2

y)Ip)

for all θ.

The following theorem provides sufficient conditions on the function τ(·) which

guarantee Ib(r) > 1 for all r = 0, 1, · · · .

Page 44: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

37

Theorem 2.4.0.6. Let p ≥ 3. Suppose

(i) 0 < τ(t) < 2(p− 2) for all t > 0;

(ii) τ(t) is a differentiable nondecreasing function of t.

Then Ib(r) > 1 for all b > 0.

Proof of Theorem 2.4.0.6.

Define τ0(t) = τ( 2t2b+1

). Notice that τ0(t) will also satisfy conditions of Theorem

2.4.0.6. Now

t +b(b + 1

2

)

tτ 2

(2t

2b + 1

)− 2bτ

(2t

2b + 1

)= t

(1− bτ0(t)

t

)2

+b

2tτ 20 (t)

and

Ib(r) =

+∞∫

0

exp

−t

(1− bτ0(t)

t

)2 (

t(1− b

tτ0(t)

)2)r+ p

2−1

Γ(r + p2)

× exp

−bτ 2

0 (t)

2t

1− b

tτ0(t)

−(p−2)

dt. (2–36)

Define t0 = supt > 0 : τ0(t)/t ≥ b−1. Since τ0(t)/t is continuous in t with

limt→0

τ0(t)/t = +∞ and limt→∞

τ0(t)/t = 0, there exists such a t0 which also satisfies

τ0(t0)/t0 = b−1. We now need the following lemma.

Lemma 2.4.0.7. For t ≥ t0, b > 0 and τ0(t) satisfying conditions of Theorem

2.4.0.6 the following inequality holds:

exp

−bτ 2

0 (t)

2t

(1− b

tτ0(t)

)−(p−2)

− q′(t) ≥ 0, (2–37)

where q(t) = t(1− bτ0(t)

t

)2

.

Proof of 2.4.0.7. Notice first that for t ≥ t0, by the inequality, (1 − z)−c ≥exp(cz) for c > 0 and 0 < z < 1, one gets

exp

−bτ 2

0 (t)

2t

(1− b

tτ0(t)

)−(p−2)

≥ exp

−bτ 2

0 (t)

2t+

b(p− 2)τ0(t)

t

, (2–38)

Page 45: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

38

for 0 < τ0(t) < 2(p− 2).

Notice that

q′(t) = 1− 2bτ ′0(t) +2b2τ0(t)τ

′0(t)

t− b2τ 2

0 (t)

t2. (2–39)

Thus q′(t) ≤ 1 for t ≥ t0 if and only if

2bτ ′0(t)(

1− bτ0(t)

t

)+

b2τ 20 (t)

t2≥ 0. (2–40)

The last inequality is true since τ ′0(t) ≥ 0 for all t > 0 and τ0(t)/t ≤ b−1 for all

t ≥ t0. The lemma follows. ¥

In view of previous lemma, it follows from (2–37) that

Ib(r) ≥ 1

Γ(r + p

2

)+∞∫

t0

exp−q(t)(q(t))r+ p2−1q′(t) dt = 1.

This proves Theorem 2.4.0.6. ¥

Remark 2. Baranchik [5], under squared error loss, proved the dominance of

δτ (X) over X under (i) and (ii). We may note that the special choice τ(t) = p− 2

for all t leading to the James-Stein estimator, satisfies both conditions (i) and (ii)

of the theorem.

Remark 3. We may note that the Baranchik class of estimators shrinks the sample

mean X towards 0. Instead one can shrink X towards any arbitrary constant µ. In

particular, if we consider the N(µ, AIp) prior for θ, where µ ∈ Rp is known, then

the Bayes estimator of θ is (1−B)X + Bµ, where B = σ2x(A+ σ2

x)−1. A general EB

estimator of θ is then given by

δ∗∗(X) =

(1− τ(S ′)

S ′

)X +

τ(S ′)S ′

µ,

Page 46: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

39

where S ′ = ‖X −µ‖2/σ2x, and Theorem 2.4.0.6 with obvious modifications will then

provide the dominance of the EB estimator δ∗∗(X) over X under the divergence

loss. The corresponding prediction result is also true.

Remark 4. The special case with τ(t) = c satisfies conditions of the theorem if

0 < c < 2(p− 2). This is the original James-Stein result.

Remark 5. Strawderman [60] considered the hierarchical prior

θ|A ∼ N(0, AIp),

where A has pdf

π(A) = δ(1 + A)−1−δI[A>0]

with δ > 0.

Under the above prior, assuming squared error loss, and recalling that S =

||X||2/σ2x, the Bayes estimator of θ is given by

(1− τ(S)

S

)X,

where

τ(t) = p + 2δ − 2exp(− t2)∫ 1

p2+δ−1exp(−λ

2t) dλ

(2–41)

Under the general divergence loss, it is not clear whether this estimator is the

hierarchical Bayes estimator of θ, although its EB interpretation continues to

hold. Besides, as it is well known, this particular τ satisfies conditions of Theorem

2.4.0.6 if p > 4 + 2δ. Thus the Strawderman class of estimators dominates X

under the general divergence loss. The corresponding predictive density also

dominates N(y

∣∣X, ((1− β)σ2x + σ2

y)Ip

). For the special KL loss, the present

results complement those of Komaki [45] and George et al. [34]. The predictive

density obtained by these authors under the Strawderman prior, (and Stein’s

superharmonic prior as a special case) are quite different from the general class of

Page 47: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

40

EB predictive densities of this dissertation. One of the virtues of the latter is that

the expressions are in closed form, and thus these densities are easy to implement.

2.5 Lindley’s Estimator and Shrinkage to Regeression Surface

Lindley [50] considered a modification of the James-Stein estimator. Rather

then shrinking X towards an arbitrary point, say µ, he proposed shrinking X

towards X1p, where X = p−1∑p

i=1 Xi and 1p is a p-component column vector with

each element equal to 1. Writing R =p∑

i=1

(Xi − X)2/σ2x, Lindley’s estimator is given

by

δ(X) = X − p− 3

R(X − X1p), p ≥ 4. (2–42)

The above estimator has a simple EB interpretation. Suppose

X|θ ∼ N(θ, σ2xIp) and θ has the Np(µ1p, AIp) prior. Then the Bayes estimator of

θ is given by (1− B)X + Bµ1p where B = σ2x(A + σ2

x)−1. Now if both µ and A are

unknown, since marginally X ∼ N(µ1p, σ2xB

−1Ip), (X, R) is complete sufficient for

µ and B, and the UMVUE of µ and B−1 are given by X and (p− 3)/R, p ≥ 4.

Following Baranchik [5] a more general class of EB estimators is given by

δτ∗(X) = X − τ(R)

R(X − X1p), p ≥ 4. (2–43)

Theorem 2.5.0.8. Assume

(i) 0 < τ(t) < 2(p− 3) for all t > 0 , p ≥ 4;

(ii) τ(t) is a nondecreasing differentiable function of t.

Then the estimator δτ∗(X) dominates X under the divergence loss given in 2–1.

Similarly, N(y| δτ∗(X), ((1− β)σ2

x + σ2y)Ip) dominates N(y|X, ((1− β)σ2

x + σ2y)Ip)

as the predictor of N(y| θ, σ2yIp).

Proof of Theorem 2.5.0.8. Let

θ = p−1

p∑i=1

θi, η = θ/σx

Page 48: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

41

and

ζ2 =1

σ2x

p∑i=1

(θi − θ)2.

As in the proof of Theorem 2.4.0.5 we first rewrite

1

σ2x

‖δτ∗(X)− θ‖2 = ‖Z − η − τ(R)

R(Z − Z1p)‖2

=

∥∥∥∥(Z − Z1p)− (η − η1p) + (Z − η)1p − τ(R)

R(Z − Z1p)

∥∥∥∥2

=

[1− τ(R)

R

]2

R + p(Z − η)2 + ζ2 − 2(η − η1p)T

(1− τ(R)

R

)(Z − Z1p). (2–44)

By the orthogonal transformation G = (G1, . . . , Gp)T = CZ, where C

is an orthogonal matrix with first two rows given by (p−12 , . . . , p−

12 ) and ((η1 −

η)/ζ, . . . , (ηp − η)/ζ). We can rewrite

1

σ2x

‖δτ∗(X)− θ‖2

=

[1− τ(G2

2 + Q)

G22 + Q

]2

(G22 + Q) + (G1 −√p η)2 + ζ2 − 2ζG2

(1− τ(G2

2 + Q)

G22 + Q

),

(2–45)

where Q =p∑

i=3

G2i and G1, G2, . . . , Gp are mutually independent with

G1 ∼ N(√

p η, 1), G2 ∼ N(ζ, 1) and G3, . . . , Gp are iid N(0, 1). Hence due to the

independence of G1 with (G2, . . . , Gp), and the fact that (G1 − √p η)2 ∼ χ21, from

(2–45),

E

[exp

− b

σ2x

‖δτ∗(X)− θ‖2

]= (2b + 1)−

p2 E

[exp

−b

((1− τ(G2

2 + Q)

G22 + Q

)2

×(G22 + Q) + ζ2 − 2ζG2

(1− τ(G2

2 + Q)

G22 + Q

))]

= (2b + 1)−p2

∞∑r=0

exp−ϕ∗ϕr∗

r!×

×+∞∫

0

exp

−t− b(b +

1

2)τ 20 (t)

t+ 2bτ0(t)4

(1− b

τ0(t)

t

)2rtr+

p2−1

Γ(r + p2)dt, (2–46)

Page 49: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

42

where ϕ∗ = (b + 12)ζ2 and as before τ0(t) = τ( t

b+ 12

). The second equality in 2–46

follows after long simplifications proceeding as in the proof of Theorem 2.4.0.5.

Hence, by (2–46), the dominance of δτ∗(X) over X follows if the right hand

side of (2–46) ≥ (2b+1)−p/2. This however is an immediate consequence of Theorem

2.4.0.6. ¥

The above result can immediately be extended to shrinkage towards an

arbitrary regression surface. Suppose now that X|θ ∼ N(θ, σ2xIp) and θ ∼

Np(Kβ, AIp) where K is a known p × r matrix of rank r(< p) and β is r × 1

regression coefficient. Writing P = K(KT K)−1KT , the projection of X on the

regression surface is given by P X = Kβ, where β = (KT K)−1KT X is the least

squares estimator of β. Now we consider the general class of estimators given by

X − τ(R∗)R∗ (X − P X),

where R∗ = ‖X − P X‖2/σ2x. The above estimator also has an EB interpretation

noting that marginally (β, R∗) is complete sufficient for (β, A).

The following theorem now extends Theorem 2.5.0.8.

Theorem 2.5.0.9. Let p ≥ r + 3 and

(i) 0 < τ(t) < 2(p− r − 2) for all t > 0 ;

(ii) τ(t) is a nondecreasing differentiable function of t.

Then the estimator X − τ(R∗)R∗ (X − P X) dominates X under the divergence loss.

A similar dominance result holds for prediction of N(y|θ, σ2yIp).

Page 50: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

CHAPTER 3POINT ESTIMATION UNDER DIVERGENCE LOSS WHEN VARIANCE

COVARIANCE MATRIX IS UNKNOWN

3.1 Preliminary Results

In this chapter we will consider the following situation. Let vectors X i ∼Np (θ,Σ) i = 1, . . . , n be n i.i.d. random vectors, where Σ is the unknown

variance covariance matrix. In Section 3.2 we consider Σ = σ2Ip with σ2 unknown,

while in section 3.3 we consider the most general situation of unknown Σ. Our goal

is to estimate the unknown vector θ under divergence loss.

First note that X is distributed as Np (θ, 1nΣ). And thus divergence loss for an

estimator a of θ is as follows:

Lβ(θ,a) =1− ∫

f 1−β(x|θ)fβ(x|a)dx

β(1− β)=

1− exp[−nβ(1−β)

2(a− θ)TΣ−1(a− θ)

]

β(1− β).

(3–1)

The best unbiased equivariant estimator is X. We will begin with the

expression for the risk of this estimator.

Lemma 3.1.0.10. Let X i ∼ Np (θ,Σ) i = 1, . . . , n be i.i.d. Then the risk of the

best unbiased estimator X of θ is as follows:

Rβ(θ) =1

β(1− β)

(1− 1

[1 + β(1− β)]p/2

). (3–2)

Proof of the lemma 3.1.0.10. Note first that

n(X − θ)TΣ−1(X − θ) ∼ χ2p.

Then

[exp

−nβ(1− β)

2(X − θ)TΣ−1(X − θ)

]=

1

(1 + β(1− β))p/2. (3–3)

43

Page 51: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

44

The lemma follows from 3–3.

¥

Thus for any rival estimator δ(X) to dominate X under the divergence loss

we will need the following inequality to be true for all possible values of θ:

[exp

−nβ(1− β)

2(δ(X)− θ)TΣ−1(δ(X)− θ)

]≥ 1

(1 + β(1− β))p/2. (3–4)

3.2 Inadmissibility Results when Variance-Covariance Matrix isProportional to Identity Matrix

Let X ∼ Np(θ, σ2Ip), where σ(> 0) is unknown, while S ∼ σ2

m+2χ2

m,

independently of X. This situation arises quite naturally in a balanced fixed effects

one-way ANOVA model. For example, let

Xij = θi + εij (i = 1, . . . , p; j = 1, . . . , n)

where the εij are i.i.d. Np(0, σ20). Then the minimal sufficient statistics is given by

(X1, . . . , Xp, S), where

Xi = n−1

n∑j=1

Xij (i = 1, . . . , p)

and

S = [(n− 1)p + 2]−1

p∑i=1

n∑j=1

(Xij − Xi)2.

This leads to the proposed setup with X = (X1, . . . , Xp)T , θ = (θ1, . . . , θp)

T ,

σ2 = σ20/n and m = (n− 1)p.

Efron and Morris [30], in the above scenario, proposed a general class of

shrinkage estimators dominating the sample mean in three or higher dimensions

under squared error loss. This class of estimators was developed along the ones of

Baranchik [5].

Page 52: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

45

Using equation (3–1), the divergence loss for an estimator a of θ is given by

Lβ(θ,a) =1− exp

[−β(1−β)

2σ2 ‖θ − a‖2]

β(1− β). (3–5)

The above loss to be interpreted as its limit when β → 0 or β → 1. The KL loss

occurs as a special case when β → 0. Also, noting that ‖X − θ‖2 ∼ σ2χ2p, the risk

of the classical estimator X of θ is readily calculated as

Rβ(θ,X) =1− [1 + β(1− β)]−p/2

β(1− β). (3–6)

Throughout we will perform calculations in the case 0 < β < 1, and will pass

to the limit as and when needed.

Following Baranchik [5] and Efron and Morris [30], we consider the rival class

of estimators

δ(X) =

(1− Sτ(‖X‖2/S)

‖X‖2

)X, (3–7)

where we will impose some conditions later on τ.

First we observe that under the divergence loss given in (3–5),

Lβ(θ, δ(X)) =1− exp

[−β(1−β)

2σ2 ‖δ(X)− θ‖2]

β(1− β). (3–8)

We now prove the following dominance result.

Theorem 3.2.0.11. Let p ≥ 3. Assume

(i) 0 < τ(t) < 2(p− 2) for all t > 0;

(ii) τ(t) is a differentiable nondecreasing function of t for t > 0.

Then R(θ, δ(X)) < R(θ,X) for all θ ∈ Rp.

Proof of Theorem 3.2.0.11 First with the transformation

Y = σ−1X, η = σ−1θ and U = S/σ2,

Page 53: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

46

one can rewrite

R(θ, δ(X)) =

1− E

[exp

−β(1−β)

2

∥∥∥∥(

1− τ(‖Y ‖2/U)U

‖Y ‖2

)Y − η

∥∥∥∥2]

β(1− β), (3–9)

where Y ∼ Np(η, Ip) and U ∼ (m + 2)−1χ2m is distributed independently of Y .

Hence a comparison of (3–9) with (3–6) reveals that Theorem (3.2.0.11) holds if

and only if

E

[exp

−β(1− β)

2

∥∥∥∥(

1− Uτ(‖Y ‖2/U)

‖Y ‖2

)Y − η

∥∥∥∥2]

> [1+β(1−β)]−p/2. (3–10)

Next writing

Z =U(m + 2)

2

and

τ0(t/z) =2

m + 2τ

((m + 2)t

2z

),

we reexpress left hand side of (3–10) as

E

[exp

−β(1− β)

2

∥∥∥∥(

1− Z

‖Y ‖2τ0(‖Y ‖2/Z)

)Y − η

∥∥∥∥2]

. (3–11)

Note that in order to find the above expectation, we first condition on Z and

then average over the distribution of Z. By the independence of Z and Y and

Theorem 2.4.0.5, the expression given in (4–29) simplifies to

[1 + β(1− β)]−p/2

∞∑r=0

exp(−φ)φr

r!I∗b (r), (3–12)

where φ = 12[1 + β(1− β)]‖η‖2, and writing b = β(1−β)

2,

I∗b (r) =

∞∫

0

∞∫

0

[1− bz

tτ0(t/z)

]2rtr+

p2−1

Γ(r + p

2

)

× exp

−t− b(b + 1/2)z2

tτ 20 (t/z) + 2bzτ0(t/z)

exp(−z)

zm2−1

Γ(

m2

) dt dz. (3–13)

Page 54: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

47

From (3–10) - (4–31), it remains only to show that

I∗b (r) > 1 ∀r = 0, 1, . . . ; p ≥ 3

under conditions (i) and (ii) of the theorem. To show this we first use the

transformation

t = zu.

Then from (4–31),

I∗b (r) =

∞∫

0

∞∫

0

[1− bτ0(u)/u]2r ur+ p2−1

Γ(r + p

2

(m2

)

× exp

[−zu + 1 +

b(b + 1/2)

uτ 20 (u)− 2bτ0(u)

]zr+

(p+m)2

−1 dz du

=

∞∫

0

[1− bτ0(u)/u]2r ur+ p2−1

B(r + p

2, m

2

)[u + 1 +

b(b + 1/2)

uτ 20 (u)− 2bτ0(u)

]−(r+ p+m2 )

du.

(3–14)

Since τ0(u)/u is a continuous function of u with

limu→0

τ0(u)/u = +∞ and limu→∞

τ0(u)/u = 0,

it follows that there exists u0 such that

u0 = supu > 0|τ0(u)/u ≥ 1/b and τ0(u0)/u0 = 1/b.

Page 55: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

48

Thus for u > u0, τ0(u)/u ≤ 1/b, from (3–14),

I∗b (r) ≥∞∫

u0

[1− bτ0(u)/u]2r ur+ p2−1

B(r + p

2, m

2

)[u + 1 +

b(b + 1/2)

uτ 20 (u)− 2bτ0(u)

]−(r+ p+m2 )

du

=

∞∫

u0

u(1− bτ0(u)/u)2r+ p2−1 (1− bτ0(u)/u)−(p−2)

B(r + p

2, m

2

)

×[u(1− bτ0(u)/u)2 + 1 +

bτ 20 (u)

2u

]−(r+ p+m2 )

du

=

∞∫

u0

[u(1− bτ0(u)/u)2 /1 + bτ 20 (u)/2u ]

r+ p2−1

[1 + u(1−bτ0(u)/u)2

1+bτ20 (u)/(2u)

]r+ p+m2

× (1− bτ0(u)/u)−(p−2) (1 + bτ 2

0 (u)/(2u))−(m

2+1)

du. (3–15)

By the inequalities

(1− bτ0(u)/u)−(p−2) ≥ exp[(p− 2)bτ0(u)/u]

and(1 + bτ 2

0 (u)/(2u))−(m

2+1) ≥ exp[−(1 + m/2)bτ 2

0 (u)/(2u)],

it follows that

(1− bτ0(u)

u

)−(p−2) (1 +

bτ 20 (u)

2u

)−(m2

+1)

≥ exp

[(p− 2)

bτ0(u)

u− (m + 2)bτ 2

0 (u)

4u

]

= exp

[bτ0(u)(m + 2)

4u

4(p− 2)

m + 2− τ0(u)

]> 1 (3–16)

since 0 < τ0(u) < 4(p− 2)/(m + 2).

Moreover, putting

w =u

(1− bτ0(u)

u

)2

1 +bτ2

0 (u)

2u

=2[u− bτ0(u)]2

[2u + bτ 20 (u)]

,

Page 56: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

49

it follows that

dw

du=

2(u− bτ0(u))

[2u + bτ 20 (u)]2

[2(1− bτ ′0(u))(2u + bτ 2

0 (u))− (u− bτ0(u))(2 + 2bτ0(u)τ ′0(u))]

=2(u− bτ0(u))

[2u + bτ 20 (u)]2

[2u + 2bτ0(u) + 2bτ 2

0 (u)− 4buτ ′0(u)− 2buτ0(u)τ ′0(u)].

Hence dwdu≤ 1 if and only if

2[u− bτ0(u)][2u + 2bτ0(u) + 2bτ 20 (u)− 4buτ ′0(u)− 2buτ0(u)τ ′0(u)] ≤ [2u + bτ 2

0 (u)]2.

The last inequality is equivalent to

b2τ 20 (u)[2 + τ0(u)]2 + 4buτ ′0(u)[2 + τ0(u)][u− bτ0(u)] ≥ 0. (3–17)

Since for u ≥ u0, u ≥ bτ0(u), (3–17) holds if τ ′0(u) ≥ 0, and the latter is true

due to assumption (ii). Now from (3–15) - (3–17) noting that w = 0 when u = u0,

one gets

I∗b (r) >

∞∫

0

wr+ p2−1

(1 + w)r+ p+m2 B

(r + p

2, m

2

) dw = 1,

for all r = 0, 1, 2, . . . . This completes the proof of Theorem 3.2.0.11. ¥

Remark 1. It is interesting to note that I∗b (r) > 1 for all r = 0, 1, 2, . . . and any

arbitrary b > 0. The particular choice b = β(1 − β)/2 does not have any special

significance.

We now consider an extension of the above result when V (X) = Σ is an

unknown variance-covariance matrix. We solve the problem by reducing the risk

expression of the corresponding shrinkage estimator to the one in this section after

a suitable transformation.

3.3 Unknown Positive Definite Variance-Covariance Matrix

Consider the situation when Z1, . . . , Zn (n ≥ 2) are i.i.d. Np(θ,Σ), where Σ is

an unknown positive definite matrix. The goal is once again to estimate θ.

Page 57: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

50

The usual estimator of θ is Z = n−1n∑

i=1

Zi (say). It is the MLE, UMVUE and

the best equivariant estimator of θ, and is distributed as Np(θ, n−1Σ). In addition

the usual estimator of Σ is

S =1

n− 1

n∑i=1

(Zi − Z)(Zi − Z)T

and S is distributed independently of Z.

Based on distribution of Z, the minimal sufficient statistic for θ for any given

Σ, the divergence loss is given by (see equation 3–1)

Lβ(θ, a) =1− exp

[−nβ(1−β)

2(a− θ)TΣ−1(a− θ)

]

β(1− β). (3–18)

The corresponding risk of Z is the same as the one given in (3–2), i.e.

Rβ(θ, Z) =1

β(1− β)[1− 1 + β(1− β)−p/2]. (3–19)

Consider now the general class of estimators

δτ (Z,S) =

[1− τ(nZ

TS−1Z)

nZTS−1Z

]Z (3–20)

of θ. Under the divergence loss given in (2–1),

Lβ(θ, δτ (Z,S)) =1− exp

[−nβ(1−β)

2(δτ (Z,S)− θ)TΣ−1(δτ (Z,S)− θ)

]

β(1− β). (3–21)

By the Helmert orthogonal transformation,

H1 =1√2(Z2 −Z1),

H2 =1√6(2Z3 −Z1 −Z2),

. . .

Hn−1 =1√

n(n− 1)[(n− 1)Zn −Z1 −Z2 − . . .−Zn−1]

Page 58: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

51

and

Hn =1√n

n∑i=1

Zi =√

nZ,

one can rewrite δτ (Z,S) as

δτ (Z,S) =

1−

τ

((n− 1)HT

n (n−1∑i=1

H iHTi )−1Hn

)

(n− 1)HTn (

n−1∑i=1

H iHTi )−1Hn)

n−

12 Hn, (3–22)

where H1, . . . , Hn are mutually independent with H1, . . . , Hn−1 i.i.d. N(0,Σ) and

Hn ∼ N(√

nθ,Σ).

Let

Y i = Σ− 12 H i (i = 1, . . . , n)

and

η = Σ− 12 (√

nθ).

Then from (3–21) and (3–22) one can rewrite

Lβ(θ, δτ (Z,S)) =

1− exp

−β(1−β)

2

∥∥∥∥∥∥

1−

τ

((n−1)Y T

n (n−1∑i=1

Y iYT

i )−1Y n

)

(n−1)Y T

n (n−1∑i=1

Y iYT

i )−1Y n

Y n − η

∥∥∥∥∥∥

2

β(1− β),

(3–23)

where Y 1, . . . , Y n are mutually independent with Y 1, . . . , Y n−1 i.i.d. N(0, Ip) and

Y n ∼ N(η, Ip).

Now from Arnold ([3], p. 333) or Anderson([4], p.172),

Y Tn

(n−1∑i=1

Y iYTi

)−1

Y nd=‖Y n‖2

U,

where U ∼ χ2n−p, and is distributed independently of Y n. Now from (3–23)

Lβ(θ, δτ (Z, S)) =

1− exp

[−β(1−β)

2

∥∥∥∥(

1− τ((n−1)‖Y n‖2/U)(n−1)‖Y n‖2/U

)Y n − η

∥∥∥∥2]

β(1− β). (3–24)

Page 59: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

52

Next writing

Z =U

2

and

τ0(t/z) =2

n− 1τ

(n− 1

2t/z

),

Lβ(θ, δτ (Z,S)) =

1− exp

[−β(1−β)

2

∥∥∥∥(

1− τ0(‖Y n‖2/Z)‖Y n‖2/Z

)Y n − η

∥∥∥∥2]

β(1− β), (3–25)

By Theorem 3.2.0.11, δτ (Z, S) dominates Z as an estimator of θ provided 0 <

τ0(u) < 4(p−2)n−p+2

for all u and 3 ≤ p ≤ n. Accordingly, δτ (Z,S) dominates Z provided

0 < τ(u) < 2(p−2)(n−1)n−p+2

. We state this result in the form of the following theorem.

Theorem 3.3.0.12. Let p ≥ 3. Assume

(i) 0 < τ(t) < 2(p−2)(n−1)n−p+2

for all t > 0;

(ii) τ(t) is a differentiable nondecreasing function of t for t > 0.

Then R(θ, δτ (Z,S)) < R(θ, Z) for all θ ∈ Rp.

Page 60: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

CHAPTER 4REFERENCE PRIORS UNDER DIVERGENCE LOSS

4.1 First Order Reference Prior under Divergence Loss

In this section we will find a reference prior for estimation problems under

divergence losses. Such a prior is obtained by maximizing the expected distance

between prior and corresponding posterior distribution and thus can be interpreted

as noninformative or prior which changes the most on average when the sample is

observed.

If we use divergence as a distance between a proper prior distribution π(θ)

(putting all its mass on a compact set if needed) and the corresponding posterior

distribution π(θ|x) we can reexpress the expected divergence as

Rβ(π) =1− ∫∫

πβ(θ)π1−β(θ|x)m(x) dx dθ

β(1− β)

=1− ∫∫

mβ(x)p1−β(x|θ)π(θ) dx dθ

β(1− β). (4–1)

Using this expression one can easily see that in order to find a prior that

maximizes Rβ(π) we need to find an asymptotic expression for

∫mβ(x)p1−β(x|θ) dx

first.

In this section we assume that we have a parametric family pθ : θ ∈ Θ,Θ ⊂ Rp, of probability density functions pθ ≡ p(x|θ) : θ ∈ Θ with respect to

a finite dominating measure λ(dx) on a measurable space X , and we have a prior

distribution for θ that has a pdf π(θ) with respect to Lebesgue measure.

53

Page 61: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

54

Next we will give a definition of Divergence rate when parameter β ≤ 1

and sample size is n. We define the relative Divergence rate between the true

distribution P n

θ(x) and the marginal distribution of the sample of size n mn(x) to

be

Rβ(θ, π) = DRβ(P n

θ‖mn(x))

=1− ∫

(mn(x))β/n(P n

θ(x))1−β/n

λ(dx)

β/n(1− β/n). (4–2)

It is easy to check for β → 0 that this definition is equivalent to the definition

of relative entropy rate considered for example in Clarke and Barron [21].

Using this definition, we can define for a given prior π the corresponding Bayes

risk as follows:

Rβ(π) = E[DRβ

(P n

θ‖mn(x))]

. (4–3)

To find an asymptotic expansion for this risk function, we will reexpress the

risk function as follows:

Rβ(θ, π) =1− Eθ

[exp

−β

nln p(x|θ)

m(x)

]

β/n(1− β/n), (4–4)

where

p(x|θ) =n∏

i=1

p(xi|θ),

and

m(x) =

∫p(x|θ)π(θ) dθ.

Clarke and Barron [20] derived the following formula:

lnp(x|θ)

m(x)=

p

2ln

n

2π+

1

2ln |I(θ)|+ ln

1

π(θ)− 1

2ST

n (I(θ))−1 Sn + o(1), (4–5)

where o(1) → 0 in L1(P ) as well as in probability as n → ∞. Here, Sn =

(1/√

n)∇ ln p(x|θ) is the standardized score function for which E(SnSTn ) = I(θ)

and E[ST

n (I(θ))−1 Sn

]= p.

Page 62: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

55

Using this formula we can write the following asymptotic expansion for risk

function (4–1):

Rβ(θ, π) =

1− (2πn

) pβ2n π

βn (θ)

|I(θ)| β2n

Eθ[exp

β2n

STn (I(θ))−1 Sn

](1 + o(1))

β/n(1− β/n). (4–6)

Since STn (I(θ))−1 Sn is asymptotically distributed as χ2

p we can rewrite the

above expression as follow:

Rβ(θ, π) =

1− (2πn

) pβ2n π

βn (θ)

|I(θ)| β2n

1

(1− βn)

p/2 + o(n−p/4)

β/n(1− β/n). (4–7)

Hence, the prior which minimizes the Bayes risk will be the one that

maximizes the integral: ∫π

βn

+1(θ)

|I(θ)| β2n

dθ (4–8)

subject to the constraint ∫π(θ) dθ = 1.

A simple calculus of variations argument gives this maximizer as

π(θ) ∝ |I(θ)| 12 (4–9)

which is Jeffreys’ prior.

4.2 Reference Prior Selection under Divergence Loss for OneParameter Exponential Family

Let X1, . . . , Xn be iid with common pdf (with respect to some σ-finite

measure) belonging to the regular one-parameter exponential family, and is

given by

p(x|θ) = exp[θx− ψ(θ) + c(x)]. (4–10)

Page 63: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

56

Consider a prior π(θ) for θ which puts all its mass on a compact set. We will

pass on to the limit later as needed. Then the posterior is given by

π(θ|x1, . . . , xn) ∝ exp[nθx− ψ(θ)]π(θ). (4–11)

We denote the same by π(θ|x). Also, let p(x|θ) denote the conditional pdf of X

given θ and m(x) the marginal pdf of X.

The general expected divergence between the prior and the posterior is given

by

Rβ(π) =1− ∫ [∫

πβ(θ)π1−β(θ| x) dθ]m(x) dx

β(1− β). (4–12)

From the relation p(x|θ)π(θ) = π(θ|x)m(x), one can reexpress Rβ(π) given in

(4–12) as

Rβ(π) =1− ∫∫

πβ+1(θ)π−β(θ| x)p(x|θ) dx dθ

β(1− β)=

1− ∫πβ+1(θ)E

[π−β(θ| x)| θ] dθ

β(1− β).

(4–13)

Let I(θ) = ψ′′(θ) denote the per observation Fisher information number. Then

we have the following theorem.

Theorem 4.2.0.13.

[π−β(θ| x)

]=

(2π

nI(θ)

)β2 1√

1− β

[1 +

1

n

(− ψ′′′(θ)β

(1− β)I2(θ)

π′(θ)π(θ)

+(ψ′′′(θ))2β(3β2 + 7β + 10)

24(1− β)I3(θ)− β

2I(θ)

(π′(θ)π(θ)

)2

+β(2− β)

2(1− β)I(θ)

π′′(θ)π(θ)

− ψ(4)(θ)β(2 + β)

8(1− β)I2(θ)

)]+ o(n−1−β/2). (4–14)

Proof of Theorem 4.2.0.13. Let θ denote the MLE of θ. Throughout this

section we will use the following notations:

l(θ) = θx− ψ(θ),

a2 = l′′(θ) = −ψ′′(θ),

Page 64: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

57

c = ψ′′(θ),

a3 = l′′′(θ) = −ψ′′′(θ),

a4 = l(4)(θ) = −ψ(4)(θ),

and

h =√

n(θ − θ).

We will use ”shrinkage” argument as it presented in Datta and Mukerjee [24] .

From Datta and Mukerjee ([24] p.13)

ππ(h| x) =

√c

2πexp

−ch2

2

[1 +

1√n

π′(θ)

π(θ)h +

1

6a3h

3

+1

n

1

2

(π′′(θ)

π(θ)h2 − π′′(θ)

π(θ)c

)+

1

6

(a3

π′(θ)

π(θ)h4 − 3a3

c2

π′(θ)

π(θ)

)

+1

24

(a4h

4 − 3a4

c2

)+

1

72

(a2

3h6 − 15a2

3

c3

)]+ o(n−1). (4–15)

With the general expansion

(1

a1 + a2√n

+ a3

n+ o(n−1)

= a−β1

(1− β

a2

a1

√n

n

(β + 1

2

a22

a21

− a3

a1

))+ o(n−1),

we get

π−β(h| x) =( c

)−β2exp

cβh2

2

[1− β√

n

π′(θ)

π(θ)h +

1

6a3h

3

n

1 + β

2

(π′(θ)

π(θ)h +

1

6a3h

3

)2

− 1

2

(π′′(θ)

π(θ)h2 − π′′(θ)

π(θ)c

)− 1

6

(a3

π′(θ)

π(θ)h4 − 3a3

c2

π′(θ)

π(θ)

)

− 1

24

(a4h

4 − 3a4

c2

)− 1

72

(a2

3h6 − 15a2

3

c3

)]+ o(n−1). (4–16)

Page 65: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

58

Using (4–15) and (4–16) we will get the following

π−β(h| x)ππ(h| x) =( c

) 1−β2

exp

−c(1− β)h2

2

[1

+1√n

π′(θ)

π(θ)h− β

π′(θ)

π(θ)h +

1− β

6a3h

3

+1

n

−β

(π′(θ)

π(θ)

π′(θ)

π(θ)h2 +

1

6a3h

4

(π′(θ)

π(θ)+

π′(θ)

π(θ)

)+

1

36a2

3h6

)

+β(1 + β)

2

(π′(θ)

π(θ)h +

1

6a3h

3

)2

− β

2

(π′′(θ)

π(θ)h2 − π′′(θ)

π(θ)c

)+

1

2

(π′′(θ)

π(θ)h2 − π′′(θ)

π(θ)c

)

+1

6

(a3

π′(θ)

π(θ)h4 − 3a3

c2

π′(θ)

π(θ)− β

(a3

π′(θ)

π(θ)h4 − 3a3

c2

π′(θ)

π(θ)

))

+1− β

24

(a4h

4 − 3a4

c2

)+

1− β

72

(a2

3h6 − 15a2

3

c3

)]+ o(n−1). (4–17)

Integrating this last expression with respect to h will get the following

∫π−β(h| x)ππ(h| x) dh =

( c

)−β2 1√

1− β

[1

+1

n

−β

(π′(θ)

π(θ)

π′(θ)

π(θ)

1

c(1− β)+

a3

2c2(1− β)2

(π′(θ)

π(θ)+

π′(θ)

π(θ)

)+

15a23

36c3(1− β)3

)

+β(1 + β)

2c(1− β)

(π′(θ)

π(θ)

)2

+a3

c(1− β)

π′(θ)

π(θ)+

15a23

36c2(1− β)2

2c(1− β)

π′′(θ)

π(θ)− β2

2c(1− β)

π′′(θ)

π(θ)+

a3β(2− β)

2c2(1− β)2

π′(θ)

π(θ)

−a3β2(2− β)

2c2(1− β)2

π′(θ)

π(θ)+

a4β(2− β)

8c2(1− β)+

15a23(1− (1− β)3)

72c3(1− β)2

]+ o(n−1). (4–18)

From the relation θ = h/√

n + θ we get

Eπ[π−β(θ| x)| x]

= n−β2

∫π−β(h| x)ππ(h| x) dh. (4–19)

Page 66: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

59

Thus by (4–18) and (4–19) we have

λ(θ) = Eθ

(Eπ

[π−β(θ| x)| x])

=

(2π

nI(θ)

)β2 1√

1− β

[1

+1

n

− β

(1− β)I(θ)

(π′(θ)π(θ)

π′(θ)π(θ)

− ψ′′′(θ)2(1− β)I(θ)

(π′(θ)π(θ)

+π′(θ)π(θ)

)+

5(ψ′′′(θ))2

12(1− β)2I2(θ)

)

+β(1 + β)

2(1− β)I(θ)

((π′(θ)π(θ)

)2

− ψ′′′(θ)(1− β)I(θ)

π′(θ)π(θ)

+5(ψ′′′(θ))2

12(1− β)2I2(θ)

)

2(1− β)I(θ)

(π′′(θ)π(θ)

− βπ′′(θ)π(θ)

)− ψ′′′(θ)β(2− β)

2(1− β)2I2(θ)

(π′(θ)π(θ)

− βπ′(θ)π(θ)

)

−ψ(4)(θ)β(2− β)

8(1− β)I2(θ)+

5(ψ′′′(θ))2β(β2 − 3β + 3)

24(1− β)2I3(θ)

]+ o(n−1−β/2). (4–20)

In the next step, we find

∫λ(θ)π(θ) dθ =

∫ (2π

nI(θ)

)β2 1√

1− β

×[1 +

1

n

ψ′′′(θ)β

2(1− β)2I2(θ)

π′(θ)π(θ)

− 5β(ψ′′′(θ))2

12(1− β)3I3(θ)+

β(1 + β)

2(1− β)I(θ)

×((

π′(θ)π(θ)

)2

− ψ′′′(θ)(1− β)I(θ)

π′(θ)π(θ)

+5(ψ′′′(θ))2

12(1− β)2I2(θ)

)− β2

2(1− β)I(θ)

π′′(θ)π(θ)

+ψ′′′(θ)β2(2− β)

2(1− β)2I2(θ)

π′(θ)π(θ)

−ψ(4)(θ)β(2− β)

8(1− β)I2(θ)+

5(ψ′′′(θ))2β(β2 − 3β + 3)

24(1− β)2I3(θ)

]π(θ) dθ

+

∫ (2π

nI(θ)

)β2 1√

1− β

[1

n

(− β

(1− β)I(θ)

π′(θ)π(θ)

+ψ′′′(θ)β

2(1− β)2I2(θ)

−ψ′′′(θ)β(2− β)

2(1− β)2I2(θ)

)π′(θ)

]dθ

+

∫ (2π

nI(θ)

)β2 β

2nI(θ)(1− β)3/2π′′(θ) dθ + o(n−1−β/2). (4–21)

The last step will give an expression for Eθ

[π−β(θ| x)

]. We consider π(θ) to

converge weakly to the degenerate prior at true θ and have chosen π(θ) in such a

way that we could integrate the last two integrals in (4–21) by parts and have the

first term equal to zero every time we use integration by parts.

Page 67: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

60

Thus we will have

[π−β(θ| x)

]=

(2π

nI(θ)

)β2 1√

1− β

[1 +

1

n

(ψ′′′(θ)β2

2(1− β)I2(θ)

π′(θ)π(θ)

+5(ψ′′′(θ))2β(2− β)

24(1− β)I3(θ)+

β(1 + β)

2(1− β)I(θ)

(π′(θ)π(θ)

)2

− β2

2(1− β)I(θ)

π′′(θ)π(θ)

− ψ(4)(θ)β(2− β)

8(1− β)I2(θ)

)]

+1

n

d

[(2π

nI(θ)

)β2 1√

1− β

(1− β)I(θ)

π′(θ)π(θ)

+ψ′′′(θ)β

2(1− β)I2(θ)

)]

+1

n

d2

dθ2

[(2π

nI(θ)

)β2 1√

1− β

β

2(1− β)I(θ)

]+ o(n−1−β/2). (4–22)

Finally, we have

d

[(2π

nI(θ)

)β2 1√

1− β

(1− β)I(θ)

π′(θ)π(θ)

+ψ′′′(θ)β

2(1− β)I2(θ)

)]

=

(2π

nI(θ)

)β2 1√

1− β

(1− β)I(θ)

π′′(θ)π(θ)

− β

(1− β)I(θ)

(π′(θ)π(θ)

)2

−β(1 + β/2)ψ′′′(θ)(1− β)I2(θ)

π′(θ)π(θ)

+βψ(4)(θ)

2(1− β)I2(θ)− β(2 + β/2)(ψ′′′(θ))2

2(1− β)I3(θ)

), (4–23)

and

d2

dθ2

[(2π

nI(θ)

)β2 1√

1− β

β

2(1− β)I(θ)

]= − d

[(2π

nI(θ)

)β2 1√

1− β

β(1 + β/2)ψ′′′(θ)2(1− β)I2(θ)

]

=

(2π

nI(θ)

)β2 1√

1− β

((ψ′′′(θ))2β(1 + β/2)(2 + β/2)

2(1− β)I3(θ)− ψ(4)(θ)β(1 + β/2)

2(1− β)I2(θ)

),

(4–24)

From (4–22) - (4–24), we get

[π−β(θ| x)

]=

(2π

nI(θ)

)β2 1√

1− β

[1 +

1

n

(− ψ′′′(θ)β

(1− β)I2(θ)

π′(θ)π(θ)

+(ψ′′′(θ))2β(3β2 + 7β + 10)

24(1− β)I3(θ)− β

2I(θ)

(π′(θ)π(θ)

)2

+β(2− β)

2(1− β)I(θ)

π′′(θ)π(θ)

− ψ(4)(θ)β(2 + β)

8(1− β)I2(θ)

)]+ o(n−1−β/2). (4–25)

Page 68: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

61

This completes the proof ¥.

In view of Theorem 4.2.0.13, for β < 1 and β 6= 0 or −1, one has

Rβ(π) =

1− (2πn

)β/2(1− β)−

12

∫ π(θ)

I12 (θ)

β

π(θ) dθ

β(1− β)+ o(1). (4–26)

Thus the first order approximation to Rβ(π) is given by

1− (2πn

)β/2(1− β)−

12

∫ π(θ)

I12 (θ)

β

π(θ) dθ

β(1− β).

We want to maximize this expression with respect to π(θ) subject to∫

π(θ) dθ = 1.

We will show that Jeffreys’ prior assymptotically maximizes Rβ(π) when

|β| < 1.

To do this we will use Holders inequality as follow

Holders inequality for positive exponents ([39], p.190) Let p, q > 1 be

real numbers satisfying 1/p + 1/q = 1. Let f ∈ Lp, g ∈ Lq. Then fg ∈ L1 and

∫|fg| dµ ≤

(∫|f |p dµ

) 1p(∫

|g|q dµ

) 1q

(4–27)

with equlity iff |f |p ∝ |g|q.Holders inequality for negative exponents ([39], p.191) Let 0 < q < 1 and

p ∈ R be such that 1/p + 1/q = 1 (hence p < 0). If f, g are measurable functions

then ∫|fg| dµ ≥

(∫|f |p dµ

) 1p(∫

|g|q dµ

) 1q

(4–28)

with equlity iff |f |p ∝ |g|q.First we will consider 0 < β < 1. In this case it is enough to minimize

∫π1+β(θ)

(I−

12 (θ)

dθ.

Page 69: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

62

From Holders inequality for positive exponents with

p = 1 + β, q =1 + β

β,

g(θ) =(I

12 (θ)

) β1+β

and

f(θ) = π(θ)(I−

12 (θ)

) β1+β

,

we can write

(∫|f |p dµ

) 1p

=

∫π1+β(θ)

(I−

12 (θ)

≥ 1(∫I

12 (θ) dθ

with ”=” iff

π(θ) ∝ I12 (θ).

Next, consider case when −1 < β < 0. In this case assymptotical maximization

of Rβ(π) is equivalent to maximization of

∫π1+β(θ)

(I

12 (θ)

)−β

dθ.

From Holders inequality for positive exponents with

p =1

1 + β, q =

1

−β,

f(θ) = π1+β(θ)

and

g(θ) =(I

12 (θ)

)−β

we obtain ∫π1+β(θ)

(I

12 (θ)

)−β

dθ ≤(∫

I12 (θ) dθ

)−β

with ”=” iff

π(θ) ∝ I12 (θ).

Page 70: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

63

When β < −1 using Holders inequality for negative exponents with

p =1

1 + β< 0, 0 < q =

1

−β< 1,

f(θ) = π1+β(θ)

and

g(θ) =(I

12 (θ)

)−β

we obtain ∫π1+β(θ)

(I

12 (θ)

)−β

dθ ≥(∫

I12 (θ) dθ

)−β

with ”=” iff

π(θ) ∝ I12 (θ).

Unfortunately, in this case, Jeffreys’ prior is a minimizer and since it is the only

solution for corresponding Euler-Lagrange equation, there is no proper prior that

will asymptotically maximize Rβ(π).

There are only two cases left β = 0, which corresponds to KL loss and β = −1,

which corresponds to chi-square loss.

As β → 0+, one obtains the expression due to Clarke and Barron (1990, 1994)

[20], [21], namely

R0(π) =1

2log

n

2πe−

∫π(θ) log

π(θ)

I1/2(θ)dθ + o(1)

which is maximized when∫

π(θ) log π(θ)

I1/2(θ)dθ = 0, i.e. π(θ) = I1/2(θ), once again

leading to Jeffreys’ prior.

The only exception is β = −1, the chi-square distance as considered in Clarke

and Sun (1997) [22]. In this case πβ+1(θ) = 1 so that the first order term as

obtained from Theorem 4.2.0.13 is a constant, and one needs to consider the second

order term. In this case,

Page 71: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

64

1+2R−1(π) =

∫ (2π

nI(θ)

)− 12 1√

2

[1 +

1

n

(ψ′′′(θ)2I2(θ)

π′(θ)π(θ)

−(ψ′′′(θ))2

8I3(θ)+

1

2I(θ)

(π′(θ)π(θ)

)2

− 3

4I(θ)

π′′(θ)π(θ)

+ψ(4)(θ)

16I2(θ)

)]dθ + o(n−1/2). (4–29)

Reference prior is obtained by maximizing the expected chi-square distance

between prior distribution and corresponding posterior. By (4–29), this amounts to

maximizing the following integral with respect to prior π(θ):

∫ (ψ′′′(θ)I3/2(θ)

π′(θ)π(θ)

+1

I1/2(θ)

(π′(θ)π(θ)

)2

− 3

2I1/2(θ)

π′′(θ)π(θ)

)dθ. (4–30)

To simplify this further, we will use the substitution

y(θ) =π′(θ)π(θ)

so that (4–30) reduces to

∫ (ψ′′′(θ)I3/2(θ)

y(θ)− 1

2I1/2(θ)y2(θ)− 3

2I1/2(θ)y′(θ)

)dθ. (4–31)

Maximizing the last expression with respect to y(θ) and noting that ψ′′′(θ) =

I ′(θ), one gets Euler-Lagrange equation:

∂L

∂y− d

(∂L

∂y′

)

with L the functional under integral sign in (4–31).

The last expression is equivalent to

I ′(θ)4I3/2(θ)

− y(θ)

I1/2(θ)= 0.

Solving, we get

y(θ) =I ′(θ)4I(θ)

,

Page 72: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

65

thereby producing the reference prior

π(θ) ∝ I1/4(θ). (4–32)

Page 73: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

CHAPTER 5SUMMARY AND FUTURE RESEARCH

5.1 Summary

This dissertation revisits the problem of simultaneous estimation of normal

means. It is shown that a general class of shrinkage estimators as introduced

by Baranchik [5] dominates the sample mean in three or higher dimensions

under a general divergence loss which includes the Kullback-Leibler (KL) and

Bhattacharyya-Hellinger (BH) losses ([13]; [38]) as special cases. An analogous

result is found for estimating the predictive density of a normal variable with

the same mean and a known but possibly different scalar multiple of the identity

matrix as its variance. The results are extended to accommodate shrinkage towards

a regression surface.

These results are extended to the estimation of the multivariate normal

mean with an unknown variance-covariance matrix. First, it is shown that for an

unknown scalar multiple of the identity matrix as the variance-covariance matrix,

a general class of estimators along the lines of Baranchik [5] and Efron and Morris

[30] continues to dominate the sample mean in three or higher dimensions. Second

it is shown that even for an unknown positive definite variance-covariance matrix,

the dominance continues to hold for a general class of a suitably defined shrinkage

estimators.

Also the problem of prior selection for an estimation problem is considered. It

is shown that the first order reference prior under divergence loss coincides with

Jeffreys’ prior.

5.2 Future Research

The following is a list of future research problems:

66

Page 74: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

67

• The admissibility of MLE under Divergence loss is an open question when

p = 2. It is conjectured that the MLE is admissible but the proofs under

squared error loss are difficult to use for the Divergence loss.

• Another important problem is to find an admissible class of estimators of the

multivariate normal mean under general divergence loss.

• Extend the results for simultaneous estimation problem with an unknown

variance-covariance matrix to prediction problems.

• Find the link identity similar to those of George et al.[34] between estimation

and prediction problems if such an identity exists.

• Explain the Stein phenomenon using differential-geometric methods on

statistical manifolds as in Amari [2].

Page 75: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

REFERENCES

[1] Aitchison, J. (1975). Goodness of prediction fit. Biometrika 62 547-554.

[2] Amari, S. (1982). Differential geometry of curved exponential families -curvatures and information loss. Ann. Statist. 10 357-387.

[3] Arnold, S.F. (1981). The theory of linear models and multivariate analysis.John Wiley & Sons, New York.

[4] Anderson, T.W. (1984). An introduction to multivariate statisticalanalysis. 2nd ed. John Wiley & Sons, New York.

[5] Baranchick, A.J. (1970). A family of minimax estimators of the mean of amultivariate normal distribution. Ann. Math. Statist. 41 642-645.

[6] Bayes, T.R. (1763). An essay towards solving a problem in the doctrineof chances. Phylosophical Transactions of the Royal Society 53 370-418.Reprinted in Biometrika 45 243-315, 1958.

[7] Berger, J.O. (1975). Minimax estimation of location vectors for a wideclass of densities. Ann. Statist. 3 1318-1328.

[8] Berger, J.O. (1985). Statistical Decision Theory and Bayesian Analysis(2nd edition). Springer-Verlag, New York.

[9] Berger, J.O. and Benardo, J.M. (1989). Estimating a product ofmeans: Bayesian analysis with reference priors. J. Amer. Statist. Assoc. 84200-207.

[10] Berger, J.O. and Benardo, J.M. (1992a). Reference priors in a variancecomponents problem. In Bayesian Analysis in Statistics and Econometrics(P.K. Goel and N.S. Lyengar eds.) 177-194. Springer-Verlag, New York.

[11] Berger, J.O. and Benardo, J.M. (1992b). On the development ofreference priors (with discussion). In Bayes Statistics 4 (J.M. Benardo et aleds.) 35-60. Oxford Univ. Press, London.

[12] Benardo, J.M. (1979). Reference posterior distributions for Bayesianinference. J. Roy. Statist. Soc. B 41 113-147.

[13] Bhattacharyya, A.K. (1943). On a measure of divergence betweentwo statistical populations defined by their probability distributions. Bull.Calcutta Math. Soc. 35 99-109.

68

Page 76: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

69

[14] Blyth, C.R. (1951). On minimax statistical decision procedures and theiradmissibility. Ann. Math. Statist. 22 22-42.

[15] Bock, M.E. (1975). Minimax estimators of the mean of a multivariatenormal distribution. Ann. Statist. 3 209-218.

[16] Bolza, O. (1904). Lectures on the Calculus of Variations. Univ. ChicagoPress, Chicago.

[17] Brown, L.D. (1966). On the admissibility of invariant estimator of one ormore location parameters. Ann. Math. Stat. 38 1087-1136.

[18] Brown, L.D. (1971). Admissible estimators, recurrent diffusions andinsoluble boundary value problems. Ann. Math. Statist. 42 855-903.

[19] Brown, L.D. and Hwang, J.T. (1982). A unified admissibility proof.Statistical Decision Theory and related topics Academic Press, New York, 3205-267.

[20] Clarke, B. and Barron, A. (1990). Information-theoretic asymptotics ofBayes methods. IEEE Trans. Inform. Theory 36 453-471.

[21] Clarke, B. and Barron, A. (1994). Jeffreys’ prior is asymptotically leastfavorable under entropy risk. J. Statist. Plann. Infer. 41 37-60.

[22] Clarke, B. and Sun, D. (1997). Reference priors under the chi-squaredistance. Sankhya, Ser.A 59 215-231.

[23] Cressie, N. and Read, T. R. C. (1984). Multinomial Goodness-of-FitTests. J. Roy. Statist. Soc. B 46 440-464.

[24] Datta, G.S. and Mukerjee, R. (2004). Probability matching priors:higher order asymptotics. Springer, New York.

[25] Dawid, A.P. (1983). Invariant Prior Distributions. in Encyclopedia ofStatistical Sciences eds. Kotz, S. and Johnson, N.L. New York: John Wiley,228-236.

[26] Dawid, A.P., Stone, N. and Zidek, J.V. (1973). Marginalizationparadoxes in Bayesian and structural inference (with discussion). J. Roy.Statist. Soc. B 35 189-233.

[27] Efron, B. and Morris, C. (1972). Limiting the risk of Bayes andempirical Bayes estimators - Part II: The empirical Bayes case. J. Amer.Statist. Assoc. 67 130-139.

[28] Efron, B. and Morris, C. (1973). Stein’s estimation rule and itscompetitors - an empirical Bayes approach. J. Amer. Statist. Assoc. 68117-130.

Page 77: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

70

[29] Efron, B. and Morris, C. (1975). Data analysis using Stein’s estimatorand its generalizations. J. Amer. Statist. Assoc. 70 311-319.

[30] Efron, B. and Morris, C. (1976). Families of minimax estimators of themean of a multivariate normal distribution. Ann. Statist. 4 11-21.

[31] Faith, R.E. (1978). Minimax Bayes and point estimations of a multivariatenormal mean. J. Mult. Anal. 8 372-379.

[32] Fisher, R.A. (1922). On the mathematical foundations of theoreticalstatistics. Phylosophical Transactions of the Royal Society of London, ser.A,222 309-368.

[33] Fourdrinier, D., Strawderman, W.E. and Wells, M.T. (1998). Onthe construction of Bayes minimax estimators. Ann. Statist. 26 660-671.

[34] George, E.I., Liang, F. and Xu, X. (2006). Improved minimaxpredictive dencities under Kullbak-Leibler loss. Ann. Statist. 34 78-92.

[35] Ghosh, M. (1992). Hierarchical and Empirical Bayes Multivariateestimation. Current Issues in Statistical Inference: Essays in Honor of D.Basu, Ghosh, M. and Pathak, P.K. eds., Institute of Mathematical StatisticsLecture Notes and Monograph Series, 17 151-177.

[36] Ghosh, J.K. and Mukerjee, R. (1991). Characterization of priors underwhich Bayesian and Barlett corrections are equivalent in the multiparametercase. J. Mult. Anal. 38 385-393.

[37] Hartigan, J.A. (1964). Invariant Prior Distributions. Ann. Math. Statist.35 836-845.

[38] Hellinger, E. (1909). Neue Begrundung der Theorie quadratischen Formenvon unendlichen vielen Veranderlichen. Journal fur Reine und AngewandteMathematik 136 210271.

[39] Hewitt, E. and Stromberg, K. (1969). Real and Abstract Analysis. AModern Treatment of the Theory of Functions of a Real Variable. secondprinting corrected, Springer-Verlag, Berlin.

[40] Hodges, J.L. and Lehmann, E.L. (1950). Some problems in minimaxpoint estimation. Ann. Math. Statist. 21 182-197.

[41] Hwang, J.T. and Casella, G. (1982). Minimax confidence sets for themean of a multivariate normal distribution. Ann. Statist. 10 868-881.

[42] James, R. and Stein, C. (1961). Estimation with quadratic loss. Pro-ceedings of the Fourth Berkeley Symposium on Mathematical Statistics andProbability University of California Press, 1 361-380.

Page 78: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

71

[43] Jaynes, E.T. (1968). Prior probabilities. IEEE Transactions on SystemsScience and Cybernetics SSC-4 227-241.

[44] Jeffreys, H. (1961). Theory of Probability. (3rd edition.) London: OxfordUniversity Press.

[45] Komaki, F. (2001). A shrinkage predictive distribution for multivariatenormal observations. Biometrika 88 859-864.

[46] Kullback, S. and Liebler, R.A. (1951). On information and sufficiency.Ann. Math. Statist. 22 525-540.

[47] Lehmann, E.L. (1986). Testing Statistical Hypotheses. (2nd edition). J.Wiley, New York.

[48] Lehmann, E.L. and Casella, G. (1998). Theory of Point Estimation.(2nd edition). Springer-Verlag, New York.

[49] Liang, F. (2002). Exact Minimax Strategies for predictive density estimationand data. Ph.D. dissertation, Dept. Statistics, Yale Univ.

[50] Lindley, D.V. (1962). Discussions of Professor Stein’s paper ’Confidencesets for the mean of a multivariate distribution’. J. Roy. Statist. Soc. B 24265-296.

[51] Lindley, D.V. and Smith, A.F.M. (1972). Bayes estimates for the linearmodel. J. Roy. Statist. Soc. B 34 1-41.

[52] Morris, C. (1981). Parametric empirical Bayes confidence intervals.Scientific Inference, Data Analysis, and Robustness. eds. Box, G.E.P.,Leonard, T. and Jeff Wu, C.F. Academic Press, 25-50.

[53] Morris, C. (1983). Parametric empirical Bayes inference and applications.J. Amer. Statist. Assoc. 78 47-65.

[54] Murray, G.D. (1977). A note on the estimation of probability densityfunctions. Biometrika 64 150-152.

[55] Ng, V.M. (1980). On the estimation of parametric density functions.Biometrika 67 505-506.

[56] Robert, C.P. (2001). The Bayesian Choice. (2nd edition). Springer-Verlag,New York.

[57] Rukhin, A.L. (1995). Admissibility: Survey of a concept in progress. Inter.Statist. Review, 63 95-115.

[58] Stein, C. (1955). Inadmissibility of the usual estimator for the meanof a multivariate normal distribution. Proceedings of the Third Berkeley

Page 79: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

72

Symposium on Mathematical Statistics and Probability Berkeley and LosAngeles, University of California Press, 197-206.

[59] Stein, C. (1974). Estimation of the mean of a multivariate normaldistribution. Proceedings of the Prague Symposium on Asymptotic Statisticsed. Hajek, J. Prague, Universita Karlova, 345-381.

[60] Strawderman, W.E. (1971). Proper Bayes minimax estimators of themultivariate normal mean. Ann. Math. Statist. 42 385-388.

[61] Strawderman, W. E. (1972). On the existence of proper Bayes minimaxestimators of the mean of a multivariate distribution. Proceedings of theSixth Berkeley Symposium on Mathematical Statistics and ProbabilityBerkeley and Los Angeles, University of California Press, 6 51-55.

Page 80: DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION …ufdcimages.uflib.ufl.edu/UF/E0/01/56/78/00001/mergel_v.pdf · Chair: Malay Ghosh Major Department: Statistics In this dissertation,

BIOGRAPHICAL SKETCH

The author was born in Korukivka, Ukraine in 1973. He received the Specialist

and Candidate of Science degrees in Probability Theory and Statistics from Kiev

National University of Taras Shevchenko in 1997 and 2001 respectively. In 2001 he

came to UF to pursue Ph.D. degree in Department of Statistics.

73