divergence loss for shrinkage estimation, prediction...
TRANSCRIPT
DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION ANDPRIOR SELECTION
By
VICTOR MERGEL
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2006
Copyright 2006
by
Victor Mergel
ACKNOWLEDGMENTS
I would like to express my sincere gratitude to my advisor Dr. Malay
Ghosh for his support and professional guidance. Working with him was not
only enjoyable but also very valuable personal experience.
I would also like to thank Michael Daniels, Panos M. Pardalos, Brett Presnell,
and Ronald Randles, for their careful reading and extensive comments of this
dissertation.
iii
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
CHAPTER
1 INTRODUCTION AND LITERATURE REVIEW . . . . . . . . . . . . 1
1.1 Statistical Decision Theory . . . . . . . . . . . . . . . . . . . . . . . 11.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Point Estimation of the Multivariate Normal Mean . . . . . 21.2.2 Shrinkage towards Regression Surfaces . . . . . . . . . . . . 101.2.3 Baranchik Class of Estimators Dominating the Sample Mean 12
1.3 Shrinkage Predictive Distribution for the Multivariate Normal Density 131.3.1 Shrinkage of Predictive Distribution . . . . . . . . . . . . . 131.3.2 Minimax Shrinkage towards Points or Subspaces . . . . . . . 17
1.4 Prior Selection Methods and Shrinkage Argument . . . . . . . . . . 181.4.1 Prior Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 181.4.2 Shrinkage Argument . . . . . . . . . . . . . . . . . . . . . . 20
2 ESTIMATION, PREDICTION AND THE STEIN PHENOMENON UNDERDIVERGENCE LOSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.1 Some Preliminary Results . . . . . . . . . . . . . . . . . . . . . . . 222.2 Minimaxity Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.3 Admissibility for p = 1 . . . . . . . . . . . . . . . . . . . . . . . . . 302.4 Inadmissibility Results for p ≥ 3 . . . . . . . . . . . . . . . . . . . . 312.5 Lindley’s Estimator and Shrinkage to Regeression Surface . . . . . . 40
3 POINT ESTIMATION UNDER DIVERGENCE LOSS WHEN VARIANCECOVARIANCE MATRIX IS UNKNOWN . . . . . . . . . . . . . . . . . 43
3.1 Preliminary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.2 Inadmissibility Results when Variance-Covariance Matrix is Proportional
to Identity Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.3 Unknown Positive Definite Variance-Covariance Matrix . . . . . . . 49
4 REFERENCE PRIORS UNDER DIVERGENCE LOSS . . . . . . . . . 53
4.1 First Order Reference Prior under Divergence Loss . . . . . . . . . 53
iv
4.2 Reference Prior Selection under Divergence Loss for One ParameterExponential Family . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5 SUMMARY AND FUTURE RESEARCH . . . . . . . . . . . . . . . . . 66
5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
v
Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy
DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION ANDPRIOR SELECTION
By
Victor Mergel
August 2006
Chair: Malay GhoshMajor Department: Statistics
In this dissertation, we consider the following problems: (1) estimate a normal
mean under a general divergence loss and (2) find a predictive density of a new
observation drawn independently of the sampled observations from a normal
distribution with the same mean but possibly with a different variance under
the same loss. The general divergence loss includes as special cases both the
Kullback-Leibler and Bhattacharyya- Hellinger losses. The sample mean, which is a
Bayes estimator of the population mean under this loss and the improper uniform
prior, is shown to be minimax in any arbitrary dimension. A counterpart of this
result for the predictive density is also proved in any arbitrary dimension. The
admissibility of these rules holds in one dimension, and we conjecture that the
result is true in two dimensions as well. However, the general Baranchik class of
estimators, which includes the James-Stein estimator and the Strawderman class
of estimators, dominates the sample mean in three or higher dimensions for the
estimation problem. An analogous class of predictive densities is defined and any
member of this class is shown to dominate the predictive density corresponding
to a uniform prior in three or higher dimensions. For the prediction problem,
vi
in the special case of Kullback-Leibler loss, our results complement to a certain
extent some of the recent important work of Komaki and George et al. While our
proposed approach produces a general class of empirical Bayes predictive densities
dominating the predictive density under a uniform prior, George et al. produce a
general class of Bayes predictors achieving a similar dominance. We show also that
various modifications of the James-Stein estimator continue to dominate the sample
mean, and by the duality of the estimation and predictive density results which we
will show, similar results continue to hold for the prediction problem as well. In
the last chapter we consider the problem of objective prior selection by maximizing
the distance between the prior and the posterior. We show that the reference prior
under divergence loss coincides with Jeffreys’ prior except in one special case.
vii
CHAPTER 1INTRODUCTION AND LITERATURE REVIEW
1.1 Statistical Decision Theory
Statistical Decision Theory primarily consists of three basic elements:
• the sample space X ,
• the parameter space Θ,
and
• the action space, A.
We assume that an unknown element θ ∈ Θ labels the otherwise known
distribution. We are concerned with inferential procedures for θ using the sampled
observations x (real or vector valued).
A decision rule δ is a function with domain space X , and a range space A.
Thus, for each x ∈ X we have an action a = δ(x) ∈ A. For every θ ∈ Θ and
δ(x) ∈ A, we incur a loss L(θ, δ(x)). The long-term average loss associated with δ is
the expectation Eθ[L(θ, δ(X))] and this expectation is called the risk function of δ
and will be denoted as R(θ, δ).
Since the risk function depends on the unknown parameter θ, it is very often
impossible to find a decision rule that is optimal for every θ. Thus the statistician
needs to restrict decision rules such as Bayes, Minimax and admissible rules.
The method required to solve the statistical problem at hand strongly depends
on the parametric model considered (the class P = Pθ, θ ∈ Ω in which the
distribution of X belongs), structure of decision space and choice of loss function.
The choice of the decision space often depends on the statistical problem at
hand. For example, two-decision problems are used in testing of hypotheses; for
point estimation problems decision space often coincides with parameter space.
1
2
The choice of the loss function is up to the decision maker, and it is supposed
to evaluate the penalty (or error) associated with the decision δ when the
parameter takes the value θ. When the setting of an experiment is such that loss
function cannot be determined, the most common option is to resort to classical
losses such as quadratic loss or absolute error loss. Sometimes, the experiment
settings are very uninformative and decision maker may need to use an intrinsic
loss such as general divergence loss as considered in this dissertation. This is
discussed for example in [56].
In this dissertation, we mostly look at the point estimation problem of the
multivariate normal mean under general divergence loss and we consider also the
prediction problem, where we are interested in estimating the density function
f(x| θ) itself.
In multidimensional settings, for dimensions high enough, the best invariant
estimator is not always admissible. There often exists a class of estimators that
dominates the intuitive choice. For quadratic loss this effect was first discovered by
Stein [58]. In this dissertation, we consider the estimation and prediction problems
simultaneously under a broader class of losses to examine whether the Stein effect
continues to hold.
Since many results for estimation and prediction problems for the multivariate
normal distribution, as considered in this dissertation, seem to have some inherent
similarities with the parallel theory of estimating a multivariate normal mean under
the quadratic loss, we will begin with a literature review of known results.
1.2 Literature Review
1.2.1 Point Estimation of the Multivariate Normal Mean
Suppose X ∼ N(θ, Ip). For estimating the unknown normal mean θ under
quadratic loss, the best equivariant estimator is X, the MLE which is also the
posterior mean under improper uniform prior see [56] pp.429-431. Blyth, see [14],
3
showed that this estimator is minimax and admissible when p = 1. Unfortunately,
this estimator may fail to be admissible in multidimensional problems. For
simultaneous estimation of p (≥ 2) normal means, this natural estimator is
admissible for p = 2, but it is inadmissible for p ≥ 3 for a wide class of losses.
This fact was first discovered by [58] for the sum of squared error losses, i.e. when
L(θ,a) =‖ θ − a ‖2. The inadmissibility result was later extended by Brown [17]
to a wider class of losses. For the sum of squared error losses, an explicit estimator
dominating the sample mean was proposed by James and Stein [42].
For estimating the multivariate normal mean, Stein [58] recommended using
”spherically symmetric estimators” of θ since, under the loss L(θ, a) =‖ θ − a ‖2,
X is an admissible estimator of θ if and only if it is admissible in the class of
all spherically symmetric estimators. The definition of spherically symmetric
estimators is as follows.
Definition 1.2.1.1. An estimator δ(X) of θ is said to be spherically symmetric if
and only if δ(X) has the form δ(X) = h(‖X‖2)X
Stein used this result and the Cramer-Rao inequality to prove admissibility
of X for p = 2. Later Brown and Hwang [19] provided a Blyth type argument for
proving the same result.
As mentioned earlier, X is a generalized Bayes estimator of θ ∈ Rp under the
loss L(θ,a) =‖ θ − a ‖2 and the uniform prior. Stein [58] showed the existence of
a, b such that
(1− b
a+‖X‖2
)X dominates X for p ≥ 3. Later James and Stein
[42] have produced the explicit estimator
δ(X) =
(1− p− 2
‖X‖2
)X
which dominates X for p ≥ 3.
Efron and Morris [28] show how James-Stein estimators arise in an empirical
Bayes context. A good review of empirical Bayes (EB) and hierarchical Bayes (HB)
4
approaches could be found in [35]. As described in [8], an EB scenario is one in
which known relationships among the coordinates of a parameter vector allow use
of the data to estimate some features of the prior distribution. Both EB and HB
procedures recognize the uncertainty in the prior information. However, while the
EB method estimates the unknown prior parameters in some classical way like the
MLE or the method of moments from the marginal distributions (after integrating
θ out) of observations, the HB procedure models the prior distribution in stages.
To ilustrate this, we begin with the following setup.
A Conditional on θ1, . . . , θp, let X1, . . . , Xp be independent with
Xi ∼ N(θi, σ2), i = 1, . . . , p, σ2(> 0) being known. Without loss of generality,
assume σ2 = 1.
B The θi’s have independent N(µi, A), i = 1, . . . , p priors.
The posterior distribution of θ given X = x is then
N((1−B)x + Bµ, (1−B)Ip),
where B = (1 + A)−1. The posterior mean (the usual Bayes estimate) of θ is given
by
E(θ|X = x) = (1−B)x + Bµ. (1–1)
Now consider the following three scenarios.
Case I. Let µ1 = . . . = µp = µ, where µ (real) is unknown, but A (> 0) is known.
Based on the marginal distribution of X, X is the UMVUE, MLE and the best
equivariant estimator of µ. Thus using EB approach, an EB estimator of θ is:
θ(1)
EB = (1−B)X + BX1p. (1–2)
This estimator was proposed in Lindley and Smith (1972) but they have used
HB approach. Their model was:
5
(i) conditional on θ and µ, X ∼ N(θ, Ip);
(ii) conditional on µ, θ ∼ N(µ1p, AIp);
(iii) µ is uniform on (−∞,∞).
Then the joint pdf of X, θ and µ is given by
f(x,θ, µ) ∝ exp
[−1
2‖ x− θ ‖2
]A− 1
2p exp
[− 1
2A‖ θ − µ1p ‖2
]. (1–3)
Thus joint pdf of X and θ is
f(x,θ) ∝ exp
[−1
2(θT D − 2θT x + xT x)
], (1–4)
and posterior distribution of θ given X = x is N(D−1x,D−1), where
D = A−1[(A + 1)Ip − p−1Jp ].
Thus one gets
E(θ|X = x) = (1−B)x + Bx1p , (1–5)
and
V (θ|X = x) = (1−B)Ip + Bp−1Jp , (1–6)
which gives the same estimator as under the EB approach. But the EB approach
ignores the uncertainty involved in estimating the prior parameters, and thus
underestimates the posterior variance.
Lindley and Smith [51] have shown that the risk of θ(1)
EB is not uniformly
smaller than that of X under squared error loss. However, there is a Bayes risk
superiority of θ(1)
EB over X as it is shown in the following theorem of Ghosh [35]:
Theorem 1.2.1.2. Consider the model X|θ ∼ N(θ, Ip) and the prior
θ ∼ N(µ1p, AIp). Let E denote expectation over the joint distribution of X and θ.
Then, assuming the loss L1(θ, a) = (a − θ)(a − θ)T , and writing θB as the Bayes
estimator of θ under L1,
EL1(θ, X) = Ip ; EL1(θ, θB) = (1−B)Ip ; (1–7)
6
EL1(θ, θ(1)
EB) = (1−B)Ip + Bp−1Jp . (1–8)
Now assuming the quadratic loss L2(θ,a) = (a − θ)T Q(a − θ), where Q is a
known non-negative definite weight matrix,
EL2(θ, X) = tr(Q); EL2(θ, θB) = (1−B)tr(Q); (1–9)
EL2(θ, θ(1)
EB) = (1−B)tr(Q) + Btr(Qp−1Jp). (1–10)
Case II. Assume that µ is known but its components need not to be equal. Also
assume A to be unknown. ‖X − µ‖2 ∼ B−1χ2p is complete sufficient statistics.
Accordingly, for p ≥ 3, the UMVUE of B is given by (p− 2)/‖X − µ‖2.
Substituting this estimator of B, an EB estimator of θ is given by
θ(2)
EB = X − p− 2
‖X − µ‖2(X − µ). (1–11)
This estimator is known as a James-Stein estimator (see [42]). The EB
interpretation of this estimator was given in a series of articles by Efron and Morris
([27], [28], [29]).
James and Stein have shown that for p ≥ 3, the risk of θ(2)
EB is smaller than
that of X under the squared error loss. However, if the loss is changed to the
arbitrary quadratic loss L2 of the previous theorem, then the risk dominance of
θ(2)
EB over X does not necessarily hold. The θ(2)
EB dominates X under the loss L2
([8]; [15]) if
(i) tr(Q) > 2ch1(Q) and
(ii) 0 < p− 2 < 2[tr(Q)/ch1(Q)− 2] where ch1(Q) denotes the largest eigenvalue of
Q.
The Bayes risk dominance however still holds which follows from the following
(see [35]).
7
Theorem 1.2.1.3. Let X|θ ∼ N(θ, Ip) and θ ∼ N(µ1p, AIp). It follows that for
p ≥ 3,
E[L1(θ, θ(2)
EB)] = Ip −B(p− 2)p−1Ip , (1–12)
and
E[L2(θ, θ(2)
EB)] = tr(Q)−B(p− 2)p−1tr(Q). (1–13)
Consider the HB approach in this case with X ∼ N(θ, Ip), θ ∼N(µ, AIp) and A has Type II beta density ∝ Am−1(1+A)−(m+n), with m > 0, n > 0.
Using the iterated formula for conditional expectations
θ(2)
HB ≡ E(θ|x) = E(E(θ|B, x)) = (1− ˆB)x +
ˆBµ, (1–14)
where B = (A + 1)−1, and
ˆB =
1∫
0
B12p+n(1−B)m−1 exp
[−1
2B‖x− µ‖2
]dB
÷1∫
0
B12p+n−1(1−B)m−1 exp
[−1
2B‖x− µ‖2
]dB. (1–15)
Strawderman [60] considered the case m = 1, and found sufficient conditions
on n under which the risk of θ(2)
HB is smaller than that of X. His results were
generalized by Faith [31].
When m = 1, the posterior mode of B is
BMO = min((p + 2n− 2)/‖x− µ‖2, 1) (1–16)
This leads to
θ(3)HB = (1− BMO)X + BMOµ, (1–17)
of θ. When n = 0 this estimator will become the positive part James-Stein
estimator, which dominates the usual James-Stein estimator.
8
Case III. We consider the same model as in Case I, except that now µ and A > 0
are both unknown. In this case (X,p∑
i=1
(Xi − X)2) is complete sufficient, so that the
UMVUE’s of µ and B are given by X and (p− 3)/p∑
i=1
(Xi − X)2. The EB estimator
of θ in this case is
θ(3)
EB = X − p− 3p∑
i=1
(Xi − X)2
(X − X1p). (1–18)
This modification of the James-Stein estimator was proposed by Lindley [50].
Whereas, the James-Stein estimator shrinks X toward a specified point, the above
estimator shrinks X towards a hyperplane spanned by 1p.
The estimator θ(3)
EB is known to dominate X for p > 3. Ghosh [35] has found
the Bayes risk of this estimator under the L1 and L2 losses.
Theorem 1.2.1.4. Assume the model and the prior given in Theorem 1.2.1.2 Then
for p ≥ 4,
E[L1(θ, θ(3)
EB)] = Ip −B(p− 3)(p− 1)−1(Ip − p−1Jp) , (1–19)
and
E[L2(θ, θ(3)
EB)] = tr(Q)−B(p− 3)(p− 1)−1tr[Q(Ip − p−1Jp)]. (1–20)
To find the HB estimator of θ in this case consider the model where
(i) conditional on θ, µ and A, X ∼ N(θ, Ip);
(ii) conditional on µ and A, θ ∼ N(µ1p, AIp);
(iii) marginally µ and A are independently distributed with µ uniform on (−∞,∞),
and A has uniform improper pdf on (0,∞).
Under this model, as shown in [52],
E(θ|x) = x− E(B|x)(x− x1p) , (1–21)
9
and
V (θ|x) = V (B|x)(x− x1p)(x− x1p)T + Ip − E(B|x)(Ip − p−1Jp), (1–22)
where
E(B|x) =
1∫
0
B12(p−3) exp
[−1
2B
p∑i=1
(xi − x)2
]dB
÷1∫
0
B12(p−5) exp
[−1
2B
p∑i=1
(xi − x)2
]dB , (1–23)
and
E(B2|x) =
1∫
0
B12(p−1) exp
[−1
2B
p∑i=1
(xi − x)2
]dB
÷1∫
0
B12(p−5) exp
[−1
2B
p∑i=1
(xi − x)2
]dB . (1–24)
Also, one can obtain a positive-part version of Lindley’s estimator by
substituting the posterior mode of B namely min
((p− 5)
/p∑
i=1
(Xi − X)2, 1
)
in 1–1. Morris (1981) suggests approximations to E(B|x) and E(B2|x) involving
replacement of1∫0
by∞∫0
both in the numerator as well as in denominator of 1–23
and 1–24 leading to the following approximations:
E(B|x).= (p− 3)
/p∑
i=1
(xi − x)2
and
E(B2|x).= (p− 1)(p− 3)
/p∑
i=1
(xi − x)2
2
,
so that
V (B|x).= 2(p− 3)
/p∑
i=1
(xi − x)2
2
.
10
Morris [52] points out that the above approximations amount to putting a
uniform prior to A on (−1,∞) which gives the approximation
E(θ|x).= X − p− 3
p∑i=1
(Xi − X)2
(X − X1p) , (1–25)
which is Lindley’s modification of the James-Stein estimator with
V (θ|x).=
2(p− 3)(p∑
i=1
(Xi − X)2
)2 (X − X1p)(X − X1p)T
+ Ip − p− 3p∑
i=1
(Xi − X)2
(Ip − p−1Jp). (1–26)
1.2.2 Shrinkage towards Regression Surfaces
In the previous section, the sample mean was shrunk towards a point or a
subspace spanned by the vector 1p. Ghosh [35] synthesized EB and HB methods to
shrink the sample mean towards an arbitrary regression surface. The HB approach
was discussed in details in [51] with known variance components, while the EB
procedure was disscussed in [53].
The set up considered in [35] is as follow
I Conditional on θ, b and a let X1, . . . , Xp be independent with
Xi ∼ N(θi, Vi), i = 1, . . . , p, where the Vi’s are known positive constants;
II Conditional on b and a, θi’s are independently distributed with
θi ∼ N(zTi b, a), i = 1, . . . , p, where z1, . . . , zp are known regression vectors of
dimension r and b is r × 1.
III B and A are marginally independent with B ∼ uniform(Rp) and
A ∼ uniform(0,∞). We assume that p ≥ r + 3. Also, we write
ZT = (z1, . . . , zp); G = Diag(V1, . . . , Vp) and assume rank(Z) = r.
11
As shown in [35], under this model
E(θi|x) = E[(1− Ui)xi + UizTi b |x] ; (1–27)
V (θi|x) =
V [Ui(xi − zTi b) |x] + E[Vi(1− Ui) + ViUi(1− Ui)z
Ti (ZT DZ)−1zi |x] ; (1–28)
Cov(θi, θj|x) =
Cov[Ui(xi − zTi b), Uj(xj − zT
j b) |x] + E[AUiUjzTi (ZT DZ)−1zj |x] , (1–29)
where Ui = Vi/(A + Vi), b = (ZT DZ)−1(ZT Dx),D = diag(1− u1, . . . , 1− up) with
(i = 1, . . . , p).
Morris [53] has approximated E[θi|x] by xi − ui(xi − zTiˆb), and V [θi|x] by
vi(xi − zTiˆb)2 + Vi(1 − ui)[1 + uiz
Ti (ZT DZ)−1zi], i = 1, . . . , p. In the above
vi = [2/(p− r − 2)]U2i (V + a)÷ (Vi + a), i = 1, . . . , p,
V =∑p
i=1 Vi(Vi + a)−1 ÷ ∑pi=1(Vi + a)−1, D = Diag(1 − u1, . . . , 1 − up), and
ˆb
is obtained from b by substituting the estimator of A. The vi’s are purported to
estimate V (Ui|x)’s.
When V1 = . . . = Vp = V , with u1 = . . . = up = V/(a + V ) = u,
D = (1 − u)Ip,ZT DZ = (1 − u)ZT Z,
ˆb = (ZT Z)−1ZT x = b, then the following
result holds
E(θi|x) = xi − E(U |x)(xi − zTi b) , (1–30)
and
V (θi|x) = V (U |x)(xi − zTi b)2 + V − V E(U |x)(1 − zT
i (ZT Z)−1zi) . (1–31)
12
If one adopts Morris’s approximations, then one estimates E(U |x) by
ˆUV (p − r − 2)/SSE and V (U |x) by [2/(p − r − 2)]
ˆU2 (SSE =
∑pi=1 x2
i −(∑p
i=1 xizi)T
(ZT Z)−1 (∑p
i=1 xizi)).
1.2.3 Baranchik Class of Estimators Dominating the Sample Mean
In some situations when prior information is available and one believes
that unknown θ is close to some known vector µ it makes more sense to shrink
estimator X to µ instead of 0. In such cases Baranchick [5] proposed using a more
general class of shrinkage minimax estimators dominating X. Let S =p∑
i=1
(Xi − µi)2
and φi(X) = − τ(S)S
(Xi − µi) then estimator X + φ(X) will dominate X under
quadratic loss function if the following conditions hold:
(i) 0 < τ(S) < 2(p− 2),
(ii) τ(S) is nondecreasing in S and differentiable in S.
Efron and Morris [30] have slightly widened Baranchick’s class of minimax
estimators. They have proved that the following conditions will guarantee that the
estimator X + φ(X) dominates X under the quadratic loss function:
(i) 0 < τ(S) < 2(p− 2), p > 2,
(ii) τ(S) is differentiable in S,
and
(iii) u(S) = Sp−22 τ(S)
2(p−2)−τ(S)is increasing in S.
Thus the Baranchick class of estimators dominates the best equivariant
estimator. The natural question was if that class had a subclass of admissible
estimators.
Strawderman [60] shows that there exists a subclass in the Baranchick class of
estimators which is proper Bayes with respect to the following class of two stage
priors. The prior distribution for θ is constructed as follows.
13
Conditional on A = a,θ ∼ N(0, aIp), while A itself has pdf
g(a) = δ(1 + a)−1−δ, a ≥ 0, δ > 0.
Under this two stage prior, the Bayes estimator of θ has the Baranchick form
(1− τ(S)
S
)X
with
τ(S) = p + 2δ − 2 exp−S2
1∫0
λp2+δ−1 exp−λ
2S dλ
and conditions (i)-(iii) holds for p > 4. When p = 5 and 0 < δ < 12
we will get
class of proper Bayes and thus admissible estimators dominating X. When p > 5
choosing 0 < δ < 1 will lead to proper Bayes class of estimators dominating X.
For p = 3 and 4 Strawderman [61] showed that there do not exist any proper
Bayes estimators dominating X.
1.3 Shrinkage Predictive Distribution for the Multivariate NormalDensity
1.3.1 Shrinkage of Predictive Distribution
When fitting a parametric model or estimating parametric density function,
two most common methods are: estimate the unknown parameter first and
then use this estimator to estimate the unknown density, or try to estimate
the predictive density without directly estimating the unknown parameter. In
situations like this, the first method often fails to provide a good overall fit to the
unknown density even when the estimator of unknown quantity may have optimal
properties itself. The second method seems to produce estimators with smaller risk
then those constructed with plug-in method.
To evaluate the goodness of fit of predictive distribution p(y|x) (where x is
the observed random vector) to the unknown p(y|θ), the most often used measure
14
of divergence is Kullback-Leibler [46] directed measure of divergence
L(θ, p(y|x)) =
∫p(y|θ) log
p(y|θ)
p(y|x)
dy (1–32)
which is nonnegative, and is zero if and only if p(y|x) coincides with p(y|θ).
Then the average loss or risk function of predictive distribution of p(y|x) could be
defined as follows:
RKL(θ, p) =
∫p(x|θ)L(θ, p(y|x)) dx, (1–33)
and under (possibly improper) prior distribution π on θ, the Bayes risk is
r(π, p) =
∫RKL(θ, p)π(θ) dθ. (1–34)
As shown in Atchinson [1] the Bayes predictive density under the prior π is
given by
pπ(y|x) =
∫p(x|θ)p(y|θ)π(θ) dθ∫
p(x|θ)π(θ) dθ, (1–35)
and this density is superior to any p(y|x) as a fit to the class of models.
Let X|θ ∼ N(θ, vxIp) and Y |θ ∼ N(θ, vyIp) be independent p-dimensional
multivariate normal vectors with common unknown mean θ and known variances
vx, vy. As shown by Murray [54] and Ng [55], the best invariant predictive density
in this situation is the constant risk Bayes rule under the uniform prior πU(θ) = 1,
which can be written as
pU(y|x) =1
2π(vx + vy) p2
exp
− ‖y − x‖2
2(vx + vy)
. (1–36)
Although invariance is a frequently used restriction, for point estimation, the
best invariant estimator θ = X of θ is not admissible if dimension of the problem
is p ≥ 3. It is known that the James-Stein estimator dominates the best invariant
estimator θ. Komaki [45] showed that the same effect holds for the prediction
problem. The best predictive density pU(y|x) which is invariant under translation
group is not admissible when p ≥ 3, and is dominated by predictive density under
15
the Stein harmonic prior (see [59])
πH(θ) ∝ ‖θ‖−(p−2). (1–37)
Under this prior, the Bayesian predictive density is given by Komaki [45]
pH(y|x) =
(vx
vy
+ 1
) p−22 φp‖(v−1
y y + v−1x x)(v−1
y + v−1x )−
12‖
φp(‖v−12
x x‖)× 1
2π(vx + vy) p2
exp
− 1
2(vx + vy)‖y − x‖2
, (1–38)
where
φp(u) = u−p+2
12u2∫
0
v12p−2 exp(−v) dv. (1–39)
The harmonic prior is a special case of the Strawderman class of priors given
by
πS(θ) ∝∞∫
0
a−p2 exp
−‖θ‖
2
2a
a−δ−1 da ∝ ‖θ‖−p−2δ
with δ = −1. More recently, Liang [49] showed that pU(y|x) is dominated by the
proper Bayes rule pa(y|x) under the Strawderman prior πa(θ)
θ|s ∼ N(0, sv0Ip), s ∼ (1 + s)a−2, (1–40)
when vx ≤ v0, p = 5 and a ∈ [.5, 1) or p ≥ 6 and a ∈ [0, 1). When a = 2, it is
well known that πH(θ) is a special case of πa(θ). As shown in George, Liang and
Xu [34], this result closely parallels some key developments concerning minimax
estimation of a multivariate normal mean under quadratic loss. As shown in Brown
[17], any Bayes rule θp = E(θ|X) under quadratic loss,has the form
θ = X +∇ log mπ(X). (1–41)
As shown in George, Liang and Xu [34], this result closely parallels some
key developments concerning minimax estimation of a multivariate normal mean
16
under quadratic loss. As shown in Brown [17], any Bayes rule θp = E(θ|X) under
quadratic loss,has the form
θ = X +∇ log mπ(X). (1–42)
Similar representation was proved by George, Liang and Xu [34], for the
predictive density under the KL loss:
pπ(y|x) = pU(y|x)mπ(w; vw)
mπ(x; vx), (1–43)
where
W =vyX + vxY
vx + vy
=v−1
x X + v−1y Y
v−1x + v−1
y
, (1–44)
vw =vxvy
vx + vy
, (1–45)
and mπ(w; vw) is a marginal distribution of W .
Now we present the domination result of George et al. under the KL loss.
Theorem 1.3.1.1. For Z|θ ∼ N(θ, vIp) and a given prior π on θ, let mπ(z; v)
be the marginal distribution of Z. If mπ(z; v) is finite for all z, then pπ(y|x) will
dominate pU(y|x), if any one of the following conditions holds for all v ≤ vx :
(i) ∂∂v
Eθ,vlog mπ(Z; v) < 0,
(ii) ∂∂v
mπ(√
vz + θ; v) ≤ 0 for all θ, with strict inequality on some interval A,
(iii)√
mπ(z; v) is superharmonic with strict inequality on some interval A,
(iv) mπ(z; v) is superharmonic with strict inequality on some interval A,
or
(v) π(θ) is superharmonic.
From the previous theorem and minimaxity of pU(y|x) under the KL loss, the
minimaxity of pπ(y|x) follows.
George et al. [34] also proved the following theorem which is similar to
Theorem 1 of Fourdrinier et al. [33].
17
Theorem 1.3.1.2. If h is a positive function such that
(i) −(s + 1)h′(s)/h(s) can be decomposed as l1(s) + l2(s) where l1 ≤ A is
nondecreasing while 0 < l2 ≤ B with 12A + B ≤ (p− 2)/4,
(ii) lims→∞ h(s)/(s + 1)p/2 = 0.
Then
(i)√
mπ(z; v) is superharmonic for all v ≤ v0.
(ii) the Bayes rule ph(y|x) under prior πh(θ) dominates pU(y|x) and is
minimax when v ≤ v0.
1.3.2 Minimax Shrinkage towards Points or Subspaces
When a prior distribution is centered around 0, minimax Bayes rules pπ(y|x)
yield most risk reduction when θ is close to 0 (see [34]). Recentering the prior π(θ)
around any b ∈ Rp results in πb(θ) = π(θ − b). The marginal mπ
b corresponding to
πb can be directly obtained by recentering the marginal mπ :
mbπ (z; v) = mπ(z − b; v). (1–46)
Such recentered marginals yield predictive distributions
pbπ (y|x) = pU(y|x)mb
π (w; vw)
mbπ (x; vx)
. (1–47)
More generally, in order to recenter a prior π(θ) around (a possibly affine)
subspace B ⊂ Rp, George et al. [34] considered only spherically symmetric in θ
priors recentered as
πB(θ) = π(θ − PBθ), (1–48)
where PBθ = argminb∈B‖θ − b‖ is the projection of θ onto B. Note that the
dimension of θ − PBθ must be taken into account when concidering π. Thus, for
example, recentering the harmonic prior πH(θ) = ‖θ‖−(p−2) around the subspace
18
spanned by 1p yields
πBH(θ) = ‖θ − θ1p‖−(p−3). (1–49)
Recentered priors yields predictive distributions
πBπ (y|x) = pU(y|x)
mBπ (w; vw)
mBπ (x; vx)
. (1–50)
George et al. [34] also considered multiple shrinkage prediction. Using the
mixture prior
π∗(θ) =N∑
i=1
wiπBi(θ), (1–51)
leads to the predictive distribution
p∗(y|x) = pU(y|x)
∑Ni=1 wim
Biπ (w; vw)∑N
i=1 wimBiπ (x; vx)
. (1–52)
1.4 Prior Selection Methods and Shrinkage Argument
1.4.1 Prior Selection
Since Bayes [6] and later since Fisher [32], the idea of Bayesian inference
was debated. The cornerstone of Bayesian analysis, namely prior selection was
criticized for arbitrariness and overwhelming difficulty in the choice of prior. From
the very beginning, when Laplace proposed a uniform prior as a noninformative
prior, its inconsistencies have been found, generating further criticism. This has
given a way to new ideas such as the one of Jeffreys [44] who proposed a prior
which remains invariant under any one-to-one reparametrization. Jeffrey’ prior
was derived as the positive square root of Fisher information matrix. However,
this prior was not an ideal one in the presence of nuisance parameters. Bernardo
[12] has noticed that Jeffreys prior can lead to marginalization paradox (see Dawid
et al. [26]) for inferences about µ/σ when the model is normal with mean µ and
variance σ2. These inconsistencies have led Bernardo [12] and later Berger and
Bernardo ( [9], [10], [11]) to propose uninformative priors known as ”reference”
19
priors. Two basic ideas were used by Bernardo to construct his prior: the idea of
missing information and the stepwise procedure to deal with nuisance parameters.
Without any nuisance parameters, Bernardo’s prior is identical to Jeffreys prior.
The missing information idea makes one to choose the prior which is furthest in
terms of Kullback-Leibler distance from the posterior under this prior and thus
allows the observed sample to change this prior the most.
Another class of reference priors is obtained using the invariance principle
which is attributed to Laplace’s idea of insufficient reasons. Indeed, the simplest
example of invariance involves permutations on a finite set. The only invariant
distribution in this case is the uniform distribution over this set. Laplace’s idea was
generalized as follows. Consider a random variable X from a family of distributions
parameterized by θ. Then if there exists a group of transformations, say ha(θ) on
the parameter space, such that the distribution of Y = ha(X) belongs to the same
family with corresponding parameter ha(θ), then we want the prior distribution for
parameter θ to be invariant under this group of transformations. Good description
of this approach is given in Jaynes [43], Hartigan [37] and Dawid [25].
A somewhat different criteria is based on matching posterior coverage
probability of a Bayesian credible set with the corresponding frequentist coverage
probability. Most often matching is accomplished by matching a) posteriors
quantiles b) highest posterior densities (HPD) regions or c) inversion of test
statistics.
In this dissertation we will find uninformative priors by maximizing divergence
between prior and corresponding posterior distribution asymptotically. To develop
asymptotic expansions we will use so called ”shrinkage argument” suggested
by J.K. Ghosh [36]. This method is particularly suitable for carrying out the
asymptotics, and avoids calculation of multivariate cumulants inherent in any
multidimentional Edgeworth expansion.
20
1.4.2 Shrinkage Argument
We follow the description of Datta and Mukerjee [24] to explain the shrinkage
argument.
Consider a possibly vector-valued random variable X with a probability
density function p(x; θ) with θ ∈ Rp or some open subset thereof. We need to find
an asymptotic expansion for Eθ[h(X; θ)] where h is a joint measurable function.
The following steps describe the Bayesian approach towards the evaluation of
Eθ[h(X; θ)].
Step 1: Consider a proper prior density π(θ) for θ, such that support of π(θ) is a
compact rectangle in the parameter space, and π(θ) vanishes on the boundary of
support while being positive in the interior. Under this prior one obtains posterior
expectation of Eπ[h(X; θ)|X].
Step 2: In this step one finds the following expectation for θ being in the interior
of support of π
λ(θ) = Eθ [Eπ[h(X; θ)|X]] .
Step 3: Integrate λ(θ) with respect to π(θ) and then allow π(θ) converge to the
degenerate prior at the true value of θ supposing that the true value of θ is an
interior point of the support π(θ). This yields Eθ[h(X; θ)].
The rationale behind this process, if integrability of h(X; θ) with respect to
joint probability measure assumed, is as follows.
Note that posterior density of θ under prior π(θ) is given by
p(X; θ)π(θ)
m(X),
where
m(X) =
∫p(X; θ)π(θ) dθ.
21
Hence, in step 1, we will get
Eπ[h(X; θ)|X] = K(X)/m(X),
where
K(X) =
∫h(X; θ)p(X; θ)π(θ) dθ.
Step 2 yields
λ(θ) =
∫K(x)/m(x)p(x; θ) dx.
In step 3 one would get
∫λ(θ)π(θ) dθ =
∫∫K(x)/m(x)p(x; θ)π(θ) dθ dx
=
∫K(x)/m(x)
∫p(x; θ)π(θ) dθ
dx
=
∫K(x) dx =
∫∫h(x; θ)p(x; θ)π(θ) dθ dx
=
∫[Eθ[h(X; θ)]π(θ) dθ.
The last integral gives us desired expectation when π(θ) converge to the degenerate
prior at the true value of θ.
CHAPTER 2ESTIMATION, PREDICTION AND THE STEIN PHENOMENON UNDER
DIVERGENCE LOSS
2.1 Some Preliminary Results
We will start this section with definition of divergence loss. Among others, we
refer to Amari [2] and Cressie and Read [23]. This loss is given by
Lβ(θ,a) =1− ∫
p1−β(x|θ)pβ(x|a)dx
β(1− β). (2–1)
This above loss is to be interpreted as its limit when β → 0 or β → 1. The KL loss
obtains in these two limiting situations. For β = 1/2, the divergence loss is 4 times
the BH loss. Throughout this dissertation, we will perform the calculations with
β ∈ (0, 1), and pass on to the endpoints only in the limit when needed.
Let X and Y be conditionally independent given θ with corresponding
pdf’s p(x|θ) and p(y|θ). We begin with a general expression for the predictive
density of Y based on X under the divergence loss and a prior pdf π(θ), possibly
improper. Under the KL loss and the prior pdf π(θ), the predictive density of Y is
given by
πKL(y|x) =
∫p(y|θ)π(θ|x) dθ,
where π(θ|x) is the posterior of θ based on X = x see [1]. The predictive density
is proper if and only if the posterior pdf is proper. We now provide a similar result
based on the general divergence loss which includes the previous result of Aitchison
as a special case when β → 0.
22
23
Lemma 2.1.0.1. Under the divergence loss and the prior π, the Bayes predictive
density of Y is given by
πD(y|x) = k1
1−β (y,x)
/∫k
11−β (y,x) dy , (2–2)
where k(y,x) =∫
p1−β(y|θ)π(θ|x) dθ.
Proof of Lemma 2.1.0.1. Under the divergence loss, the posterior risk of
predicting p(y|θ), by a pdf p(y|x), is β−1(1− β)−1 times
1−∫ [∫
p1−β(y|θ)pβ(y|x)dy
]π(θ|x)dθ
= 1−∫
pβ(y|x)
∫p1−β(y|θ)π(θ|x) dθ
dy
= 1−∫
k(y,x)pβ(y|x)dy. (2–3)
An application of Holder’s inequality now shows that the integral in (2–3) is
maximized at p(y|x) ∝ k1
1−β (y,x). Again by the same inequality, the denominator
of (2–2) is finite provided the posterior pdf is proper. This leads to the result
noting that πD(y|x) has to be a pdf. ¥
The next lemma, to be used repeatedly in the sequel, provides an expression
for the integral of the product of two normal densities each raised to a certain
power.
Lemma 2.1.0.2. Let Np(x|µ,Σ) denote the pdf of a p-variate normal random
variable with mean vector µ and positive definite variance-covariance matrix Σ.
Then for α1 > 0, α2 > 0,
∫[Np(x|µ1,Σ1)]
α1 [Np(x|µ2,Σ2)]α2 dx
= (2π)p2(1−α1−α2) |Σ1|
12(1−α1) |Σ2|
12(1−α2) |α1Σ2 + α2Σ1|−
12
× exp[−α1α2
2(µ1 − µ2)
T (α1Σ2 + α2Σ1)−1 (µ1 − µ2)
]. (2–4)
24
Proof of Lemma 2.1.0.2. Writing H = α1Σ−11 +α2Σ
−12 and g = H−1(α1Σ
−11 µ1+
α2Σ−12 µ2), it follows after some simplification that
∫[Np(x|µ1,Σ1)]
α1 [Np(x|µ2,Σ2)]α2 dx
= (2π)−p2(α1+α2) |Σ1|−
12α1 |Σ2|−
12α2
∫exp
[−1
2(x− g)T H(x− g)
−1
2
α1(µ
T1 Σ−1
1 µ1) + α2(µT2 Σ−1
2 µ2)− gT Hg]
dx
= (2π)p2(1−α1−α2) |Σ1|−
12α1 |Σ2|−
12α2 |H|− 1
2
× exp
[−1
2
α1(µ
T1 Σ−1
1 µ1) + α2(µT2 Σ−1
2 µ2)− gT Hg]
. (2–5)
It can be checked that
α1(µT1 Σ−1
1 µ1) + α2(µT2 Σ−1
2 µ2)− gT Hg
= α1α2(µ1 − µ2)T (α1Σ2 + α2Σ1)
−1(µ1 − µ2), (2–6)
and
|H|−1/2 = |Σ1|1/2|Σ2|1/2|α1Σ2 + α2Σ1|−1/2. (2–7)
Then by (2–6) and (2–7),
right hand side of (2–5)
= (2π)p(1−α1−α2)/2|Σ1|(1−α1)/2|Σ2|(1−α2)/2|α1Σ2 + α2Σ1|−1/2
× exp[−α1α2
2(µ1 − µ2)
T (α1Σ2 + α2Σ1)−1(µ1 − µ2)].
This proves the lemma. ¥
The above results are now used to obtain the Bayes estimator of θ and
the Bayesian predictive density of a future Y ∼ N(θ, σ2yIp) under the general
divergence loss and the N(µ, AIp) prior for θ. We continue to assume that
conditional on θ, X ∼ N(θ, σ2xIp), where σ2
x > 0 is known. The Bayes estimator of
25
θ is obtained by minimizing
1−∫
exp[−β(1− β)
2σ2x
||θ − a||2]N(θ|(1−B)X + Bµ, σ2x(1−B)Ip)dθ
with respect to a, where B = σ2x(σ
2x + A)−1. By Lemma 2.1.0.2,
∫exp[−β(1− β)
2σ2x
||θ − a||2]N(θ|(1−B)X + Bµ, σ2x(1−B)Ip)dθ =
(2πσ2
x
β(1− β)
)p/2 ∫N(θ|a, σ2
xβ−1(1− β)−1Ip)N(θ|(1−B)X + Bµ, σ2
x(1−B)Ip)dθ
∝ exp
[− ||a− (1−B)X −Bµ||2
2σ2x(β
−1(1− β)−1 + 1−B)
](2–8)
which is maximized with respect to a at (1 − B)X + Bµ. Hence, the Bayes
estimator of θ under the N(µ, AIp) prior and the general divergence loss is
(1− B)X + Bµ, the posterior mean. Also, by Lemma 2.1.0.2, the Bayes predictive
density under the divergence loss is given by
πD(y|X) ∝[∫
[N1−β(θ|y, σ2yIp)N(θ| (1−B)X + Bµ, σ2
x(1−B)Ip)dθ
] 11−β
∝ N(y|(1−B)X + Bµ,(σ2
x(1−B)(1− β) + σ2y
)Ip).
In the limiting (B → 0) case, i.e. under the uniform prior π(θ) = 1, the Bayes
estimator of θ is X, and the Bayes predictive density of Y is
N(y|X,(σ2
x(1− β) + σ2y
)Ip).
It may be noted that by Lemma 2.1.0.2 with α1 = 1 − β and α2 = β, the
divergence loss for the plug-in predictive density N(θ, σ2xIp), which we denote by
26
δ0, is
L(θ, δ0) =1
β(1− β)
[1−
∫N1−β(y|θ, σ2
yIp)Nβ(y|X, σ2
xIp) dy
]
=1
β(1− β)
[1− (σ2
y)pβ2 (σ2
x)p(1−β)
2 (1− β)σ2x + βσ2
y−p/2
× exp
− β(1− β)‖X − θ‖2
2((1− β)σ2x + βσ2
y)
]. (2–9)
Noting that ‖X − θ‖2 ∼ σ2xχ
2p, the corresponding risk is given by
R(θ, δ0) =1
β(1− β)
[1− (σ2
y)pβ2 (σ2
x)p(1−β)
2 (1− β)σ2x + βσ2
y−p/2
×
1 +β(1− β)σ2
x
(1− β)σ2x + βσ2
y
− p2
]
=1
β(1− β)
[1− (σ2
y)pβ2 (σ2
x)p(1−β)
2 (1− β2)σ2x + βσ2
y−p/2]. (2–10)
On the other hand, by Lemma 2.1.0.2 again, the divergence loss for the Bayes
predictive density (under uniform prior) of N(y|θ, σ2yIp) which we denote by δ∗ is
L(θ, δ∗) =1
β(1− β)
[1−
∫N1−β(y|θ, σ2
yIp)Nβ(y|X, ((1− β)σ2
x + σ2y)Ip) dy
]
=1
β(1− β)
[1− (σ2
y)pβ2 ((1− β)σ2
x + σ2y)
p(1−β)2 (1− β)2σ2
x + σ2y−p/2
× exp
−β(1− β)‖X − θ‖2
2((1− β)2σ2x + σ2
y)
]. (2–11)
The corresponding risk
R(θ, δ∗) =1
β(1− β)
[1− (σ2
y)pβ2 ((1− β)σ2
x + σ2y)
p(1−β)2 ((1− β)2σ2
x + σ2y)− p
2
×
1 +β(1− β)σ2
x
(1− β)2σ2x + σ2
y
− p2
]=
1
β(1− β)
[1− (σ2
y)pβ2 ((1− β)σ2
x + σ2y)
−pβ2
].
(2–12)
27
To show that R(θ, δ0) > R(θ, δ∗) for all θ, σ2x > 0 and σ2
y > 0, it suffices to
show that
(σ2y)
pβ2 (σ2
x)p(1−β)
2 (1− β2)σ2x + βσ2
y−p/2 < (σ2y)
pβ2 ((1− β)σ2
x + σ2y)
−pβ2 , (2–13)
or equivalently that
1 + β(σ2y/σ
2x − β) > (1 + σ2
y/σ2x − β)β, (2–14)
for all 0 < β < 1, σ2x > 0 and σ2
y > 0. But the last inequality is a consequence of the
elementary inequality (1 + z)u < 1 + uz for all real z and 0 < u < 1.
In the next section, we prove the minimaxity of X as an estimator of θ and
the minimaxity of N(y|X, ((1−β)σ2x +σ2
y)Ip) as the predictive density of Y in any
arbitrary dimension.
2.2 Minimaxity Results
Suppose X ∼ N(θ, σ2xIp), where θ ∈ Rp. By Lemma 2.1.0.2 under the general
divergence loss given in 2–1, the risk of X is given by
R(θ,X) =1
β(1− β)[1− 1 + β(1− β)−p/2] (2–15)
for all θ. We now prove the minimaxity of X as an estimator of θ.
Theorem 2.2.0.3. X is a minimax estimator of the θ in any arbitrary dimension
under the divergence loss given in 2–1.
Proof of Theorem 2.2.0.3. Consider the sequence of proper priors N(0, σ2nIp)
for θ, where σ2n −→ ∞ as n −→ ∞. We denote this sequence of priors by πn. The
Bayes estimator of θ, namely the posterior mean, under the prior πn is
δπn(X) = (1−Bn)X, (2–16)
with Bn = σ2x(σ
2x + σ2
n)−1.
28
The Bayes risk of δπn under the prior πn is given by:
r(πn, δπn) =
1
β(1− β)
[1− E
[exp
−β(1− β)
2σ2x
‖δπn(X)− θ‖2
]], (2–17)
where expectation is taken over the joint distribution of X and θ, with θ having
the prior πn.
Since under the prior πn,
θ|X = x ∼ N(δπn(x), σ2
x(1−Bn)Ip
),
it follows that
‖θ − δπn(X)‖2 |X = x ∼ σ2x(1−Bn)χ2
p,
which does not depend on x. Accordingly, from (3.3),
r(πn, δπn) =1
β(1− β)[1− 1 + β(1− β)(1−Bn)−p/2]. (2–18)
Since Bn → 0 as n →∞, it follows from 2–18 that
r(πn, δπn) → 1
β(1− β)[1− 1 + β(1− β)−p/2]
as n →∞.
Noting (2–15), an appeal to a result of Hodges and Lehmann [40] now shows
that X is a minimax estimator of θ for all p.
Next we prove the minimaxity of the predictive density
δ∗(X) = N(y|X, ((1− β)σ2x + σ2
y)Ip)
of Y having pdf N(y|θ, σ2yIp).
Theorem 2.2.0.4. δ∗(X) is a minimax predictive density of N(y|θ, σ2yIp) in any
arbitrary dimension under the general divergence loss given in (2–1).
29
Proof of Theorem 2.2.0.4. We have shown already that the predictive density
δ∗(X) of N(y|θ, σ2yIp) has constant risk
1
β(1− β)
[1− (σ2
y)pβ2 ((1− β)σ2
x + σ2y)
−pβ2
]
under the divergence loss given in (2–1). Under the same sequence πn of priors
considered earlier in this section, by Lemma 2.1.0.2, the Bayes predictive density of
N(y|θ, σ2yIp) is given by N(y| (1−Bn)X, (1− β)(1−Bn)σ2
x + σ2yIp). By Lemma
2.1.0.2 once again, one gets the identity
∫N1−β(y|θ, σ2
yIp)Nβ(y| (1−Bn)X, (1− β)(1−Bn)σ2
x + σ2yIp)dy
= (σ2y)
pβ2 ((1− β)(1−Bn)σ2
x + σ2y)
p(1−β)2 (1− β)2(1−Bn)σ2
x + σ2y−p/2
× exp[−β(1− β)||θ − (1−Bn)X||22((1− β)2(1−Bn)σ2
x + σ2y)
]. (2–19)
Noting once again that ||θ − (1 − Bn)X||2|X = x ∼ σ2x(1 − Bn)χ2
p, the posterior
risk of δ∗(X) simplifies to
1
β(1− β)
[1− (σ2
y)pβ2 ((1− β)(1−Bn)σ2
x + σ2y)
p(1−β)2
×((1− β)2(1−Bn)σ2x + σ2
y)− p
2
1 +
β(1− β)σ2x(1−Bn)
(1− β)2(1−Bn)σ2x + σ2
y
− p2
]
=1
β(1− β)
[1− (σ2
y)pβ2 ((1− β)(1−Bn)σ2
x + σ2y)− pβ
2
]. (2–20)
Since the expression does not depend on x, this is also the same as the Bayes
risk of δ∗(X). The Bayes risk converges to
1
β(1− β)
[1− (σ2
y)pβ2 ((1− β)σ2
x + σ2y)
−pβ2
].
An appeal to Hodges and Lehmann [40] once again proves the theorem.
30
2.3 Admissibility for p = 1
We use Blyth’s [14] original technique for proving admissibility. First consider
the estimation problem. Suppose that X is not an admissible estimator of θ. Then
there exists an estimator δ0(X) of θ such that R(θ, δ0) ≤ R(θ, X) for all θ with
strict inequality for some θ = θ0. Let η = R(θ0, X) − R(θ0, δ0(X)) > 0. Due to
continuity of the risk function, there exists an interval [θ0 − ε, θ0 + ε] with ε > 0
such that R(θ, X)− R(θ, δ0(X)) ≥ 12η for all θ ∈ [θ0 − ε, θ0 + ε]. Now with the same
prior πn(θ) = N(θ| 0, σ2n),
r(πn, X)− r(πn, δ0(X))
=
∫
R
[R(θ,X)−R(θ, δ0(X))] πn(dθ) ≥θ0+ε∫
θ0−ε
[R(θ,X)−R(θ, δ0(X))] πn(dθ)
≥ 1
2η
θ0+ε∫
θ0−ε
(2πσ2n)−
12 exp
− 1
2σ2n
θ2
dθ ≥ 1
2η(2πσ2
n)−12 (2ε) . (2–21)
Again,
r(πn, X)− r(πn, δπn(X)) =[1 + β(1− β)(1−Bn)−1/2 − 1 + β(1− β)−1/2]
β(1− β)
= Oe(Bn) (2–22)
for large n, where Oe denotes the exact order. Since Bn = σ2x(σ
2x + σ2
n)−1 and
σ2n → ∞ as n → ∞, denoting C(> 0) as a generic constant, it follows from (2–21)
and (2–22) that for large n, say n ≥ n0,
r(πn, X)− r(πn, δ0(X))
r(πn, X)− r(πn, δπn(X))≥ 1
4η(2π)−1/2σ−1
n CB−1n →∞ (2–23)
as n →∞.
Hence, for large n, r(πn, δπn(X)) > r(πn, δ0(X)) which contradicts the
Bayesness of δπn(X) with respect to πn. This proves the admissibility of X for
p = 1. ¥
31
For the prediction problem, suppose there exists a density p(y| ν(X)) which
dominates N(y|X, ((1− β)σ2x + σ2
y). Since
[((1− β)(1−Bn)σ2x + σ2
y))−β
2 − ((1− β)σ2x + σ2
y))−β
2 ]
β(1− β)= Oe(Bn)
for large n under the same prior πn, using a similar argument,
r(πn, N(y|X, ((1− β)σ2x + σ2
y)))− r(πn, p(y|ν(X)))
r(πn, N(y|X, ((1− β)σ2x + σ2
y)))− r(πn, N(y|X, ((1− β)(1−Bn)σ2x + σ2
y)))
= O(σ−1n B−1
n ) →∞, (2–24)
as n →∞. An argument similar to the previous result now completes the proof. ¥
Remark 1. The above technique of proving admissibility does not work for
p = 2. This is because for p = 2, the ratios in the left hand side of (2–23)
and (2–24) are greater than or equal to some constant times σ−2n B−1
n for large n
which tends to a constant as n → ∞. We conjecture the admissibility of X or
N(y
∣∣ X, ((1− β)σ2x + σ2
y)Ip
)for p = 2 under the general divergence loss for the
respective problems of estimation and prediction.
2.4 Inadmissibility Results for p ≥ 3
Let S = ‖X‖2/σ2x. The Baranchik class of estimators for θ is given by
δτ (X) =
(1− τ(S)
S
)X,
where one needs some restrictions on τ. The special choice τ(S) = p − 2 (with
p ≥ 3) leads to the James-Stein estimator.
It is important to note that the class of estimators δτ (X) can be motivated
from an empirical Bayes (EB) point of view. To see this, we first note that with the
N(0, AIp) (A > 0) prior for θ, the Bayes estimator of θ under the divergence loss is
(1−B)X, where B = σ2x(A + σ2
x)−1.
32
An EB estimator of θ estimates B from the marginal distribution of X.
Marginally, X ∼ N(0, σ2xB
−1Ip) so that S is minimal sufficient for B. Thus, a
general EB estimator of θ can be written in the form δτ (X). In particular, the
UMVUE of B is (p− 2)/S which leads to the James-Stein estimator [28].
Note that for the estimation problem,
L(θ, δτ (X)) =1− exp
[−β(1−β)
2σ2x‖δτ (X)− θ‖2
]
β(1− β), (2–25)
while for the prediction problem,
L(N(y|θ, Ip), N
(y| δτ (X), ((1− β)σ2
x + σ2y)Ip
))
=1− (σ2
y)pβ2 ((1− β)σ2
x + σ2y)
p(1−β)2 (1− β)2σ2
x + σ2y−p/2
β(1− β)
× exp
[−β(1− β)‖δτ (X)− θ‖2
2((1− β)2σ2x + σ2
y)
]. (2–26)
The first result of this section finds an expression for
Eθ
[exp
− b
σ2x
‖δτ (X)− θ‖2
], b > 0.
We will need b = β(1− β)/2 and b = β(1−β)σ2x
2((1−β)2σ2x+σ2
y)to evaluate (2–25) and (2–26).
Theorem 2.4.0.5.
Eθ
[exp
− b
σ2x
‖δτ (X)− θ‖2
]= (2b + 1)−p/2
∞∑r=0
exp(−φ)φr/r! Ib(r), (2–27)
where φ = (b + 12)‖θ‖
2
σ2x
and
Ib(r) =
∞∫
0
1− b
tτ
(2t
2b + 1
)2rtr+
p2−1
Γ(r + p
2
)
× exp
[−t− b(b + 1
2)
tτ 2
(2t
2b + 1
)+ 2bτ
(2t
2b + 1
)]dt. (2–28)
33
Proof.
Recall that S = ||X||2/σ2x. For ||θ|| = 0, proof is straightforward. So
we consider ||θ|| > 0. Let Z = X/σx and η = θ/σx. First we reexpress
1σ2
x
∥∥∥(1− τ(S)
S
)X − θ
∥∥∥2
as
∥∥∥∥(
1− τ(‖Z‖2)
‖Z‖2
)Z − η
∥∥∥∥2
= S+τ 2(S)
S−2τ(S)+‖η‖2−2ηT Z
(1− τ(S)
S
). (2–29)
We begin with the orthogonal transformation Y = CZ where C is an orthogonal
matrix with its first row given by (θ1/‖θ‖, . . . , θp/‖θ‖). Writing Y = (Y1, . . . , Yp)T ,
the right hand side of (2–29) can be written as
S +τ 2(S)
S− 2τ(S) + ‖η‖2 − 2‖η‖Y1
(1− τ(S)
S
), (2–30)
where S = ‖Y ‖2. Also we note that Y1, . . . , Yp are mutually independent with
Y1 ∼ N(‖η‖, 1), and Y2, . . . , Yp are iid N(0, 1). Now writing Z =p∑
i=2
Y 2i ∼ χ2
p−1 we
have
E
[exp
− b
σ2x
‖δτ (X)− θ‖2
]
=
+∞∫
0
+∞∫
−∞
exp
[−b
y2
1 + z +τ 2(y2
1 + z)
(y21 + z)
−2τ(y21 + z) + ‖η‖2 − 2‖η‖y1
(1− τ(y2
1 + z)
y21 + z
)]×
× (2π)−1/2 exp
−1
2(y1 − ‖η‖)2
exp
−1
2z
z
12(p−1)−1
2p−12 Γ
(p−12
) dy1 dz
=
+∞∫
0
+∞∫
−∞
(2π)−1/2 exp
[−
(b +
1
2
)(y1 − ‖η‖)2 − bτ 2(y2
1 + z)
(y21 + z)
+ 2bτ(y21 + z)
−2b‖η‖y1τ(y2
1 + z)
y21 + z
]exp
−
(b +
1
2
)z
z
p−32
2p−12 Γ
(p−12
) dy1 dz (2–31)
34
We first simplify
+∞∫
−∞
(2π)−1/2 exp
[−
(b +
1
2
)(y1 − ‖η‖)2
−bτ 2(y21 + z)
(y21 + z)
+ 2bτ(y21 + z)− 2b‖η‖y1
τ(y21 + z)
y21 + z
]dy1
=
+∞∫
0
(2π)−1/2 exp
[−
(b +
1
2
)(y2
1 + ‖η‖2)− bτ 2(y21 + z)
(y21 + z)
+ 2bτ(y21 + z)
]
×[exp
2
(b +
1
2
)‖η‖y1 − 2b‖η‖y1
τ(y21 + z)
y21 + z
+ exp
−2
(b +
1
2
)‖η‖y1 + 2b‖η‖y1
τ(y21 + z)
y21 + z
]dy1 (2–32)
= 2
+∞∫
0
(2π)−1/2 exp
[−
(b +
1
2
)(y2
1 + ‖η‖2)− bτ 2(y21 + z)
(y21 + z)
+ 2bτ(y21 + z)
]
×[ ∞∑
r=0
(2‖η‖y1)2r
(2r)!
(b +
1
2
)− bτ(y2
1 + z)
y21 + z
2r]
dy1
= 2
+∞∫
0
(2π)−1/2 exp
[−
(b +
1
2
)(w + ‖η‖2)− bτ 2(w + z)
(w + z)+ 2bτ(w + z)
]
×[ ∞∑
r=0
‖2η‖2rwr− 12
(2r)!
(b +
1
2
)− bτ(w + z)
w + z
2r]
dw,
where w = y21.
With the substitution v = w + z and u = w/(w + z), it follows from (2–31) and
(2–32) that
E
[exp
− b
σ2x
‖δτ (X)− θ‖2
]
=
+∞∫
0
1∫
0
(2π)−1/2
∞∑r=0
exp
(−
(b +
1
2
)‖η‖2
)(2b‖η‖)2r
(2r)!
((2b)−1 + 1
)− τ(v)
v
2r
× exp
[−
(b +
1
2
)v − bτ 2(v)
v+ 2bτ(v)
]vr+ p
2−1ur+ 1
2−1(1− u)
p−12−1
2p−12 Γ
(p−12
) du dv (2–33)
35
By the Legendre duplication formula, namely,
(2r)! = Γ(2r + 1) = Γ
(r +
1
2
)Γ(r + 1)22rπ−1/2,
(2–33) simplifies into
E
[exp
− b
σ2x
‖δτ (X)− θ‖2
]
=
+∞∫
0
1∫
0
(2π)−1/2
∞∑r=0
exp
(−
(b +
1
2
)‖η‖2
)(2b‖η‖)2r
√π
r!Γ(r + 1
2
)22r
((2b)−1 + 1
)− τ(v)
v
2r
× vr+ p2−1 exp
[−
(b +
1
2
)v − bτ 2(v)
v+ 2bτ(v)
]ur+ 1
2−1(1− u)
p−12−1
2p−12 Γ
(p−12
) du dv
=
+∞∫
0
1∫
0
∞∑r=0
exp
(−
(b +
1
2
)‖η‖2
)((b +
1
2
)‖η‖2
)r(b + 1
2
)r
r!
1− τ(v)
((2b)−1 + 1) v
2r
×vr+ p2−1 exp
[−
(b +
1
2
)v − bτ 2(v)
v+ 2bτ(v)
]ur+ 1
2−1(1− u)
p−12−1Γ
(r + p
2
)
2p2 Γ
(r + 1
2
)Γ
(p−12
)Γ
(r + p
2
) du dv
(2–34)
Integrating with respect to u, (2–34) leads to
E
[exp
− b
σ2x
‖δτ (X)− θ‖2
]
=∞∑
r=0
exp−φφr
r!
+∞∫
0
(b +
1
2
)r (1− τ(v)
((2b)−1 + 1) v
)2rvr+ p
2−1
2p2 Γ
(r + p
2
)
× exp
[−
(b +
1
2
)v − bτ 2(v)
v+ 2bτ(v)
]dv (2–35)
where φ =(b + 1
2
) ‖η‖2. Now putting t =(b + 1
2
)v, we get from (2–35)
E
[exp
− b
σ2x
‖δτ (X)− θ‖2
]= (2b + 1)−
p2
∞∑r=0
exp−φφr
r!
×+∞∫
0
exp
−t− b
(b + 1
2
)τ 2( 2t
2b+1)
t+ 2bτ
(2t
2b + 1
) (1− b
tτ
(2t
2b + 1
))2rtr+
p2−1
Γ(r + p2)dt,
The theorem follows. ¥
36
As a consequence of this theorem, putting b = β(1 − β)/2, it follows from
(2–25) and (2–27) that
R(θ, δτ (X)) =
1− (1 + β(1− β))−p/2∞∑
r=0
exp(−φ)φr/r! Iβ(1−β)/2(r)
β(1− β),
while putting b = β(1−β)σ2x
2((1−β)2σ2x+σ2
y), it follows from (2–26) and (2–27) that
R(N(y|θ, σ2
yIp), N(y
∣∣ δτ (X), ((1− β)σ2x + σ2
y)Ip
))
=
1− (σ2y)
pβ/2((1− β)σ2x + σ2
y)−pβ/2
∞∑r=0
exp(−φ)φr/r! I β(1−β)σ2x
2((1−β)2σ2x+σ2
y)
(r)
β(1− β).
Hence, proving Ib(r) > 1 for all b > 0 under certain conditions on τ leads to
R(θ, δτ (X)) < R(θ,X)
and
R(N(y|θ, σ2
yIp), N(y
∣∣δτ (X), ((1− β)σ2x + σ2
y)Ip
))
< R(N
(y|θ, σ2
yIp), N(y|X, ((1− β)σ2
x + σ2y)Ip
))
for all θ. In the limiting case when β → 0, i.e. for the KL loss, one gets
RKL (θ, δτ (X)) < p/2 = RKL(θ, X) for all θ, since as shown in Section 1,
for estimation, the KL loss is half of the squared error loss. Similarly, for the
prediction problem, as β → 0,
RKL(N(y|θ, σ2yIp), N(y| δτ (X), ((1− β)σ2
x + σ2y)Ip)
<p
2log
(σ2
x + σ2y
σ2y
)= RKL(N(y|θ, σ2
yIp), N(y|X, ((1− β)σ2x + σ2
y)Ip)
for all θ.
The following theorem provides sufficient conditions on the function τ(·) which
guarantee Ib(r) > 1 for all r = 0, 1, · · · .
37
Theorem 2.4.0.6. Let p ≥ 3. Suppose
(i) 0 < τ(t) < 2(p− 2) for all t > 0;
(ii) τ(t) is a differentiable nondecreasing function of t.
Then Ib(r) > 1 for all b > 0.
Proof of Theorem 2.4.0.6.
Define τ0(t) = τ( 2t2b+1
). Notice that τ0(t) will also satisfy conditions of Theorem
2.4.0.6. Now
t +b(b + 1
2
)
tτ 2
(2t
2b + 1
)− 2bτ
(2t
2b + 1
)= t
(1− bτ0(t)
t
)2
+b
2tτ 20 (t)
and
Ib(r) =
+∞∫
0
exp
−t
(1− bτ0(t)
t
)2 (
t(1− b
tτ0(t)
)2)r+ p
2−1
Γ(r + p2)
× exp
−bτ 2
0 (t)
2t
1− b
tτ0(t)
−(p−2)
dt. (2–36)
Define t0 = supt > 0 : τ0(t)/t ≥ b−1. Since τ0(t)/t is continuous in t with
limt→0
τ0(t)/t = +∞ and limt→∞
τ0(t)/t = 0, there exists such a t0 which also satisfies
τ0(t0)/t0 = b−1. We now need the following lemma.
Lemma 2.4.0.7. For t ≥ t0, b > 0 and τ0(t) satisfying conditions of Theorem
2.4.0.6 the following inequality holds:
exp
−bτ 2
0 (t)
2t
(1− b
tτ0(t)
)−(p−2)
− q′(t) ≥ 0, (2–37)
where q(t) = t(1− bτ0(t)
t
)2
.
Proof of 2.4.0.7. Notice first that for t ≥ t0, by the inequality, (1 − z)−c ≥exp(cz) for c > 0 and 0 < z < 1, one gets
exp
−bτ 2
0 (t)
2t
(1− b
tτ0(t)
)−(p−2)
≥ exp
−bτ 2
0 (t)
2t+
b(p− 2)τ0(t)
t
, (2–38)
38
for 0 < τ0(t) < 2(p− 2).
Notice that
q′(t) = 1− 2bτ ′0(t) +2b2τ0(t)τ
′0(t)
t− b2τ 2
0 (t)
t2. (2–39)
Thus q′(t) ≤ 1 for t ≥ t0 if and only if
2bτ ′0(t)(
1− bτ0(t)
t
)+
b2τ 20 (t)
t2≥ 0. (2–40)
The last inequality is true since τ ′0(t) ≥ 0 for all t > 0 and τ0(t)/t ≤ b−1 for all
t ≥ t0. The lemma follows. ¥
In view of previous lemma, it follows from (2–37) that
Ib(r) ≥ 1
Γ(r + p
2
)+∞∫
t0
exp−q(t)(q(t))r+ p2−1q′(t) dt = 1.
This proves Theorem 2.4.0.6. ¥
Remark 2. Baranchik [5], under squared error loss, proved the dominance of
δτ (X) over X under (i) and (ii). We may note that the special choice τ(t) = p− 2
for all t leading to the James-Stein estimator, satisfies both conditions (i) and (ii)
of the theorem.
Remark 3. We may note that the Baranchik class of estimators shrinks the sample
mean X towards 0. Instead one can shrink X towards any arbitrary constant µ. In
particular, if we consider the N(µ, AIp) prior for θ, where µ ∈ Rp is known, then
the Bayes estimator of θ is (1−B)X + Bµ, where B = σ2x(A+ σ2
x)−1. A general EB
estimator of θ is then given by
δ∗∗(X) =
(1− τ(S ′)
S ′
)X +
τ(S ′)S ′
µ,
39
where S ′ = ‖X −µ‖2/σ2x, and Theorem 2.4.0.6 with obvious modifications will then
provide the dominance of the EB estimator δ∗∗(X) over X under the divergence
loss. The corresponding prediction result is also true.
Remark 4. The special case with τ(t) = c satisfies conditions of the theorem if
0 < c < 2(p− 2). This is the original James-Stein result.
Remark 5. Strawderman [60] considered the hierarchical prior
θ|A ∼ N(0, AIp),
where A has pdf
π(A) = δ(1 + A)−1−δI[A>0]
with δ > 0.
Under the above prior, assuming squared error loss, and recalling that S =
||X||2/σ2x, the Bayes estimator of θ is given by
(1− τ(S)
S
)X,
where
τ(t) = p + 2δ − 2exp(− t2)∫ 1
0λ
p2+δ−1exp(−λ
2t) dλ
(2–41)
Under the general divergence loss, it is not clear whether this estimator is the
hierarchical Bayes estimator of θ, although its EB interpretation continues to
hold. Besides, as it is well known, this particular τ satisfies conditions of Theorem
2.4.0.6 if p > 4 + 2δ. Thus the Strawderman class of estimators dominates X
under the general divergence loss. The corresponding predictive density also
dominates N(y
∣∣X, ((1− β)σ2x + σ2
y)Ip
). For the special KL loss, the present
results complement those of Komaki [45] and George et al. [34]. The predictive
density obtained by these authors under the Strawderman prior, (and Stein’s
superharmonic prior as a special case) are quite different from the general class of
40
EB predictive densities of this dissertation. One of the virtues of the latter is that
the expressions are in closed form, and thus these densities are easy to implement.
2.5 Lindley’s Estimator and Shrinkage to Regeression Surface
Lindley [50] considered a modification of the James-Stein estimator. Rather
then shrinking X towards an arbitrary point, say µ, he proposed shrinking X
towards X1p, where X = p−1∑p
i=1 Xi and 1p is a p-component column vector with
each element equal to 1. Writing R =p∑
i=1
(Xi − X)2/σ2x, Lindley’s estimator is given
by
δ(X) = X − p− 3
R(X − X1p), p ≥ 4. (2–42)
The above estimator has a simple EB interpretation. Suppose
X|θ ∼ N(θ, σ2xIp) and θ has the Np(µ1p, AIp) prior. Then the Bayes estimator of
θ is given by (1− B)X + Bµ1p where B = σ2x(A + σ2
x)−1. Now if both µ and A are
unknown, since marginally X ∼ N(µ1p, σ2xB
−1Ip), (X, R) is complete sufficient for
µ and B, and the UMVUE of µ and B−1 are given by X and (p− 3)/R, p ≥ 4.
Following Baranchik [5] a more general class of EB estimators is given by
δτ∗(X) = X − τ(R)
R(X − X1p), p ≥ 4. (2–43)
Theorem 2.5.0.8. Assume
(i) 0 < τ(t) < 2(p− 3) for all t > 0 , p ≥ 4;
(ii) τ(t) is a nondecreasing differentiable function of t.
Then the estimator δτ∗(X) dominates X under the divergence loss given in 2–1.
Similarly, N(y| δτ∗(X), ((1− β)σ2
x + σ2y)Ip) dominates N(y|X, ((1− β)σ2
x + σ2y)Ip)
as the predictor of N(y| θ, σ2yIp).
Proof of Theorem 2.5.0.8. Let
θ = p−1
p∑i=1
θi, η = θ/σx
41
and
ζ2 =1
σ2x
p∑i=1
(θi − θ)2.
As in the proof of Theorem 2.4.0.5 we first rewrite
1
σ2x
‖δτ∗(X)− θ‖2 = ‖Z − η − τ(R)
R(Z − Z1p)‖2
=
∥∥∥∥(Z − Z1p)− (η − η1p) + (Z − η)1p − τ(R)
R(Z − Z1p)
∥∥∥∥2
=
[1− τ(R)
R
]2
R + p(Z − η)2 + ζ2 − 2(η − η1p)T
(1− τ(R)
R
)(Z − Z1p). (2–44)
By the orthogonal transformation G = (G1, . . . , Gp)T = CZ, where C
is an orthogonal matrix with first two rows given by (p−12 , . . . , p−
12 ) and ((η1 −
η)/ζ, . . . , (ηp − η)/ζ). We can rewrite
1
σ2x
‖δτ∗(X)− θ‖2
=
[1− τ(G2
2 + Q)
G22 + Q
]2
(G22 + Q) + (G1 −√p η)2 + ζ2 − 2ζG2
(1− τ(G2
2 + Q)
G22 + Q
),
(2–45)
where Q =p∑
i=3
G2i and G1, G2, . . . , Gp are mutually independent with
G1 ∼ N(√
p η, 1), G2 ∼ N(ζ, 1) and G3, . . . , Gp are iid N(0, 1). Hence due to the
independence of G1 with (G2, . . . , Gp), and the fact that (G1 − √p η)2 ∼ χ21, from
(2–45),
E
[exp
− b
σ2x
‖δτ∗(X)− θ‖2
]= (2b + 1)−
p2 E
[exp
−b
((1− τ(G2
2 + Q)
G22 + Q
)2
×(G22 + Q) + ζ2 − 2ζG2
(1− τ(G2
2 + Q)
G22 + Q
))]
= (2b + 1)−p2
∞∑r=0
exp−ϕ∗ϕr∗
r!×
×+∞∫
0
exp
−t− b(b +
1
2)τ 20 (t)
t+ 2bτ0(t)4
(1− b
τ0(t)
t
)2rtr+
p2−1
Γ(r + p2)dt, (2–46)
42
where ϕ∗ = (b + 12)ζ2 and as before τ0(t) = τ( t
b+ 12
). The second equality in 2–46
follows after long simplifications proceeding as in the proof of Theorem 2.4.0.5.
Hence, by (2–46), the dominance of δτ∗(X) over X follows if the right hand
side of (2–46) ≥ (2b+1)−p/2. This however is an immediate consequence of Theorem
2.4.0.6. ¥
The above result can immediately be extended to shrinkage towards an
arbitrary regression surface. Suppose now that X|θ ∼ N(θ, σ2xIp) and θ ∼
Np(Kβ, AIp) where K is a known p × r matrix of rank r(< p) and β is r × 1
regression coefficient. Writing P = K(KT K)−1KT , the projection of X on the
regression surface is given by P X = Kβ, where β = (KT K)−1KT X is the least
squares estimator of β. Now we consider the general class of estimators given by
X − τ(R∗)R∗ (X − P X),
where R∗ = ‖X − P X‖2/σ2x. The above estimator also has an EB interpretation
noting that marginally (β, R∗) is complete sufficient for (β, A).
The following theorem now extends Theorem 2.5.0.8.
Theorem 2.5.0.9. Let p ≥ r + 3 and
(i) 0 < τ(t) < 2(p− r − 2) for all t > 0 ;
(ii) τ(t) is a nondecreasing differentiable function of t.
Then the estimator X − τ(R∗)R∗ (X − P X) dominates X under the divergence loss.
A similar dominance result holds for prediction of N(y|θ, σ2yIp).
CHAPTER 3POINT ESTIMATION UNDER DIVERGENCE LOSS WHEN VARIANCE
COVARIANCE MATRIX IS UNKNOWN
3.1 Preliminary Results
In this chapter we will consider the following situation. Let vectors X i ∼Np (θ,Σ) i = 1, . . . , n be n i.i.d. random vectors, where Σ is the unknown
variance covariance matrix. In Section 3.2 we consider Σ = σ2Ip with σ2 unknown,
while in section 3.3 we consider the most general situation of unknown Σ. Our goal
is to estimate the unknown vector θ under divergence loss.
First note that X is distributed as Np (θ, 1nΣ). And thus divergence loss for an
estimator a of θ is as follows:
Lβ(θ,a) =1− ∫
f 1−β(x|θ)fβ(x|a)dx
β(1− β)=
1− exp[−nβ(1−β)
2(a− θ)TΣ−1(a− θ)
]
β(1− β).
(3–1)
The best unbiased equivariant estimator is X. We will begin with the
expression for the risk of this estimator.
Lemma 3.1.0.10. Let X i ∼ Np (θ,Σ) i = 1, . . . , n be i.i.d. Then the risk of the
best unbiased estimator X of θ is as follows:
Rβ(θ) =1
β(1− β)
(1− 1
[1 + β(1− β)]p/2
). (3–2)
Proof of the lemma 3.1.0.10. Note first that
n(X − θ)TΣ−1(X − θ) ∼ χ2p.
Then
Eθ
[exp
−nβ(1− β)
2(X − θ)TΣ−1(X − θ)
]=
1
(1 + β(1− β))p/2. (3–3)
43
44
The lemma follows from 3–3.
¥
Thus for any rival estimator δ(X) to dominate X under the divergence loss
we will need the following inequality to be true for all possible values of θ:
Eθ
[exp
−nβ(1− β)
2(δ(X)− θ)TΣ−1(δ(X)− θ)
]≥ 1
(1 + β(1− β))p/2. (3–4)
3.2 Inadmissibility Results when Variance-Covariance Matrix isProportional to Identity Matrix
Let X ∼ Np(θ, σ2Ip), where σ(> 0) is unknown, while S ∼ σ2
m+2χ2
m,
independently of X. This situation arises quite naturally in a balanced fixed effects
one-way ANOVA model. For example, let
Xij = θi + εij (i = 1, . . . , p; j = 1, . . . , n)
where the εij are i.i.d. Np(0, σ20). Then the minimal sufficient statistics is given by
(X1, . . . , Xp, S), where
Xi = n−1
n∑j=1
Xij (i = 1, . . . , p)
and
S = [(n− 1)p + 2]−1
p∑i=1
n∑j=1
(Xij − Xi)2.
This leads to the proposed setup with X = (X1, . . . , Xp)T , θ = (θ1, . . . , θp)
T ,
σ2 = σ20/n and m = (n− 1)p.
Efron and Morris [30], in the above scenario, proposed a general class of
shrinkage estimators dominating the sample mean in three or higher dimensions
under squared error loss. This class of estimators was developed along the ones of
Baranchik [5].
45
Using equation (3–1), the divergence loss for an estimator a of θ is given by
Lβ(θ,a) =1− exp
[−β(1−β)
2σ2 ‖θ − a‖2]
β(1− β). (3–5)
The above loss to be interpreted as its limit when β → 0 or β → 1. The KL loss
occurs as a special case when β → 0. Also, noting that ‖X − θ‖2 ∼ σ2χ2p, the risk
of the classical estimator X of θ is readily calculated as
Rβ(θ,X) =1− [1 + β(1− β)]−p/2
β(1− β). (3–6)
Throughout we will perform calculations in the case 0 < β < 1, and will pass
to the limit as and when needed.
Following Baranchik [5] and Efron and Morris [30], we consider the rival class
of estimators
δ(X) =
(1− Sτ(‖X‖2/S)
‖X‖2
)X, (3–7)
where we will impose some conditions later on τ.
First we observe that under the divergence loss given in (3–5),
Lβ(θ, δ(X)) =1− exp
[−β(1−β)
2σ2 ‖δ(X)− θ‖2]
β(1− β). (3–8)
We now prove the following dominance result.
Theorem 3.2.0.11. Let p ≥ 3. Assume
(i) 0 < τ(t) < 2(p− 2) for all t > 0;
(ii) τ(t) is a differentiable nondecreasing function of t for t > 0.
Then R(θ, δ(X)) < R(θ,X) for all θ ∈ Rp.
Proof of Theorem 3.2.0.11 First with the transformation
Y = σ−1X, η = σ−1θ and U = S/σ2,
46
one can rewrite
R(θ, δ(X)) =
1− E
[exp
−β(1−β)
2
∥∥∥∥(
1− τ(‖Y ‖2/U)U
‖Y ‖2
)Y − η
∥∥∥∥2]
β(1− β), (3–9)
where Y ∼ Np(η, Ip) and U ∼ (m + 2)−1χ2m is distributed independently of Y .
Hence a comparison of (3–9) with (3–6) reveals that Theorem (3.2.0.11) holds if
and only if
E
[exp
−β(1− β)
2
∥∥∥∥(
1− Uτ(‖Y ‖2/U)
‖Y ‖2
)Y − η
∥∥∥∥2]
> [1+β(1−β)]−p/2. (3–10)
Next writing
Z =U(m + 2)
2
and
τ0(t/z) =2
m + 2τ
((m + 2)t
2z
),
we reexpress left hand side of (3–10) as
E
[exp
−β(1− β)
2
∥∥∥∥(
1− Z
‖Y ‖2τ0(‖Y ‖2/Z)
)Y − η
∥∥∥∥2]
. (3–11)
Note that in order to find the above expectation, we first condition on Z and
then average over the distribution of Z. By the independence of Z and Y and
Theorem 2.4.0.5, the expression given in (4–29) simplifies to
[1 + β(1− β)]−p/2
∞∑r=0
exp(−φ)φr
r!I∗b (r), (3–12)
where φ = 12[1 + β(1− β)]‖η‖2, and writing b = β(1−β)
2,
I∗b (r) =
∞∫
0
∞∫
0
[1− bz
tτ0(t/z)
]2rtr+
p2−1
Γ(r + p
2
)
× exp
−t− b(b + 1/2)z2
tτ 20 (t/z) + 2bzτ0(t/z)
exp(−z)
zm2−1
Γ(
m2
) dt dz. (3–13)
47
From (3–10) - (4–31), it remains only to show that
I∗b (r) > 1 ∀r = 0, 1, . . . ; p ≥ 3
under conditions (i) and (ii) of the theorem. To show this we first use the
transformation
t = zu.
Then from (4–31),
I∗b (r) =
∞∫
0
∞∫
0
[1− bτ0(u)/u]2r ur+ p2−1
Γ(r + p
2
)Γ
(m2
)
× exp
[−zu + 1 +
b(b + 1/2)
uτ 20 (u)− 2bτ0(u)
]zr+
(p+m)2
−1 dz du
=
∞∫
0
[1− bτ0(u)/u]2r ur+ p2−1
B(r + p
2, m
2
)[u + 1 +
b(b + 1/2)
uτ 20 (u)− 2bτ0(u)
]−(r+ p+m2 )
du.
(3–14)
Since τ0(u)/u is a continuous function of u with
limu→0
τ0(u)/u = +∞ and limu→∞
τ0(u)/u = 0,
it follows that there exists u0 such that
u0 = supu > 0|τ0(u)/u ≥ 1/b and τ0(u0)/u0 = 1/b.
48
Thus for u > u0, τ0(u)/u ≤ 1/b, from (3–14),
I∗b (r) ≥∞∫
u0
[1− bτ0(u)/u]2r ur+ p2−1
B(r + p
2, m
2
)[u + 1 +
b(b + 1/2)
uτ 20 (u)− 2bτ0(u)
]−(r+ p+m2 )
du
=
∞∫
u0
u(1− bτ0(u)/u)2r+ p2−1 (1− bτ0(u)/u)−(p−2)
B(r + p
2, m
2
)
×[u(1− bτ0(u)/u)2 + 1 +
bτ 20 (u)
2u
]−(r+ p+m2 )
du
=
∞∫
u0
[u(1− bτ0(u)/u)2 /1 + bτ 20 (u)/2u ]
r+ p2−1
[1 + u(1−bτ0(u)/u)2
1+bτ20 (u)/(2u)
]r+ p+m2
× (1− bτ0(u)/u)−(p−2) (1 + bτ 2
0 (u)/(2u))−(m
2+1)
du. (3–15)
By the inequalities
(1− bτ0(u)/u)−(p−2) ≥ exp[(p− 2)bτ0(u)/u]
and(1 + bτ 2
0 (u)/(2u))−(m
2+1) ≥ exp[−(1 + m/2)bτ 2
0 (u)/(2u)],
it follows that
(1− bτ0(u)
u
)−(p−2) (1 +
bτ 20 (u)
2u
)−(m2
+1)
≥ exp
[(p− 2)
bτ0(u)
u− (m + 2)bτ 2
0 (u)
4u
]
= exp
[bτ0(u)(m + 2)
4u
4(p− 2)
m + 2− τ0(u)
]> 1 (3–16)
since 0 < τ0(u) < 4(p− 2)/(m + 2).
Moreover, putting
w =u
(1− bτ0(u)
u
)2
1 +bτ2
0 (u)
2u
=2[u− bτ0(u)]2
[2u + bτ 20 (u)]
,
49
it follows that
dw
du=
2(u− bτ0(u))
[2u + bτ 20 (u)]2
[2(1− bτ ′0(u))(2u + bτ 2
0 (u))− (u− bτ0(u))(2 + 2bτ0(u)τ ′0(u))]
=2(u− bτ0(u))
[2u + bτ 20 (u)]2
[2u + 2bτ0(u) + 2bτ 2
0 (u)− 4buτ ′0(u)− 2buτ0(u)τ ′0(u)].
Hence dwdu≤ 1 if and only if
2[u− bτ0(u)][2u + 2bτ0(u) + 2bτ 20 (u)− 4buτ ′0(u)− 2buτ0(u)τ ′0(u)] ≤ [2u + bτ 2
0 (u)]2.
The last inequality is equivalent to
b2τ 20 (u)[2 + τ0(u)]2 + 4buτ ′0(u)[2 + τ0(u)][u− bτ0(u)] ≥ 0. (3–17)
Since for u ≥ u0, u ≥ bτ0(u), (3–17) holds if τ ′0(u) ≥ 0, and the latter is true
due to assumption (ii). Now from (3–15) - (3–17) noting that w = 0 when u = u0,
one gets
I∗b (r) >
∞∫
0
wr+ p2−1
(1 + w)r+ p+m2 B
(r + p
2, m
2
) dw = 1,
for all r = 0, 1, 2, . . . . This completes the proof of Theorem 3.2.0.11. ¥
Remark 1. It is interesting to note that I∗b (r) > 1 for all r = 0, 1, 2, . . . and any
arbitrary b > 0. The particular choice b = β(1 − β)/2 does not have any special
significance.
We now consider an extension of the above result when V (X) = Σ is an
unknown variance-covariance matrix. We solve the problem by reducing the risk
expression of the corresponding shrinkage estimator to the one in this section after
a suitable transformation.
3.3 Unknown Positive Definite Variance-Covariance Matrix
Consider the situation when Z1, . . . , Zn (n ≥ 2) are i.i.d. Np(θ,Σ), where Σ is
an unknown positive definite matrix. The goal is once again to estimate θ.
50
The usual estimator of θ is Z = n−1n∑
i=1
Zi (say). It is the MLE, UMVUE and
the best equivariant estimator of θ, and is distributed as Np(θ, n−1Σ). In addition
the usual estimator of Σ is
S =1
n− 1
n∑i=1
(Zi − Z)(Zi − Z)T
and S is distributed independently of Z.
Based on distribution of Z, the minimal sufficient statistic for θ for any given
Σ, the divergence loss is given by (see equation 3–1)
Lβ(θ, a) =1− exp
[−nβ(1−β)
2(a− θ)TΣ−1(a− θ)
]
β(1− β). (3–18)
The corresponding risk of Z is the same as the one given in (3–2), i.e.
Rβ(θ, Z) =1
β(1− β)[1− 1 + β(1− β)−p/2]. (3–19)
Consider now the general class of estimators
δτ (Z,S) =
[1− τ(nZ
TS−1Z)
nZTS−1Z
]Z (3–20)
of θ. Under the divergence loss given in (2–1),
Lβ(θ, δτ (Z,S)) =1− exp
[−nβ(1−β)
2(δτ (Z,S)− θ)TΣ−1(δτ (Z,S)− θ)
]
β(1− β). (3–21)
By the Helmert orthogonal transformation,
H1 =1√2(Z2 −Z1),
H2 =1√6(2Z3 −Z1 −Z2),
. . .
Hn−1 =1√
n(n− 1)[(n− 1)Zn −Z1 −Z2 − . . .−Zn−1]
51
and
Hn =1√n
n∑i=1
Zi =√
nZ,
one can rewrite δτ (Z,S) as
δτ (Z,S) =
1−
τ
((n− 1)HT
n (n−1∑i=1
H iHTi )−1Hn
)
(n− 1)HTn (
n−1∑i=1
H iHTi )−1Hn)
n−
12 Hn, (3–22)
where H1, . . . , Hn are mutually independent with H1, . . . , Hn−1 i.i.d. N(0,Σ) and
Hn ∼ N(√
nθ,Σ).
Let
Y i = Σ− 12 H i (i = 1, . . . , n)
and
η = Σ− 12 (√
nθ).
Then from (3–21) and (3–22) one can rewrite
Lβ(θ, δτ (Z,S)) =
1− exp
−β(1−β)
2
∥∥∥∥∥∥
1−
τ
((n−1)Y T
n (n−1∑i=1
Y iYT
i )−1Y n
)
(n−1)Y T
n (n−1∑i=1
Y iYT
i )−1Y n
Y n − η
∥∥∥∥∥∥
2
β(1− β),
(3–23)
where Y 1, . . . , Y n are mutually independent with Y 1, . . . , Y n−1 i.i.d. N(0, Ip) and
Y n ∼ N(η, Ip).
Now from Arnold ([3], p. 333) or Anderson([4], p.172),
Y Tn
(n−1∑i=1
Y iYTi
)−1
Y nd=‖Y n‖2
U,
where U ∼ χ2n−p, and is distributed independently of Y n. Now from (3–23)
Lβ(θ, δτ (Z, S)) =
1− exp
[−β(1−β)
2
∥∥∥∥(
1− τ((n−1)‖Y n‖2/U)(n−1)‖Y n‖2/U
)Y n − η
∥∥∥∥2]
β(1− β). (3–24)
52
Next writing
Z =U
2
and
τ0(t/z) =2
n− 1τ
(n− 1
2t/z
),
Lβ(θ, δτ (Z,S)) =
1− exp
[−β(1−β)
2
∥∥∥∥(
1− τ0(‖Y n‖2/Z)‖Y n‖2/Z
)Y n − η
∥∥∥∥2]
β(1− β), (3–25)
By Theorem 3.2.0.11, δτ (Z, S) dominates Z as an estimator of θ provided 0 <
τ0(u) < 4(p−2)n−p+2
for all u and 3 ≤ p ≤ n. Accordingly, δτ (Z,S) dominates Z provided
0 < τ(u) < 2(p−2)(n−1)n−p+2
. We state this result in the form of the following theorem.
Theorem 3.3.0.12. Let p ≥ 3. Assume
(i) 0 < τ(t) < 2(p−2)(n−1)n−p+2
for all t > 0;
(ii) τ(t) is a differentiable nondecreasing function of t for t > 0.
Then R(θ, δτ (Z,S)) < R(θ, Z) for all θ ∈ Rp.
CHAPTER 4REFERENCE PRIORS UNDER DIVERGENCE LOSS
4.1 First Order Reference Prior under Divergence Loss
In this section we will find a reference prior for estimation problems under
divergence losses. Such a prior is obtained by maximizing the expected distance
between prior and corresponding posterior distribution and thus can be interpreted
as noninformative or prior which changes the most on average when the sample is
observed.
If we use divergence as a distance between a proper prior distribution π(θ)
(putting all its mass on a compact set if needed) and the corresponding posterior
distribution π(θ|x) we can reexpress the expected divergence as
Rβ(π) =1− ∫∫
πβ(θ)π1−β(θ|x)m(x) dx dθ
β(1− β)
=1− ∫∫
mβ(x)p1−β(x|θ)π(θ) dx dθ
β(1− β). (4–1)
Using this expression one can easily see that in order to find a prior that
maximizes Rβ(π) we need to find an asymptotic expression for
∫mβ(x)p1−β(x|θ) dx
first.
In this section we assume that we have a parametric family pθ : θ ∈ Θ,Θ ⊂ Rp, of probability density functions pθ ≡ p(x|θ) : θ ∈ Θ with respect to
a finite dominating measure λ(dx) on a measurable space X , and we have a prior
distribution for θ that has a pdf π(θ) with respect to Lebesgue measure.
53
54
Next we will give a definition of Divergence rate when parameter β ≤ 1
and sample size is n. We define the relative Divergence rate between the true
distribution P n
θ(x) and the marginal distribution of the sample of size n mn(x) to
be
Rβ(θ, π) = DRβ(P n
θ‖mn(x))
=1− ∫
(mn(x))β/n(P n
θ(x))1−β/n
λ(dx)
β/n(1− β/n). (4–2)
It is easy to check for β → 0 that this definition is equivalent to the definition
of relative entropy rate considered for example in Clarke and Barron [21].
Using this definition, we can define for a given prior π the corresponding Bayes
risk as follows:
Rβ(π) = E[DRβ
(P n
θ‖mn(x))]
. (4–3)
To find an asymptotic expansion for this risk function, we will reexpress the
risk function as follows:
Rβ(θ, π) =1− Eθ
[exp
−β
nln p(x|θ)
m(x)
]
β/n(1− β/n), (4–4)
where
p(x|θ) =n∏
i=1
p(xi|θ),
and
m(x) =
∫p(x|θ)π(θ) dθ.
Clarke and Barron [20] derived the following formula:
lnp(x|θ)
m(x)=
p
2ln
n
2π+
1
2ln |I(θ)|+ ln
1
π(θ)− 1
2ST
n (I(θ))−1 Sn + o(1), (4–5)
where o(1) → 0 in L1(P ) as well as in probability as n → ∞. Here, Sn =
(1/√
n)∇ ln p(x|θ) is the standardized score function for which E(SnSTn ) = I(θ)
and E[ST
n (I(θ))−1 Sn
]= p.
55
Using this formula we can write the following asymptotic expansion for risk
function (4–1):
Rβ(θ, π) =
1− (2πn
) pβ2n π
βn (θ)
|I(θ)| β2n
Eθ[exp
β2n
STn (I(θ))−1 Sn
](1 + o(1))
β/n(1− β/n). (4–6)
Since STn (I(θ))−1 Sn is asymptotically distributed as χ2
p we can rewrite the
above expression as follow:
Rβ(θ, π) =
1− (2πn
) pβ2n π
βn (θ)
|I(θ)| β2n
1
(1− βn)
p/2 + o(n−p/4)
β/n(1− β/n). (4–7)
Hence, the prior which minimizes the Bayes risk will be the one that
maximizes the integral: ∫π
βn
+1(θ)
|I(θ)| β2n
dθ (4–8)
subject to the constraint ∫π(θ) dθ = 1.
A simple calculus of variations argument gives this maximizer as
π(θ) ∝ |I(θ)| 12 (4–9)
which is Jeffreys’ prior.
4.2 Reference Prior Selection under Divergence Loss for OneParameter Exponential Family
Let X1, . . . , Xn be iid with common pdf (with respect to some σ-finite
measure) belonging to the regular one-parameter exponential family, and is
given by
p(x|θ) = exp[θx− ψ(θ) + c(x)]. (4–10)
56
Consider a prior π(θ) for θ which puts all its mass on a compact set. We will
pass on to the limit later as needed. Then the posterior is given by
π(θ|x1, . . . , xn) ∝ exp[nθx− ψ(θ)]π(θ). (4–11)
We denote the same by π(θ|x). Also, let p(x|θ) denote the conditional pdf of X
given θ and m(x) the marginal pdf of X.
The general expected divergence between the prior and the posterior is given
by
Rβ(π) =1− ∫ [∫
πβ(θ)π1−β(θ| x) dθ]m(x) dx
β(1− β). (4–12)
From the relation p(x|θ)π(θ) = π(θ|x)m(x), one can reexpress Rβ(π) given in
(4–12) as
Rβ(π) =1− ∫∫
πβ+1(θ)π−β(θ| x)p(x|θ) dx dθ
β(1− β)=
1− ∫πβ+1(θ)E
[π−β(θ| x)| θ] dθ
β(1− β).
(4–13)
Let I(θ) = ψ′′(θ) denote the per observation Fisher information number. Then
we have the following theorem.
Theorem 4.2.0.13.
Eθ
[π−β(θ| x)
]=
(2π
nI(θ)
)β2 1√
1− β
[1 +
1
n
(− ψ′′′(θ)β
(1− β)I2(θ)
π′(θ)π(θ)
+(ψ′′′(θ))2β(3β2 + 7β + 10)
24(1− β)I3(θ)− β
2I(θ)
(π′(θ)π(θ)
)2
+β(2− β)
2(1− β)I(θ)
π′′(θ)π(θ)
− ψ(4)(θ)β(2 + β)
8(1− β)I2(θ)
)]+ o(n−1−β/2). (4–14)
Proof of Theorem 4.2.0.13. Let θ denote the MLE of θ. Throughout this
section we will use the following notations:
l(θ) = θx− ψ(θ),
a2 = l′′(θ) = −ψ′′(θ),
57
c = ψ′′(θ),
a3 = l′′′(θ) = −ψ′′′(θ),
a4 = l(4)(θ) = −ψ(4)(θ),
and
h =√
n(θ − θ).
We will use ”shrinkage” argument as it presented in Datta and Mukerjee [24] .
From Datta and Mukerjee ([24] p.13)
ππ(h| x) =
√c
2πexp
−ch2
2
[1 +
1√n
π′(θ)
π(θ)h +
1
6a3h
3
+1
n
1
2
(π′′(θ)
π(θ)h2 − π′′(θ)
π(θ)c
)+
1
6
(a3
π′(θ)
π(θ)h4 − 3a3
c2
π′(θ)
π(θ)
)
+1
24
(a4h
4 − 3a4
c2
)+
1
72
(a2
3h6 − 15a2
3
c3
)]+ o(n−1). (4–15)
With the general expansion
(1
a1 + a2√n
+ a3
n+ o(n−1)
)β
= a−β1
(1− β
a2
a1
√n
+β
n
(β + 1
2
a22
a21
− a3
a1
))+ o(n−1),
we get
π−β(h| x) =( c
2π
)−β2exp
cβh2
2
[1− β√
n
π′(θ)
π(θ)h +
1
6a3h
3
+β
n
1 + β
2
(π′(θ)
π(θ)h +
1
6a3h
3
)2
− 1
2
(π′′(θ)
π(θ)h2 − π′′(θ)
π(θ)c
)− 1
6
(a3
π′(θ)
π(θ)h4 − 3a3
c2
π′(θ)
π(θ)
)
− 1
24
(a4h
4 − 3a4
c2
)− 1
72
(a2
3h6 − 15a2
3
c3
)]+ o(n−1). (4–16)
58
Using (4–15) and (4–16) we will get the following
π−β(h| x)ππ(h| x) =( c
2π
) 1−β2
exp
−c(1− β)h2
2
[1
+1√n
π′(θ)
π(θ)h− β
π′(θ)
π(θ)h +
1− β
6a3h
3
+1
n
−β
(π′(θ)
π(θ)
π′(θ)
π(θ)h2 +
1
6a3h
4
(π′(θ)
π(θ)+
π′(θ)
π(θ)
)+
1
36a2
3h6
)
+β(1 + β)
2
(π′(θ)
π(θ)h +
1
6a3h
3
)2
− β
2
(π′′(θ)
π(θ)h2 − π′′(θ)
π(θ)c
)+
1
2
(π′′(θ)
π(θ)h2 − π′′(θ)
π(θ)c
)
+1
6
(a3
π′(θ)
π(θ)h4 − 3a3
c2
π′(θ)
π(θ)− β
(a3
π′(θ)
π(θ)h4 − 3a3
c2
π′(θ)
π(θ)
))
+1− β
24
(a4h
4 − 3a4
c2
)+
1− β
72
(a2
3h6 − 15a2
3
c3
)]+ o(n−1). (4–17)
Integrating this last expression with respect to h will get the following
∫π−β(h| x)ππ(h| x) dh =
( c
2π
)−β2 1√
1− β
[1
+1
n
−β
(π′(θ)
π(θ)
π′(θ)
π(θ)
1
c(1− β)+
a3
2c2(1− β)2
(π′(θ)
π(θ)+
π′(θ)
π(θ)
)+
15a23
36c3(1− β)3
)
+β(1 + β)
2c(1− β)
(π′(θ)
π(θ)
)2
+a3
c(1− β)
π′(θ)
π(θ)+
15a23
36c2(1− β)2
+β
2c(1− β)
π′′(θ)
π(θ)− β2
2c(1− β)
π′′(θ)
π(θ)+
a3β(2− β)
2c2(1− β)2
π′(θ)
π(θ)
−a3β2(2− β)
2c2(1− β)2
π′(θ)
π(θ)+
a4β(2− β)
8c2(1− β)+
15a23(1− (1− β)3)
72c3(1− β)2
]+ o(n−1). (4–18)
From the relation θ = h/√
n + θ we get
Eπ[π−β(θ| x)| x]
= n−β2
∫π−β(h| x)ππ(h| x) dh. (4–19)
59
Thus by (4–18) and (4–19) we have
λ(θ) = Eθ
(Eπ
[π−β(θ| x)| x])
=
(2π
nI(θ)
)β2 1√
1− β
[1
+1
n
− β
(1− β)I(θ)
(π′(θ)π(θ)
π′(θ)π(θ)
− ψ′′′(θ)2(1− β)I(θ)
(π′(θ)π(θ)
+π′(θ)π(θ)
)+
5(ψ′′′(θ))2
12(1− β)2I2(θ)
)
+β(1 + β)
2(1− β)I(θ)
((π′(θ)π(θ)
)2
− ψ′′′(θ)(1− β)I(θ)
π′(θ)π(θ)
+5(ψ′′′(θ))2
12(1− β)2I2(θ)
)
+β
2(1− β)I(θ)
(π′′(θ)π(θ)
− βπ′′(θ)π(θ)
)− ψ′′′(θ)β(2− β)
2(1− β)2I2(θ)
(π′(θ)π(θ)
− βπ′(θ)π(θ)
)
−ψ(4)(θ)β(2− β)
8(1− β)I2(θ)+
5(ψ′′′(θ))2β(β2 − 3β + 3)
24(1− β)2I3(θ)
]+ o(n−1−β/2). (4–20)
In the next step, we find
∫λ(θ)π(θ) dθ =
∫ (2π
nI(θ)
)β2 1√
1− β
×[1 +
1
n
ψ′′′(θ)β
2(1− β)2I2(θ)
π′(θ)π(θ)
− 5β(ψ′′′(θ))2
12(1− β)3I3(θ)+
β(1 + β)
2(1− β)I(θ)
×((
π′(θ)π(θ)
)2
− ψ′′′(θ)(1− β)I(θ)
π′(θ)π(θ)
+5(ψ′′′(θ))2
12(1− β)2I2(θ)
)− β2
2(1− β)I(θ)
π′′(θ)π(θ)
+ψ′′′(θ)β2(2− β)
2(1− β)2I2(θ)
π′(θ)π(θ)
−ψ(4)(θ)β(2− β)
8(1− β)I2(θ)+
5(ψ′′′(θ))2β(β2 − 3β + 3)
24(1− β)2I3(θ)
]π(θ) dθ
+
∫ (2π
nI(θ)
)β2 1√
1− β
[1
n
(− β
(1− β)I(θ)
π′(θ)π(θ)
+ψ′′′(θ)β
2(1− β)2I2(θ)
−ψ′′′(θ)β(2− β)
2(1− β)2I2(θ)
)π′(θ)
]dθ
+
∫ (2π
nI(θ)
)β2 β
2nI(θ)(1− β)3/2π′′(θ) dθ + o(n−1−β/2). (4–21)
The last step will give an expression for Eθ
[π−β(θ| x)
]. We consider π(θ) to
converge weakly to the degenerate prior at true θ and have chosen π(θ) in such a
way that we could integrate the last two integrals in (4–21) by parts and have the
first term equal to zero every time we use integration by parts.
60
Thus we will have
Eθ
[π−β(θ| x)
]=
(2π
nI(θ)
)β2 1√
1− β
[1 +
1
n
(ψ′′′(θ)β2
2(1− β)I2(θ)
π′(θ)π(θ)
+5(ψ′′′(θ))2β(2− β)
24(1− β)I3(θ)+
β(1 + β)
2(1− β)I(θ)
(π′(θ)π(θ)
)2
− β2
2(1− β)I(θ)
π′′(θ)π(θ)
− ψ(4)(θ)β(2− β)
8(1− β)I2(θ)
)]
+1
n
d
dθ
[(2π
nI(θ)
)β2 1√
1− β
(β
(1− β)I(θ)
π′(θ)π(θ)
+ψ′′′(θ)β
2(1− β)I2(θ)
)]
+1
n
d2
dθ2
[(2π
nI(θ)
)β2 1√
1− β
β
2(1− β)I(θ)
]+ o(n−1−β/2). (4–22)
Finally, we have
d
dθ
[(2π
nI(θ)
)β2 1√
1− β
(β
(1− β)I(θ)
π′(θ)π(θ)
+ψ′′′(θ)β
2(1− β)I2(θ)
)]
=
(2π
nI(θ)
)β2 1√
1− β
(β
(1− β)I(θ)
π′′(θ)π(θ)
− β
(1− β)I(θ)
(π′(θ)π(θ)
)2
−β(1 + β/2)ψ′′′(θ)(1− β)I2(θ)
π′(θ)π(θ)
+βψ(4)(θ)
2(1− β)I2(θ)− β(2 + β/2)(ψ′′′(θ))2
2(1− β)I3(θ)
), (4–23)
and
d2
dθ2
[(2π
nI(θ)
)β2 1√
1− β
β
2(1− β)I(θ)
]= − d
dθ
[(2π
nI(θ)
)β2 1√
1− β
β(1 + β/2)ψ′′′(θ)2(1− β)I2(θ)
]
=
(2π
nI(θ)
)β2 1√
1− β
((ψ′′′(θ))2β(1 + β/2)(2 + β/2)
2(1− β)I3(θ)− ψ(4)(θ)β(1 + β/2)
2(1− β)I2(θ)
),
(4–24)
From (4–22) - (4–24), we get
Eθ
[π−β(θ| x)
]=
(2π
nI(θ)
)β2 1√
1− β
[1 +
1
n
(− ψ′′′(θ)β
(1− β)I2(θ)
π′(θ)π(θ)
+(ψ′′′(θ))2β(3β2 + 7β + 10)
24(1− β)I3(θ)− β
2I(θ)
(π′(θ)π(θ)
)2
+β(2− β)
2(1− β)I(θ)
π′′(θ)π(θ)
− ψ(4)(θ)β(2 + β)
8(1− β)I2(θ)
)]+ o(n−1−β/2). (4–25)
61
This completes the proof ¥.
In view of Theorem 4.2.0.13, for β < 1 and β 6= 0 or −1, one has
Rβ(π) =
1− (2πn
)β/2(1− β)−
12
∫ π(θ)
I12 (θ)
β
π(θ) dθ
β(1− β)+ o(1). (4–26)
Thus the first order approximation to Rβ(π) is given by
1− (2πn
)β/2(1− β)−
12
∫ π(θ)
I12 (θ)
β
π(θ) dθ
β(1− β).
We want to maximize this expression with respect to π(θ) subject to∫
π(θ) dθ = 1.
We will show that Jeffreys’ prior assymptotically maximizes Rβ(π) when
|β| < 1.
To do this we will use Holders inequality as follow
Holders inequality for positive exponents ([39], p.190) Let p, q > 1 be
real numbers satisfying 1/p + 1/q = 1. Let f ∈ Lp, g ∈ Lq. Then fg ∈ L1 and
∫|fg| dµ ≤
(∫|f |p dµ
) 1p(∫
|g|q dµ
) 1q
(4–27)
with equlity iff |f |p ∝ |g|q.Holders inequality for negative exponents ([39], p.191) Let 0 < q < 1 and
p ∈ R be such that 1/p + 1/q = 1 (hence p < 0). If f, g are measurable functions
then ∫|fg| dµ ≥
(∫|f |p dµ
) 1p(∫
|g|q dµ
) 1q
(4–28)
with equlity iff |f |p ∝ |g|q.First we will consider 0 < β < 1. In this case it is enough to minimize
∫π1+β(θ)
(I−
12 (θ)
)β
dθ.
62
From Holders inequality for positive exponents with
p = 1 + β, q =1 + β
β,
g(θ) =(I
12 (θ)
) β1+β
and
f(θ) = π(θ)(I−
12 (θ)
) β1+β
,
we can write
(∫|f |p dµ
) 1p
=
∫π1+β(θ)
(I−
12 (θ)
)β
≥ 1(∫I
12 (θ) dθ
)β
with ”=” iff
π(θ) ∝ I12 (θ).
Next, consider case when −1 < β < 0. In this case assymptotical maximization
of Rβ(π) is equivalent to maximization of
∫π1+β(θ)
(I
12 (θ)
)−β
dθ.
From Holders inequality for positive exponents with
p =1
1 + β, q =
1
−β,
f(θ) = π1+β(θ)
and
g(θ) =(I
12 (θ)
)−β
we obtain ∫π1+β(θ)
(I
12 (θ)
)−β
dθ ≤(∫
I12 (θ) dθ
)−β
with ”=” iff
π(θ) ∝ I12 (θ).
63
When β < −1 using Holders inequality for negative exponents with
p =1
1 + β< 0, 0 < q =
1
−β< 1,
f(θ) = π1+β(θ)
and
g(θ) =(I
12 (θ)
)−β
we obtain ∫π1+β(θ)
(I
12 (θ)
)−β
dθ ≥(∫
I12 (θ) dθ
)−β
with ”=” iff
π(θ) ∝ I12 (θ).
Unfortunately, in this case, Jeffreys’ prior is a minimizer and since it is the only
solution for corresponding Euler-Lagrange equation, there is no proper prior that
will asymptotically maximize Rβ(π).
There are only two cases left β = 0, which corresponds to KL loss and β = −1,
which corresponds to chi-square loss.
As β → 0+, one obtains the expression due to Clarke and Barron (1990, 1994)
[20], [21], namely
R0(π) =1
2log
n
2πe−
∫π(θ) log
π(θ)
I1/2(θ)dθ + o(1)
which is maximized when∫
π(θ) log π(θ)
I1/2(θ)dθ = 0, i.e. π(θ) = I1/2(θ), once again
leading to Jeffreys’ prior.
The only exception is β = −1, the chi-square distance as considered in Clarke
and Sun (1997) [22]. In this case πβ+1(θ) = 1 so that the first order term as
obtained from Theorem 4.2.0.13 is a constant, and one needs to consider the second
order term. In this case,
64
1+2R−1(π) =
∫ (2π
nI(θ)
)− 12 1√
2
[1 +
1
n
(ψ′′′(θ)2I2(θ)
π′(θ)π(θ)
−(ψ′′′(θ))2
8I3(θ)+
1
2I(θ)
(π′(θ)π(θ)
)2
− 3
4I(θ)
π′′(θ)π(θ)
+ψ(4)(θ)
16I2(θ)
)]dθ + o(n−1/2). (4–29)
Reference prior is obtained by maximizing the expected chi-square distance
between prior distribution and corresponding posterior. By (4–29), this amounts to
maximizing the following integral with respect to prior π(θ):
∫ (ψ′′′(θ)I3/2(θ)
π′(θ)π(θ)
+1
I1/2(θ)
(π′(θ)π(θ)
)2
− 3
2I1/2(θ)
π′′(θ)π(θ)
)dθ. (4–30)
To simplify this further, we will use the substitution
y(θ) =π′(θ)π(θ)
so that (4–30) reduces to
∫ (ψ′′′(θ)I3/2(θ)
y(θ)− 1
2I1/2(θ)y2(θ)− 3
2I1/2(θ)y′(θ)
)dθ. (4–31)
Maximizing the last expression with respect to y(θ) and noting that ψ′′′(θ) =
I ′(θ), one gets Euler-Lagrange equation:
∂L
∂y− d
dθ
(∂L
∂y′
)
with L the functional under integral sign in (4–31).
The last expression is equivalent to
I ′(θ)4I3/2(θ)
− y(θ)
I1/2(θ)= 0.
Solving, we get
y(θ) =I ′(θ)4I(θ)
,
65
thereby producing the reference prior
π(θ) ∝ I1/4(θ). (4–32)
CHAPTER 5SUMMARY AND FUTURE RESEARCH
5.1 Summary
This dissertation revisits the problem of simultaneous estimation of normal
means. It is shown that a general class of shrinkage estimators as introduced
by Baranchik [5] dominates the sample mean in three or higher dimensions
under a general divergence loss which includes the Kullback-Leibler (KL) and
Bhattacharyya-Hellinger (BH) losses ([13]; [38]) as special cases. An analogous
result is found for estimating the predictive density of a normal variable with
the same mean and a known but possibly different scalar multiple of the identity
matrix as its variance. The results are extended to accommodate shrinkage towards
a regression surface.
These results are extended to the estimation of the multivariate normal
mean with an unknown variance-covariance matrix. First, it is shown that for an
unknown scalar multiple of the identity matrix as the variance-covariance matrix,
a general class of estimators along the lines of Baranchik [5] and Efron and Morris
[30] continues to dominate the sample mean in three or higher dimensions. Second
it is shown that even for an unknown positive definite variance-covariance matrix,
the dominance continues to hold for a general class of a suitably defined shrinkage
estimators.
Also the problem of prior selection for an estimation problem is considered. It
is shown that the first order reference prior under divergence loss coincides with
Jeffreys’ prior.
5.2 Future Research
The following is a list of future research problems:
66
67
• The admissibility of MLE under Divergence loss is an open question when
p = 2. It is conjectured that the MLE is admissible but the proofs under
squared error loss are difficult to use for the Divergence loss.
• Another important problem is to find an admissible class of estimators of the
multivariate normal mean under general divergence loss.
• Extend the results for simultaneous estimation problem with an unknown
variance-covariance matrix to prediction problems.
• Find the link identity similar to those of George et al.[34] between estimation
and prediction problems if such an identity exists.
• Explain the Stein phenomenon using differential-geometric methods on
statistical manifolds as in Amari [2].
REFERENCES
[1] Aitchison, J. (1975). Goodness of prediction fit. Biometrika 62 547-554.
[2] Amari, S. (1982). Differential geometry of curved exponential families -curvatures and information loss. Ann. Statist. 10 357-387.
[3] Arnold, S.F. (1981). The theory of linear models and multivariate analysis.John Wiley & Sons, New York.
[4] Anderson, T.W. (1984). An introduction to multivariate statisticalanalysis. 2nd ed. John Wiley & Sons, New York.
[5] Baranchick, A.J. (1970). A family of minimax estimators of the mean of amultivariate normal distribution. Ann. Math. Statist. 41 642-645.
[6] Bayes, T.R. (1763). An essay towards solving a problem in the doctrineof chances. Phylosophical Transactions of the Royal Society 53 370-418.Reprinted in Biometrika 45 243-315, 1958.
[7] Berger, J.O. (1975). Minimax estimation of location vectors for a wideclass of densities. Ann. Statist. 3 1318-1328.
[8] Berger, J.O. (1985). Statistical Decision Theory and Bayesian Analysis(2nd edition). Springer-Verlag, New York.
[9] Berger, J.O. and Benardo, J.M. (1989). Estimating a product ofmeans: Bayesian analysis with reference priors. J. Amer. Statist. Assoc. 84200-207.
[10] Berger, J.O. and Benardo, J.M. (1992a). Reference priors in a variancecomponents problem. In Bayesian Analysis in Statistics and Econometrics(P.K. Goel and N.S. Lyengar eds.) 177-194. Springer-Verlag, New York.
[11] Berger, J.O. and Benardo, J.M. (1992b). On the development ofreference priors (with discussion). In Bayes Statistics 4 (J.M. Benardo et aleds.) 35-60. Oxford Univ. Press, London.
[12] Benardo, J.M. (1979). Reference posterior distributions for Bayesianinference. J. Roy. Statist. Soc. B 41 113-147.
[13] Bhattacharyya, A.K. (1943). On a measure of divergence betweentwo statistical populations defined by their probability distributions. Bull.Calcutta Math. Soc. 35 99-109.
68
69
[14] Blyth, C.R. (1951). On minimax statistical decision procedures and theiradmissibility. Ann. Math. Statist. 22 22-42.
[15] Bock, M.E. (1975). Minimax estimators of the mean of a multivariatenormal distribution. Ann. Statist. 3 209-218.
[16] Bolza, O. (1904). Lectures on the Calculus of Variations. Univ. ChicagoPress, Chicago.
[17] Brown, L.D. (1966). On the admissibility of invariant estimator of one ormore location parameters. Ann. Math. Stat. 38 1087-1136.
[18] Brown, L.D. (1971). Admissible estimators, recurrent diffusions andinsoluble boundary value problems. Ann. Math. Statist. 42 855-903.
[19] Brown, L.D. and Hwang, J.T. (1982). A unified admissibility proof.Statistical Decision Theory and related topics Academic Press, New York, 3205-267.
[20] Clarke, B. and Barron, A. (1990). Information-theoretic asymptotics ofBayes methods. IEEE Trans. Inform. Theory 36 453-471.
[21] Clarke, B. and Barron, A. (1994). Jeffreys’ prior is asymptotically leastfavorable under entropy risk. J. Statist. Plann. Infer. 41 37-60.
[22] Clarke, B. and Sun, D. (1997). Reference priors under the chi-squaredistance. Sankhya, Ser.A 59 215-231.
[23] Cressie, N. and Read, T. R. C. (1984). Multinomial Goodness-of-FitTests. J. Roy. Statist. Soc. B 46 440-464.
[24] Datta, G.S. and Mukerjee, R. (2004). Probability matching priors:higher order asymptotics. Springer, New York.
[25] Dawid, A.P. (1983). Invariant Prior Distributions. in Encyclopedia ofStatistical Sciences eds. Kotz, S. and Johnson, N.L. New York: John Wiley,228-236.
[26] Dawid, A.P., Stone, N. and Zidek, J.V. (1973). Marginalizationparadoxes in Bayesian and structural inference (with discussion). J. Roy.Statist. Soc. B 35 189-233.
[27] Efron, B. and Morris, C. (1972). Limiting the risk of Bayes andempirical Bayes estimators - Part II: The empirical Bayes case. J. Amer.Statist. Assoc. 67 130-139.
[28] Efron, B. and Morris, C. (1973). Stein’s estimation rule and itscompetitors - an empirical Bayes approach. J. Amer. Statist. Assoc. 68117-130.
70
[29] Efron, B. and Morris, C. (1975). Data analysis using Stein’s estimatorand its generalizations. J. Amer. Statist. Assoc. 70 311-319.
[30] Efron, B. and Morris, C. (1976). Families of minimax estimators of themean of a multivariate normal distribution. Ann. Statist. 4 11-21.
[31] Faith, R.E. (1978). Minimax Bayes and point estimations of a multivariatenormal mean. J. Mult. Anal. 8 372-379.
[32] Fisher, R.A. (1922). On the mathematical foundations of theoreticalstatistics. Phylosophical Transactions of the Royal Society of London, ser.A,222 309-368.
[33] Fourdrinier, D., Strawderman, W.E. and Wells, M.T. (1998). Onthe construction of Bayes minimax estimators. Ann. Statist. 26 660-671.
[34] George, E.I., Liang, F. and Xu, X. (2006). Improved minimaxpredictive dencities under Kullbak-Leibler loss. Ann. Statist. 34 78-92.
[35] Ghosh, M. (1992). Hierarchical and Empirical Bayes Multivariateestimation. Current Issues in Statistical Inference: Essays in Honor of D.Basu, Ghosh, M. and Pathak, P.K. eds., Institute of Mathematical StatisticsLecture Notes and Monograph Series, 17 151-177.
[36] Ghosh, J.K. and Mukerjee, R. (1991). Characterization of priors underwhich Bayesian and Barlett corrections are equivalent in the multiparametercase. J. Mult. Anal. 38 385-393.
[37] Hartigan, J.A. (1964). Invariant Prior Distributions. Ann. Math. Statist.35 836-845.
[38] Hellinger, E. (1909). Neue Begrundung der Theorie quadratischen Formenvon unendlichen vielen Veranderlichen. Journal fur Reine und AngewandteMathematik 136 210271.
[39] Hewitt, E. and Stromberg, K. (1969). Real and Abstract Analysis. AModern Treatment of the Theory of Functions of a Real Variable. secondprinting corrected, Springer-Verlag, Berlin.
[40] Hodges, J.L. and Lehmann, E.L. (1950). Some problems in minimaxpoint estimation. Ann. Math. Statist. 21 182-197.
[41] Hwang, J.T. and Casella, G. (1982). Minimax confidence sets for themean of a multivariate normal distribution. Ann. Statist. 10 868-881.
[42] James, R. and Stein, C. (1961). Estimation with quadratic loss. Pro-ceedings of the Fourth Berkeley Symposium on Mathematical Statistics andProbability University of California Press, 1 361-380.
71
[43] Jaynes, E.T. (1968). Prior probabilities. IEEE Transactions on SystemsScience and Cybernetics SSC-4 227-241.
[44] Jeffreys, H. (1961). Theory of Probability. (3rd edition.) London: OxfordUniversity Press.
[45] Komaki, F. (2001). A shrinkage predictive distribution for multivariatenormal observations. Biometrika 88 859-864.
[46] Kullback, S. and Liebler, R.A. (1951). On information and sufficiency.Ann. Math. Statist. 22 525-540.
[47] Lehmann, E.L. (1986). Testing Statistical Hypotheses. (2nd edition). J.Wiley, New York.
[48] Lehmann, E.L. and Casella, G. (1998). Theory of Point Estimation.(2nd edition). Springer-Verlag, New York.
[49] Liang, F. (2002). Exact Minimax Strategies for predictive density estimationand data. Ph.D. dissertation, Dept. Statistics, Yale Univ.
[50] Lindley, D.V. (1962). Discussions of Professor Stein’s paper ’Confidencesets for the mean of a multivariate distribution’. J. Roy. Statist. Soc. B 24265-296.
[51] Lindley, D.V. and Smith, A.F.M. (1972). Bayes estimates for the linearmodel. J. Roy. Statist. Soc. B 34 1-41.
[52] Morris, C. (1981). Parametric empirical Bayes confidence intervals.Scientific Inference, Data Analysis, and Robustness. eds. Box, G.E.P.,Leonard, T. and Jeff Wu, C.F. Academic Press, 25-50.
[53] Morris, C. (1983). Parametric empirical Bayes inference and applications.J. Amer. Statist. Assoc. 78 47-65.
[54] Murray, G.D. (1977). A note on the estimation of probability densityfunctions. Biometrika 64 150-152.
[55] Ng, V.M. (1980). On the estimation of parametric density functions.Biometrika 67 505-506.
[56] Robert, C.P. (2001). The Bayesian Choice. (2nd edition). Springer-Verlag,New York.
[57] Rukhin, A.L. (1995). Admissibility: Survey of a concept in progress. Inter.Statist. Review, 63 95-115.
[58] Stein, C. (1955). Inadmissibility of the usual estimator for the meanof a multivariate normal distribution. Proceedings of the Third Berkeley
72
Symposium on Mathematical Statistics and Probability Berkeley and LosAngeles, University of California Press, 197-206.
[59] Stein, C. (1974). Estimation of the mean of a multivariate normaldistribution. Proceedings of the Prague Symposium on Asymptotic Statisticsed. Hajek, J. Prague, Universita Karlova, 345-381.
[60] Strawderman, W.E. (1971). Proper Bayes minimax estimators of themultivariate normal mean. Ann. Math. Statist. 42 385-388.
[61] Strawderman, W. E. (1972). On the existence of proper Bayes minimaxestimators of the mean of a multivariate distribution. Proceedings of theSixth Berkeley Symposium on Mathematical Statistics and ProbabilityBerkeley and Los Angeles, University of California Press, 6 51-55.
BIOGRAPHICAL SKETCH
The author was born in Korukivka, Ukraine in 1973. He received the Specialist
and Candidate of Science degrees in Probability Theory and Statistics from Kiev
National University of Taras Shevchenko in 1997 and 2001 respectively. In 2001 he
came to UF to pursue Ph.D. degree in Department of Statistics.
73