estimation in gaussian noise: properties of the minimum...

Estimation in Gaussian Noise: Properties of theMinimum Mean-square Error

Dongning Guo, Shlomo Shamai (Shitz), and Sergio Verdu

June 18, 2008

Abstract

This work studies the minimum mean-square error (MMSE) of estimating an arbitrary randomvariable from an observation contaminated by Gaussian noise. The MMSE can be regarded as a functionof the signal-to-noise ratio (SNR), as well as a functional of the distribution of the random variable.The MMSE is shown to be an analytic function of the SNR, and simple expressions for its first threederivatives are obtained. This paper also shows that the MMSE is convex in the SNR and concave inthe input distribution. Moreover, it is shown that there can be only one SNR for which the MMSE of aGaussian random variable and that of a non-Gaussian random variable coincide. These properties leadto simple proofs of the facts that Gaussian inputs achieve both the secrecy capacity of scalar Gaussianwiretap channels and the capacity of scalar Gaussian broadcast channels.

Index Terms: Entropy, estimation, Gaussian noise, Gaussian broadcast channel, Gaussian wiretapchannel, minimum mean-square error (MMSE), mutual information.

I. INTRODUCTION

The concept of mean square error has assumed a central role in the theory and practice ofestimation since the time of Gauss and Legendre. In particular, minimization of mean squareerror underlies numerous methods in statistical sciences. The focus of this paper is the minimummean-square error (MMSE) of estimating an arbitrary random variable contaminated by additiveGaussian noise.

Let (X, Y ) be random variables with an arbitrary joint distribution. We denote the conditionalmean estimate of X given Y as E {X | Y }, and the corresponding conditional variance is a

The work of D. Guo has been supported by the NSF under grant CCF-0644344 and DARPA under grant W911NF-07-1-0028.The work of S. Shamai and S. Verdu has been supported by the Binational US-Israel Scientific Foundation under Grant 2004140.

D. Guo is with the Department of Electrical Engineering and Computer Science, Northwestern University, Evanston, IL 60208,USA.

S. Shamai (Shitz) is with the Department of Electrical Engineering, Technion-Israel Institute of Technology, 32000 Haifa,Israel.

S. Verdu is with the Department of Electrical Engineering, Princeton University, Princeton, NJ 08544, USA.

1

function of Y which we denote by

var {X|Y } = E{

(X − E {X | Y })2∣∣ Y } . (1)

It is well known that the conditional mean estimate is optimal in the mean-square sense. In fact,the MMSE of estimating X given Y is nothing but the average conditional variance

mmse(X|Y ) = E {var {X|Y }} . (2)

Throughout this paper, the variables X and Y are related through models of the followingform:

Y =√

snrX +N (3)

where N ∼ N (0, 1) is standard Gaussian, and snr ≥ 0 stands for the gain of the model insignal-to-noise ratio (SNR). The MMSE of estimating the input (X) of the model given thenoisy output (Y ) is also denoted by

mmse(X, snr) = mmse(X|√

snrX +N)

(4)

= E{(X−E

{X|√

snrX +N})2}

. (5)

Evidently, the MMSE can be regarded as a function of the SNR, and as a functional of the inputdistribution PX .

For illustration, the function mmse(X, snr) is plotted in Figure 1 for three special inputs: thestandard Gaussian variable, a Gaussian variable with variance 1/2, and a binary variable equallylikely to be ±1.

0 2 4 6 8snr0.0

0.2

0.4

0.6

0.8

1.0

12 + snr

11 + snr

1−∫ ∞−∞

e−y2

2

√2π

tanh(snr −√

snr y) dy��

Fig. 1. The MMSE of Gaussian input and binary input as a function of the SNR. Upper thick curve: standard Gaussian input.Thin curve: Gaussian input with variance 1/2. Lower thick curve: binary input equally likely to be ±1.

The main results in this paper can be summarized briefly as follows. In Section II, a simple

2

property of the MMSE on input scaling is shown, followed by several results on the finitenessof the moments of the error achieved by conditional mean estimation.

Section III collects several previously known results which connect information measures andthe MMSE. In particular, the mutual information between a random variable and its Gaussiannoise contaminated observation can be written as an integral of the MMSE as a function of theSNR. This is due to the following relationship established in [1]:

mmse(X, snr) = 2d

dsnrI(X;

√snrX +N). (6)

Moreover, the entropy and the differential entropy also admit an integral expressions as a functionof the MMSE.

In Section IV, mmse(X, snr) is found to be an analytic function of snr on (0,∞) for everyinput distribution regardless of the existence of its moments (even the mean of the input can beinfinite). In other words, the MMSE can be arbitrarily well approximated by its Taylor seriesexpansion at all positive SNR. The development is facilitated using the incremental channeldevice, which was introduced in [1] to investigate the change in the mutual information due toan infinitesimal increase in the noise variance.

The first three derivatives of the MMSE with respect to the SNR are found in Section V,conveniently expressed in terms of the average central moments of the input conditioned on theoutput. As a consequence, up to the fourth-order derivatives of the input-output mutual infor-mation of Gaussian channels find simple expressions in terms of estimation theoretic quantities.

Section VI shows that the MMSE is concave in the distribution PX . The monotonicity ofthe MMSE of a partial sum of independent identically distributed (i.i.d.) random variables isalso investigated. Moreover, it is shown that the MMSE curve of a non-Gaussian input cannotcoincide with that of a Gaussian input for all SNRs. In fact, the two MMSE curves intersect atmost once on [0,∞).

Applications of the properties of the MMSE to the Gaussian wiretap channel and the scalarGaussian broadcast channel are shown in Section VII. Sidestepping Shannon’s entropy-powerinequality (EPI), the properties of the MMSE lead to simple and natural proofs of the fact thatGaussian input is optimal in both problems.

II. SCALING AND FINITENESS

The input X and the observation Y in the model described by (3) are tied probabilisticallyby the conditional Gaussian density

pY |X(y|x; snr) =1√2π

exp

[−1

2

(y −√

snr x)2]

. (7)

3

Let us defineqi(y; snr) = E

{X i pY |X(y|X; snr)

}, i = 0, 1, . . . (8)

Since pY |X(y|x; snr) is bounded and vanishes quadratic exponentially fast as either x or ybecomes large with the other variable bounded, it is easy to see that qi(y; snr) exists for ally ∈ R if snr > 0. Due to the same reason,

E{|X|i

∣∣ Y = y}<∞. (9)

In particular, q0(y; snr) is nothing but the marginal distribution pY (y; snr). Moreover, the condi-tional mean can be expressed as [1], [2]

X(y) = E {X | Y = y} =q1(y; snr)

q0(y; snr)(10)

and the MMSE can be calculated as [2]

mmse(X, snr) = E{X2}−∫ ∞−∞

q21(y; snr)

q0(y; snr)dy. (11)

Note that the estimation error X − E {X | Y } remains the same if X is subject to a constantshift. Hence the following well-known fact:

Proposition 1: For every random variable X and a ∈ R,

mmse(X + a, snr) = mmse(X, snr). (12)

The following is also straightforward from the definition of MMSE.Proposition 2: For every random variable X and a ∈ R,

mmse(aX, snr) = a2 mmse(X, a2 snr). (13)

The input to a Gaussian model with nonzero SNR can always be estimated with finite meansquare error based on the output, regardless of the moments of the input distribution. In fact,X = Y/

√snr achieves mean square error of 1/snr, even if E {X} does not exist. Moreover, the

trivial estimate of 0 achieves mean square error of E {X2} if it is finite. Hence the followingresult.

Proposition 3: For every input X ,

mmse(X, snr) ≤ min

{E{X2},

1

snr

}. (14)

Due to the same reason leading to (9), all moments of the estimation error are also finite atpositive SNR.

4

Proposition 4: For every random variable X , snr > 0, n ≥ 1 and N ∼ N (0, 1),

E{∣∣X − E

{X |√

snrX +N}∣∣n} <∞. (15)

In fact, for every realization of Y =√

snrX +N ,

E {|X − E {X | Y = y}|n | Y = y} <∞. (16)

Note that the estimation error of the input is proportional to the estimation error of the noise,i.e.,

X − E {X | Y } = X − Y√snr

+ E

{Y√snr−X

∣∣∣∣ Y} (17)

=1√snr

(E {N | Y } −N

). (18)

Hence, Proposition 4 also applies to the estimation error of the noise based on the output.Proof of Proposition 4: Conditioned on Y = y,

E {|X − E {X | Y } |n | Y = y} ≤ E {(|X|+ |E {X | Y } |)n | Y = y} (19)

≤ E {(2|X|)n + (2|E {X | Y } |)n | Y = y} (20)

≤ 2n+1E {|X|n | Y = y} (21)

<∞ (22)

where (21) is due to Jensen’s inequality and (22) is due to (9).Without conditioning on the received signal, using (18) and similar techniques leading to (22)

E {|X − E {X | Y } |n} = snr−n/2 E {|E {N | Y } −N |n} (23)

≤ snr−n/2 2n+1 E {|N |n} (24)

=

√23n+2

π snrnΓ

(n+ 1

2

)(25)

which is finite for all snr > 0 and n ≥ 0.

III. MMSE AND INFORMATION MEASURES

Since [1], a number of connections between information measures and MMSE have beenidentified. The central result is given by (6), which equates MMSE to twice the derivative of theinput-output mutual information of a Gaussian channel with respect to the SNR. This relationshipimplies the following integral expression for the input-output mutual information of Gaussian

5

channels:I(X;

√snr g(X) +N) =

1

2

∫ snr

0

mmse(g(X), γ) dγ (26)

which holds for any one-to-one function g(·).Sending snr→∞ in (26) leads to the following results.Proposition 5 ( [1], [3]): If X is a discrete random variable, then

H(X) =1

2

∫ ∞0

mmse(g(X), γ) dγ (27)

holds for any one-to-one function g(·). In case X is a continuous random variable with a density,then

h(X) =1

2log (2πe)− 1

2

∫ ∞0

1

1 + γ−mmse(g(X), γ) dγ. (28)

It is interesting to note that it is not at all obvious that the integrals on the right hand side of(27) and (28) are independent of the choice of the one-to-one function g(·).

The above information-estimation relationships have found a number of applications, e.g., innonlinear filtering [1], in multiuser communications [2], and in the proof of Shannon’s EPI andvariations [3]–[5]. In Section VII, the integral representation of mutual information (26) is usedto prove the optimality of Gaussian input in Gaussian wiretap channels and scalar Gaussianbroadcast channels.

Moreover, all the above results can be generalized to vector-valued inputs, where the MMSEbecomes the sum of the MMSEs of all input elements [1], [4]. In particular, the derivative ofthe mutual information of a multiple-input multiple-output model with respect to each elementof the channel matrix has been obtained [6].

IV. THE ANALYTICITY OF THE MMSE

For arbitrary but fixed distribution PX , this section establishes the analyticity of mmse(X, snr)

as a function of snr, and thereby clears the way towards calculating its derivatives in Section V.The key that underlies the analyticity is that pY |X(y|x; snr) given in (7) is smooth and vanishes

quickly as either x or y becomes large with the other variable bounded. These properties can beused to establish the analyticity of mmse(X, snr) at snr = 0+ and then for all snr > 0 using theincremental channel technique [1].

Proposition 6: For every random variable X , the MMSE function mmse(X, snr) is analyticfor all snr > 0, namely, the function is locally given by a convergent power series. The functionis also analytic on [0,∞) if E {X2} <∞.

An immediate corollary is the following.

6

Corollary 1: For every input, mmse(X, snr) is infinitely differentiable for all snr > 0. Thestatement also holds for snr = 0+ if E {X2} <∞.

The analyticity of the MMSE also implies that the MMSE can be reconstructed from its localderivatives, which, as we shall see, are in general polynomials in the conditional moments ofthe input given the observation.

A. Analyticity at snr = 0

We first establish the analyticity of mmse(X, snr) at snr = 0 assuming that the input X hasfinite moments for all n. The following sequence of lemmas are essential to the proof.

Lemma 1: If E {|X|n} <∞, n = 1, 2, . . . , then∫xi exp

[−1

2

(y −√δ x)2]

dPX(x) (29)

is analytic in δ for all δ > 0 and y ∈ R. Furthermore, its Taylor series expansion at δ = 0 isabsolutely convergent.

Proof: Fix y. The function exp[− 1

2(y −

√δ x)2

]is bounded and analytic in δ, and in fact

its Taylor series expansion in√δ, denote as

exp

[−1

2y2

]+∞∑n=1

fn(y)xnδn2 (30)

converges absolutely for all x and δ ≥ 0. The following equivalent form of (29),

exp

[−1

2y2

]E{X i}

+∞∑n=1

fn(y)E{Xn+i

}δ

n2 (31)

can then be shown to be absolutely convergent using the bounded convergence theorem.Lemma 2: Let Y =

√δ X+N with δ > 0. For all y, the conditional mean estimate X(y, δ) =

E {X | Y = y} is absolutely convergent in δ.Proof: Note that q0(y; snr) > 0 for all y and snr. Since the integral (29) is absolutely

convergent in δ for i = 0, 1, hence their ratio, which is the conditional mean, must also beabsolutely convergent (and hence also analytic in δ).

Lemma 3: For every y, the conditional variance

var {X|Y = y} = E{

(X − X(y, δ))2∣∣∣ Y = y

}(32)

is analytic in δ.Proof: For every y, the integral∫

(x− X(y, δ))2pY |X(y|x; δ) dPX(x) (33)

7

XY1 Y2

snr + γ

snr

σ1N1 σ2N2

? ?- - -⊕ ⊕� -

� -

Fig. 2. An incremental Gaussian channel.

is analytic in δ due to Lemma 1. Dividing by pY (y; δ), which is the integral in (29) with i = 0

and positive for all y ∈ R, yields the conditional variance. Hence the proof of the lemma.The analyticity at snr = 0 can now be established.Lemma 4:

mmse(X, δ) =

∫ ∞−∞

E{

(X − X(y, δ))2∣∣∣ Y = y

}pY (y; δ) dy (34)

is analytic at δ = 0.Proof: Note that the integrand as a product of two absolutely convergent series is also

absolutely convergent. The analyticity of the integral can be established using the boundedconvergence theorem.

B. The Incremental Channel

Let snr be an arbitrary but fixed positive number. In the following, we describe mmse(X, snr)

by translating the function to the origin snr = 0, so that the analyticity of the MMSE at zero SNRcan be extended to all SNR. Consider mmse(X, snr + γ) as a function of γ > 0. We examinethe incremental channel [1] (a cascade of two Gaussian channels) as depicted in Figure 2:

Y1 = X + σ1N1 (35a)

Y2 = Y1 + σ2N2 (35b)

where X is the input, and N1 and N2 are independent standard Gaussian random variables. Letσ1, σ2 > 0 satisfy σ2

1 = 1/(snr + γ) and σ21 + σ2

2 = 1/snr so that the SNR of the first channel(35a) is snr + γ and that of the composite channel is snr. A linear combination of (35a) and(35b) yields

(snr + γ)Y1 = snr Y2 + γ X +√γ N (36)

where we have definedN =

1√γ

(γ σ1N1 − snr σ2N2). (37)

8

Clearly, the incremental channel (35) is equivalent to (36) paired with (35b). Due to mutualindependence of (X,N1, N2), N is a standard Gaussian random variable independent of X .Moreover, (X,N, σ1N1 +σ2N2) are mutually independent. Thus N is independent of (X, Y2) by(35). Based on the above observations, the joint distribution of X and Y1 conditioned on Y2 = y2

is exactly the input-output relationship of a Gaussian channel with SNR equal to γ describedby (36) with Y2 = y2. Note that knowledge of Y2 does not improve the estimation error basedon Y1 because Y2 is a degraded version of Y1.

For convenience, let the conditional MMSE be defined for any pair of jointly distributedvariables (X,U) and SNR γ ≥ 0:

mmse(X, γ|U) = E{

(X − E {X|√γ X +N,U})2 } (38)

where N ∼ N (0, 1) is independent of (X,U). It can be regarded as the MMSE achieved with sideinformation U available to the estimator. The MMSE of estimating X conditioned on Y2 = y2

can then be written asmmse(X, γ|Y2 = y2). (39)

Taking expectation over the distribution of Y2, which depends on snr but not on γ, the averageMMSE of estimating X given Y1 is then expressed as

mmse(X, snr + γ) = mmse(X, γ|Y2). (40)

C. Analyticity At Arbitrary SNR

Proposition 6 can now be established using the incremental channel as well as the analyticityat snr = 0. For any snr > 0, consider the series expansion of the MMSE at snr + δ about snr.The Gaussian channel can be translated to an equivalent conditional Gaussian channel at SNRequal to δ using (40). The finiteness of the central moments of the input conditioned on theoutput at snr is guaranteed by Proposition 4. Applying the analyticity of the conditional MMSEat δ = 0 leads to the analyticity of mmse(X, snr) for all snr > 0.

V. DERIVATIVES OF THE MMSE

A. Derivatives of the MMSE

With the smoothness of the MMSE established in Corollary 1, we next investigate its deriva-tives with respect to the SNR. Note that the Taylor series expansion of the MMSE aroundsnr = 0+ to the third order has been obtained in [1] as

mmse(X, snr) = 1−snr + snr2 − 1

6

[(EX4)2 − 6EX4 − 2(EX3)2 + 15

]snr3 +O(snr4) (41)

9

where X is assumed to have zero mean and unit variance. The technique is to expand (7)in terms of the small signal

√snrX , evaluate (8) using the moments of X , and then calculate

(11), where the integral over y can be evaluated as a Gaussian integral. Evidently, the polynomialexpansion can be carried out to arbitrarily high orders, although the resulting expressions becomeincreasingly complicated.

The above expansion of the MMSE at snr = 0+ can be lifted to arbitrary SNR using theincremental channel technique. Essentially, conditioned on the degraded received signal at slightlyweaker SNR, the channel at stronger SNR can be viewed as a Gaussian channel with smallSNR, for which the Taylor series expansion is essentially given by (41). Finiteness of the inputmoments is not required for snr > 0 because the conditional moments are always finite due toProposition 4.

We denote the following random variables:

Mi = E{

(X − E {X | Y })i∣∣∣ Y } , i = 1, 2, . . . (42)

which all exist according to Proposition 4. Evidently, M1 = 0, M2 = var{X|√

snrX +N}

and

E {M2} = mmse(X, snr). (43)

If the input distribution PX is symmetric, i.e., X and −X are identically distributed, then Mi = 0

for all odd i.The derivatives of the MMSE are found to be the expected value of polynomials of Mi, whose

existence is guaranteed by Proposition 4.Proposition 7: For every random variable X and every snr > 0,

d mmse(X, snr)

d snr= −E

{M2

2

}(44)

where N ∼ N (0, 1) and Mi are defined in (42). Moreover,

d2mmse(X, snr)

d snr2= 2 E

{M3

2

}(45)

and

d3mmse(X, snr)

d snr3= E

{−M2

4 + 6M4M22 + 2M2

3M2 − 15M42

}. (46)

The derivatives are also valid for snr = 0+ if X has finite moments.It is easy to check that the derivatives found in Proposition 7 are consistent with the Taylor

series expansion (41) at zero SNR. In principle, any derivative of the MMSE can be obtained asan expectation of a polynomial of the conditional moments. We relegate the proof of Proposition

10

7 to Section V-D.According to Proposition 7, the first derivative of the MMSE is negative and the second

positive, hence the following.Corollary 2: The function mmse(X, snr) is monotonic decreasing and convex in snr.

B. Derivatives of the Mutual Information

Based on Proposition 7, the following derivatives of the mutual information are extensions ofthe key information-estimation relationship (6).

Corollary 3: For every distribution PX , the mutual information I(X;√

snrX+N) is analyticin snr. Moreover,

di

dsnriI(X;

√snrX +N) = (−1)i−1 (i− 1)!

2E{M i

2

}(47)

for i = 1, 2, 3, and

d4

dsnr4I(X;

√snrX +N)

=1

2E{−M2

4 + 6M4M22 + 2M2

3M2 − 15M42

}.

(48)

Corollary 3 is a generalization of previous results on the small SNR expansion of the mutualinformation such as in [7]. Note that (47) with i = 1 is exactly the original relationship of themutual information and the MMSE given by (6) in light of (43). Interestingly, the first threederivatives of the mutual information can be simply evaluated using the first three moments ofthe conditional variance M2.

C. Derivatives of the Conditional MMSE

The derivatives in Proposition 7 can be generalized to the conditional MMSE defined in (38).For every u, let Xu denote a random variable indexed by u with distribution PX|U=u. Then theconditional MMSE can be seen as an average

mmse(X, snr|U) =

∫mmse(Xu, snr) dPU(u). (49)

The following is a straightforward extension of (44).Corollary 4: For every jointly distributed (X,U),

ddsnr

mmse(X,snr|U) = −E{

(M2(U))2}

(50)

where for every u and i = 1, 2, . . . ,

Mi(u) = E{

[Xu − E {Xu|Y }]i∣∣∣Y =

√snrXu +N

}. (51)

11

D. Proof of Proposition 7 on the Derivatives

The first derivative of the mutual information with respect to the SNR is derived in [1] using theincremental channel technique. The same technique is adequate for the analysis of the derivativesof various other information theoretic and estimation theoretic quantities.

The MMSE of estimating an input Z with zero mean, unit variance (σ2Z = 1) and finite

higher-order moments admits the following Taylor series expansion in SNR [1, equation (91)]:

mmse(Z, snr) = 1− snr + snr2 − 1

6

[(EZ4

)2 − 6EZ4 − 2(EZ3

)2+ 15

]snr3 +O

(snr4

). (52)

This expansion can be obtained from (11) by expanding qi(y; snr) at the vicinity of snr = 0 (seeequation (90) in [1]). In general, given an arbitrary random variable X , we denote its centralmoments by

mi = E{

(X − E {X})i}, i = 1, 2, . . . . (53)

Suppose all moments of X are finite, the random variable can be represented as X = E {X}+√m2 Z where Z has zero mean and unit variance. By (52) and Proposition 2,

mmse(X, snr)

=m2 mmse(Z, snrm2) (54)

=m2 − snrm22 + snr2m3

2 −1

6snr3

[m2

4 − 6m4m22 − 2m2

3m2 + 15m42

]+O

(snr4

)(55)

because E {Zi} = m− i

22 mi. In general, taking into account the variance of the input, we have

mmse′(X, 0) = −m22 (56)

mmse′′(X, 0) = 2m23 (57)

mmse′′′(X, 0) = −m24 + 6m4m

22 + 2m2

3m2 − 15m42 . (58)

Now that the MMSE at an arbitrary SNR is rewritten as the expectation of MMSEs at zeroSNR, we can make use of known derivatives at zero SNR to obtain derivatives at any SNR. Inparticular, because of (56),

ddγ

mmse(X, γ|Y2 = y2)∣∣∣γ=0

= − (var {X|Y2 = y2})2 (59)

andd

dγmmse(X, γ|Y2)

∣∣∣γ=0

= −E{

(var {X|Y2})2} . (60)

12

Thus,

ddsnr

mmse(X, snr) =d

dγmmse(X, snr + γ)

∣∣∣γ=0

(61)

=d

dγmmse(X, γ|Y2)

∣∣∣γ=0

(62)

= −E{

(var {X|Y2})2} (63)

= −E{(

var{X|√

snrX +N})2} (64)

= −E{M2

2

}(65)

where (62) is due to (40) and the fact that the distribution of Y2 is not dependent on γ, and (63)is due to (60). Hence (44) is proved. Moreover, because of (57),

d2

dγ2mmse(X, γ|Y2 = y2)

∣∣∣γ=0

= 2 (var {X|Y2 = y2})3 (66)

and hence

d2

dsnr2mmse(X, snr) = 2 E

{(var{X|√

snrX +N})3} (67)

proves (45). Similar arguments, together with (58), lead to the third derivative of the MMSEwhich is obtained as (46).

Finally, the Taylor series expansion (52) for mmse(Z, snr) can be extended to arbitrary orderby further expanding the functions qi(y; snr), although the resulting expression becomes tediousquickly. Nonetheless all coefficients of the power of snr are polynomials in the moments of Z.Using the incremental channel technique, derivatives of mmse(X, snr) of arbitrary order can becalculated, all of which are polynomials of the conditional moments of the input.

Proposition 7 is easily verified in the special case of standard Gaussian input (X ∼ N (0, 1)),where conditioned on Y = y, the input is Gaussian distributed:

X ∼ N( √

snr

1 + snry,

1

1 + snr

). (68)

In this case M2 = (1 + snr)−1, M3 = 0 and M4 = 3(1 + snr)−2 are constants, and (44), (45) and(46) are straightforward.

VI. PROPERTIES OF THE MMSE FUNCTIONAL

For any fixed snr, mmse(X, snr) can be regarded as a functional of the input distribution PX .Meanwhile, the MMSE function, {mmse(X, snr), snr ∈ [0,∞)}, can be regarded as a “transform”of the input distribution. This section studies the property of this MMSE functional.

13

A. Conditioning Reduces the MMSE

As a fundamental measure of uncertainty, the MMSE decreases with additional side informa-tion available to the estimator. This is because that an informed optimal estimator performs noworse than any uninformed estimator by simply discarding the side information. The followingresult is well-known, which can also be viewed as a consequence of the fact that data processingincreases the MMSE.

Lemma 5: For any jointly distributed (X,U) and snr ≥ 0,

mmse(X, snr|U) ≤ mmse(X, snr) (69)

in which the equality holds if and only if X is independent of U .

B. Concavity and Monotonicity

Proposition 8: The functional mmse(X, snr) is concave in PX for every snr ≥ 0,Proof: Let B be a Bernoulli variable with probability α to be 0. Consider any random

variables X0, X1 independent of B. Let Z = XB. Then the distribution of Z is αPX0+(1−α)PX1 .Consider the problem of estimating Z given

√snrZ + N where N is standard Gaussian. Note

that if B is revealed, one can choose either the optimal estimator for PX0 or PX1 depending onthe value of B. In fact, Lemma 5 leads to

mmse(Z, snr) ≥ mmse(Z, snr|B) (70)

= αmmse(X0, snr) + (1− α)mmse(X1, snr) (71)

which is the desired concavity of mmse(X, snr) in PX .Note that the reasoning in the proof of Proposition 8 can also be used to prove the concavity

of I(X;Y ) in the distribution PX .Proposition 8 suggests that a mixture of random variables is harder to estimate than the

individual variables in average. A related result in [3] states that a linear combination of tworandom variables X1 and X2 is also harder to estimate than the individual variables in someaverage:

Proposition 9 ( [3]): For every snr ≥ 0 and α ∈ [0, 2π],

mmse(cosαX1 + sinαX2, snr)

≥ cos2 αmmse(X1, snr) + sin2 αmmse(X2, snr)(72)

A related result concerns the MMSE of estimating a normalized sum of i.i.d. random variables.Let X1, X2, . . . be i.i.d. with finite variance and Sn = (X1 + · · ·+Xn)/

√n. It has been shown

that the entropy of Sn increases monotonically to that of a Gaussian random variable of the same

14

variance [5], [9]. The following monotonicity result of the MMSE of estimating Sn in Gaussiannoise can be established.

Proposition 10: Let X1, X2, . . . be i.i.d. with finite variance. Let Sn = (X1 + · · ·+Xn)/√n.

Then for every snr ≥ 0,mmse(Sn+1, snr) ≥ mmse(Sn, snr). (73)

Because of the central limit theorem, as n → ∞ the MMSE converges to the MMSE ofestimating a Gaussian random variable with the same variance as that of X:

1

snr σ2X + 1

. (74)

Proposition 10 is a simple corollary of the following general result in [5].Proposition 11 ( [5]): Let X1, . . . , Xn be independent. For any λ1, . . . , λn ≥ 0 which sum

up to one and any γ ≥ 0,

mmse

(n∑i=1

Xi, γ

)≥

n∑i=1

λi mmse

(X\i√

(n− 1)λi, γ

)(75)

where X\i =n∑

j=1,j 6=iXj .

Setting λi = 1/n in (75) yields Proposition 10.In view of the representation of the entropy or differential entropy using the MMSE in Section

III, taking integral of both sides of (73) proves the following monotonicity result of the entropyor differential entropy of Sn whichever is well-defined. More generally, reference [5] appliesPropositions 5 and 11 to prove the following result, which is originally given in [9].

Theorem 1 ( [9]): Under the same condition as in Proposition 11,

h

(n∑i=1

aiXi

)≥

n∑i=1

1− a2i

n− 1h

(n∑

j=1,j 6=i

ajXj√1− a2

i

)(76)

where (a1, . . . , an) is an arbitrary vector with unit norm.

C. Gaussian and Non-Gaussian Inputs

Is the transform MMSE one-to-one? We conjecture the following relationship of the MMSEsof different inputs.

Conjecture 1: The MMSEs of two input distributions coincide for all SNR if and only ifthe two distributions are identical modulo a shift in mean and polarity. In other words, for anyzero-mean random variables X and Z, mmse(X, snr) ≡ mmse(Z, snr) for all snr ∈ [0,∞) if andonly if X is identically distributed as either Z or −Z.

15

If the conjecture is true, then the input distribution is completely determined modulo polarityand shift in its mean value by its MMSE transform.

We have been able to prove the conjecture if one of the inputs is Gaussian. It is widelyunderstood that linear estimation is optimal for Gaussian inputs, which achieves the GaussianMMSE. The suboptimality of linear estimation for non-Gaussian inputs implies that any non-Gaussian input achieves strictly smaller MMSE than Gaussian input of the same variance. Thiswell-known result is illustrated in Figure 1 and stated as follows.

Proposition 12: For every snr ≥ 0 and random variable X with variance no greater than σ2,

mmse(X, snr) ≤ σ2

1 + snr σ2. (77)

The equality of (77) is achieved if and only if the distribution of X is Gaussian with varianceσ2.

Proof: Due to Propositions 1 and 2, it is enough to prove the result assuming that E {X} = 0

and var {X} = σ2. Consider the linear estimator for the channel (3):

X l =

√snr

snr σ2 + 1Y (78)

which achieves the least mean square error among all linear estimators, which is exactly theright hand side of (77). The inequality (77) is evident due to the suboptimality of the linearconstraint on the estimator. In order to show strict inequality if X is non-Gaussian, we showinstead that X must be Gaussian if the linear estimator is optimal. It is well-known that theoptimal estimate X satisfies the orthogonality property, i.e.,

E{f(Y )(X − X)

}= 0 (79)

for all functions f of the observation Y . Plugging in f(Y ) = Y k with k = 1, 2, . . . , all momentsof X can be obtained, which turn out to be the moments of a Gaussian random variable withzero mean and variance σ2. Due to the Carleman’s Theorem [10], the distribution is uniquelydetermined by the moments to be Gaussian.

D. The Unique Crossing Property

In view of Proposition 12 and the scaling property of the MMSE, at any given SNR, the MMSEof a non-Gaussian input is equal to the MMSE of some Gaussian input with reduced variance.The following result suggest that there is some additional simple ordering of the MMSEs dueto Gaussian and non-Gaussian inputs.

Proposition 13: Consider the difference between the MMSE of a standard Gaussian random

16

variable and that of an arbitrary variable X ,

f(γ) = (1 + γ)−1 −mmse(X, γ) (80)

at SNR equal to γ. If X is not standard Gaussian, the function f may have at most one zero,i.e., the curves of mmse(X, γ) and (1 + γ)−1 intersect at most once on [0,∞). In particular, if0 ≤ snr0 <∞ is a zero of the function (i.e., f(snr0) = 0), then

1) f(0) ≤ 0;2) f(γ) is strictly increasing on γ ∈ [0, snr0];3) f(γ) > 0 for every γ ∈ (snr0,∞); and4) limγ→∞ f(γ) = 0.

Furthermore, the result holds if the term (1 + γ)−1 in (80) is replaced by σ2/(1 +σ2γ) with anyσ, which is the MMSE of a Gaussian variable with variance σ2 corrupted by standard Gaussiannoise.

2 4 6 8 10

-1.0

-0.8

-0.6

-0.4

-0.2

0.0

0.2

0.4 f(γ) = (1 + γ)−1 −mmse(X, γ)

γsnr0

Fig. 3. An example of the difference between the MMSE for standard Gaussian input and that of a binary input equally likelyto be ±

√2. The difference crosses the horizontal axis only once.

Proof: If var {X} < 1, the proposition holds because f(γ) has no zero due to Proposition12. We suppose in the following var {X} ≥ 1 but X is not standard Gaussian. An instance ofthe function f(γ) with X equally likely to be ±

√2 is shown in Figure 3. Evidently f(0) =

1 − var {X} ≤ 0. Consider the derivative of the difference (80) at any γ with f(γ) < 0. ByProposition 7,

f ′(γ) = E{M2

2

}− (1 + γ)−2 (81)

> E{M2

2

}− (mmse(X, γ))2 (82)

= E{M2

2

}− (EM2)

2 (83)

≥ 0 (84)

17

where (82) holds because f(γ) < 0 for all 0 ≤ γ < snr0 by assumption, and (83) is due to(43). It is clear that f(0) = 1− var {X} ≤ 0, limγ→∞ f(γ) = 0 and f(γ) has one zero at snr0.The smoothness of f and its monotone increasing property for all γ with f(γ) < 0 implies thatf has no more than one zero crossing (by zero crossing we mean f takes nonzero values inarbitrarily small neighborhood of the crossing).

For any fixed σ, the above arguments can be repeated with σ2γ treated as the SNR. Theproperty of f(σ2γ) implies that the proposition holds with the standard Gaussian MMSE replacedby the MMSE of a Gaussian variable with variance σ2.

Corollary 5: As functions of γ, mmse(X, γ) and (1 + γ)−1 intersect at most once on [0,∞)

unless X ∼ N (0, 1) for which the MMSEs coincide for all γ.The single-crossing property can be generalized to the conditional MMSE defined in (38).Proposition 14: Let X and U be jointly distributed variables. The function

f(γ) = (1 + γ)−1 −mmse(X, γ|U) (85)

has at most one zero on [0,∞) unless X is standard Gaussian conditioned on U . In particular, ifsnr0 <∞ is the unique zero, then f(γ) is strictly increasing on γ ∈ [0, snr0] and strictly positiveon γ ∈ (snr0,∞). Furthermore, the result holds if (1 + γ)−1 in (85) is replaced by σ2/(1 +σ2γ)

with any σ.Proof: For every u, let Xu denote a random variable indexed by u with distribution PX|U=u.

Define also a random variable for every u,

M(u, γ) = M2(Xu, γ) (86)

= var{Xu|√

snrXu +N}

(87)

where N ∼ N (0, 1). Evidently, E {M(u, γ)} = mmse(Xu, γ) and hence

f(γ) =1

1 + γ− E {E {M(U, γ) | U}} (88)

=1

1 + γ− E {M(U, γ)} . (89)

Clearly,

f ′(γ) = − 1

(1 + γ)2− E

{d

dγM(U, γ)

}(90)

= E{M2(U, γ)

}− 1

(1 + γ)2(91)

18

by Proposition 7. In view of (89), for all γ such that f(γ) ≤ 0, we have

f ′(γ) ≥ E{M2(U, γ)

}− (E {M(U, γ)})2 (92)

≥ 0 (93)

by (91) and Jensen’s inequality. The fact that (1 +γ)−1 in (85) can be replaced by σ2/(1 +σ2γ)

is due to the same reason as the proof of Proposition 13.

E. The High-SNR Asymptotics

The asymptotics of mmse(X, γ) as γ →∞ can be further characterized as follows. Above all,it is upper bounded by σ2

X/(1 + σ2X γ) due to Proposition 12. Moreover, the MMSE can vanish

faster than exponential in γ with arbitrary rate, under for instance a sufficiently skewed binaryinput [11]. On the other hand, the decay of the MMSE of a non-Gaussian random variable neednot be faster than the MMSE of a Gaussian variable. For example, let X = Z +

√σ2X − 1B

where σX > 1, Z ∼ N (0, 1) and the Bernoulli variable B are independent. Clearly, X is harderto estimate than Z but no harder than σXZ, i.e.,

1

1 + γ< mmse(X, γ) <

σ2X

1 + σ2Xγ

(94)

where the difference between the upper and lower bounds is O (γ−2). As a consequence, thefunction f defined in (80) may not have any zero even if f(0) = 1−σ2

X < 0 and limγ→∞ f(γ) =

0.

VII. APPLICATIONS TO CHANNEL CAPACITY

A. Secrecy Capacity of the Gaussian Wiretap Channel

This section makes use of the MMSE as a middleman to show that the secrecy capacity of theGaussian wiretap channel is achieved by Gaussian inputs. The wiretap channel was introducedby Wyner in [12] in the context of discrete memoryless channels. Let X denote the input, andlet Y and Z denote the output of the main channel and the wiretapper’s channel respectively.The problem is to find the rate at which reliable communication is possible between X and Y ,while keeping the mutual information between the message and the wiretapper’s observation assmall as possible. Assuming that the wiretapper sees a degraded output of the main channel,Wyner showed that such communication can achieve any rate up to the secrecy capacity

Cs = supPX

[I(X;Y )− I(X;Z)] (95)

where the supremum is taken over all admissible choices of the input distribution. Wyner alsoderived the achievable rate-equivocation region.

19

We consider the following Gaussian wiretap channel studied in [13]:

Y =√

snr1X +N1 (96)

Z =√

snr2X +N2 (97)

where snr1 ≥ snr2 and N1, N2 ∼ N (0, 1) are independent. Let the average energy per channeluse be constrained in every codeword of length N :

1

N

N∑i=1

x2i ≤ 1. (98)

Reference [13] showed that the optimal input which achieves the supremum in (95) is standardGaussian by means of Shannon’s EPI. In particular, the secrecy capacity is

Cs =1

2log

(1 + snr11 + snr2

). (99)

A simple proof of the optimality of Gaussian input can be obtained by using the integralexpression of mutual information via the MMSE. For a certain fixed input distribution PX , letus write

I(X;Y )− I(X;Z) =1

2

∫ snr1

snr2

mmse(X, γ) dγ (100)

using (26). Under the constraint E {X2} ≤ 1, the maximum of (100) is achieved by standardGaussian input because the it maximizes the MMSE for every SNR under the power constraint.Plugging mmse(X, γ) = (1 + γ)−1 into (100) yields the secrecy capacity given in (99). In factthe whole rate-equivocation region can be obtained using the same techniques.

B. The Gaussian Broadcast Channel

In light of the mutual information-MMSE relationship (26), comparing the mutual informationdue to different inputs is equivalent to comparing the integral of the corresponding MMSEs. Inthis section, we use the single-crossing property to show that Gaussian input achieves the capacityregion of scalar Gaussian broadcast channels.

Consider the following degraded Gaussian broadcast channel:

Y1 =√

snr1X +N1 (101)

Y2 =√

snr2X +N2 (102)

where snr1 ≥ snr2 and N1, N2 ∼ N (0, 1) are independent. Note that the formulation of theGaussian broadcast channel is statistically identical to that of the Gaussian wiretap channel,

20

except for a different goal: The rates between the sender and both receivers are to be maximized,rather than minimizing the rate between the sender and the (degraded) wiretapper. The capacityregion of degraded broadcast channels under unit input power constraint is given by [14]:⋃

PUX : E{X2}≤1

{R1 ≤ I(X;Y1|U)

R2 ≤ I(U ;Y2)

}(103)

where U is an auxiliary random variable with U–X–(Y1, Y2) being a Markov chain. It has longbeen recognized that the capacity achieving input X for (101) and (102) is Gaussian. In factone choice of PUX is joint Gaussian with standard Gaussian marginals and E {UX} =

√1− α.

The resulting capacity region of the Gaussian broadcast channel is

⋃α∈[0,1]

R1 ≤

1

2log(1 + α snr1

)R2 ≤

1

2log

(1 + snr2

1 + α snr2

) . (104)

The conventional proof of the optimality of Gaussian inputs relies on the EPI in conjunctionwith Fano’s inequality [15]. The converse can also be proved directly from (103) using onlythe EPI [8], [16]. In the following we show a simple alternative proof using the single-crossingproperty of MMSE.

1 2 3 4

0.2

0.4

0.6

0.8

1.0

snr1snr2snr0snr

Fig. 4. The thin curves show the MMSE (solid line) and mutual information (dashed line) of a Gaussian input. The thickcurves show the MMSE (solid) and mutual information (dashed) of binary input. The two mutual informations are identical atsnr2, which must be greater than snr0 where the two MMSE curves cross.

Due to the power constraint on X , there must exist α ∈ [0, 1] such that

I(X;Y2|U) =1

2log (1 + α snr2) (105)

=1

2

∫ snr2

0

α

αγ + 1dγ. (106)

21

By the chain rule,

I(U ;Y2) = I(U,X;Y2)− I(X;Y2|U) (107)

= I(X;Y2)− I(X;Y2|U). (108)

By (103) and (105), the desired bound on R2 is established:

R2 ≤1

2log (1 + snr2)−

1

2log (1 + α snr2) (109)

=1

2log

(1 + snr2

1 + α snr2

). (110)

It remains to establish the desired bound for R1. The idea is illustrated in Figure 4, wherecrossing of the MMSE curves imply some ordering of the corresponding mutual informations.Note that

I(X;Y2|U = u) =1

2

∫ snr2

0

mmse(Xu, γ) dγ (111)

and henceI(X;Y2|U) =

1

2

∫ snr2

0

E {mmse(XU , γ|U)} dγ. (112)

Comparing (112) with (106), there must exist 0 ≤ snr0 ≤ snr2 such that

E {mmse(XU , snr0|U)} =α

αsnr0 + 1. (113)

By Proposition 14, this implies that for all γ ≥ snr2 ≥ snr0,

E {mmse(XU , γ|U)} ≤ α

αγ + 1. (114)

Consequently,

R1 ≤ I(X;Y1|U) (115)

=1

2

∫ snr1

0

E {mmse(XU , γ|U)} dγ (116)

=1

2

(∫ snr2

0

+

∫ snr1

snr2

)E {mmse(XU , γ|U)} dγ (117)

≤ 1

2log (1 + α snr2) +

1

2

∫ snr1

snr2

α

αγ + 1dγ (118)

=1

2log (1 + α snr1) (119)

where the inequality (118) is due to (105), (112) and (114).

22

REFERENCES

[1] D. Guo, S. Shamai, and S. Verdu, “Mutual information and minimum mean-square error in Gaussian channels,” IEEETrans. Inform. Theory, vol. 51, pp. 1261–1282, Apr. 2005.

[2] D. Guo and S. Verdu, “Randomly spread CDMA: Asymptotics via statistical physics,” IEEE Trans. Inform. Theory, vol. 51,pp. 1982–2010, June 2005.

[3] S. Verdu and D. Guo, “A simple proof of the entropy power inequality,” IEEE Trans. Inform. Theory, pp. 2165–2166, May2006.

[4] D. Guo, S. Shamai, and S. Verdu, “Proof of entropy power inequalities via MMSE,” in Proc. IEEE Int. Symp. Inform.Theory, pp. 1011–1015, Seattle, WA, USA, July 2006.

[5] A. M. Tulino and S. Verdu, “Monotonic decrease of the non-Gaussianness ofthe sum of independent random variables: Asimple proof,” IEEE Trans. Inform. Theory, vol. 52, pp. 4295–4297, Sept. 2006.

[6] D. P. Palomar and S. Verdu, “Gradient of mutual information in linear vector Gaussian channels,” IEEE Trans. Inform.Theory, vol. 52, pp. 141–154, Jan. 2006.

[7] V. Prelov and S. Verdu, “Second-order asymptotics of mutual information,” IEEE Trans. Inform. Theory, vol. 50, pp. 1567–1580, Aug. 2004.

[8] D. Tuninetti, S. Shamai, and G. Caire, “Scalar fading Gaussian broadcast channels with perfect receiver CSI: Is Gaussianinput optimal?,” in Proc. Workshop Inform. Theory Applications, San Diego, CA, USA, Jan. 2007.

[9] S. Artstein, K. M. Ball, F. Barthe, and A. Naor, “Solution of Shannon’s problem on the monotonicity of entropy,” J. Amer.Math. Soc., vol. 17, pp. 975–982, 2004.

[10] W. Feller, An Introduction to Probability Theory and Its Applications, vol. II. John Wiley & Sons, Inc., 2nd ed., 1971.[11] D. Guo, Gaussian Channels: Information, Estimation and Multiuser Detection. PhD thesis, Department of Electrical

Engineering, Princeton University, 2004.[12] A. D. Wyner, “The wire-tap channel,” The Bell System Technical Journal, vol. 54, pp. 1355–1387, Oct. 1975.[13] S. K. Leung-Yan-Cheong and M. E. Hellman, “The Gaussian wire-tap channel,” IEEE Trans. Inform. Theory, vol. IT-24,

pp. 451–456, 1978.[14] R. G. Gallager, “Capacity and coding for degraded broadcast channels,” Problemy Peredachi Informatsii, vol. 10, p. 3014,

July-Sept. 1974. Translated: Probl. Inform. Transmission., pp. 185-193, July-Sept. 1974.[15] P. P. Bergmans, “A simple converse for broadcast channels with additive white Gaussian noise,” IEEE Trans. Inform.

Theory, vol. 20, pp. 279–280, Mar. 1974.[16] A. El Gamal, “Course ee478: Multiple user information theory.” http://www-isl.stanford.edu/people/abbas/, 2003.

23

estimation in gaussian noise: properties of the minimum...

Documents