chapter 8 hypothesis testingshannon.cm.nctu.edu.tw/it/c2-8s04.pdf · geometrical interpretation for...

Chapter 8

Hypothesis Testing

Po-Ning Chen

Department of Communications Engineering

National Chiao-Tung University

Hsin Chu, Taiwan 30050

Error exponent and divergence II:8-1

Definition 8.1 (exponent) A real number a is said to be the exponent for a

sequence of non-negative quantities {an}n≥1 converging to zero, if

a = limn→∞

(−1

nlog an

).

• In operation, exponent is an index for the exponential rate-of-convergence forsequence an. We can say that for any γ > 0,

e−n(a+γ) ≤ an ≤ e−n(a−γ),as n large enough.

• Recall that in proving the channel coding theorem, the probability of decodingerror for channel block codes can be made arbitrarily close to zero when the

rate of the codes is less than channel capacity.

• Actually, this result can be mathematically written as:Pe( C∼∗n) → 0 as n → ∞,

provided R = lim supn→∞(1/n) log ‖ C∼∗n‖ < C, where C∼∗n is the optimal codefor block length n.

• From the theorem, we only know the decoding error will vanish as block lengthincreases; but, it does not reveal how fast the decoding error approaches zero.

Error exponent and divergence II:8-2

• In other words, we do not know the rate-of-convergence of the decoding error.Sometimes, this information is very important, especially for one to decide the

sufficient block length to achieve some error bound.

• The first step of investigating the rate-of-convergence of the decoding error isto compute its exponent, if the decoding error decays to zero exponentially fast

(it indeed does for memoryless channels.) This exponent, as a function of the

rate, is in fact called the channel reliability function, and will be discussed in

the next chapter.

• For the hypothesis testing problems, the type II error probability of fixed testlevel also decays to zero as the number of observations increases. As it turns

out, its exponent is the divergence of the null hypothesis distribution against

alternative hypothesis distribution.

Stein’s lemma II:8-3

Lemma 8.2 (Stein’s lemma) For a sequence of i.i.d. observations Xn which is

possibly drawn from either null hypothesis distribution PXn or alternative hypoth-

esis distribution PX̂n, the type II error satisfies

(∀ ε ∈ (0, 1)) limn→∞

−1n

log β∗n(ε) = D(PX‖PX̂),

where β∗n(ε) = minαn≤ε βn, and αn and βn represent the type I and type II errors

respectively.

Proof: [1. Forward Part]

In the forward part, we prove that there exists an acceptance region for null

hypothesis such that

lim infn→∞

−1n

log βn(ε) ≥ D(PX‖PX̂).

step 1: divergence typical set. For any δ > 0, define divergence typical set

as

An(δ)

=

{xn :

∣∣∣∣1n log PXn(xn)PX̂n(xn) − D(PX‖PX̂)∣∣∣∣ < δ} .

Note that in divergence typical set,

PX̂n(xn) ≤ PXn(xn)e−n(D(PX‖PX̂)−δ).


step 2: computation of type I error. By weak law of large number,

PXn(An(δ)) → 1.

Hence,

αn = PXn(Acn(δ)) < ε,for sufficiently large n.

step 3: computation of type II error.

βn(ε) = PX̂n(An(δ))=

∑xn∈An(δ)

PX̂n(xn)

≤∑

xn∈An(δ)PXn(x

n)e−n(D(PX‖PX̂)−δ)

≤ e−n(D(PX‖PX̂)−δ)∑

xn∈An(δ)PXn(x

n)

≤ e−n(D(PX‖PX̂)−δ)(1 − αn).

Hence,

−1n

log βn(ε) ≥ D(PX‖PX̂) − δ +1

nlog(1 − αn),


which implies

lim infn→∞

−1n

log βn(ε) ≥ D(PX‖PX̂) − δ.

The above inequality is true for any δ > 0. Therefore

lim infn→∞

−1n

log2 βn(ε) ≥ D(PX‖PX̂).

[2. Converse Part]

In the converse part, we will prove that for any acceptance region for null hy-

pothesis Bn satisfying the type I error constraint, i.e.,

αn(Bn) = PXn(Bcn) ≤ ε,

its type II error βn(Bn) satisfies

lim supn→∞

−1n

log βn(Bn) ≤ D(PX‖PX̂).


βn(Bn) = PX̂n(Bn) ≥ PX̂n(Bn ∩ An(δ))≥

∑xn∈Bn∩An(δ)

PX̂n(xn)

≥∑

xn∈Bn∩An(δ)PXn(x

n)e−n(D(PX‖PX̂)+δ)

= e−n(D(PX‖PX̂)+δ)PXn(Bn ∩ An(δ))≥ e−n(D(PX‖PX̂)+δ) (1 − PXn(Bcn) − PXn (Acn(δ)))≥ e−n(D(PX‖PX̂)+δ) (1 − αn(Bn) − PXn (Acn(δ)))≥ e−n(D(PX‖PX̂)+δ) (1 − ε − PXn (Acn(δ))) .

Hence,

−1n

log βn(Bn) ≤ D(PX‖PX̂) + δ +1

nlog (1 − ε − PXn (Acn(δ))) ,

which implies that

lim supn→∞

−1n

log βn(Bn) ≤ D(PX‖PX̂) + δ.

The above inequality is true for any δ > 0. Therefore,

lim supn→∞

−1n

log βn(Bn) ≤ D(PX‖PX̂).

Composition of sequence of i.i.d. observations II:8-7

• Stein’s lemma gives the exponent of the type II error probability for fixed testlevel.

• As a result, this exponent, which is the divergence of null hypothesis distribu-tion against alternative hypothesis distribution, is independent of the type I

error bound ε for i.i.d. observations.

• Specifically under i.i.d. environment, the probability for each sequence of xndepends only on its composition, which is defined as an |X |-dimensional vector,and is of the form (

#1(xn)

n,#2(xn)

n, . . . ,

#k(xn)

n

),

where X = {1, 2, . . . , k}, and #i(xn) is the number of occurrences of symboli in xn.

• The probability of xn is therefore can be written asPXn(x

n) = PX(1)#1(xn) × PX(2)#2(x

n) × PX(k)#k(xn).

• Note that #1(xn) + · · · + #k(xn) = n.• Since the composition of sequence decides its probability deterministically, all

sequences with the same composition should have the same statistical property,

and hence should be treated the same when processing.


• Instead of manipulating the sequences of observations based on the typical-set-like concept, we may focus on their compositions.

• As it turns out, such approach yields simpler proofs and better geometricalexplanations for theories under i.i.d. environment.

• (It needs to be pointed out that for cases when composition alone can notdecide the probability, this viewpoint does not seem to be effective.)

Lemma 8.3 (polynomial bound on number of composition) The num-

ber of compositions increases polynomially fast, while the number of possible se-

quences increases exponentially fast.

Proof:

• Let Pn denotes the set of all possible compositions.

• |Pn| ≤ (n + 1)|X |


Lemma 8.4 (probability of sequences of the same composition) The

probability of the sequences of composition C with respect to distribution PXnsatisfies

1

(n + 1)|X |e−nD(PC‖PX) ≤ PXn(C) ≤ e−nD(PC‖PX),

where PC is the composition distribution for composition C, and C (by abusingnotation without ambiguity) is also used to represent the set of all sequences (in

X n) of composition C.

Theorem 8.5 (Sanov’s Theorem) Let En be the set that consists of all com-positions over finite alphabet X , whose composition distribution belongs to P . Fixa sequence of product distribution PXn =

∏ni=1 PX . Then,

lim infn→∞

−1n

log PXn(En) ≥ infPC∈P

D(P‖PX),

where PC is the composition distribution for composition C. If, in addition, forevery distribution P in P , there exists a sequence of composition distributionsPC1, PC2, PC3, . . . ∈ P such that lim supn→∞ D(PCn‖PX) = D(P‖PX), then

lim supn→∞

−1n

log PXn(En) ≤ infP∈P

D(PC‖PX).

Geometrical interpretation for Sanov’s theorem II:8-10

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

�

P

�

PX

�

P1

�

P2

��

��

��

��

minP∈P

D(P‖PX)

The geometric meaning for Sanov’s theorem.


Example 8.6 • Question: One wants to roughly estimate the probabilitythat the average of the throws is greater or equal to 4, when tossing a fair dice

n times.

• He observes that whether the requirement is satisfied only depends on thecompositions of the observations.

• Let En be the set of compositions which satisfy the requirement.

En ={C :

6∑i=1

iPC(i) ≥ 4}

.

• To minimize D(PC‖PX) for C ∈ En, we can use the Lagrange multiplier tech-nique (since divergence is convex with respect to the first argument.) with the

constraints on PC being:6∑

i=1

iPC(i) = k and6∑

i=1

PC(i) = 1

for k = 4, 5, 6, . . . , n.

• So it becomes to minimize:6∑

i=1

PC(i) logPC(i)

PX(i)+ λ1

(6∑

i=1

iPC(i) − k)

+ λ2

(6∑

i=1

PC(i) − 1)

.


• By taking the derivatives, we found that the minimizer should be of the form

PC(i) =eλ1·i∑6j=1 e

λ1·j,

for λ1 is chosen to satisfy6∑

i=1

iPC(i) = k. (8.1.1)

• Since the above is true for all k ≥ 4, it suffices to take the smallest one as oursolution, i.e., k = 4.

• Finally, by solving (8.1.1) for k = 4 numerically, the minimizer should be

PC∗ = (0.1031, 0.1227, 0.1461, 0.1740, 0.2072, 0.2468),

and the exponent of the desired probability is D(PC∗‖PX) = 0.0433 nat.

• Consequently,PXn(En) ≈ e−0.0433·n.

Divergence typical set on composition II:8-13

• Divergence typical set in Stein’s lemma:

An(δ)

=

{xn :

∣∣∣∣1n log PXn(xn)PX̂n(xn) − D(PX‖PX̂)∣∣∣∣ < δ} .

•Tn(δ)

={xn ∈ X n : D(PCxn‖PX) ≤ δ},

where Cxn represents the composition of xn.

• PXn(Tn(δ)) → 1 is justified by

1 − PXn(Tn(δ)) =∑

{C : D(PC‖PX)>δ}PXn(C)

≤∑

{C : D(PC‖PX)>δ}e−nD(PC‖PX), from Lemma 8.4.

≤∑

{C : D(PC‖PX)>δ}e−nδ

≤ (n + 1)|X |e−nδ, cf. Lemma 8.3.

Universal source coding on composition II:8-14

• Universal codefn : X n →

∞⋃i=1

{0, 1}i

for i.i.d. source:1

n

∑xn∈Xn

PXn(xn)�(fn(x

n)) → H(X),

as n goes to infinity.

Example 8.7 (universal encoding using compositions)

• Binary-index the compositions using log2(n+1)|X | bits, and denote this binaryindex for composition C by a(C).

– Let Cxn denote the composition with respect to xn, i.e. xn ∈ Cxn.

• Binary-index the elements in C using n · H(PC) bits, and denote this binaryindex for elements in C by b(Cxn).

– For each composition C, we know that the number of sequence xn in C is atmost 2n·H(PC) (Here, H(PC) is measured in bits. I.e., the logarithmic base

in entropy is 2. See the proof of Lemma 8.4).


• Define a universal encoding function fn as

fn(xn) = concatenation{a(Cxn), b(Cxn)}.

• Then this encoding rule is a universal code for all i.i.d. sources.


Proof:

�̄n =∑

xn∈XnPXn(x

n)�(a(Cxn)) +∑

xn∈XnPXn(x

n)�(b(Cxn))

≤∑

xn∈XnPXn(x

n) · log2(n + 1)|X | +∑

xn∈XnPXn(x

n) · n · H(PCxn)

≤ |X | · log2(n + 1) +∑{C}

PXn(C) · n · H(PC).

1

n�̄n ≤

|X | × log2(n + 1)n

+∑{C}

PXn(C)H(PC).

Universal source coding on composition II:8-17∑{C}

PXn(C)H(PC)

=∑

{C∈Tn(δ)}PXn(C)H(PC) +

∑{C∈Tn(δ)}

PXn(C)H(PC)

≤ max{C : D(PC‖PX)≤δ/ log(2)}

H(PC) +∑

{C : D(PC‖PX)>δ/ log(2)}PXn(C)H(PC)


H(PC) +∑

{C : D(PC‖PX)>δ/ log(2)}2−nD(PC‖PX)H(PC),

(From Lemma 8.4)


H(PC) +∑

{C : D(PC‖PX)>δ/ log(2)}e−nδH(PC)


H(PC) +∑

{C : D(PC‖PX)>δ/ log(2)}e−nδ log2 |X |


H(PC) + (n + 1)|X |e−nδ log2 |X |,

where the second term of the last step vanishes as n → ∞. (Note that when base-2logarithm is taken in divergence instead of natural logarithm, the range [0, δ] in


Tn(δ) should be replaced by [0, δ/ log(2)].) It remains to show that

max{C : D(PC‖PX)≤δ/ log(2)}

H(PC) ≤ H(X) + γ(δ),

where γ(δ) only depends on δ, and approaches zero as δ → 0.. . .

Likelihood ratio versus divergence II:8-19

• Recall that the Neyman-Pearson lemma indicates that the optimal test for twohypothesis is of the form

PXn(xn)

PX̂n(xn)

>< τ. (8.1.2)

• This is the likelihood ratio test and the quantity PXn(xn)/PX̂n(xn) is calledthe likelihood ratio.

• If a log operation is performed on both sides of (8.1.2), the test remains.


logPXn(x

n)

PX̂n(xn)

=n∑

i=1

logPX(xi)

PX̂(xi)

=∑a∈X

[#a(xn)] logPX(a)

PX̂(a)

=∑a∈X

[nPCxn(a)] logPX(a)

PX̂(a)

= n ·∑a∈X

PCxn(a) logPX(a)

PCxn(a)

PCxn(a)

PX̂(a)

= n

[∑a∈X

PCxn(a) logPCxn(a)

PX̂(a)−∑a∈X

PCxn(a) logPCxn(a)

PX(a)

]= n

[D(PCxn‖PX̂) − D(PCxn‖PX)

]Hence, (8.1.2) is equivalent to

D(PCxn‖PX̂) − D(PCxn‖PX) ><1

nlog τ. (8.1.3)

• This equivalence means that for hypothesis testing, selection of the acceptanceregion can be made upon compositions instead of observations.


• In other words, the optimal decision function can be defined as:

φ(C) =

0, if composition C is classified to belong to null hypothesis

according to (8.1.3);

1, otherwise.

Exponent of Bayesian cost II:8-22

• Randomization is of no help to Bayesian test.

φ(xn) =

{0, with probability η;

1, with probability 1 − η;

satisfies

π0ηPXn(xn) + π1(1 − η)PX̂n(x

n) ≥ min{π0PXn(xn), π1PX̂n(xn)}.

• Now suppose the acceptance region for null hypothesis is

A={C : D(PC‖PX̂) − D(PC‖PX) > τ′}.

• Then by Sanov’s theorem, the exponent of type II error, βn, is

minC∈A

D(PC‖PX̂).

• Similarly, the exponent of type I error, αn is

minC∈Ac

D(PC‖PX).


• Lagrange multiplier: by taking derivative of

D(PX̃‖PX̂) + λ(D(PX̃‖PX̂) − D(PX̃‖PX) − τ′) + ν

(∑x∈X

PX̃(x) − 1)

with respective to each PX̃(x), we have

logPX̃(x)

PX̂(x)+ 1 + λ log

PX(x)

PX̂(x)+ ν = 0.

Solving these equations, we obtain the optimal PX̃ is of the form

PX̃(x) = Pλ(x)

=

P λX(x)P1−λX̂

(x)∑a∈X

P λX(a)P1−λX̂

(a).

• The geometrical explanation for Pλ is that it locates on the “straight line”between PX and PX̂ (in the sense of divergence measure) over the probability

space.


� �PX PX̂D(PX̃‖PX)

�PX̃

D(PX̃‖PX̂)

D(PC‖PX) = D(PC‖PX̂) − τ ′��

��

��

��

The divergence view on hypothesis testing.


• When λ → 0, Pλ → PX̂ ; when λ → 1, Pλ → PX .

• Usually, Pλ is named the tilted or twisted distribution.

• The value of λ is dependent on τ ′ = (1/n) log τ .

• It is known from detection theory that the best τ for Bayes testing is π1/π0,which is fixed.

• Therefore,τ ′ = lim

n→∞

1

nlog

π1π0

= 0,

which implies that the optimal exponent for Bayes error is the minimum of

D(Pλ‖PX) subject to D(Pλ‖PX) = D(Pλ‖PX̂), namely the mid-point (λ =1/2) of the line segment (PX, PX̂) on probability space. This quantity is called

the Chernoff bound.

Large deviations theory II:8-26

• The large deviations theory basically consider the technique of computing theexponent of an exponentially decayed probability.

Tilted or twisted distribution II:8-27

• Suppose the probability of a set PX(An) decreasing down to zero exponentiallyfact, and its exponent is equal to a > 0.

• Over the probability space, let P denote the set of those distributions PX̃satisfying PX̃(An) exhibits zero exponent.

• Then applying similar concept as Sanov’s theorem, we can expect thata = min

PX̃∈P

D(PX̃‖PX).

• Now suppose the minimizer of the above function happens at f(PX̃) = τ forsome constant τ and some differentiable function f(·), the minimizer shouldbe of the form

(∀ a ∈ X ) PX̃(a) =PX(a)e

λ∂f(P

X̃)

∂PX̃

(a)

∑a′∈X

PX(a′)e

λ∂f(P

X̃)

∂PX̃

(a′)

.

As a result, PX̃ is the resultant distribution from PX exponentially twisted via

the partial derivative of the function f .

• Note that PX̃ is usually written as P(λ)X since it is generated by twisting PX

with twisted factor λ.

Conventional twisted distribution II:8-28

• The conventional definition for twisted distribution is based on the divergencefunction, i.e., f(PX̃) = D(PX̃‖PX̂) − D(PX̃‖PX).

• Since∂D(PX̃‖PX)

∂PX̃(a)= log

PX̃(a)

PX(a)+ 1,

the twisted distribution becomes

(∀ a ∈ X ) PX̃(a) =PX(a)e

λ logPX̂

(a)PX (a)∑

a′∈XPX(a

′)eλ log

PX̂

(a′)PX (a

′)

=P 1−λX (a)P

λX̂

(a)∑a′∈X

P 1−αX (a)PλX̂

(a)

Cramer’s theorem II:8-29

• Question: Consider a sequence of i.i.d. random variables, Xn, and supposethat we are interested in the probability of the set{

X1 + · · · + Xnn

> τ

}.

• Observe that (X1 + · · · + Xn)/n can be re-written as∑a∈X

a · PC(a).

• Therefore, the function f becomesf(PX̃) =

∑a∈X

aPX̃(a),

and its partial derivative with respect to PX̃(a) is a.

• The resultant twisted distribution is

(∀ a ∈ X ) P (λ)X (a) =PX(a)e

λa∑a′∈X

PX(a′)eλa

′ .

• So the exponent of PXn{(X1 + · · · + Xn)/n > τ} ismin

{PX̃

: D(PX̃‖PX)>τ}

D(PX̃‖PX) = min{P (λ)X : D(P

(λ)X ‖PX)>τ}

D(P(λ)X ‖PX).

Cramer’s theorem II:8-30

• It should be pointed out that∑

a′∈X PX(a′)eλa

′is the moment generating func-

tion of PX .

• The conventional Cramer’s result does not use the divergence. Instead, itintroduced the large deviation rate function, defined by

IX(x)

= sup

θ∈�[θx − log MX(θ)], (8.2.4)

where MX(θ) is the moment generating function of X .

• Using his statement, the exponent of the above probability is respectively lower-and upper bounded by

infx≥τ

IX(x) and infx>τ

IX(x).

An example on how to obtain the exponent bounds is illustrated in the next

subsection.

Exponent and moment generating function II:8-31

A) Preliminaries : Observe that since E[X ] = µ < λ and E[|X − µ|2] < ∞,

Pr

{X1 + · · · + Xn

n≥ λ}

→ 0 as n → ∞.

Hence, we can compute its rate of convergence (to zero).

B) Upper bound of the probability :

Pr

{X1 + · · · + Xn

n≥ λ}

= Pr {θ(X1 + · · · + Xn) ≥ θnλ} , for any θ > 0= Pr {exp (θ(X1 + · · · + Xn)) ≥ exp (θnλ)}

≤ E [exp (θ(X1 + · · · + Xn))]exp (θnλ)

=En [exp (θX))]

exp (θnλ)

=

(MX(θ)

exp (θλ)

)n.

Hence,

lim infn→∞

−1nPr

{X1 + · · · + Xn

n> λ

}≥ θλ − log MX(θ).

Exponent and moment generating function II:8-32

Since the above inequality holds for every θ > 0, we have

lim infn→∞

−1nPr

{X1 + · · · + Xn

n> λ

}≥ max

θ>0[θλ − log MX(θ)]

= θ∗λ − log MX(θ∗),

where θ∗ > 0 is the optimizer of the maximum operation. (The positivity of

θ∗ can be easily verified by the concavity of the function θλ − log MX(θ) inθ, and it derivative at θ = 0 equals (λ − µ) which is strictly greater than 0.)Consequently,

lim infn→∞

−1nPr

{X1 + · · · + Xn

n> λ

}≥ θ∗λ − log MX(θ∗)

= supθ∈�

[θλ − log MX(θ)] = IX(λ).

C) Lower bound of the probability : omit.

Theories on Large deviations II:8-33

In this section, we will derive inequalities on the exponent of the probability,

Pr{Zn/n ∈ [a, b]}, which is a slight extension of the Gärtner-Ellis theorem.

Extension of Gärtner-Ellis upper bounds II:8-34

Definition 8.8 In this subsection, {Zn}∞n=1 will denote an infinite sequence ofarbitrary random variables.

Definition 8.9 Define

ϕn(θ)

=

1

nlog E [exp {θZn}] and ϕ̄(θ)

= lim sup

n→∞ϕn(θ).

The sup-large deviation rate function of an arbitrary random sequence {Zn}∞n=1is defined as

Ī(x)

= sup

{θ∈� : ϕ̄(θ)>−∞}[θx − ϕ̄(θ)] . (8.3.5)

The range of the supremum operation in (8.3.5) is always non-empty since ϕ̄(0) =

0, i.e. {θ ∈ � : ϕ̄(θ) > −∞} = ∅. Hence, Ī(x) is always defined. With theabove definition, the first extension theorem of Gärtner-Ellis can be proposed as

follows.

Theorem 8.10 For a, b ∈ � and a ≤ b,

lim supn→∞

1

nlog Pr

{Znn

∈ [a, b]}

≤ − infx∈[a,b]

Ī(x).

• The bound obtained in the above theorem is not in general tight.


Example 8.11 Suppose that Pr{Zn = 0} = 1 − e−2n, and Pr{Zn = −2n} =e−2n. Then from Definition 8.9, we have

ϕn(θ)

=

1

nlog E

[eθZn]

=1

nlog[1 − e−2n + e−(θ+1)·2·n

],

and

ϕ̄(θ)

= lim sup

n→∞ϕn(θ) =

{0, for θ ≥ −1;−2(θ + 1), for θ < −1.

Hence, {θ ∈ � : ϕ̄(θ) > −∞} = � and

Ī(x) = supθ∈�

[θx − ϕ̄(θ)]

= supθ∈�

[θx + 2(θ + 1)1{θ < −1)}]

=

{−x, for − 2 ≤ x ≤ 0;∞, otherwise,

where 1{·} represents the indicator function of a set.


Consequently, by Theorem 8.10,

lim supn→∞

1

nlog Pr

{Znn

∈ [a, b]}


Ī(x)

=

0, for 0 ∈ [a, b];b, for b ∈ [−2, 0];−∞, otherwise.

The exponent of Pr{Zn/n ∈ [a, b]} in the above example is indeed given by

limn→∞

1

nlog PZn

{Znn

∈ [a, b]}

= − infx∈[a,b]

I∗(x),

where

I∗(x) =

2, for x = −2;0, for x = 0;

∞, otherwise.(8.3.6)

Thus, the upper bound obtained in Theorem 8.10 is not tight.


Definition 8.12 Define

ϕn(θ; h)

=

1

nlog E

[exp

{n · θ · h

(Znn

)}]and ϕ̄h(θ)

= lim sup

n→∞ϕn(θ; h),

where h(·) is a given real-valued continuous function. The twisted sup-large de-viation rate function of an arbitrary random sequence {Zn}∞n=1 with respect to areal-valued continuous function h(·) is defined as

J̄h(x)

= sup

{θ∈� : ϕ̄h(θ)>−∞}[θ · h(x) − ϕ̄h(θ)] . (8.3.7)

Theorem 8.13 Suppose that h(·) is a real-valued continuous function. Then fora, b ∈ � and a ≤ b,

lim supn→∞

1

nlog Pr

{Znn

∈ [a, b]}


J̄h(x).


Example 8.14 Let us, again, investigate the {Zn}∞n=1 defined in Example 8.11.Take

h(x) =1

2(x + 2)2 − 1.

Then from Definition 8.12, we have

ϕn(θ; h)

=

1

nlog E [exp {nθh(Zn/n)}]

=1

nlog [exp {nθ} − exp {n(θ − 2)} + exp {−n(θ + 2)}] ,

and

ϕ̄h(θ)

= lim sup

n→∞ϕn(θ; h) =

{−(θ + 2), for θ ≤ −1;θ, for θ > −1.

Hence, {θ ∈ � : ϕ̄h(θ) > −∞} = � and

J̄h(x)

= sup

θ∈�[θh(x) − ϕ̄h(θ)] =

{−1

2(x + 2)2 + 2, for x ∈ [−4, 0];

∞, otherwise.


Consequently, by Theorem 8.13,

lim supn→∞

1

nlog Pr

{Znn

∈ [a, b]}


J̄h(x)

=

−min

{−(a + 2)

2

2,−(b + 2)

2

2

}− 2, for − 4 ≤ a < b ≤ 0;

0, for a > 0 or b < −4;−∞, otherwise.

(8.3.8)

For b ∈ (−2, 0) and a ∈[−2 −

√2b − 4, b

), the upper bound attained in the

previous example is strictly less than that given in Example 8.11, and hence, an

improvement is obtained. However, for b ∈ (−2, 0) and a < −2 −√

2b − 4, theupper bound in (8.3.8) is actually looser. Accordingly, we combine the two upper

bounds from Examples 8.11 and 8.14 to get

lim supn→∞

1

nlog Pr

{Znn

∈ [a, b]}

≤ −max{

infx∈[a,b]

J̄h(x), infx∈[a,b]

Ī(x)

}

=

0, for 0 ∈ [a, b];1

2(b + 2)2 − 2, for b ∈ [−2, 0];

−∞, otherwise.



lim supn→∞

1

nlog Pr

{Znn

∈ [a, b]}


J̄(x),

where J̄(x)

= suph∈H J̄h(x) and H is the set of all real-valued continuous functions.

Example 8.16 Let us again study the {Zn}∞n=1 in Example 8.11 (also in Example8.14). Suppose c > 1. Take hc(x) = c1(x + c2)

2 − c, where

c1

=

c +√

c2 − 12

and c2

=

2√

c + 1√c + 1 +

√c − 1

.

Then from Definition 8.12, we have

ϕn(θ; hc)

=

1

nlog E

[exp

{nθhc

(Znn

)}]=

1

nlog [exp {nθ} − exp {n(θ − 2)} + exp {−n(θ + 2)}] ,

and

ϕ̄hc(θ)

= lim sup

n→∞ϕn(θ; hc) =

{−(θ + 2), for θ ≤ −1;θ, for θ > −1.


Hence, {θ ∈ � : ϕ̄hc(θ) > −∞} = � and

J̄hc(x) = supθ∈�

[θhc(x) − ϕ̄hc(θ)]

=

{−c1(x + c2)2 + c + 1, for x ∈ [−2c2, 0];∞, otherwise.

From Theorem 8.15,

J̄(x) = suph∈H

J̄h(x) ≥ max{lim infc→∞

J̄hc(x), Ī(x)} = I∗(x),

where I∗(x) is defined in (8.3.6). Consequently,

lim supn→∞

1

nlog Pr

{Znn

∈ [a, b]}


J̄(x)


I∗(x)

=

0, if 0 ∈ [a, b];−2, if − 2 ∈ [a, b] and 0 ∈ [a, b];−∞, otherwise.

and a tight upper bound is finally obtained!


Definition 8.17 Define ϕh(θ)

= lim infn→∞ ϕn(θ; h), where ϕn(θ; h) was defined

in Definition 8.12. The twisted inf-large deviation rate function of an arbitrary

random sequence {Zn}∞n=1 with respect to a real-valued continuous function h(·) isdefined as

Jh(x)

= sup

{θ∈� : ϕh(θ)>−∞}

[θ · h(x) − ϕ

h(θ)].


lim infn→∞

1

nlog Pr

{Znn

∈ [a, b]}


J(x),

where J(x)

= suph∈H Jh(x) and H is the set of all real-valued continuous functions.

Extension of Gärtner-Ellis lower bounds II:8-43

• Hope to know when

lim supn→∞

1

nlog Pr

{Znn

∈ (a, b)}

≥ − infx∈(a,b)

J̄h(x). (8.3.9)

Definition 8.19 Define the sup-Gärtner-Ellis set with respect to a real-valued

continuous function h(·) as

D̄h

=

⋃{θ∈� : ϕ̄h(θ)>−∞}

D̄(θ; h)

where

D̄(θ; h) ={

x ∈ � : lim supt↓0

ϕ̄h(θ + t) − ϕ̄h(θ)t

≤ h(x) ≤ lim inft↓0

ϕ̄h(θ) − ϕ̄h(θ − t)t

}.

Let us briefly remark on the sup-Gärtner-Ellis set defined above.

• It can be derived that the sup-Gärtner-Ellis set is reduced to

D̄h

=

⋃{θ∈� : ϕ̄h(θ)>−∞}

{x ∈ � : ϕ̄′h(θ) = h(x)} ,

if the derivative ϕ̄′h(θ) exists for all θ.


Theorem 8.20 Suppose that h(·) is a real-valued continuous function. Then if(a, b) ⊂ D̄h,

lim supn→∞

1

nlog Pr

{Znn

∈ (a, b)}


J̄h(x).

Example 8.21 Suppose Zn = X1 + · · · + Xn, where {Xi}ni=1 are i.i.d. Gaussianrandom variables with mean 1 and variance 1 if n is even, and with mean −1 andvariance 1 if n is odd. Then the exact large deviation rate formula Ī∗(x) that

satisfies for all a < b,

− infx∈[a,b]

Ī∗(x) ≥ lim supn→∞

1

nlog Pr

{Znn

∈ [a, b]}

≥ lim supn→∞

1

nlog Pr

{Znn

∈ (a, b)}


Ī∗(x)

is

Ī∗(x) =(|x| − 1)2

2. (8.3.10)

Case A: h(x) = x.

For the affine h(·), ϕn(θ) = θ + θ2/2 when n is even, and ϕn(θ) = −θ + θ2/2


when n is odd. Hence, ϕ̄(θ) = |θ| + θ2/2, and

D̄h =(⋃

θ>0

{v ∈ � : v = 1 + θ})⋃(⋃

θ 1;

0, for |x| ≤ 1,

we obtain for any a ∈ (−∞, 1) ∪ (1,∞),

limε↓0

lim supn→∞

1

nlog Pr

{Znn

∈ (a − ε, a + ε)}

≥ − limε↓0

infx∈(a−ε,a+ε)

Ī(x) = −(|a| − 1)2

2,


which can be shown tight by Theorem 8.13 (or directly by (8.3.10)). Note that

the above inequality does not hold for any a ∈ (−1, 1). To fill the gap, adifferent h(·) must be employed.

Case B: h(x) = |x|.For n even,

E[enθh(Zn/n)

]= E

[enθ|Zn/n−a|

]=

∫ na−∞

e−θx+nθa1√2πn

e−(x−n)2/(2n)dx +

∫ ∞na

eθx−nθa1√2πn

e−(x−n)2/(2n)dx

= enθ(θ−2+2a)/2∫ na−∞

1√2πn

e−[x−n(1−θ)]2/(2n)dx

+ enθ(θ+2−2a)/2∫ ∞

na

1√2πn

e−[x−n(1+θ)]2/(2n)dx

= enθ(θ−2+2a)/2 · Φ((θ + a − 1)

√n)

+ enθ(θ+2−2a)/2 · Φ((θ − a + 1)

√n),

where Φ(·) represents the unit Gaussian cdf.


Similarly, for n odd,

E[enθh(Zn/n)

]= enθ(θ+2+2a)/2 · Φ

((θ + a + 1)

√n)

+ enθ(θ−2−2a)/2 · Φ((θ − a − 1)

√n).

Observe that for any b ∈ �,

limn→∞

1

nlog Φ(b

√n) =

0, for b ≥ 0;−b22

, for b < 0.

Hence,

ϕ̄h(θ) =

−(|a| − 1)

2

2, for θ < |a| − 1;

θ[θ + 2(1 − |a|)]2

, for |a| − 1 ≤ θ < 0;θ[θ + 2(1 + |a|)]

2, for θ ≥ 0.


Therefore,

D̄h =(⋃

θ>0

{x ∈ � : |x − a| = θ + 1 + |a|})

⋃(⋃θ a + 1 + |a| or x < a − 1 − |a|;0, otherwise.

(8.3.11)


We then apply Theorem 8.20 to obtain

limε↓0

lim supn→∞

1

nlog Pr

{Znn

∈ (a − ε, a + ε)}

≥ − limε↓0

infx∈(a−ε,a+ε)

J̄h(x)

= − limε↓0

(ε − 1 + |a|)22

= −(|a| − 1)2

2.

Note that the above lower bound is valid for any a ∈ (−1, 1), and can be showntight, again, by Theorem 8.13 (or directly by (8.3.10)).

Finally, by combining the results of Cases A) and B), the true large deviation

rate of {Zn}n≥1 is completely characterized.


Definition 8.22 Define the inf-Gärtner-Ellis set with respect to a real-valued

continuous function h(·) as

Dh

=

⋃{θ∈� : ϕ

h(θ)>−∞}

D(θ; h)

where

D(θ; h) ={

x ∈ � : lim supt↓0

ϕh(θ + t) − ϕ

h(θ)

t

≤ h(x) ≤ lim inft↓0

ϕh(θ) − ϕ

h(θ − t)

t

}.

Theorem 8.23 Suppose that h(·) is a real-valued continuous function. Then if(a, b) ⊂ Dh,

lim infn→∞

1

nlog Pr

{Znn

∈ (a, b)}


Jh(x).

Properties II:8-51

Property 8.24 Let Ī(x) and I(x) be the sup- and inf- large deviation rate func-

tions of an infinite sequence of arbitrary random variables {Zn}∞n=1, respectively.Denote mn = (1/n)E[Zn]. Let m̄

= lim supn→∞ mn and m

= lim infn→∞ mn. Then

1. Ī(x) and I(x) are both convex.

2. Ī(x) is continuous over {x ∈ � : Ī(x) < ∞}. Likewise, I(x) is continuousover {x ∈ � : I(x) < ∞}.

3. Ī(x) gives its minimum value 0 at m ≤ x ≤ m̄.

4. I(x) ≥ 0. But I(x) does not necessary give its minimum value at both x = m̄and x = m.

Properties II:8-52

Property 8.25 Suppose that h(·) is a real-valued continuous function. Let J̄h(x)and Jh(x) be the corresponding twisted sup- and inf- large deviation rate functions,

respectively. Denote mn(h)

=E[h(Zn/n)]. Let

m̄h

= lim sup

n→∞mn(h) and mh

= lim inf

n→∞mn(h).

Then

1. J̄h(x) ≥ 0, with equality holds if mh ≤ h(x) ≤ m̄h.

2. Jh(x) ≥ 0, but Jh(x) does not necessary give its minimum value at bothx = m̄h and x = mh.

Probabilitic subexponential behavior II:8-53

• subexponential behavior.an = (1/n) exp{−2n} and bn = (1/

√n) exp{−2n} have the same exponent,

but contain different subexponential terms

Berry-Esseen theorem for compound i.i.d. sequence II:8-54

• Berry-Esseen theorem states that the distribution of the sum of independentzero-mean random variables {Xi}ni=1, normalized by the standard deviationof the sum, differs from the Gaussian distribution by at most C rn/s

3n, where

s2n and rn are respectively sums of the marginal variances and the marginal

absolute third moments, and C is an absolute constant.

• Specifically, for every a ∈ �,∣∣∣∣Pr{ 1sn (X1 + · · · + Xn) ≤ a}− Φ(a)

∣∣∣∣ ≤ C rns3n, (8.4.12)where Φ(·) represents the unit Gaussian cdf.

• The striking feature of this theorem is that the upper bound depends only onthe variance and the absolute third moment, and hence, can provide a good

asymptotic estimate based on only the first three moments.

• The absolute constant C is commonly 6. When {Xn}ni=1 are identically dis-tributed, in addition to independent, the absolute constant can be reduced to

3, and has been reported to be improved down to 2.05.

Definition: compound i.i.d. sequence. The samples that we concern in this

section actually consists of two i.i.d. sequences (and, is therefore named compound

i.i.d. sequence.)


Lemma 8.26 (smoothing lemma) Fix the bandlimited filtering function

vT(x)

=

1 − cos(Tx)πTx2

=2 sin2(Tx/2)

πTx2

=T

2πsinc2

(Tx

2π

)= Four−1

[Λ

(f

T/(2π)

)].

For any cumulative distribution function H(·) on the real line �,

supx∈�

|∆T (x)| ≥1

2η − 6

Tπ√

2πh

(T√

2π

2η

),

where

∆T (t)

=

∫ ∞−∞

[H(t − x) − Φ(t − x)] × vT(x)dx, η

= sup

x∈�|H(x) − Φ(x)| ,

and

h(u)

=

u∫ ∞

u

1 − cos(x)x2

dx =π

2u + 1 − cos(u) − u

∫ u0

sin(x)

xdx, if u ≥ 0;

0, otherwise.


Lemma 8.27 For any cumulative distribution function H(·) with characteristicfunction ϕ

H(ζ),

η ≤ 1π

∫ T−T

∣∣∣ϕH(ζ) − e−(1/2)ζ2∣∣∣ dζ|ζ| + 12Tπ√2π h(

T√

2π

2η

),

where η and h(·) are defined in Lemma 8.26.

Theorem 8.28 (BE theorem for compound i.i.d. sequences) Let Yn =∑ni=1 Xi be the sum of independent random variables, among which {Xi}di=1 are

identically Gaussian distributed, and {Xi}ni=d+1 are identically distributed but notnecessarily Gaussian.

• Denote the mean-variance pair of X1 and Xd+1 by (µ, σ2) and (µ̂, σ̂2), respec-tively.

• Defineρ

=E[|X1 − µ|3

], ρ̂

=E[|Xd+1 − µ̂|3

]and

s2n = Var[Yn] = σ2d + σ̂2(n − d).

• Also denote the cdf of (Yn − E[Yn])/sn by Hn(·).


Then for all y ∈ �,

|Hn(y) − Φ(y)| ≤ Cn,d2√π

(n − d − 1)(2(n − d) − 3

√2) ρ̂

σ̂2sn,

where Cn,d is the unique positive number satisfying

π

6Cn,d − h(Cn,d)

=

√π(2(n − d) − 3

√2)

12(n − d − 1)

√6π2(3 −

√2)3/2 + 92(11 − 6√2)√n − d

,provided that n − d ≥ 3.


-1

-0.5

0

0.5

1

1.5

2

2.5

0 1 2 3 4 5

π

6u − h(u)

u

Function of (π/6)u − h(u).


By letting d = 0, the Berry-Esseen inequality for i.i.d. sequences can also be

readily obtained from the previous Theorem.

Corollary 8.29 (Berry-Esseen theorem for i.i.d. sequence) Let

Yn =n∑

i=1

Xi

be the sum of independent random variables with common marginal distribution.

Denote the marginal mean and variance by (µ̂, σ̂2). Define ρ̂

=E[|X1 − µ̂|3

]. Also

denote the cdf of (Yn − nµ̂)/(√

nσ̂) by Hn(·). Then for all y ∈ �,

|Hn(y) − Φ(y)| ≤ Cn2(n − 1)

√π(2n − 3

√2) ρ̂

σ̂3√

n,

where Cn is the unique positive solution of

π

6u − h(u) =

√π(2n − 3

√2)

12(n − 1)

√6π2(3 −

√2)3/2 + 92(11 − 6√2)√n

,provided that n ≥ 3.


• Let us briefly remark on the previous corollary. We observe from numericalsthat the quantity

Cn2√π

(n − 1)(2n − 3

√2)

is decreasing in n, and ranges from 3.628 to 1.627 (cf. The picture in slide

II:8-62.)

– We can upperbound Cn by the unique positive solution Dn of

π

6u − h(u) =

√π

6

√6π2(3 −

√2)3/2 + 92(11 − 6√2)√n

,which is strictly decreasing in n. Hence,

Cn2√π

(n − 1)(2n − 3

√2) ≤ En = Dn 2√

π

(n − 1)(2n − 3

√2),

and the right-hand-side of the above inequality is strictly decreasing (since

both Dn and (n − 1)/(2n − 3√

2) are decreasing) in n, and ranges from

E3 = 4.1911, . . . ,E9 = 2.0363,. . . , E100 = 1.6833 to E∞ = 1.6266. If the

property of strictly decreasingness is preferred, one can use Dn instead of

Cn in the Berry-Esseen inequality. Note that both Cn and Dn converges to

2.8831 . . . as n goes to infinity.


• Numerical result shows that it lies below 2 when n ≥ 9, and is smaller than1.68 as n ≥ 100. In other words, we can upperbound this quantity by 1.68 asn ≥ 100, and therefore, establish a better estimate of the original Berry-Esseenconstant.


0

1

1.682

3

3 10 20 30 4050 75100 150200

Cn2√π

(n − 1)(2n − 3

√2)

n

The Berry-Esseen constant as a function of the sample size n. The sample size n

is plotted in log-scale.

Generalized Neyman-Pearson Hypothesis Testing II:8-63

The general expression of the Neyman-Pearson type-II error exponent subject to

a constant bound on the type-I error has been proved for arbitrary observations.

In this section, we will state the results in terms of the ε-inf/sup-divergence rates.

Theorem 8.30 (Neyman-Pearson type-II error exponent for a fixed

test level) Consider a sequence of random observations which is assumed to have a

probability distribution governed by either PX (null hypothesis) or PX̂ (alternative

hypothesis). Then, the type-II error exponent satisfies

limδ↑δ

D̄δ(X‖X̂) ≤ lim supn→∞

−1n

log β∗n(ε) ≤ D̄ε(X‖X̂)

limδ↑ε

Dδ(X‖X̂) ≤ lim infn→∞

−1n

log β∗n(ε) ≤ Dε(X‖X̂)

where β∗n(ε) represents the minimum type-II error probability subject to a fixed

type-I error bound ε ∈ [0, 1).

The general formula for Neyman-Pearson type-II error exponent subject to an

exponential test level has also been proved in terms of the ε-inf/sup-divergence

rates.

Generalized Neyman-Pearson Hypothesis Testing II:8-64

Theorem 8.31 (Neyman-Pearson type-II error exponent for an ex-

ponential test level) Fix s ∈ (0, 1) and ε ∈ [0, 1). It is possible to choosedecision regions for a binary hypothesis testing problem with arbitrary datawords

of blocklength n, (which are governed by either the null hypothesis distribution PXor the alternative hypothesis distribution PX̂ ,) such that

lim infn→∞

−1n

log βn ≥ D̄ε(X̂(s)‖X̂) and lim sup

n→∞−1

nlog αn ≥ D(1−ε)(X̂

(s)‖X),(8.5.13)

or

lim infn→∞

−1n

log βn ≥ Dε(X̂(s)‖X̂) and lim sup

n→∞−1

nlog αn ≥ D̄(1−ε)(X̂

(s)‖X),(8.5.14)

where X̂(s)

exhibits the tilted distributions{P

(s)

X̂n

}∞n=1

defined by

dP(s)

X̂n(xn)

=

1

Ωn(s)exp

{s log

dPXn

dPX̂n(xn)

}dPX̂n(x

n),

and

Ωn(s)

=

∫X n

exp

{s log

dPXn

dPX̂n(xn)

}dPX̂n(x

n).

Here, αn and βn are the type-I and type-II error probabilities, respectively.

chapter 8 hypothesis testingshannon.cm.nctu.edu.tw/it/c2-8s04.pdf · geometrical interpretation for...

Documents