chapter 8 hypothesis testingshannon.cm.nctu.edu.tw/it/c2-8s04.pdf · geometrical interpretation for...
TRANSCRIPT
-
Chapter 8
Hypothesis Testing
Po-Ning Chen
Department of Communications Engineering
National Chiao-Tung University
Hsin Chu, Taiwan 30050
-
Error exponent and divergence II:8-1
Definition 8.1 (exponent) A real number a is said to be the exponent for a
sequence of non-negative quantities {an}n≥1 converging to zero, if
a = limn→∞
(−1
nlog an
).
• In operation, exponent is an index for the exponential rate-of-convergence forsequence an. We can say that for any γ > 0,
e−n(a+γ) ≤ an ≤ e−n(a−γ),as n large enough.
• Recall that in proving the channel coding theorem, the probability of decodingerror for channel block codes can be made arbitrarily close to zero when the
rate of the codes is less than channel capacity.
• Actually, this result can be mathematically written as:Pe( C∼∗n) → 0 as n → ∞,
provided R = lim supn→∞(1/n) log ‖ C∼∗n‖ < C, where C∼∗n is the optimal codefor block length n.
• From the theorem, we only know the decoding error will vanish as block lengthincreases; but, it does not reveal how fast the decoding error approaches zero.
-
Error exponent and divergence II:8-2
• In other words, we do not know the rate-of-convergence of the decoding error.Sometimes, this information is very important, especially for one to decide the
sufficient block length to achieve some error bound.
• The first step of investigating the rate-of-convergence of the decoding error isto compute its exponent, if the decoding error decays to zero exponentially fast
(it indeed does for memoryless channels.) This exponent, as a function of the
rate, is in fact called the channel reliability function, and will be discussed in
the next chapter.
• For the hypothesis testing problems, the type II error probability of fixed testlevel also decays to zero as the number of observations increases. As it turns
out, its exponent is the divergence of the null hypothesis distribution against
alternative hypothesis distribution.
-
Stein’s lemma II:8-3
Lemma 8.2 (Stein’s lemma) For a sequence of i.i.d. observations Xn which is
possibly drawn from either null hypothesis distribution PXn or alternative hypoth-
esis distribution PX̂n, the type II error satisfies
(∀ ε ∈ (0, 1)) limn→∞
−1n
log β∗n(ε) = D(PX‖PX̂),
where β∗n(ε) = minαn≤ε βn, and αn and βn represent the type I and type II errors
respectively.
Proof: [1. Forward Part]
In the forward part, we prove that there exists an acceptance region for null
hypothesis such that
lim infn→∞
−1n
log βn(ε) ≥ D(PX‖PX̂).
step 1: divergence typical set. For any δ > 0, define divergence typical set
as
An(δ)
=
{xn :
∣∣∣∣1n log PXn(xn)PX̂n(xn) − D(PX‖PX̂)∣∣∣∣ < δ} .
Note that in divergence typical set,
PX̂n(xn) ≤ PXn(xn)e−n(D(PX‖PX̂)−δ).
-
Stein’s lemma II:8-4
step 2: computation of type I error. By weak law of large number,
PXn(An(δ)) → 1.
Hence,
αn = PXn(Acn(δ)) < ε,for sufficiently large n.
step 3: computation of type II error.
βn(ε) = PX̂n(An(δ))=
∑xn∈An(δ)
PX̂n(xn)
≤∑
xn∈An(δ)PXn(x
n)e−n(D(PX‖PX̂)−δ)
≤ e−n(D(PX‖PX̂)−δ)∑
xn∈An(δ)PXn(x
n)
≤ e−n(D(PX‖PX̂)−δ)(1 − αn).
Hence,
−1n
log βn(ε) ≥ D(PX‖PX̂) − δ +1
nlog(1 − αn),
-
Stein’s lemma II:8-5
which implies
lim infn→∞
−1n
log βn(ε) ≥ D(PX‖PX̂) − δ.
The above inequality is true for any δ > 0. Therefore
lim infn→∞
−1n
log2 βn(ε) ≥ D(PX‖PX̂).
[2. Converse Part]
In the converse part, we will prove that for any acceptance region for null hy-
pothesis Bn satisfying the type I error constraint, i.e.,
αn(Bn) = PXn(Bcn) ≤ ε,
its type II error βn(Bn) satisfies
lim supn→∞
−1n
log βn(Bn) ≤ D(PX‖PX̂).
-
Stein’s lemma II:8-6
βn(Bn) = PX̂n(Bn) ≥ PX̂n(Bn ∩ An(δ))≥
∑xn∈Bn∩An(δ)
PX̂n(xn)
≥∑
xn∈Bn∩An(δ)PXn(x
n)e−n(D(PX‖PX̂)+δ)
= e−n(D(PX‖PX̂)+δ)PXn(Bn ∩ An(δ))≥ e−n(D(PX‖PX̂)+δ) (1 − PXn(Bcn) − PXn (Acn(δ)))≥ e−n(D(PX‖PX̂)+δ) (1 − αn(Bn) − PXn (Acn(δ)))≥ e−n(D(PX‖PX̂)+δ) (1 − ε − PXn (Acn(δ))) .
Hence,
−1n
log βn(Bn) ≤ D(PX‖PX̂) + δ +1
nlog (1 − ε − PXn (Acn(δ))) ,
which implies that
lim supn→∞
−1n
log βn(Bn) ≤ D(PX‖PX̂) + δ.
The above inequality is true for any δ > 0. Therefore,
lim supn→∞
−1n
log βn(Bn) ≤ D(PX‖PX̂).
-
Composition of sequence of i.i.d. observations II:8-7
• Stein’s lemma gives the exponent of the type II error probability for fixed testlevel.
• As a result, this exponent, which is the divergence of null hypothesis distribu-tion against alternative hypothesis distribution, is independent of the type I
error bound ε for i.i.d. observations.
• Specifically under i.i.d. environment, the probability for each sequence of xndepends only on its composition, which is defined as an |X |-dimensional vector,and is of the form (
#1(xn)
n,#2(xn)
n, . . . ,
#k(xn)
n
),
where X = {1, 2, . . . , k}, and #i(xn) is the number of occurrences of symboli in xn.
• The probability of xn is therefore can be written asPXn(x
n) = PX(1)#1(xn) × PX(2)#2(x
n) × PX(k)#k(xn).
• Note that #1(xn) + · · · + #k(xn) = n.• Since the composition of sequence decides its probability deterministically, all
sequences with the same composition should have the same statistical property,
and hence should be treated the same when processing.
-
Composition of sequence of i.i.d. observations II:8-8
• Instead of manipulating the sequences of observations based on the typical-set-like concept, we may focus on their compositions.
• As it turns out, such approach yields simpler proofs and better geometricalexplanations for theories under i.i.d. environment.
• (It needs to be pointed out that for cases when composition alone can notdecide the probability, this viewpoint does not seem to be effective.)
Lemma 8.3 (polynomial bound on number of composition) The num-
ber of compositions increases polynomially fast, while the number of possible se-
quences increases exponentially fast.
Proof:
• Let Pn denotes the set of all possible compositions.
• |Pn| ≤ (n + 1)|X |
-
Composition of sequence of i.i.d. observations II:8-9
Lemma 8.4 (probability of sequences of the same composition) The
probability of the sequences of composition C with respect to distribution PXnsatisfies
1
(n + 1)|X |e−nD(PC‖PX) ≤ PXn(C) ≤ e−nD(PC‖PX),
where PC is the composition distribution for composition C, and C (by abusingnotation without ambiguity) is also used to represent the set of all sequences (in
X n) of composition C.
Theorem 8.5 (Sanov’s Theorem) Let En be the set that consists of all com-positions over finite alphabet X , whose composition distribution belongs to P . Fixa sequence of product distribution PXn =
∏ni=1 PX . Then,
lim infn→∞
−1n
log PXn(En) ≥ infPC∈P
D(P‖PX),
where PC is the composition distribution for composition C. If, in addition, forevery distribution P in P , there exists a sequence of composition distributionsPC1, PC2, PC3, . . . ∈ P such that lim supn→∞ D(PCn‖PX) = D(P‖PX), then
lim supn→∞
−1n
log PXn(En) ≤ infP∈P
D(PC‖PX).
-
Geometrical interpretation for Sanov’s theorem II:8-10
��
��
��
��
��
��
��
��
���
��
��
��
��
��
��
��
�
P
�
PX
�
P1
�
P2
���
���
���
��������������
minP∈P
D(P‖PX)
The geometric meaning for Sanov’s theorem.
-
Geometrical interpretation for Sanov’s theorem II:8-11
Example 8.6 • Question: One wants to roughly estimate the probabilitythat the average of the throws is greater or equal to 4, when tossing a fair dice
n times.
• He observes that whether the requirement is satisfied only depends on thecompositions of the observations.
• Let En be the set of compositions which satisfy the requirement.
En ={C :
6∑i=1
iPC(i) ≥ 4}
.
• To minimize D(PC‖PX) for C ∈ En, we can use the Lagrange multiplier tech-nique (since divergence is convex with respect to the first argument.) with the
constraints on PC being:6∑
i=1
iPC(i) = k and6∑
i=1
PC(i) = 1
for k = 4, 5, 6, . . . , n.
• So it becomes to minimize:6∑
i=1
PC(i) logPC(i)
PX(i)+ λ1
(6∑
i=1
iPC(i) − k)
+ λ2
(6∑
i=1
PC(i) − 1)
.
-
Geometrical interpretation for Sanov’s theorem II:8-12
• By taking the derivatives, we found that the minimizer should be of the form
PC(i) =eλ1·i∑6j=1 e
λ1·j,
for λ1 is chosen to satisfy6∑
i=1
iPC(i) = k. (8.1.1)
• Since the above is true for all k ≥ 4, it suffices to take the smallest one as oursolution, i.e., k = 4.
• Finally, by solving (8.1.1) for k = 4 numerically, the minimizer should be
PC∗ = (0.1031, 0.1227, 0.1461, 0.1740, 0.2072, 0.2468),
and the exponent of the desired probability is D(PC∗‖PX) = 0.0433 nat.
• Consequently,PXn(En) ≈ e−0.0433·n.
-
Divergence typical set on composition II:8-13
• Divergence typical set in Stein’s lemma:
An(δ)
=
{xn :
∣∣∣∣1n log PXn(xn)PX̂n(xn) − D(PX‖PX̂)∣∣∣∣ < δ} .
•Tn(δ)
={xn ∈ X n : D(PCxn‖PX) ≤ δ},
where Cxn represents the composition of xn.
• PXn(Tn(δ)) → 1 is justified by
1 − PXn(Tn(δ)) =∑
{C : D(PC‖PX)>δ}PXn(C)
≤∑
{C : D(PC‖PX)>δ}e−nD(PC‖PX), from Lemma 8.4.
≤∑
{C : D(PC‖PX)>δ}e−nδ
≤ (n + 1)|X |e−nδ, cf. Lemma 8.3.
-
Universal source coding on composition II:8-14
• Universal codefn : X n →
∞⋃i=1
{0, 1}i
for i.i.d. source:1
n
∑xn∈Xn
PXn(xn)�(fn(x
n)) → H(X),
as n goes to infinity.
Example 8.7 (universal encoding using compositions)
• Binary-index the compositions using log2(n+1)|X | bits, and denote this binaryindex for composition C by a(C).
– Let Cxn denote the composition with respect to xn, i.e. xn ∈ Cxn.
• Binary-index the elements in C using n · H(PC) bits, and denote this binaryindex for elements in C by b(Cxn).
– For each composition C, we know that the number of sequence xn in C is atmost 2n·H(PC) (Here, H(PC) is measured in bits. I.e., the logarithmic base
in entropy is 2. See the proof of Lemma 8.4).
-
Universal source coding on composition II:8-15
• Define a universal encoding function fn as
fn(xn) = concatenation{a(Cxn), b(Cxn)}.
• Then this encoding rule is a universal code for all i.i.d. sources.
-
Universal source coding on composition II:8-16
Proof:
�̄n =∑
xn∈XnPXn(x
n)�(a(Cxn)) +∑
xn∈XnPXn(x
n)�(b(Cxn))
≤∑
xn∈XnPXn(x
n) · log2(n + 1)|X | +∑
xn∈XnPXn(x
n) · n · H(PCxn)
≤ |X | · log2(n + 1) +∑{C}
PXn(C) · n · H(PC).
1
n�̄n ≤
|X | × log2(n + 1)n
+∑{C}
PXn(C)H(PC).
-
Universal source coding on composition II:8-17∑{C}
PXn(C)H(PC)
=∑
{C∈Tn(δ)}PXn(C)H(PC) +
∑{C∈Tn(δ)}
PXn(C)H(PC)
≤ max{C : D(PC‖PX)≤δ/ log(2)}
H(PC) +∑
{C : D(PC‖PX)>δ/ log(2)}PXn(C)H(PC)
≤ max{C : D(PC‖PX)≤δ/ log(2)}
H(PC) +∑
{C : D(PC‖PX)>δ/ log(2)}2−nD(PC‖PX)H(PC),
(From Lemma 8.4)
≤ max{C : D(PC‖PX)≤δ/ log(2)}
H(PC) +∑
{C : D(PC‖PX)>δ/ log(2)}e−nδH(PC)
≤ max{C : D(PC‖PX)≤δ/ log(2)}
H(PC) +∑
{C : D(PC‖PX)>δ/ log(2)}e−nδ log2 |X |
≤ max{C : D(PC‖PX)≤δ/ log(2)}
H(PC) + (n + 1)|X |e−nδ log2 |X |,
where the second term of the last step vanishes as n → ∞. (Note that when base-2logarithm is taken in divergence instead of natural logarithm, the range [0, δ] in
-
Universal source coding on composition II:8-18
Tn(δ) should be replaced by [0, δ/ log(2)].) It remains to show that
max{C : D(PC‖PX)≤δ/ log(2)}
H(PC) ≤ H(X) + γ(δ),
where γ(δ) only depends on δ, and approaches zero as δ → 0.. . .
-
Likelihood ratio versus divergence II:8-19
• Recall that the Neyman-Pearson lemma indicates that the optimal test for twohypothesis is of the form
PXn(xn)
PX̂n(xn)
>< τ. (8.1.2)
• This is the likelihood ratio test and the quantity PXn(xn)/PX̂n(xn) is calledthe likelihood ratio.
• If a log operation is performed on both sides of (8.1.2), the test remains.
-
Likelihood ratio versus divergence II:8-20
logPXn(x
n)
PX̂n(xn)
=n∑
i=1
logPX(xi)
PX̂(xi)
=∑a∈X
[#a(xn)] logPX(a)
PX̂(a)
=∑a∈X
[nPCxn(a)] logPX(a)
PX̂(a)
= n ·∑a∈X
PCxn(a) logPX(a)
PCxn(a)
PCxn(a)
PX̂(a)
= n
[∑a∈X
PCxn(a) logPCxn(a)
PX̂(a)−∑a∈X
PCxn(a) logPCxn(a)
PX(a)
]= n
[D(PCxn‖PX̂) − D(PCxn‖PX)
]Hence, (8.1.2) is equivalent to
D(PCxn‖PX̂) − D(PCxn‖PX) ><1
nlog τ. (8.1.3)
• This equivalence means that for hypothesis testing, selection of the acceptanceregion can be made upon compositions instead of observations.
-
Likelihood ratio versus divergence II:8-21
• In other words, the optimal decision function can be defined as:
φ(C) =
0, if composition C is classified to belong to null hypothesis
according to (8.1.3);
1, otherwise.
-
Exponent of Bayesian cost II:8-22
• Randomization is of no help to Bayesian test.
φ(xn) =
{0, with probability η;
1, with probability 1 − η;
satisfies
π0ηPXn(xn) + π1(1 − η)PX̂n(x
n) ≥ min{π0PXn(xn), π1PX̂n(xn)}.
• Now suppose the acceptance region for null hypothesis is
A={C : D(PC‖PX̂) − D(PC‖PX) > τ′}.
• Then by Sanov’s theorem, the exponent of type II error, βn, is
minC∈A
D(PC‖PX̂).
• Similarly, the exponent of type I error, αn is
minC∈Ac
D(PC‖PX).
-
Exponent of Bayesian cost II:8-23
• Lagrange multiplier: by taking derivative of
D(PX̃‖PX̂) + λ(D(PX̃‖PX̂) − D(PX̃‖PX) − τ′) + ν
(∑x∈X
PX̃(x) − 1)
with respective to each PX̃(x), we have
logPX̃(x)
PX̂(x)+ 1 + λ log
PX(x)
PX̂(x)+ ν = 0.
Solving these equations, we obtain the optimal PX̃ is of the form
PX̃(x) = Pλ(x)
=
P λX(x)P1−λX̂
(x)∑a∈X
P λX(a)P1−λX̂
(a).
• The geometrical explanation for Pλ is that it locates on the “straight line”between PX and PX̂ (in the sense of divergence measure) over the probability
space.
-
Exponent of Bayesian cost II:8-24
� �PX PX̂D(PX̃‖PX)
�PX̃
D(PX̃‖PX̂)
D(PC‖PX) = D(PC‖PX̂) − τ ′��
��
����
����
The divergence view on hypothesis testing.
-
Exponent of Bayesian cost II:8-25
• When λ → 0, Pλ → PX̂ ; when λ → 1, Pλ → PX .
• Usually, Pλ is named the tilted or twisted distribution.
• The value of λ is dependent on τ ′ = (1/n) log τ .
• It is known from detection theory that the best τ for Bayes testing is π1/π0,which is fixed.
• Therefore,τ ′ = lim
n→∞
1
nlog
π1π0
= 0,
which implies that the optimal exponent for Bayes error is the minimum of
D(Pλ‖PX) subject to D(Pλ‖PX) = D(Pλ‖PX̂), namely the mid-point (λ =1/2) of the line segment (PX, PX̂) on probability space. This quantity is called
the Chernoff bound.
-
Large deviations theory II:8-26
• The large deviations theory basically consider the technique of computing theexponent of an exponentially decayed probability.
-
Tilted or twisted distribution II:8-27
• Suppose the probability of a set PX(An) decreasing down to zero exponentiallyfact, and its exponent is equal to a > 0.
• Over the probability space, let P denote the set of those distributions PX̃satisfying PX̃(An) exhibits zero exponent.
• Then applying similar concept as Sanov’s theorem, we can expect thata = min
PX̃∈P
D(PX̃‖PX).
• Now suppose the minimizer of the above function happens at f(PX̃) = τ forsome constant τ and some differentiable function f(·), the minimizer shouldbe of the form
(∀ a ∈ X ) PX̃(a) =PX(a)e
λ∂f(P
X̃)
∂PX̃
(a)
∑a′∈X
PX(a′)e
λ∂f(P
X̃)
∂PX̃
(a′)
.
As a result, PX̃ is the resultant distribution from PX exponentially twisted via
the partial derivative of the function f .
• Note that PX̃ is usually written as P(λ)X since it is generated by twisting PX
with twisted factor λ.
-
Conventional twisted distribution II:8-28
• The conventional definition for twisted distribution is based on the divergencefunction, i.e., f(PX̃) = D(PX̃‖PX̂) − D(PX̃‖PX).
• Since∂D(PX̃‖PX)
∂PX̃(a)= log
PX̃(a)
PX(a)+ 1,
the twisted distribution becomes
(∀ a ∈ X ) PX̃(a) =PX(a)e
λ logPX̂
(a)PX (a)∑
a′∈XPX(a
′)eλ log
PX̂
(a′)PX (a
′)
=P 1−λX (a)P
λX̂
(a)∑a′∈X
P 1−αX (a)PλX̂
(a)
-
Cramer’s theorem II:8-29
• Question: Consider a sequence of i.i.d. random variables, Xn, and supposethat we are interested in the probability of the set{
X1 + · · · + Xnn
> τ
}.
• Observe that (X1 + · · · + Xn)/n can be re-written as∑a∈X
a · PC(a).
• Therefore, the function f becomesf(PX̃) =
∑a∈X
aPX̃(a),
and its partial derivative with respect to PX̃(a) is a.
• The resultant twisted distribution is
(∀ a ∈ X ) P (λ)X (a) =PX(a)e
λa∑a′∈X
PX(a′)eλa
′ .
• So the exponent of PXn{(X1 + · · · + Xn)/n > τ} ismin
{PX̃
: D(PX̃‖PX)>τ}
D(PX̃‖PX) = min{P (λ)X : D(P
(λ)X ‖PX)>τ}
D(P(λ)X ‖PX).
-
Cramer’s theorem II:8-30
• It should be pointed out that∑
a′∈X PX(a′)eλa
′is the moment generating func-
tion of PX .
• The conventional Cramer’s result does not use the divergence. Instead, itintroduced the large deviation rate function, defined by
IX(x)
= sup
θ∈�[θx − log MX(θ)], (8.2.4)
where MX(θ) is the moment generating function of X .
• Using his statement, the exponent of the above probability is respectively lower-and upper bounded by
infx≥τ
IX(x) and infx>τ
IX(x).
An example on how to obtain the exponent bounds is illustrated in the next
subsection.
-
Exponent and moment generating function II:8-31
A) Preliminaries : Observe that since E[X ] = µ < λ and E[|X − µ|2] < ∞,
Pr
{X1 + · · · + Xn
n≥ λ}
→ 0 as n → ∞.
Hence, we can compute its rate of convergence (to zero).
B) Upper bound of the probability :
Pr
{X1 + · · · + Xn
n≥ λ}
= Pr {θ(X1 + · · · + Xn) ≥ θnλ} , for any θ > 0= Pr {exp (θ(X1 + · · · + Xn)) ≥ exp (θnλ)}
≤ E [exp (θ(X1 + · · · + Xn))]exp (θnλ)
=En [exp (θX))]
exp (θnλ)
=
(MX(θ)
exp (θλ)
)n.
Hence,
lim infn→∞
−1nPr
{X1 + · · · + Xn
n> λ
}≥ θλ − log MX(θ).
-
Exponent and moment generating function II:8-32
Since the above inequality holds for every θ > 0, we have
lim infn→∞
−1nPr
{X1 + · · · + Xn
n> λ
}≥ max
θ>0[θλ − log MX(θ)]
= θ∗λ − log MX(θ∗),
where θ∗ > 0 is the optimizer of the maximum operation. (The positivity of
θ∗ can be easily verified by the concavity of the function θλ − log MX(θ) inθ, and it derivative at θ = 0 equals (λ − µ) which is strictly greater than 0.)Consequently,
lim infn→∞
−1nPr
{X1 + · · · + Xn
n> λ
}≥ θ∗λ − log MX(θ∗)
= supθ∈�
[θλ − log MX(θ)] = IX(λ).
C) Lower bound of the probability : omit.
-
Theories on Large deviations II:8-33
In this section, we will derive inequalities on the exponent of the probability,
Pr{Zn/n ∈ [a, b]}, which is a slight extension of the Gärtner-Ellis theorem.
-
Extension of Gärtner-Ellis upper bounds II:8-34
Definition 8.8 In this subsection, {Zn}∞n=1 will denote an infinite sequence ofarbitrary random variables.
Definition 8.9 Define
ϕn(θ)
=
1
nlog E [exp {θZn}] and ϕ̄(θ)
= lim sup
n→∞ϕn(θ).
The sup-large deviation rate function of an arbitrary random sequence {Zn}∞n=1is defined as
Ī(x)
= sup
{θ∈� : ϕ̄(θ)>−∞}[θx − ϕ̄(θ)] . (8.3.5)
The range of the supremum operation in (8.3.5) is always non-empty since ϕ̄(0) =
0, i.e. {θ ∈ � : ϕ̄(θ) > −∞} = ∅. Hence, Ī(x) is always defined. With theabove definition, the first extension theorem of Gärtner-Ellis can be proposed as
follows.
Theorem 8.10 For a, b ∈ � and a ≤ b,
lim supn→∞
1
nlog Pr
{Znn
∈ [a, b]}
≤ − infx∈[a,b]
Ī(x).
• The bound obtained in the above theorem is not in general tight.
-
Extension of Gärtner-Ellis upper bounds II:8-35
Example 8.11 Suppose that Pr{Zn = 0} = 1 − e−2n, and Pr{Zn = −2n} =e−2n. Then from Definition 8.9, we have
ϕn(θ)
=
1
nlog E
[eθZn]
=1
nlog[1 − e−2n + e−(θ+1)·2·n
],
and
ϕ̄(θ)
= lim sup
n→∞ϕn(θ) =
{0, for θ ≥ −1;−2(θ + 1), for θ < −1.
Hence, {θ ∈ � : ϕ̄(θ) > −∞} = � and
Ī(x) = supθ∈�
[θx − ϕ̄(θ)]
= supθ∈�
[θx + 2(θ + 1)1{θ < −1)}]
=
{−x, for − 2 ≤ x ≤ 0;∞, otherwise,
where 1{·} represents the indicator function of a set.
-
Extension of Gärtner-Ellis upper bounds II:8-36
Consequently, by Theorem 8.10,
lim supn→∞
1
nlog Pr
{Znn
∈ [a, b]}
≤ − infx∈[a,b]
Ī(x)
=
0, for 0 ∈ [a, b];b, for b ∈ [−2, 0];−∞, otherwise.
The exponent of Pr{Zn/n ∈ [a, b]} in the above example is indeed given by
limn→∞
1
nlog PZn
{Znn
∈ [a, b]}
= − infx∈[a,b]
I∗(x),
where
I∗(x) =
2, for x = −2;0, for x = 0;
∞, otherwise.(8.3.6)
Thus, the upper bound obtained in Theorem 8.10 is not tight.
-
Extension of Gärtner-Ellis upper bounds II:8-37
Definition 8.12 Define
ϕn(θ; h)
=
1
nlog E
[exp
{n · θ · h
(Znn
)}]and ϕ̄h(θ)
= lim sup
n→∞ϕn(θ; h),
where h(·) is a given real-valued continuous function. The twisted sup-large de-viation rate function of an arbitrary random sequence {Zn}∞n=1 with respect to areal-valued continuous function h(·) is defined as
J̄h(x)
= sup
{θ∈� : ϕ̄h(θ)>−∞}[θ · h(x) − ϕ̄h(θ)] . (8.3.7)
Theorem 8.13 Suppose that h(·) is a real-valued continuous function. Then fora, b ∈ � and a ≤ b,
lim supn→∞
1
nlog Pr
{Znn
∈ [a, b]}
≤ − infx∈[a,b]
J̄h(x).
-
Extension of Gärtner-Ellis upper bounds II:8-38
Example 8.14 Let us, again, investigate the {Zn}∞n=1 defined in Example 8.11.Take
h(x) =1
2(x + 2)2 − 1.
Then from Definition 8.12, we have
ϕn(θ; h)
=
1
nlog E [exp {nθh(Zn/n)}]
=1
nlog [exp {nθ} − exp {n(θ − 2)} + exp {−n(θ + 2)}] ,
and
ϕ̄h(θ)
= lim sup
n→∞ϕn(θ; h) =
{−(θ + 2), for θ ≤ −1;θ, for θ > −1.
Hence, {θ ∈ � : ϕ̄h(θ) > −∞} = � and
J̄h(x)
= sup
θ∈�[θh(x) − ϕ̄h(θ)] =
{−1
2(x + 2)2 + 2, for x ∈ [−4, 0];
∞, otherwise.
-
Extension of Gärtner-Ellis upper bounds II:8-39
Consequently, by Theorem 8.13,
lim supn→∞
1
nlog Pr
{Znn
∈ [a, b]}
≤ − infx∈[a,b]
J̄h(x)
=
−min
{−(a + 2)
2
2,−(b + 2)
2
2
}− 2, for − 4 ≤ a < b ≤ 0;
0, for a > 0 or b < −4;−∞, otherwise.
(8.3.8)
For b ∈ (−2, 0) and a ∈[−2 −
√2b − 4, b
), the upper bound attained in the
previous example is strictly less than that given in Example 8.11, and hence, an
improvement is obtained. However, for b ∈ (−2, 0) and a < −2 −√
2b − 4, theupper bound in (8.3.8) is actually looser. Accordingly, we combine the two upper
bounds from Examples 8.11 and 8.14 to get
lim supn→∞
1
nlog Pr
{Znn
∈ [a, b]}
≤ −max{
infx∈[a,b]
J̄h(x), infx∈[a,b]
Ī(x)
}
=
0, for 0 ∈ [a, b];1
2(b + 2)2 − 2, for b ∈ [−2, 0];
−∞, otherwise.
-
Extension of Gärtner-Ellis upper bounds II:8-40
Theorem 8.15 For a, b ∈ � and a ≤ b,
lim supn→∞
1
nlog Pr
{Znn
∈ [a, b]}
≤ − infx∈[a,b]
J̄(x),
where J̄(x)
= suph∈H J̄h(x) and H is the set of all real-valued continuous functions.
Example 8.16 Let us again study the {Zn}∞n=1 in Example 8.11 (also in Example8.14). Suppose c > 1. Take hc(x) = c1(x + c2)
2 − c, where
c1
=
c +√
c2 − 12
and c2
=
2√
c + 1√c + 1 +
√c − 1
.
Then from Definition 8.12, we have
ϕn(θ; hc)
=
1
nlog E
[exp
{nθhc
(Znn
)}]=
1
nlog [exp {nθ} − exp {n(θ − 2)} + exp {−n(θ + 2)}] ,
and
ϕ̄hc(θ)
= lim sup
n→∞ϕn(θ; hc) =
{−(θ + 2), for θ ≤ −1;θ, for θ > −1.
-
Extension of Gärtner-Ellis upper bounds II:8-41
Hence, {θ ∈ � : ϕ̄hc(θ) > −∞} = � and
J̄hc(x) = supθ∈�
[θhc(x) − ϕ̄hc(θ)]
=
{−c1(x + c2)2 + c + 1, for x ∈ [−2c2, 0];∞, otherwise.
From Theorem 8.15,
J̄(x) = suph∈H
J̄h(x) ≥ max{lim infc→∞
J̄hc(x), Ī(x)} = I∗(x),
where I∗(x) is defined in (8.3.6). Consequently,
lim supn→∞
1
nlog Pr
{Znn
∈ [a, b]}
≤ − infx∈[a,b]
J̄(x)
≤ − infx∈[a,b]
I∗(x)
=
0, if 0 ∈ [a, b];−2, if − 2 ∈ [a, b] and 0 ∈ [a, b];−∞, otherwise.
and a tight upper bound is finally obtained!
-
Extension of Gärtner-Ellis upper bounds II:8-42
Definition 8.17 Define ϕh(θ)
= lim infn→∞ ϕn(θ; h), where ϕn(θ; h) was defined
in Definition 8.12. The twisted inf-large deviation rate function of an arbitrary
random sequence {Zn}∞n=1 with respect to a real-valued continuous function h(·) isdefined as
Jh(x)
= sup
{θ∈� : ϕh(θ)>−∞}
[θ · h(x) − ϕ
h(θ)].
Theorem 8.18 For a, b ∈ � and a ≤ b,
lim infn→∞
1
nlog Pr
{Znn
∈ [a, b]}
≤ − infx∈[a,b]
J(x),
where J(x)
= suph∈H Jh(x) and H is the set of all real-valued continuous functions.
-
Extension of Gärtner-Ellis lower bounds II:8-43
• Hope to know when
lim supn→∞
1
nlog Pr
{Znn
∈ (a, b)}
≥ − infx∈(a,b)
J̄h(x). (8.3.9)
Definition 8.19 Define the sup-Gärtner-Ellis set with respect to a real-valued
continuous function h(·) as
D̄h
=
⋃{θ∈� : ϕ̄h(θ)>−∞}
D̄(θ; h)
where
D̄(θ; h) ={
x ∈ � : lim supt↓0
ϕ̄h(θ + t) − ϕ̄h(θ)t
≤ h(x) ≤ lim inft↓0
ϕ̄h(θ) − ϕ̄h(θ − t)t
}.
Let us briefly remark on the sup-Gärtner-Ellis set defined above.
• It can be derived that the sup-Gärtner-Ellis set is reduced to
D̄h
=
⋃{θ∈� : ϕ̄h(θ)>−∞}
{x ∈ � : ϕ̄′h(θ) = h(x)} ,
if the derivative ϕ̄′h(θ) exists for all θ.
-
Extension of Gärtner-Ellis lower bounds II:8-44
Theorem 8.20 Suppose that h(·) is a real-valued continuous function. Then if(a, b) ⊂ D̄h,
lim supn→∞
1
nlog Pr
{Znn
∈ (a, b)}
≥ − infx∈(a,b)
J̄h(x).
Example 8.21 Suppose Zn = X1 + · · · + Xn, where {Xi}ni=1 are i.i.d. Gaussianrandom variables with mean 1 and variance 1 if n is even, and with mean −1 andvariance 1 if n is odd. Then the exact large deviation rate formula Ī∗(x) that
satisfies for all a < b,
− infx∈[a,b]
Ī∗(x) ≥ lim supn→∞
1
nlog Pr
{Znn
∈ [a, b]}
≥ lim supn→∞
1
nlog Pr
{Znn
∈ (a, b)}
≥ − infx∈(a,b)
Ī∗(x)
is
Ī∗(x) =(|x| − 1)2
2. (8.3.10)
Case A: h(x) = x.
For the affine h(·), ϕn(θ) = θ + θ2/2 when n is even, and ϕn(θ) = −θ + θ2/2
-
Extension of Gärtner-Ellis lower bounds II:8-45
when n is odd. Hence, ϕ̄(θ) = |θ| + θ2/2, and
D̄h =(⋃
θ>0
{v ∈ � : v = 1 + θ})⋃(⋃
θ 1;
0, for |x| ≤ 1,
we obtain for any a ∈ (−∞, 1) ∪ (1,∞),
limε↓0
lim supn→∞
1
nlog Pr
{Znn
∈ (a − ε, a + ε)}
≥ − limε↓0
infx∈(a−ε,a+ε)
Ī(x) = −(|a| − 1)2
2,
-
Extension of Gärtner-Ellis lower bounds II:8-46
which can be shown tight by Theorem 8.13 (or directly by (8.3.10)). Note that
the above inequality does not hold for any a ∈ (−1, 1). To fill the gap, adifferent h(·) must be employed.
Case B: h(x) = |x|.For n even,
E[enθh(Zn/n)
]= E
[enθ|Zn/n−a|
]=
∫ na−∞
e−θx+nθa1√2πn
e−(x−n)2/(2n)dx +
∫ ∞na
eθx−nθa1√2πn
e−(x−n)2/(2n)dx
= enθ(θ−2+2a)/2∫ na−∞
1√2πn
e−[x−n(1−θ)]2/(2n)dx
+ enθ(θ+2−2a)/2∫ ∞
na
1√2πn
e−[x−n(1+θ)]2/(2n)dx
= enθ(θ−2+2a)/2 · Φ((θ + a − 1)
√n)
+ enθ(θ+2−2a)/2 · Φ((θ − a + 1)
√n),
where Φ(·) represents the unit Gaussian cdf.
-
Extension of Gärtner-Ellis lower bounds II:8-47
Similarly, for n odd,
E[enθh(Zn/n)
]= enθ(θ+2+2a)/2 · Φ
((θ + a + 1)
√n)
+ enθ(θ−2−2a)/2 · Φ((θ − a − 1)
√n).
Observe that for any b ∈ �,
limn→∞
1
nlog Φ(b
√n) =
0, for b ≥ 0;−b22
, for b < 0.
Hence,
ϕ̄h(θ) =
−(|a| − 1)
2
2, for θ < |a| − 1;
θ[θ + 2(1 − |a|)]2
, for |a| − 1 ≤ θ < 0;θ[θ + 2(1 + |a|)]
2, for θ ≥ 0.
-
Extension of Gärtner-Ellis lower bounds II:8-48
Therefore,
D̄h =(⋃
θ>0
{x ∈ � : |x − a| = θ + 1 + |a|})
⋃(⋃θ a + 1 + |a| or x < a − 1 − |a|;0, otherwise.
(8.3.11)
-
Extension of Gärtner-Ellis lower bounds II:8-49
We then apply Theorem 8.20 to obtain
limε↓0
lim supn→∞
1
nlog Pr
{Znn
∈ (a − ε, a + ε)}
≥ − limε↓0
infx∈(a−ε,a+ε)
J̄h(x)
= − limε↓0
(ε − 1 + |a|)22
= −(|a| − 1)2
2.
Note that the above lower bound is valid for any a ∈ (−1, 1), and can be showntight, again, by Theorem 8.13 (or directly by (8.3.10)).
Finally, by combining the results of Cases A) and B), the true large deviation
rate of {Zn}n≥1 is completely characterized.
-
Extension of Gärtner-Ellis lower bounds II:8-50
Definition 8.22 Define the inf-Gärtner-Ellis set with respect to a real-valued
continuous function h(·) as
Dh
=
⋃{θ∈� : ϕ
h(θ)>−∞}
D(θ; h)
where
D(θ; h) ={
x ∈ � : lim supt↓0
ϕh(θ + t) − ϕ
h(θ)
t
≤ h(x) ≤ lim inft↓0
ϕh(θ) − ϕ
h(θ − t)
t
}.
Theorem 8.23 Suppose that h(·) is a real-valued continuous function. Then if(a, b) ⊂ Dh,
lim infn→∞
1
nlog Pr
{Znn
∈ (a, b)}
≥ − infx∈(a,b)
Jh(x).
-
Properties II:8-51
Property 8.24 Let Ī(x) and I(x) be the sup- and inf- large deviation rate func-
tions of an infinite sequence of arbitrary random variables {Zn}∞n=1, respectively.Denote mn = (1/n)E[Zn]. Let m̄
= lim supn→∞ mn and m
= lim infn→∞ mn. Then
1. Ī(x) and I(x) are both convex.
2. Ī(x) is continuous over {x ∈ � : Ī(x) < ∞}. Likewise, I(x) is continuousover {x ∈ � : I(x) < ∞}.
3. Ī(x) gives its minimum value 0 at m ≤ x ≤ m̄.
4. I(x) ≥ 0. But I(x) does not necessary give its minimum value at both x = m̄and x = m.
-
Properties II:8-52
Property 8.25 Suppose that h(·) is a real-valued continuous function. Let J̄h(x)and Jh(x) be the corresponding twisted sup- and inf- large deviation rate functions,
respectively. Denote mn(h)
=E[h(Zn/n)]. Let
m̄h
= lim sup
n→∞mn(h) and mh
= lim inf
n→∞mn(h).
Then
1. J̄h(x) ≥ 0, with equality holds if mh ≤ h(x) ≤ m̄h.
2. Jh(x) ≥ 0, but Jh(x) does not necessary give its minimum value at bothx = m̄h and x = mh.
-
Probabilitic subexponential behavior II:8-53
• subexponential behavior.an = (1/n) exp{−2n} and bn = (1/
√n) exp{−2n} have the same exponent,
but contain different subexponential terms
-
Berry-Esseen theorem for compound i.i.d. sequence II:8-54
• Berry-Esseen theorem states that the distribution of the sum of independentzero-mean random variables {Xi}ni=1, normalized by the standard deviationof the sum, differs from the Gaussian distribution by at most C rn/s
3n, where
s2n and rn are respectively sums of the marginal variances and the marginal
absolute third moments, and C is an absolute constant.
• Specifically, for every a ∈ �,∣∣∣∣Pr{ 1sn (X1 + · · · + Xn) ≤ a}− Φ(a)
∣∣∣∣ ≤ C rns3n, (8.4.12)where Φ(·) represents the unit Gaussian cdf.
• The striking feature of this theorem is that the upper bound depends only onthe variance and the absolute third moment, and hence, can provide a good
asymptotic estimate based on only the first three moments.
• The absolute constant C is commonly 6. When {Xn}ni=1 are identically dis-tributed, in addition to independent, the absolute constant can be reduced to
3, and has been reported to be improved down to 2.05.
Definition: compound i.i.d. sequence. The samples that we concern in this
section actually consists of two i.i.d. sequences (and, is therefore named compound
i.i.d. sequence.)
-
Berry-Esseen theorem for compound i.i.d. sequence II:8-55
Lemma 8.26 (smoothing lemma) Fix the bandlimited filtering function
vT(x)
=
1 − cos(Tx)πTx2
=2 sin2(Tx/2)
πTx2
=T
2πsinc2
(Tx
2π
)= Four−1
[Λ
(f
T/(2π)
)].
For any cumulative distribution function H(·) on the real line �,
supx∈�
|∆T (x)| ≥1
2η − 6
Tπ√
2πh
(T√
2π
2η
),
where
∆T (t)
=
∫ ∞−∞
[H(t − x) − Φ(t − x)] × vT(x)dx, η
= sup
x∈�|H(x) − Φ(x)| ,
and
h(u)
=
u∫ ∞
u
1 − cos(x)x2
dx =π
2u + 1 − cos(u) − u
∫ u0
sin(x)
xdx, if u ≥ 0;
0, otherwise.
-
Berry-Esseen theorem for compound i.i.d. sequence II:8-56
Lemma 8.27 For any cumulative distribution function H(·) with characteristicfunction ϕ
H(ζ),
η ≤ 1π
∫ T−T
∣∣∣ϕH(ζ) − e−(1/2)ζ2∣∣∣ dζ|ζ| + 12Tπ√2π h(
T√
2π
2η
),
where η and h(·) are defined in Lemma 8.26.
Theorem 8.28 (BE theorem for compound i.i.d. sequences) Let Yn =∑ni=1 Xi be the sum of independent random variables, among which {Xi}di=1 are
identically Gaussian distributed, and {Xi}ni=d+1 are identically distributed but notnecessarily Gaussian.
• Denote the mean-variance pair of X1 and Xd+1 by (µ, σ2) and (µ̂, σ̂2), respec-tively.
• Defineρ
=E[|X1 − µ|3
], ρ̂
=E[|Xd+1 − µ̂|3
]and
s2n = Var[Yn] = σ2d + σ̂2(n − d).
• Also denote the cdf of (Yn − E[Yn])/sn by Hn(·).
-
Berry-Esseen theorem for compound i.i.d. sequence II:8-57
Then for all y ∈ �,
|Hn(y) − Φ(y)| ≤ Cn,d2√π
(n − d − 1)(2(n − d) − 3
√2) ρ̂
σ̂2sn,
where Cn,d is the unique positive number satisfying
π
6Cn,d − h(Cn,d)
=
√π(2(n − d) − 3
√2)
12(n − d − 1)
√6π2(3 −
√2)3/2 + 92(11 − 6√2)√n − d
,provided that n − d ≥ 3.
-
Berry-Esseen theorem for compound i.i.d. sequence II:8-58
-1
-0.5
0
0.5
1
1.5
2
2.5
0 1 2 3 4 5
π
6u − h(u)
u
Function of (π/6)u − h(u).
-
Berry-Esseen theorem for compound i.i.d. sequence II:8-59
By letting d = 0, the Berry-Esseen inequality for i.i.d. sequences can also be
readily obtained from the previous Theorem.
Corollary 8.29 (Berry-Esseen theorem for i.i.d. sequence) Let
Yn =n∑
i=1
Xi
be the sum of independent random variables with common marginal distribution.
Denote the marginal mean and variance by (µ̂, σ̂2). Define ρ̂
=E[|X1 − µ̂|3
]. Also
denote the cdf of (Yn − nµ̂)/(√
nσ̂) by Hn(·). Then for all y ∈ �,
|Hn(y) − Φ(y)| ≤ Cn2(n − 1)
√π(2n − 3
√2) ρ̂
σ̂3√
n,
where Cn is the unique positive solution of
π
6u − h(u) =
√π(2n − 3
√2)
12(n − 1)
√6π2(3 −
√2)3/2 + 92(11 − 6√2)√n
,provided that n ≥ 3.
-
Berry-Esseen theorem for compound i.i.d. sequence II:8-60
• Let us briefly remark on the previous corollary. We observe from numericalsthat the quantity
Cn2√π
(n − 1)(2n − 3
√2)
is decreasing in n, and ranges from 3.628 to 1.627 (cf. The picture in slide
II:8-62.)
– We can upperbound Cn by the unique positive solution Dn of
π
6u − h(u) =
√π
6
√6π2(3 −
√2)3/2 + 92(11 − 6√2)√n
,which is strictly decreasing in n. Hence,
Cn2√π
(n − 1)(2n − 3
√2) ≤ En = Dn 2√
π
(n − 1)(2n − 3
√2),
and the right-hand-side of the above inequality is strictly decreasing (since
both Dn and (n − 1)/(2n − 3√
2) are decreasing) in n, and ranges from
E3 = 4.1911, . . . ,E9 = 2.0363,. . . , E100 = 1.6833 to E∞ = 1.6266. If the
property of strictly decreasingness is preferred, one can use Dn instead of
Cn in the Berry-Esseen inequality. Note that both Cn and Dn converges to
2.8831 . . . as n goes to infinity.
-
Berry-Esseen theorem for compound i.i.d. sequence II:8-61
• Numerical result shows that it lies below 2 when n ≥ 9, and is smaller than1.68 as n ≥ 100. In other words, we can upperbound this quantity by 1.68 asn ≥ 100, and therefore, establish a better estimate of the original Berry-Esseenconstant.
-
Berry-Esseen theorem for compound i.i.d. sequence II:8-62
0
1
1.682
3
3 10 20 30 4050 75100 150200
Cn2√π
(n − 1)(2n − 3
√2)
n
The Berry-Esseen constant as a function of the sample size n. The sample size n
is plotted in log-scale.
-
Generalized Neyman-Pearson Hypothesis Testing II:8-63
The general expression of the Neyman-Pearson type-II error exponent subject to
a constant bound on the type-I error has been proved for arbitrary observations.
In this section, we will state the results in terms of the ε-inf/sup-divergence rates.
Theorem 8.30 (Neyman-Pearson type-II error exponent for a fixed
test level) Consider a sequence of random observations which is assumed to have a
probability distribution governed by either PX (null hypothesis) or PX̂ (alternative
hypothesis). Then, the type-II error exponent satisfies
limδ↑δ
D̄δ(X‖X̂) ≤ lim supn→∞
−1n
log β∗n(ε) ≤ D̄ε(X‖X̂)
limδ↑ε
Dδ(X‖X̂) ≤ lim infn→∞
−1n
log β∗n(ε) ≤ Dε(X‖X̂)
where β∗n(ε) represents the minimum type-II error probability subject to a fixed
type-I error bound ε ∈ [0, 1).
The general formula for Neyman-Pearson type-II error exponent subject to an
exponential test level has also been proved in terms of the ε-inf/sup-divergence
rates.
-
Generalized Neyman-Pearson Hypothesis Testing II:8-64
Theorem 8.31 (Neyman-Pearson type-II error exponent for an ex-
ponential test level) Fix s ∈ (0, 1) and ε ∈ [0, 1). It is possible to choosedecision regions for a binary hypothesis testing problem with arbitrary datawords
of blocklength n, (which are governed by either the null hypothesis distribution PXor the alternative hypothesis distribution PX̂ ,) such that
lim infn→∞
−1n
log βn ≥ D̄ε(X̂(s)‖X̂) and lim sup
n→∞−1
nlog αn ≥ D(1−ε)(X̂
(s)‖X),(8.5.13)
or
lim infn→∞
−1n
log βn ≥ Dε(X̂(s)‖X̂) and lim sup
n→∞−1
nlog αn ≥ D̄(1−ε)(X̂
(s)‖X),(8.5.14)
where X̂(s)
exhibits the tilted distributions{P
(s)
X̂n
}∞n=1
defined by
dP(s)
X̂n(xn)
=
1
Ωn(s)exp
{s log
dPXn
dPX̂n(xn)
}dPX̂n(x
n),
and
Ωn(s)
=
∫X n
exp
{s log
dPXn
dPX̂n(xn)
}dPX̂n(x
n).
Here, αn and βn are the type-I and type-II error probabilities, respectively.