chapter 8 hypothesis testingshannon.cm.nctu.edu.tw/it/c2-8s04.pdf · geometrical interpretation for...

65
Chapter 8 Hypothesis Testing Po-Ning Chen Department of Communications Engineering National Chiao-Tung University Hsin Chu, Taiwan 30050

Upload: others

Post on 26-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • Chapter 8

    Hypothesis Testing

    Po-Ning Chen

    Department of Communications Engineering

    National Chiao-Tung University

    Hsin Chu, Taiwan 30050

  • Error exponent and divergence II:8-1

    Definition 8.1 (exponent) A real number a is said to be the exponent for a

    sequence of non-negative quantities {an}n≥1 converging to zero, if

    a = limn→∞

    (−1

    nlog an

    ).

    • In operation, exponent is an index for the exponential rate-of-convergence forsequence an. We can say that for any γ > 0,

    e−n(a+γ) ≤ an ≤ e−n(a−γ),as n large enough.

    • Recall that in proving the channel coding theorem, the probability of decodingerror for channel block codes can be made arbitrarily close to zero when the

    rate of the codes is less than channel capacity.

    • Actually, this result can be mathematically written as:Pe( C∼∗n) → 0 as n → ∞,

    provided R = lim supn→∞(1/n) log ‖ C∼∗n‖ < C, where C∼∗n is the optimal codefor block length n.

    • From the theorem, we only know the decoding error will vanish as block lengthincreases; but, it does not reveal how fast the decoding error approaches zero.

  • Error exponent and divergence II:8-2

    • In other words, we do not know the rate-of-convergence of the decoding error.Sometimes, this information is very important, especially for one to decide the

    sufficient block length to achieve some error bound.

    • The first step of investigating the rate-of-convergence of the decoding error isto compute its exponent, if the decoding error decays to zero exponentially fast

    (it indeed does for memoryless channels.) This exponent, as a function of the

    rate, is in fact called the channel reliability function, and will be discussed in

    the next chapter.

    • For the hypothesis testing problems, the type II error probability of fixed testlevel also decays to zero as the number of observations increases. As it turns

    out, its exponent is the divergence of the null hypothesis distribution against

    alternative hypothesis distribution.

  • Stein’s lemma II:8-3

    Lemma 8.2 (Stein’s lemma) For a sequence of i.i.d. observations Xn which is

    possibly drawn from either null hypothesis distribution PXn or alternative hypoth-

    esis distribution PX̂n, the type II error satisfies

    (∀ ε ∈ (0, 1)) limn→∞

    −1n

    log β∗n(ε) = D(PX‖PX̂),

    where β∗n(ε) = minαn≤ε βn, and αn and βn represent the type I and type II errors

    respectively.

    Proof: [1. Forward Part]

    In the forward part, we prove that there exists an acceptance region for null

    hypothesis such that

    lim infn→∞

    −1n

    log βn(ε) ≥ D(PX‖PX̂).

    step 1: divergence typical set. For any δ > 0, define divergence typical set

    as

    An(δ)

    =

    {xn :

    ∣∣∣∣1n log PXn(xn)PX̂n(xn) − D(PX‖PX̂)∣∣∣∣ < δ} .

    Note that in divergence typical set,

    PX̂n(xn) ≤ PXn(xn)e−n(D(PX‖PX̂)−δ).

  • Stein’s lemma II:8-4

    step 2: computation of type I error. By weak law of large number,

    PXn(An(δ)) → 1.

    Hence,

    αn = PXn(Acn(δ)) < ε,for sufficiently large n.

    step 3: computation of type II error.

    βn(ε) = PX̂n(An(δ))=

    ∑xn∈An(δ)

    PX̂n(xn)

    ≤∑

    xn∈An(δ)PXn(x

    n)e−n(D(PX‖PX̂)−δ)

    ≤ e−n(D(PX‖PX̂)−δ)∑

    xn∈An(δ)PXn(x

    n)

    ≤ e−n(D(PX‖PX̂)−δ)(1 − αn).

    Hence,

    −1n

    log βn(ε) ≥ D(PX‖PX̂) − δ +1

    nlog(1 − αn),

  • Stein’s lemma II:8-5

    which implies

    lim infn→∞

    −1n

    log βn(ε) ≥ D(PX‖PX̂) − δ.

    The above inequality is true for any δ > 0. Therefore

    lim infn→∞

    −1n

    log2 βn(ε) ≥ D(PX‖PX̂).

    [2. Converse Part]

    In the converse part, we will prove that for any acceptance region for null hy-

    pothesis Bn satisfying the type I error constraint, i.e.,

    αn(Bn) = PXn(Bcn) ≤ ε,

    its type II error βn(Bn) satisfies

    lim supn→∞

    −1n

    log βn(Bn) ≤ D(PX‖PX̂).

  • Stein’s lemma II:8-6

    βn(Bn) = PX̂n(Bn) ≥ PX̂n(Bn ∩ An(δ))≥

    ∑xn∈Bn∩An(δ)

    PX̂n(xn)

    ≥∑

    xn∈Bn∩An(δ)PXn(x

    n)e−n(D(PX‖PX̂)+δ)

    = e−n(D(PX‖PX̂)+δ)PXn(Bn ∩ An(δ))≥ e−n(D(PX‖PX̂)+δ) (1 − PXn(Bcn) − PXn (Acn(δ)))≥ e−n(D(PX‖PX̂)+δ) (1 − αn(Bn) − PXn (Acn(δ)))≥ e−n(D(PX‖PX̂)+δ) (1 − ε − PXn (Acn(δ))) .

    Hence,

    −1n

    log βn(Bn) ≤ D(PX‖PX̂) + δ +1

    nlog (1 − ε − PXn (Acn(δ))) ,

    which implies that

    lim supn→∞

    −1n

    log βn(Bn) ≤ D(PX‖PX̂) + δ.

    The above inequality is true for any δ > 0. Therefore,

    lim supn→∞

    −1n

    log βn(Bn) ≤ D(PX‖PX̂).

  • Composition of sequence of i.i.d. observations II:8-7

    • Stein’s lemma gives the exponent of the type II error probability for fixed testlevel.

    • As a result, this exponent, which is the divergence of null hypothesis distribu-tion against alternative hypothesis distribution, is independent of the type I

    error bound ε for i.i.d. observations.

    • Specifically under i.i.d. environment, the probability for each sequence of xndepends only on its composition, which is defined as an |X |-dimensional vector,and is of the form (

    #1(xn)

    n,#2(xn)

    n, . . . ,

    #k(xn)

    n

    ),

    where X = {1, 2, . . . , k}, and #i(xn) is the number of occurrences of symboli in xn.

    • The probability of xn is therefore can be written asPXn(x

    n) = PX(1)#1(xn) × PX(2)#2(x

    n) × PX(k)#k(xn).

    • Note that #1(xn) + · · · + #k(xn) = n.• Since the composition of sequence decides its probability deterministically, all

    sequences with the same composition should have the same statistical property,

    and hence should be treated the same when processing.

  • Composition of sequence of i.i.d. observations II:8-8

    • Instead of manipulating the sequences of observations based on the typical-set-like concept, we may focus on their compositions.

    • As it turns out, such approach yields simpler proofs and better geometricalexplanations for theories under i.i.d. environment.

    • (It needs to be pointed out that for cases when composition alone can notdecide the probability, this viewpoint does not seem to be effective.)

    Lemma 8.3 (polynomial bound on number of composition) The num-

    ber of compositions increases polynomially fast, while the number of possible se-

    quences increases exponentially fast.

    Proof:

    • Let Pn denotes the set of all possible compositions.

    • |Pn| ≤ (n + 1)|X |

  • Composition of sequence of i.i.d. observations II:8-9

    Lemma 8.4 (probability of sequences of the same composition) The

    probability of the sequences of composition C with respect to distribution PXnsatisfies

    1

    (n + 1)|X |e−nD(PC‖PX) ≤ PXn(C) ≤ e−nD(PC‖PX),

    where PC is the composition distribution for composition C, and C (by abusingnotation without ambiguity) is also used to represent the set of all sequences (in

    X n) of composition C.

    Theorem 8.5 (Sanov’s Theorem) Let En be the set that consists of all com-positions over finite alphabet X , whose composition distribution belongs to P . Fixa sequence of product distribution PXn =

    ∏ni=1 PX . Then,

    lim infn→∞

    −1n

    log PXn(En) ≥ infPC∈P

    D(P‖PX),

    where PC is the composition distribution for composition C. If, in addition, forevery distribution P in P , there exists a sequence of composition distributionsPC1, PC2, PC3, . . . ∈ P such that lim supn→∞ D(PCn‖PX) = D(P‖PX), then

    lim supn→∞

    −1n

    log PXn(En) ≤ infP∈P

    D(PC‖PX).

  • Geometrical interpretation for Sanov’s theorem II:8-10

    ��

    ��

    ��

    ��

    ��

    ��

    ��

    ��

    ���

    ��

    ��

    ��

    ��

    ��

    ��

    ��

    P

    PX

    P1

    P2

    ���

    ���

    ���

    ��������������

    minP∈P

    D(P‖PX)

    The geometric meaning for Sanov’s theorem.

  • Geometrical interpretation for Sanov’s theorem II:8-11

    Example 8.6 • Question: One wants to roughly estimate the probabilitythat the average of the throws is greater or equal to 4, when tossing a fair dice

    n times.

    • He observes that whether the requirement is satisfied only depends on thecompositions of the observations.

    • Let En be the set of compositions which satisfy the requirement.

    En ={C :

    6∑i=1

    iPC(i) ≥ 4}

    .

    • To minimize D(PC‖PX) for C ∈ En, we can use the Lagrange multiplier tech-nique (since divergence is convex with respect to the first argument.) with the

    constraints on PC being:6∑

    i=1

    iPC(i) = k and6∑

    i=1

    PC(i) = 1

    for k = 4, 5, 6, . . . , n.

    • So it becomes to minimize:6∑

    i=1

    PC(i) logPC(i)

    PX(i)+ λ1

    (6∑

    i=1

    iPC(i) − k)

    + λ2

    (6∑

    i=1

    PC(i) − 1)

    .

  • Geometrical interpretation for Sanov’s theorem II:8-12

    • By taking the derivatives, we found that the minimizer should be of the form

    PC(i) =eλ1·i∑6j=1 e

    λ1·j,

    for λ1 is chosen to satisfy6∑

    i=1

    iPC(i) = k. (8.1.1)

    • Since the above is true for all k ≥ 4, it suffices to take the smallest one as oursolution, i.e., k = 4.

    • Finally, by solving (8.1.1) for k = 4 numerically, the minimizer should be

    PC∗ = (0.1031, 0.1227, 0.1461, 0.1740, 0.2072, 0.2468),

    and the exponent of the desired probability is D(PC∗‖PX) = 0.0433 nat.

    • Consequently,PXn(En) ≈ e−0.0433·n.

  • Divergence typical set on composition II:8-13

    • Divergence typical set in Stein’s lemma:

    An(δ)

    =

    {xn :

    ∣∣∣∣1n log PXn(xn)PX̂n(xn) − D(PX‖PX̂)∣∣∣∣ < δ} .

    •Tn(δ)

    ={xn ∈ X n : D(PCxn‖PX) ≤ δ},

    where Cxn represents the composition of xn.

    • PXn(Tn(δ)) → 1 is justified by

    1 − PXn(Tn(δ)) =∑

    {C : D(PC‖PX)>δ}PXn(C)

    ≤∑

    {C : D(PC‖PX)>δ}e−nD(PC‖PX), from Lemma 8.4.

    ≤∑

    {C : D(PC‖PX)>δ}e−nδ

    ≤ (n + 1)|X |e−nδ, cf. Lemma 8.3.

  • Universal source coding on composition II:8-14

    • Universal codefn : X n →

    ∞⋃i=1

    {0, 1}i

    for i.i.d. source:1

    n

    ∑xn∈Xn

    PXn(xn)�(fn(x

    n)) → H(X),

    as n goes to infinity.

    Example 8.7 (universal encoding using compositions)

    • Binary-index the compositions using log2(n+1)|X | bits, and denote this binaryindex for composition C by a(C).

    – Let Cxn denote the composition with respect to xn, i.e. xn ∈ Cxn.

    • Binary-index the elements in C using n · H(PC) bits, and denote this binaryindex for elements in C by b(Cxn).

    – For each composition C, we know that the number of sequence xn in C is atmost 2n·H(PC) (Here, H(PC) is measured in bits. I.e., the logarithmic base

    in entropy is 2. See the proof of Lemma 8.4).

  • Universal source coding on composition II:8-15

    • Define a universal encoding function fn as

    fn(xn) = concatenation{a(Cxn), b(Cxn)}.

    • Then this encoding rule is a universal code for all i.i.d. sources.

  • Universal source coding on composition II:8-16

    Proof:

    �̄n =∑

    xn∈XnPXn(x

    n)�(a(Cxn)) +∑

    xn∈XnPXn(x

    n)�(b(Cxn))

    ≤∑

    xn∈XnPXn(x

    n) · log2(n + 1)|X | +∑

    xn∈XnPXn(x

    n) · n · H(PCxn)

    ≤ |X | · log2(n + 1) +∑{C}

    PXn(C) · n · H(PC).

    1

    n�̄n ≤

    |X | × log2(n + 1)n

    +∑{C}

    PXn(C)H(PC).

  • Universal source coding on composition II:8-17∑{C}

    PXn(C)H(PC)

    =∑

    {C∈Tn(δ)}PXn(C)H(PC) +

    ∑{C∈Tn(δ)}

    PXn(C)H(PC)

    ≤ max{C : D(PC‖PX)≤δ/ log(2)}

    H(PC) +∑

    {C : D(PC‖PX)>δ/ log(2)}PXn(C)H(PC)

    ≤ max{C : D(PC‖PX)≤δ/ log(2)}

    H(PC) +∑

    {C : D(PC‖PX)>δ/ log(2)}2−nD(PC‖PX)H(PC),

    (From Lemma 8.4)

    ≤ max{C : D(PC‖PX)≤δ/ log(2)}

    H(PC) +∑

    {C : D(PC‖PX)>δ/ log(2)}e−nδH(PC)

    ≤ max{C : D(PC‖PX)≤δ/ log(2)}

    H(PC) +∑

    {C : D(PC‖PX)>δ/ log(2)}e−nδ log2 |X |

    ≤ max{C : D(PC‖PX)≤δ/ log(2)}

    H(PC) + (n + 1)|X |e−nδ log2 |X |,

    where the second term of the last step vanishes as n → ∞. (Note that when base-2logarithm is taken in divergence instead of natural logarithm, the range [0, δ] in

  • Universal source coding on composition II:8-18

    Tn(δ) should be replaced by [0, δ/ log(2)].) It remains to show that

    max{C : D(PC‖PX)≤δ/ log(2)}

    H(PC) ≤ H(X) + γ(δ),

    where γ(δ) only depends on δ, and approaches zero as δ → 0.. . .

  • Likelihood ratio versus divergence II:8-19

    • Recall that the Neyman-Pearson lemma indicates that the optimal test for twohypothesis is of the form

    PXn(xn)

    PX̂n(xn)

    >< τ. (8.1.2)

    • This is the likelihood ratio test and the quantity PXn(xn)/PX̂n(xn) is calledthe likelihood ratio.

    • If a log operation is performed on both sides of (8.1.2), the test remains.

  • Likelihood ratio versus divergence II:8-20

    logPXn(x

    n)

    PX̂n(xn)

    =n∑

    i=1

    logPX(xi)

    PX̂(xi)

    =∑a∈X

    [#a(xn)] logPX(a)

    PX̂(a)

    =∑a∈X

    [nPCxn(a)] logPX(a)

    PX̂(a)

    = n ·∑a∈X

    PCxn(a) logPX(a)

    PCxn(a)

    PCxn(a)

    PX̂(a)

    = n

    [∑a∈X

    PCxn(a) logPCxn(a)

    PX̂(a)−∑a∈X

    PCxn(a) logPCxn(a)

    PX(a)

    ]= n

    [D(PCxn‖PX̂) − D(PCxn‖PX)

    ]Hence, (8.1.2) is equivalent to

    D(PCxn‖PX̂) − D(PCxn‖PX) ><1

    nlog τ. (8.1.3)

    • This equivalence means that for hypothesis testing, selection of the acceptanceregion can be made upon compositions instead of observations.

  • Likelihood ratio versus divergence II:8-21

    • In other words, the optimal decision function can be defined as:

    φ(C) =

    0, if composition C is classified to belong to null hypothesis

    according to (8.1.3);

    1, otherwise.

  • Exponent of Bayesian cost II:8-22

    • Randomization is of no help to Bayesian test.

    φ(xn) =

    {0, with probability η;

    1, with probability 1 − η;

    satisfies

    π0ηPXn(xn) + π1(1 − η)PX̂n(x

    n) ≥ min{π0PXn(xn), π1PX̂n(xn)}.

    • Now suppose the acceptance region for null hypothesis is

    A={C : D(PC‖PX̂) − D(PC‖PX) > τ′}.

    • Then by Sanov’s theorem, the exponent of type II error, βn, is

    minC∈A

    D(PC‖PX̂).

    • Similarly, the exponent of type I error, αn is

    minC∈Ac

    D(PC‖PX).

  • Exponent of Bayesian cost II:8-23

    • Lagrange multiplier: by taking derivative of

    D(PX̃‖PX̂) + λ(D(PX̃‖PX̂) − D(PX̃‖PX) − τ′) + ν

    (∑x∈X

    PX̃(x) − 1)

    with respective to each PX̃(x), we have

    logPX̃(x)

    PX̂(x)+ 1 + λ log

    PX(x)

    PX̂(x)+ ν = 0.

    Solving these equations, we obtain the optimal PX̃ is of the form

    PX̃(x) = Pλ(x)

    =

    P λX(x)P1−λX̂

    (x)∑a∈X

    P λX(a)P1−λX̂

    (a).

    • The geometrical explanation for Pλ is that it locates on the “straight line”between PX and PX̂ (in the sense of divergence measure) over the probability

    space.

  • Exponent of Bayesian cost II:8-24

    � �PX PX̂D(PX̃‖PX)

    �PX̃

    D(PX̃‖PX̂)

    D(PC‖PX) = D(PC‖PX̂) − τ ′��

    ��

    ����

    ����

    The divergence view on hypothesis testing.

  • Exponent of Bayesian cost II:8-25

    • When λ → 0, Pλ → PX̂ ; when λ → 1, Pλ → PX .

    • Usually, Pλ is named the tilted or twisted distribution.

    • The value of λ is dependent on τ ′ = (1/n) log τ .

    • It is known from detection theory that the best τ for Bayes testing is π1/π0,which is fixed.

    • Therefore,τ ′ = lim

    n→∞

    1

    nlog

    π1π0

    = 0,

    which implies that the optimal exponent for Bayes error is the minimum of

    D(Pλ‖PX) subject to D(Pλ‖PX) = D(Pλ‖PX̂), namely the mid-point (λ =1/2) of the line segment (PX, PX̂) on probability space. This quantity is called

    the Chernoff bound.

  • Large deviations theory II:8-26

    • The large deviations theory basically consider the technique of computing theexponent of an exponentially decayed probability.

  • Tilted or twisted distribution II:8-27

    • Suppose the probability of a set PX(An) decreasing down to zero exponentiallyfact, and its exponent is equal to a > 0.

    • Over the probability space, let P denote the set of those distributions PX̃satisfying PX̃(An) exhibits zero exponent.

    • Then applying similar concept as Sanov’s theorem, we can expect thata = min

    PX̃∈P

    D(PX̃‖PX).

    • Now suppose the minimizer of the above function happens at f(PX̃) = τ forsome constant τ and some differentiable function f(·), the minimizer shouldbe of the form

    (∀ a ∈ X ) PX̃(a) =PX(a)e

    λ∂f(P

    X̃)

    ∂PX̃

    (a)

    ∑a′∈X

    PX(a′)e

    λ∂f(P

    X̃)

    ∂PX̃

    (a′)

    .

    As a result, PX̃ is the resultant distribution from PX exponentially twisted via

    the partial derivative of the function f .

    • Note that PX̃ is usually written as P(λ)X since it is generated by twisting PX

    with twisted factor λ.

  • Conventional twisted distribution II:8-28

    • The conventional definition for twisted distribution is based on the divergencefunction, i.e., f(PX̃) = D(PX̃‖PX̂) − D(PX̃‖PX).

    • Since∂D(PX̃‖PX)

    ∂PX̃(a)= log

    PX̃(a)

    PX(a)+ 1,

    the twisted distribution becomes

    (∀ a ∈ X ) PX̃(a) =PX(a)e

    λ logPX̂

    (a)PX (a)∑

    a′∈XPX(a

    ′)eλ log

    PX̂

    (a′)PX (a

    ′)

    =P 1−λX (a)P

    λX̂

    (a)∑a′∈X

    P 1−αX (a)PλX̂

    (a)

  • Cramer’s theorem II:8-29

    • Question: Consider a sequence of i.i.d. random variables, Xn, and supposethat we are interested in the probability of the set{

    X1 + · · · + Xnn

    > τ

    }.

    • Observe that (X1 + · · · + Xn)/n can be re-written as∑a∈X

    a · PC(a).

    • Therefore, the function f becomesf(PX̃) =

    ∑a∈X

    aPX̃(a),

    and its partial derivative with respect to PX̃(a) is a.

    • The resultant twisted distribution is

    (∀ a ∈ X ) P (λ)X (a) =PX(a)e

    λa∑a′∈X

    PX(a′)eλa

    ′ .

    • So the exponent of PXn{(X1 + · · · + Xn)/n > τ} ismin

    {PX̃

    : D(PX̃‖PX)>τ}

    D(PX̃‖PX) = min{P (λ)X : D(P

    (λ)X ‖PX)>τ}

    D(P(λ)X ‖PX).

  • Cramer’s theorem II:8-30

    • It should be pointed out that∑

    a′∈X PX(a′)eλa

    ′is the moment generating func-

    tion of PX .

    • The conventional Cramer’s result does not use the divergence. Instead, itintroduced the large deviation rate function, defined by

    IX(x)

    = sup

    θ∈�[θx − log MX(θ)], (8.2.4)

    where MX(θ) is the moment generating function of X .

    • Using his statement, the exponent of the above probability is respectively lower-and upper bounded by

    infx≥τ

    IX(x) and infx>τ

    IX(x).

    An example on how to obtain the exponent bounds is illustrated in the next

    subsection.

  • Exponent and moment generating function II:8-31

    A) Preliminaries : Observe that since E[X ] = µ < λ and E[|X − µ|2] < ∞,

    Pr

    {X1 + · · · + Xn

    n≥ λ}

    → 0 as n → ∞.

    Hence, we can compute its rate of convergence (to zero).

    B) Upper bound of the probability :

    Pr

    {X1 + · · · + Xn

    n≥ λ}

    = Pr {θ(X1 + · · · + Xn) ≥ θnλ} , for any θ > 0= Pr {exp (θ(X1 + · · · + Xn)) ≥ exp (θnλ)}

    ≤ E [exp (θ(X1 + · · · + Xn))]exp (θnλ)

    =En [exp (θX))]

    exp (θnλ)

    =

    (MX(θ)

    exp (θλ)

    )n.

    Hence,

    lim infn→∞

    −1nPr

    {X1 + · · · + Xn

    n> λ

    }≥ θλ − log MX(θ).

  • Exponent and moment generating function II:8-32

    Since the above inequality holds for every θ > 0, we have

    lim infn→∞

    −1nPr

    {X1 + · · · + Xn

    n> λ

    }≥ max

    θ>0[θλ − log MX(θ)]

    = θ∗λ − log MX(θ∗),

    where θ∗ > 0 is the optimizer of the maximum operation. (The positivity of

    θ∗ can be easily verified by the concavity of the function θλ − log MX(θ) inθ, and it derivative at θ = 0 equals (λ − µ) which is strictly greater than 0.)Consequently,

    lim infn→∞

    −1nPr

    {X1 + · · · + Xn

    n> λ

    }≥ θ∗λ − log MX(θ∗)

    = supθ∈�

    [θλ − log MX(θ)] = IX(λ).

    C) Lower bound of the probability : omit.

  • Theories on Large deviations II:8-33

    In this section, we will derive inequalities on the exponent of the probability,

    Pr{Zn/n ∈ [a, b]}, which is a slight extension of the Gärtner-Ellis theorem.

  • Extension of Gärtner-Ellis upper bounds II:8-34

    Definition 8.8 In this subsection, {Zn}∞n=1 will denote an infinite sequence ofarbitrary random variables.

    Definition 8.9 Define

    ϕn(θ)

    =

    1

    nlog E [exp {θZn}] and ϕ̄(θ)

    = lim sup

    n→∞ϕn(θ).

    The sup-large deviation rate function of an arbitrary random sequence {Zn}∞n=1is defined as

    Ī(x)

    = sup

    {θ∈� : ϕ̄(θ)>−∞}[θx − ϕ̄(θ)] . (8.3.5)

    The range of the supremum operation in (8.3.5) is always non-empty since ϕ̄(0) =

    0, i.e. {θ ∈ � : ϕ̄(θ) > −∞} = ∅. Hence, Ī(x) is always defined. With theabove definition, the first extension theorem of Gärtner-Ellis can be proposed as

    follows.

    Theorem 8.10 For a, b ∈ � and a ≤ b,

    lim supn→∞

    1

    nlog Pr

    {Znn

    ∈ [a, b]}

    ≤ − infx∈[a,b]

    Ī(x).

    • The bound obtained in the above theorem is not in general tight.

  • Extension of Gärtner-Ellis upper bounds II:8-35

    Example 8.11 Suppose that Pr{Zn = 0} = 1 − e−2n, and Pr{Zn = −2n} =e−2n. Then from Definition 8.9, we have

    ϕn(θ)

    =

    1

    nlog E

    [eθZn]

    =1

    nlog[1 − e−2n + e−(θ+1)·2·n

    ],

    and

    ϕ̄(θ)

    = lim sup

    n→∞ϕn(θ) =

    {0, for θ ≥ −1;−2(θ + 1), for θ < −1.

    Hence, {θ ∈ � : ϕ̄(θ) > −∞} = � and

    Ī(x) = supθ∈�

    [θx − ϕ̄(θ)]

    = supθ∈�

    [θx + 2(θ + 1)1{θ < −1)}]

    =

    {−x, for − 2 ≤ x ≤ 0;∞, otherwise,

    where 1{·} represents the indicator function of a set.

  • Extension of Gärtner-Ellis upper bounds II:8-36

    Consequently, by Theorem 8.10,

    lim supn→∞

    1

    nlog Pr

    {Znn

    ∈ [a, b]}

    ≤ − infx∈[a,b]

    Ī(x)

    =

    0, for 0 ∈ [a, b];b, for b ∈ [−2, 0];−∞, otherwise.

    The exponent of Pr{Zn/n ∈ [a, b]} in the above example is indeed given by

    limn→∞

    1

    nlog PZn

    {Znn

    ∈ [a, b]}

    = − infx∈[a,b]

    I∗(x),

    where

    I∗(x) =

    2, for x = −2;0, for x = 0;

    ∞, otherwise.(8.3.6)

    Thus, the upper bound obtained in Theorem 8.10 is not tight.

  • Extension of Gärtner-Ellis upper bounds II:8-37

    Definition 8.12 Define

    ϕn(θ; h)

    =

    1

    nlog E

    [exp

    {n · θ · h

    (Znn

    )}]and ϕ̄h(θ)

    = lim sup

    n→∞ϕn(θ; h),

    where h(·) is a given real-valued continuous function. The twisted sup-large de-viation rate function of an arbitrary random sequence {Zn}∞n=1 with respect to areal-valued continuous function h(·) is defined as

    J̄h(x)

    = sup

    {θ∈� : ϕ̄h(θ)>−∞}[θ · h(x) − ϕ̄h(θ)] . (8.3.7)

    Theorem 8.13 Suppose that h(·) is a real-valued continuous function. Then fora, b ∈ � and a ≤ b,

    lim supn→∞

    1

    nlog Pr

    {Znn

    ∈ [a, b]}

    ≤ − infx∈[a,b]

    J̄h(x).

  • Extension of Gärtner-Ellis upper bounds II:8-38

    Example 8.14 Let us, again, investigate the {Zn}∞n=1 defined in Example 8.11.Take

    h(x) =1

    2(x + 2)2 − 1.

    Then from Definition 8.12, we have

    ϕn(θ; h)

    =

    1

    nlog E [exp {nθh(Zn/n)}]

    =1

    nlog [exp {nθ} − exp {n(θ − 2)} + exp {−n(θ + 2)}] ,

    and

    ϕ̄h(θ)

    = lim sup

    n→∞ϕn(θ; h) =

    {−(θ + 2), for θ ≤ −1;θ, for θ > −1.

    Hence, {θ ∈ � : ϕ̄h(θ) > −∞} = � and

    J̄h(x)

    = sup

    θ∈�[θh(x) − ϕ̄h(θ)] =

    {−1

    2(x + 2)2 + 2, for x ∈ [−4, 0];

    ∞, otherwise.

  • Extension of Gärtner-Ellis upper bounds II:8-39

    Consequently, by Theorem 8.13,

    lim supn→∞

    1

    nlog Pr

    {Znn

    ∈ [a, b]}

    ≤ − infx∈[a,b]

    J̄h(x)

    =

    −min

    {−(a + 2)

    2

    2,−(b + 2)

    2

    2

    }− 2, for − 4 ≤ a < b ≤ 0;

    0, for a > 0 or b < −4;−∞, otherwise.

    (8.3.8)

    For b ∈ (−2, 0) and a ∈[−2 −

    √2b − 4, b

    ), the upper bound attained in the

    previous example is strictly less than that given in Example 8.11, and hence, an

    improvement is obtained. However, for b ∈ (−2, 0) and a < −2 −√

    2b − 4, theupper bound in (8.3.8) is actually looser. Accordingly, we combine the two upper

    bounds from Examples 8.11 and 8.14 to get

    lim supn→∞

    1

    nlog Pr

    {Znn

    ∈ [a, b]}

    ≤ −max{

    infx∈[a,b]

    J̄h(x), infx∈[a,b]

    Ī(x)

    }

    =

    0, for 0 ∈ [a, b];1

    2(b + 2)2 − 2, for b ∈ [−2, 0];

    −∞, otherwise.

  • Extension of Gärtner-Ellis upper bounds II:8-40

    Theorem 8.15 For a, b ∈ � and a ≤ b,

    lim supn→∞

    1

    nlog Pr

    {Znn

    ∈ [a, b]}

    ≤ − infx∈[a,b]

    J̄(x),

    where J̄(x)

    = suph∈H J̄h(x) and H is the set of all real-valued continuous functions.

    Example 8.16 Let us again study the {Zn}∞n=1 in Example 8.11 (also in Example8.14). Suppose c > 1. Take hc(x) = c1(x + c2)

    2 − c, where

    c1

    =

    c +√

    c2 − 12

    and c2

    =

    2√

    c + 1√c + 1 +

    √c − 1

    .

    Then from Definition 8.12, we have

    ϕn(θ; hc)

    =

    1

    nlog E

    [exp

    {nθhc

    (Znn

    )}]=

    1

    nlog [exp {nθ} − exp {n(θ − 2)} + exp {−n(θ + 2)}] ,

    and

    ϕ̄hc(θ)

    = lim sup

    n→∞ϕn(θ; hc) =

    {−(θ + 2), for θ ≤ −1;θ, for θ > −1.

  • Extension of Gärtner-Ellis upper bounds II:8-41

    Hence, {θ ∈ � : ϕ̄hc(θ) > −∞} = � and

    J̄hc(x) = supθ∈�

    [θhc(x) − ϕ̄hc(θ)]

    =

    {−c1(x + c2)2 + c + 1, for x ∈ [−2c2, 0];∞, otherwise.

    From Theorem 8.15,

    J̄(x) = suph∈H

    J̄h(x) ≥ max{lim infc→∞

    J̄hc(x), Ī(x)} = I∗(x),

    where I∗(x) is defined in (8.3.6). Consequently,

    lim supn→∞

    1

    nlog Pr

    {Znn

    ∈ [a, b]}

    ≤ − infx∈[a,b]

    J̄(x)

    ≤ − infx∈[a,b]

    I∗(x)

    =

    0, if 0 ∈ [a, b];−2, if − 2 ∈ [a, b] and 0 ∈ [a, b];−∞, otherwise.

    and a tight upper bound is finally obtained!

  • Extension of Gärtner-Ellis upper bounds II:8-42

    Definition 8.17 Define ϕh(θ)

    = lim infn→∞ ϕn(θ; h), where ϕn(θ; h) was defined

    in Definition 8.12. The twisted inf-large deviation rate function of an arbitrary

    random sequence {Zn}∞n=1 with respect to a real-valued continuous function h(·) isdefined as

    Jh(x)

    = sup

    {θ∈� : ϕh(θ)>−∞}

    [θ · h(x) − ϕ

    h(θ)].

    Theorem 8.18 For a, b ∈ � and a ≤ b,

    lim infn→∞

    1

    nlog Pr

    {Znn

    ∈ [a, b]}

    ≤ − infx∈[a,b]

    J(x),

    where J(x)

    = suph∈H Jh(x) and H is the set of all real-valued continuous functions.

  • Extension of Gärtner-Ellis lower bounds II:8-43

    • Hope to know when

    lim supn→∞

    1

    nlog Pr

    {Znn

    ∈ (a, b)}

    ≥ − infx∈(a,b)

    J̄h(x). (8.3.9)

    Definition 8.19 Define the sup-Gärtner-Ellis set with respect to a real-valued

    continuous function h(·) as

    D̄h

    =

    ⋃{θ∈� : ϕ̄h(θ)>−∞}

    D̄(θ; h)

    where

    D̄(θ; h) ={

    x ∈ � : lim supt↓0

    ϕ̄h(θ + t) − ϕ̄h(θ)t

    ≤ h(x) ≤ lim inft↓0

    ϕ̄h(θ) − ϕ̄h(θ − t)t

    }.

    Let us briefly remark on the sup-Gärtner-Ellis set defined above.

    • It can be derived that the sup-Gärtner-Ellis set is reduced to

    D̄h

    =

    ⋃{θ∈� : ϕ̄h(θ)>−∞}

    {x ∈ � : ϕ̄′h(θ) = h(x)} ,

    if the derivative ϕ̄′h(θ) exists for all θ.

  • Extension of Gärtner-Ellis lower bounds II:8-44

    Theorem 8.20 Suppose that h(·) is a real-valued continuous function. Then if(a, b) ⊂ D̄h,

    lim supn→∞

    1

    nlog Pr

    {Znn

    ∈ (a, b)}

    ≥ − infx∈(a,b)

    J̄h(x).

    Example 8.21 Suppose Zn = X1 + · · · + Xn, where {Xi}ni=1 are i.i.d. Gaussianrandom variables with mean 1 and variance 1 if n is even, and with mean −1 andvariance 1 if n is odd. Then the exact large deviation rate formula Ī∗(x) that

    satisfies for all a < b,

    − infx∈[a,b]

    Ī∗(x) ≥ lim supn→∞

    1

    nlog Pr

    {Znn

    ∈ [a, b]}

    ≥ lim supn→∞

    1

    nlog Pr

    {Znn

    ∈ (a, b)}

    ≥ − infx∈(a,b)

    Ī∗(x)

    is

    Ī∗(x) =(|x| − 1)2

    2. (8.3.10)

    Case A: h(x) = x.

    For the affine h(·), ϕn(θ) = θ + θ2/2 when n is even, and ϕn(θ) = −θ + θ2/2

  • Extension of Gärtner-Ellis lower bounds II:8-45

    when n is odd. Hence, ϕ̄(θ) = |θ| + θ2/2, and

    D̄h =(⋃

    θ>0

    {v ∈ � : v = 1 + θ})⋃(⋃

    θ 1;

    0, for |x| ≤ 1,

    we obtain for any a ∈ (−∞, 1) ∪ (1,∞),

    limε↓0

    lim supn→∞

    1

    nlog Pr

    {Znn

    ∈ (a − ε, a + ε)}

    ≥ − limε↓0

    infx∈(a−ε,a+ε)

    Ī(x) = −(|a| − 1)2

    2,

  • Extension of Gärtner-Ellis lower bounds II:8-46

    which can be shown tight by Theorem 8.13 (or directly by (8.3.10)). Note that

    the above inequality does not hold for any a ∈ (−1, 1). To fill the gap, adifferent h(·) must be employed.

    Case B: h(x) = |x|.For n even,

    E[enθh(Zn/n)

    ]= E

    [enθ|Zn/n−a|

    ]=

    ∫ na−∞

    e−θx+nθa1√2πn

    e−(x−n)2/(2n)dx +

    ∫ ∞na

    eθx−nθa1√2πn

    e−(x−n)2/(2n)dx

    = enθ(θ−2+2a)/2∫ na−∞

    1√2πn

    e−[x−n(1−θ)]2/(2n)dx

    + enθ(θ+2−2a)/2∫ ∞

    na

    1√2πn

    e−[x−n(1+θ)]2/(2n)dx

    = enθ(θ−2+2a)/2 · Φ((θ + a − 1)

    √n)

    + enθ(θ+2−2a)/2 · Φ((θ − a + 1)

    √n),

    where Φ(·) represents the unit Gaussian cdf.

  • Extension of Gärtner-Ellis lower bounds II:8-47

    Similarly, for n odd,

    E[enθh(Zn/n)

    ]= enθ(θ+2+2a)/2 · Φ

    ((θ + a + 1)

    √n)

    + enθ(θ−2−2a)/2 · Φ((θ − a − 1)

    √n).

    Observe that for any b ∈ �,

    limn→∞

    1

    nlog Φ(b

    √n) =

    0, for b ≥ 0;−b22

    , for b < 0.

    Hence,

    ϕ̄h(θ) =

    −(|a| − 1)

    2

    2, for θ < |a| − 1;

    θ[θ + 2(1 − |a|)]2

    , for |a| − 1 ≤ θ < 0;θ[θ + 2(1 + |a|)]

    2, for θ ≥ 0.

  • Extension of Gärtner-Ellis lower bounds II:8-48

    Therefore,

    D̄h =(⋃

    θ>0

    {x ∈ � : |x − a| = θ + 1 + |a|})

    ⋃(⋃θ a + 1 + |a| or x < a − 1 − |a|;0, otherwise.

    (8.3.11)

  • Extension of Gärtner-Ellis lower bounds II:8-49

    We then apply Theorem 8.20 to obtain

    limε↓0

    lim supn→∞

    1

    nlog Pr

    {Znn

    ∈ (a − ε, a + ε)}

    ≥ − limε↓0

    infx∈(a−ε,a+ε)

    J̄h(x)

    = − limε↓0

    (ε − 1 + |a|)22

    = −(|a| − 1)2

    2.

    Note that the above lower bound is valid for any a ∈ (−1, 1), and can be showntight, again, by Theorem 8.13 (or directly by (8.3.10)).

    Finally, by combining the results of Cases A) and B), the true large deviation

    rate of {Zn}n≥1 is completely characterized.

  • Extension of Gärtner-Ellis lower bounds II:8-50

    Definition 8.22 Define the inf-Gärtner-Ellis set with respect to a real-valued

    continuous function h(·) as

    Dh

    =

    ⋃{θ∈� : ϕ

    h(θ)>−∞}

    D(θ; h)

    where

    D(θ; h) ={

    x ∈ � : lim supt↓0

    ϕh(θ + t) − ϕ

    h(θ)

    t

    ≤ h(x) ≤ lim inft↓0

    ϕh(θ) − ϕ

    h(θ − t)

    t

    }.

    Theorem 8.23 Suppose that h(·) is a real-valued continuous function. Then if(a, b) ⊂ Dh,

    lim infn→∞

    1

    nlog Pr

    {Znn

    ∈ (a, b)}

    ≥ − infx∈(a,b)

    Jh(x).

  • Properties II:8-51

    Property 8.24 Let Ī(x) and I(x) be the sup- and inf- large deviation rate func-

    tions of an infinite sequence of arbitrary random variables {Zn}∞n=1, respectively.Denote mn = (1/n)E[Zn]. Let m̄

    = lim supn→∞ mn and m

    = lim infn→∞ mn. Then

    1. Ī(x) and I(x) are both convex.

    2. Ī(x) is continuous over {x ∈ � : Ī(x) < ∞}. Likewise, I(x) is continuousover {x ∈ � : I(x) < ∞}.

    3. Ī(x) gives its minimum value 0 at m ≤ x ≤ m̄.

    4. I(x) ≥ 0. But I(x) does not necessary give its minimum value at both x = m̄and x = m.

  • Properties II:8-52

    Property 8.25 Suppose that h(·) is a real-valued continuous function. Let J̄h(x)and Jh(x) be the corresponding twisted sup- and inf- large deviation rate functions,

    respectively. Denote mn(h)

    =E[h(Zn/n)]. Let

    m̄h

    = lim sup

    n→∞mn(h) and mh

    = lim inf

    n→∞mn(h).

    Then

    1. J̄h(x) ≥ 0, with equality holds if mh ≤ h(x) ≤ m̄h.

    2. Jh(x) ≥ 0, but Jh(x) does not necessary give its minimum value at bothx = m̄h and x = mh.

  • Probabilitic subexponential behavior II:8-53

    • subexponential behavior.an = (1/n) exp{−2n} and bn = (1/

    √n) exp{−2n} have the same exponent,

    but contain different subexponential terms

  • Berry-Esseen theorem for compound i.i.d. sequence II:8-54

    • Berry-Esseen theorem states that the distribution of the sum of independentzero-mean random variables {Xi}ni=1, normalized by the standard deviationof the sum, differs from the Gaussian distribution by at most C rn/s

    3n, where

    s2n and rn are respectively sums of the marginal variances and the marginal

    absolute third moments, and C is an absolute constant.

    • Specifically, for every a ∈ �,∣∣∣∣Pr{ 1sn (X1 + · · · + Xn) ≤ a}− Φ(a)

    ∣∣∣∣ ≤ C rns3n, (8.4.12)where Φ(·) represents the unit Gaussian cdf.

    • The striking feature of this theorem is that the upper bound depends only onthe variance and the absolute third moment, and hence, can provide a good

    asymptotic estimate based on only the first three moments.

    • The absolute constant C is commonly 6. When {Xn}ni=1 are identically dis-tributed, in addition to independent, the absolute constant can be reduced to

    3, and has been reported to be improved down to 2.05.

    Definition: compound i.i.d. sequence. The samples that we concern in this

    section actually consists of two i.i.d. sequences (and, is therefore named compound

    i.i.d. sequence.)

  • Berry-Esseen theorem for compound i.i.d. sequence II:8-55

    Lemma 8.26 (smoothing lemma) Fix the bandlimited filtering function

    vT(x)

    =

    1 − cos(Tx)πTx2

    =2 sin2(Tx/2)

    πTx2

    =T

    2πsinc2

    (Tx

    )= Four−1

    (f

    T/(2π)

    )].

    For any cumulative distribution function H(·) on the real line �,

    supx∈�

    |∆T (x)| ≥1

    2η − 6

    Tπ√

    2πh

    (T√

    ),

    where

    ∆T (t)

    =

    ∫ ∞−∞

    [H(t − x) − Φ(t − x)] × vT(x)dx, η

    = sup

    x∈�|H(x) − Φ(x)| ,

    and

    h(u)

    =

    u∫ ∞

    u

    1 − cos(x)x2

    dx =π

    2u + 1 − cos(u) − u

    ∫ u0

    sin(x)

    xdx, if u ≥ 0;

    0, otherwise.

  • Berry-Esseen theorem for compound i.i.d. sequence II:8-56

    Lemma 8.27 For any cumulative distribution function H(·) with characteristicfunction ϕ

    H(ζ),

    η ≤ 1π

    ∫ T−T

    ∣∣∣ϕH(ζ) − e−(1/2)ζ2∣∣∣ dζ|ζ| + 12Tπ√2π h(

    T√

    ),

    where η and h(·) are defined in Lemma 8.26.

    Theorem 8.28 (BE theorem for compound i.i.d. sequences) Let Yn =∑ni=1 Xi be the sum of independent random variables, among which {Xi}di=1 are

    identically Gaussian distributed, and {Xi}ni=d+1 are identically distributed but notnecessarily Gaussian.

    • Denote the mean-variance pair of X1 and Xd+1 by (µ, σ2) and (µ̂, σ̂2), respec-tively.

    • Defineρ

    =E[|X1 − µ|3

    ], ρ̂

    =E[|Xd+1 − µ̂|3

    ]and

    s2n = Var[Yn] = σ2d + σ̂2(n − d).

    • Also denote the cdf of (Yn − E[Yn])/sn by Hn(·).

  • Berry-Esseen theorem for compound i.i.d. sequence II:8-57

    Then for all y ∈ �,

    |Hn(y) − Φ(y)| ≤ Cn,d2√π

    (n − d − 1)(2(n − d) − 3

    √2) ρ̂

    σ̂2sn,

    where Cn,d is the unique positive number satisfying

    π

    6Cn,d − h(Cn,d)

    =

    √π(2(n − d) − 3

    √2)

    12(n − d − 1)

    √6π2(3 −

    √2)3/2 + 92(11 − 6√2)√n − d

    ,provided that n − d ≥ 3.

  • Berry-Esseen theorem for compound i.i.d. sequence II:8-58

    -1

    -0.5

    0

    0.5

    1

    1.5

    2

    2.5

    0 1 2 3 4 5

    π

    6u − h(u)

    u

    Function of (π/6)u − h(u).

  • Berry-Esseen theorem for compound i.i.d. sequence II:8-59

    By letting d = 0, the Berry-Esseen inequality for i.i.d. sequences can also be

    readily obtained from the previous Theorem.

    Corollary 8.29 (Berry-Esseen theorem for i.i.d. sequence) Let

    Yn =n∑

    i=1

    Xi

    be the sum of independent random variables with common marginal distribution.

    Denote the marginal mean and variance by (µ̂, σ̂2). Define ρ̂

    =E[|X1 − µ̂|3

    ]. Also

    denote the cdf of (Yn − nµ̂)/(√

    nσ̂) by Hn(·). Then for all y ∈ �,

    |Hn(y) − Φ(y)| ≤ Cn2(n − 1)

    √π(2n − 3

    √2) ρ̂

    σ̂3√

    n,

    where Cn is the unique positive solution of

    π

    6u − h(u) =

    √π(2n − 3

    √2)

    12(n − 1)

    √6π2(3 −

    √2)3/2 + 92(11 − 6√2)√n

    ,provided that n ≥ 3.

  • Berry-Esseen theorem for compound i.i.d. sequence II:8-60

    • Let us briefly remark on the previous corollary. We observe from numericalsthat the quantity

    Cn2√π

    (n − 1)(2n − 3

    √2)

    is decreasing in n, and ranges from 3.628 to 1.627 (cf. The picture in slide

    II:8-62.)

    – We can upperbound Cn by the unique positive solution Dn of

    π

    6u − h(u) =

    √π

    6

    √6π2(3 −

    √2)3/2 + 92(11 − 6√2)√n

    ,which is strictly decreasing in n. Hence,

    Cn2√π

    (n − 1)(2n − 3

    √2) ≤ En = Dn 2√

    π

    (n − 1)(2n − 3

    √2),

    and the right-hand-side of the above inequality is strictly decreasing (since

    both Dn and (n − 1)/(2n − 3√

    2) are decreasing) in n, and ranges from

    E3 = 4.1911, . . . ,E9 = 2.0363,. . . , E100 = 1.6833 to E∞ = 1.6266. If the

    property of strictly decreasingness is preferred, one can use Dn instead of

    Cn in the Berry-Esseen inequality. Note that both Cn and Dn converges to

    2.8831 . . . as n goes to infinity.

  • Berry-Esseen theorem for compound i.i.d. sequence II:8-61

    • Numerical result shows that it lies below 2 when n ≥ 9, and is smaller than1.68 as n ≥ 100. In other words, we can upperbound this quantity by 1.68 asn ≥ 100, and therefore, establish a better estimate of the original Berry-Esseenconstant.

  • Berry-Esseen theorem for compound i.i.d. sequence II:8-62

    0

    1

    1.682

    3

    3 10 20 30 4050 75100 150200

    Cn2√π

    (n − 1)(2n − 3

    √2)

    n

    The Berry-Esseen constant as a function of the sample size n. The sample size n

    is plotted in log-scale.

  • Generalized Neyman-Pearson Hypothesis Testing II:8-63

    The general expression of the Neyman-Pearson type-II error exponent subject to

    a constant bound on the type-I error has been proved for arbitrary observations.

    In this section, we will state the results in terms of the ε-inf/sup-divergence rates.

    Theorem 8.30 (Neyman-Pearson type-II error exponent for a fixed

    test level) Consider a sequence of random observations which is assumed to have a

    probability distribution governed by either PX (null hypothesis) or PX̂ (alternative

    hypothesis). Then, the type-II error exponent satisfies

    limδ↑δ

    D̄δ(X‖X̂) ≤ lim supn→∞

    −1n

    log β∗n(ε) ≤ D̄ε(X‖X̂)

    limδ↑ε

    Dδ(X‖X̂) ≤ lim infn→∞

    −1n

    log β∗n(ε) ≤ Dε(X‖X̂)

    where β∗n(ε) represents the minimum type-II error probability subject to a fixed

    type-I error bound ε ∈ [0, 1).

    The general formula for Neyman-Pearson type-II error exponent subject to an

    exponential test level has also been proved in terms of the ε-inf/sup-divergence

    rates.

  • Generalized Neyman-Pearson Hypothesis Testing II:8-64

    Theorem 8.31 (Neyman-Pearson type-II error exponent for an ex-

    ponential test level) Fix s ∈ (0, 1) and ε ∈ [0, 1). It is possible to choosedecision regions for a binary hypothesis testing problem with arbitrary datawords

    of blocklength n, (which are governed by either the null hypothesis distribution PXor the alternative hypothesis distribution PX̂ ,) such that

    lim infn→∞

    −1n

    log βn ≥ D̄ε(X̂(s)‖X̂) and lim sup

    n→∞−1

    nlog αn ≥ D(1−ε)(X̂

    (s)‖X),(8.5.13)

    or

    lim infn→∞

    −1n

    log βn ≥ Dε(X̂(s)‖X̂) and lim sup

    n→∞−1

    nlog αn ≥ D̄(1−ε)(X̂

    (s)‖X),(8.5.14)

    where X̂(s)

    exhibits the tilted distributions{P

    (s)

    X̂n

    }∞n=1

    defined by

    dP(s)

    X̂n(xn)

    =

    1

    Ωn(s)exp

    {s log

    dPXn

    dPX̂n(xn)

    }dPX̂n(x

    n),

    and

    Ωn(s)

    =

    ∫X n

    exp

    {s log

    dPXn

    dPX̂n(xn)

    }dPX̂n(x

    n).

    Here, αn and βn are the type-I and type-II error probabilities, respectively.