feature extraction for universal hypothesis testing via rank-constrained optimizaiton (isit 2010)

Feature Extraction for Universal Hypothesis Testing viaRank-Constrained Optimization

Dayu Huang and Sean Meyn

Department of Electrical and Computer Engineeringand Coordinated Science Laboratory

University of Illinois, Urbana-Champaign

June 18, 2010

Huang and Meyn (UIUC) Feature Extraction June 2010 1 / 18

Introduction

Universal Hypothesis Testing

Sequence of observations: Zn1 := (Z1, . . . ,Zn).

i.i.d. π0 under H0, π1 under H1

π0: known π1: not known

Observation space Z is finite.

Task: Design a test to decide in favor of H0 or H1.

The Hoeffding testφHn = 1{D(Γn‖π0) ≥ η},

Empirical distribution

Γn(A) =1

n

n∑k=1

1{Zk ∈ A}, A ⊂ Z.


The Hoeffding Test

Theorem

1 1The Hoeffding test achieves the optimal error exponents inNeyman-Pearson criterion.

2 2The asymptotic variance of the Hoeffding test depends on the size ofthe observation space. When Zn

1 has marginal π0, we have

limn→∞

Var [nD(Γn‖π0)] = 12(|Z| − 1).


1. Hoeffding 1963;

The Hoeffding Test

Theorem

1 1The Hoeffding test achieves the optimal error exponents inNeyman-Pearson criterion.

2 2The asymptotic variance of the Hoeffding test depends on the size ofthe observation space. When Zn

1 has marginal π0, we have

limn→∞

Var [nD(Γn‖π0)] = 12(|Z| − 1).


Large variance when |Z| large

1. Hoeffding 1963; 2. Unnikrishnan, Huang, Meyn, Surana & Veeravalli; Wilks 1938;Clarke & Barron 1990.

Performance of the Hoeffding Test

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pro

babili

ty o

f D

ete

ctio

n

Probability of False Alarm

|Z|=39|Z|=19


Pr(φ = 1|H0).

Pr(φ

=1|

H1)

Red: Better error exponent but larger variance

Mismatched Universal Test

Variational representation of KL divergence

D(µ‖π) = supf

(〈µ, f 〉 − log(〈π, ef 〉)

)Mismatched divergence 1

DMMF (µ‖π) := sup

f ∈F

(〈µ, f 〉 − log(〈π, ef 〉)

)

Mismatched universal test 2

φMMn = 1{DMM

F (Γn‖π0) ≥ η}


〈µ, f 〉 =∑

z µ(z)f (z)

Mismatched Universal Test

Variational representation of KL divergence

D(µ‖π) = supf

(〈µ, f 〉 − log(〈π, ef 〉)

)Mismatched divergence 1

DMMF (µ‖π) := sup

f ∈F

(〈µ, f 〉 − log(〈π, ef 〉)

)Mismatched universal test 2

φMMn = 1{DMM

F (Γn‖π0) ≥ η}


〈µ, f 〉 =∑

z µ(z)f (z)

1. Abbe, Medard, Meyn & Zheng 2007; 2. Unnikrishnan et al.

Function Class and Performance

Consider a linear function class:

F ={

fr :=d∑i

riψi

}Choice of function class F determines performance:

Mismatched divergence approximates KL divergence. Determineserror exponent of the mismatched universal test. When d is smallerthan |Z|, it is optimal for a restricted set of alternative distributions.

Dimension d determines asymptotic variance1: Under H0,

limn→∞

Var [nDMMF (Γn‖π0)] = 1

2d

Problem: How to choose function class F?


1. Unnikrishnan et al.

Our Contribution

1 Mismatched test even with a small dimension d is optimal for a largeset of alternative distributions.

2 Framework to choose F for the mismatched test.


How powerful is mismatched test?

Example

10 distributions.d =?

0

0.05

0.1

0.15

0.2

1 2 3 4 5 6 7 8 9

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 2 3 4 5 6 7 8 90

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 2 3 4 5 6 7 8 9

00.10.20.30.40.50.60.70.8

1 2 3 4 5 6 7 8 90

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 90

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 90

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9


π0

When MM is optimal?

When does DMMF (π1‖π0) = D(π1‖π0)?

Fact (1)

When F includes LLR.

Exponential family E(F) = {µ : µ(z) ∝ (exp f (z)), f ∈ F}.

Fact (2)

When π0, π1 are in the same exponential family.

How many distributions in an d-dimensional exponential family?


ε-Extremal Distributions

πθ(z) ∝ exp(θf (z)) ∈ E(F)

Extremal distributions: πθθ→∞−−−→ Distributions on the boundary of E(F).

Example

F = span(ψ): ψ = [5,−1,−1] i.e. ψ(z1) = −5, ψ(z2) = ψ(z3) = −1.What are the extremal distributions?[1, 0, 0] : f = [5,−1,−1][0, 0.5, 0.5] : f = [−5, 1, 1][1/3, 1/3, 1/3]: f = [0, 0, 0]

F ε(π) := {z : π(z) ≥ maxz(π(z))− ε}

Definition

• π is called ε-extremal if π(F ε(π)) ≥ 1− ε.

Example: [0.004, 0.499, 0.497].


ε-Distinguishable Distributions

DistinguishableD(π1‖π0) = D(π0‖π1) =∞ ⇔ π1 6≺ π0 and π0 6≺ π1.

Example

π0(z1) = 0.5, π0(z2) = 0.5, π0(z3) = 0π1(z1) = 0, π1(z2) = 0.5, π1(z3) = 0.5

Approximately distinguishable

Example

π0(z1) = 0.49999, π0(z2) = 0.49999, π0(z3) = 0.00002π1(z1) = 0.00002, π1(z2) = 0.49999, π1(z3) = 0.49999

Definition

π1, π2 are ε-distinguishable if F ε(π1) 6⊆ F ε(π1) and F ε(π2) 6⊆ F ε(π1).


The Number of ε-Distinguishable ε-Extremal Distributions

Definition

N(E): The maximum N such that for any small ε > 0, there exist Ndistributions in E that are ε-extremal and pairwise ε-distinguishable.

Proposition

DenoteN(d) : max{N(E) : E is d-dimensional }

It admits the following lower and upper bounds:

N(d) ≥ exp(bd

2c[log(|Z|)− logbd

2c − 1]

)N(d) ≤ exp

((d + 1)(1 + log(|Z|)− log(d + 1))

)Many alternative distributions can be distinguished even with smalldimension d


A Framework for Choosing Function Class

Scenario: Alternative distributions are in a set S (not known to thealgorithm). Observe p distributions from the set: π1, . . . , πp.

Objective function to be maximized:

maxF1p

∑pi=1 γ

iDMMF (πi‖π0)

subject to dim(F) ≤ d

Rank-constrained optimization:

maxX1p

∑pi=1 γ

i(〈πi ,Xi 〉 − log(〈π0, eXi 〉

)subject to rank (X ) ≤ d


A Framework for Choosing Function Class

Scenario: Alternative distributions are in a set S (not known to thealgorithm). Observe p distributions from the set: π1, . . . , πp.

Objective function to be maximized:

maxF1p

∑pi=1 γ

iDMMF (πi‖π0)

subject to dim(F) ≤ d

Rank-constrained optimization:

maxX1p

∑pi=1 γ

i(〈πi ,Xi 〉 − log(〈π0, eXi 〉

)subject to rank (X ) ≤ d


〈µ, f 〉 =∑

z µ(z)f (z)

Algorithm

Iterative gradient projection:

1 Y k+1 = X k + αk∇h(X k).

2 X k+1 = PS(Y k+1).

PS(Y ) = argmin{‖Y − X‖ : rank (X ) ≤ d}.

Provable local convergence.


Numerical Experiment

Randomly from a set S of distributions.

π0,

π1, . . . , πp for feature extraction.

π1′

for testing.

Experiment steps:

Feature extraction: Extract a d-dimensional function class F basedon π0 and π1, . . . , πp.

Test: Alternative distribution is π1′. Estimate probability of error by

simulation.


Numerical Experiment


S: 12-dimensional exponential family.|Z | = 20. n = 30.

Pr(φ = 1|H0).

Pr(φ

=1|

H1)

Conclusion and Future Work

Conclusions:

Variance is as important as error exponent.

Balance between variance and error-exponent.

Feature extraction algorithm: Exploit prior information to optimizeperformance of mismatched test.

Future Work:

Bound probability of error based on finer statistics.

Extend to processes with long memory.

Other heuristics (such as nuclear-norm) for algorithm design.


feature extraction for universal hypothesis testing via rank-constrained optimizaiton (isit 2010)

Technology

f fhuang

ichoice of function

z zf z mismatched divergence

divergence d

z large

smallerthan z

meyn zheng

hoeding testtheorem