feature extraction for universal hypothesis testing via rank-constrained optimizaiton (isit 2010)

32
Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimization Dayu Huang and Sean Meyn Department of Electrical and Computer Engineering and Coordinated Science Laboratory University of Illinois, Urbana-Champaign June 18, 2010 Huang and Meyn (UIUC) Feature Extraction June 2010 1 / 18

Upload: dayuhuang

Post on 06-Aug-2015

675 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimizaiton (ISIT 2010)

Feature Extraction for Universal Hypothesis Testing viaRank-Constrained Optimization

Dayu Huang and Sean Meyn

Department of Electrical and Computer Engineeringand Coordinated Science Laboratory

University of Illinois, Urbana-Champaign

June 18, 2010

Huang and Meyn (UIUC) Feature Extraction June 2010 1 / 18

Page 2: Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimizaiton (ISIT 2010)

Introduction

Universal Hypothesis Testing

Sequence of observations: Zn1 := (Z1, . . . ,Zn).

i.i.d. π0 under H0, π1 under H1

π0: known π1: not known

Observation space Z is finite.

Task: Design a test to decide in favor of H0 or H1.

The Hoeffding testφHn = 1{D(Γn‖π0) ≥ η},

Empirical distribution

Γn(A) =1

n

n∑k=1

1{Zk ∈ A}, A ⊂ Z.

Huang and Meyn (UIUC) Feature Extraction June 2010 2 / 18

Page 3: Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimizaiton (ISIT 2010)

Introduction

Universal Hypothesis Testing

Sequence of observations: Zn1 := (Z1, . . . ,Zn).

i.i.d. π0 under H0, π1 under H1

π0: known π1: not known

Observation space Z is finite.

Task: Design a test to decide in favor of H0 or H1.

The Hoeffding testφHn = 1{D(Γn‖π0) ≥ η},

Empirical distribution

Γn(A) =1

n

n∑k=1

1{Zk ∈ A}, A ⊂ Z.

Huang and Meyn (UIUC) Feature Extraction June 2010 2 / 18

Page 4: Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimizaiton (ISIT 2010)

The Hoeffding Test

Theorem

1 1The Hoeffding test achieves the optimal error exponents inNeyman-Pearson criterion.

2 2The asymptotic variance of the Hoeffding test depends on the size ofthe observation space. When Zn

1 has marginal π0, we have

limn→∞

Var [nD(Γn‖π0)] = 12(|Z| − 1).

Huang and Meyn (UIUC) Feature Extraction June 2010 3 / 18

1. Hoeffding 1963;

Page 5: Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimizaiton (ISIT 2010)

The Hoeffding Test

Theorem

1 1The Hoeffding test achieves the optimal error exponents inNeyman-Pearson criterion.

2 2The asymptotic variance of the Hoeffding test depends on the size ofthe observation space. When Zn

1 has marginal π0, we have

limn→∞

Var [nD(Γn‖π0)] = 12(|Z| − 1).

Huang and Meyn (UIUC) Feature Extraction June 2010 3 / 18

Large variance when |Z| large

1. Hoeffding 1963; 2. Unnikrishnan, Huang, Meyn, Surana & Veeravalli; Wilks 1938;Clarke & Barron 1990.

Page 6: Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimizaiton (ISIT 2010)

Performance of the Hoeffding Test

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pro

babili

ty o

f D

ete

ctio

n

Probability of False Alarm

|Z|=39|Z|=19

Huang and Meyn (UIUC) Feature Extraction June 2010 4 / 18

Pr(φ = 1|H0).

Pr(φ

=1|

H1)

Red: Better error exponent but larger variance

Page 7: Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimizaiton (ISIT 2010)

Mismatched Universal Test

Variational representation of KL divergence

D(µ‖π) = supf

(〈µ, f 〉 − log(〈π, ef 〉)

)Mismatched divergence 1

DMMF (µ‖π) := sup

f ∈F

(〈µ, f 〉 − log(〈π, ef 〉)

)

Mismatched universal test 2

φMMn = 1{DMM

F (Γn‖π0) ≥ η}

Huang and Meyn (UIUC) Feature Extraction June 2010 5 / 18

〈µ, f 〉 =∑

z µ(z)f (z)

Page 8: Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimizaiton (ISIT 2010)

Mismatched Universal Test

Variational representation of KL divergence

D(µ‖π) = supf

(〈µ, f 〉 − log(〈π, ef 〉)

)Mismatched divergence 1

DMMF (µ‖π) := sup

f ∈F

(〈µ, f 〉 − log(〈π, ef 〉)

)Mismatched universal test 2

φMMn = 1{DMM

F (Γn‖π0) ≥ η}

Huang and Meyn (UIUC) Feature Extraction June 2010 5 / 18

〈µ, f 〉 =∑

z µ(z)f (z)

1. Abbe, Medard, Meyn & Zheng 2007; 2. Unnikrishnan et al.

Page 9: Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimizaiton (ISIT 2010)

Function Class and Performance

Consider a linear function class:

F ={

fr :=d∑i

riψi

}Choice of function class F determines performance:

Mismatched divergence approximates KL divergence. Determineserror exponent of the mismatched universal test. When d is smallerthan |Z|, it is optimal for a restricted set of alternative distributions.

Dimension d determines asymptotic variance1: Under H0,

limn→∞

Var [nDMMF (Γn‖π0)] = 1

2d

Problem: How to choose function class F?

Huang and Meyn (UIUC) Feature Extraction June 2010 6 / 18

1. Unnikrishnan et al.

Page 10: Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimizaiton (ISIT 2010)

Function Class and Performance

Consider a linear function class:

F ={

fr :=d∑i

riψi

}Choice of function class F determines performance:

Mismatched divergence approximates KL divergence. Determineserror exponent of the mismatched universal test. When d is smallerthan |Z|, it is optimal for a restricted set of alternative distributions.

Dimension d determines asymptotic variance1: Under H0,

limn→∞

Var [nDMMF (Γn‖π0)] = 1

2d

Problem: How to choose function class F?

Huang and Meyn (UIUC) Feature Extraction June 2010 6 / 18

1. Unnikrishnan et al.

Page 11: Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimizaiton (ISIT 2010)

Our Contribution

1 Mismatched test even with a small dimension d is optimal for a largeset of alternative distributions.

2 Framework to choose F for the mismatched test.

Huang and Meyn (UIUC) Feature Extraction June 2010 7 / 18

Page 12: Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimizaiton (ISIT 2010)

How powerful is mismatched test?

Example

10 distributions.d =?

0

0.05

0.1

0.15

0.2

1 2 3 4 5 6 7 8 9

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 2 3 4 5 6 7 8 90

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 2 3 4 5 6 7 8 9

00.10.20.30.40.50.60.70.8

1 2 3 4 5 6 7 8 90

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 90

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 90

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9

Huang and Meyn (UIUC) Feature Extraction June 2010 8 / 18

π0

Page 13: Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimizaiton (ISIT 2010)

When MM is optimal?

When does DMMF (π1‖π0) = D(π1‖π0)?

Fact (1)

When F includes LLR.

Exponential family E(F) = {µ : µ(z) ∝ (exp f (z)), f ∈ F}.

Fact (2)

When π0, π1 are in the same exponential family.

How many distributions in an d-dimensional exponential family?

Huang and Meyn (UIUC) Feature Extraction June 2010 9 / 18

Page 14: Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimizaiton (ISIT 2010)

When MM is optimal?

When does DMMF (π1‖π0) = D(π1‖π0)?

Fact (1)

When F includes LLR.

Exponential family E(F) = {µ : µ(z) ∝ (exp f (z)), f ∈ F}.

Fact (2)

When π0, π1 are in the same exponential family.

How many distributions in an d-dimensional exponential family?

Huang and Meyn (UIUC) Feature Extraction June 2010 9 / 18

Page 15: Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimizaiton (ISIT 2010)

ε-Extremal Distributions

πθ(z) ∝ exp(θf (z)) ∈ E(F)

Extremal distributions: πθθ→∞−−−→ Distributions on the boundary of E(F).

Example

F = span(ψ): ψ = [5,−1,−1] i.e. ψ(z1) = −5, ψ(z2) = ψ(z3) = −1.What are the extremal distributions?[1, 0, 0] : f = [5,−1,−1][0, 0.5, 0.5] : f = [−5, 1, 1][1/3, 1/3, 1/3]: f = [0, 0, 0]

F ε(π) := {z : π(z) ≥ maxz(π(z))− ε}

Definition

• π is called ε-extremal if π(F ε(π)) ≥ 1− ε.

Example: [0.004, 0.499, 0.497].

Huang and Meyn (UIUC) Feature Extraction June 2010 10 / 18

Page 16: Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimizaiton (ISIT 2010)

ε-Extremal Distributions

πθ(z) ∝ exp(θf (z)) ∈ E(F)

Extremal distributions: πθθ→∞−−−→ Distributions on the boundary of E(F).

Example

F = span(ψ): ψ = [5,−1,−1] i.e. ψ(z1) = −5, ψ(z2) = ψ(z3) = −1.What are the extremal distributions?[1, 0, 0] : f = [5,−1,−1][0, 0.5, 0.5] : f = [−5, 1, 1][1/3, 1/3, 1/3]: f = [0, 0, 0]

F ε(π) := {z : π(z) ≥ maxz(π(z))− ε}

Definition

• π is called ε-extremal if π(F ε(π)) ≥ 1− ε.

Example: [0.004, 0.499, 0.497].

Huang and Meyn (UIUC) Feature Extraction June 2010 10 / 18

Page 17: Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimizaiton (ISIT 2010)

ε-Extremal Distributions

πθ(z) ∝ exp(θf (z)) ∈ E(F)

Extremal distributions: πθθ→∞−−−→ Distributions on the boundary of E(F).

Example

F = span(ψ): ψ = [5,−1,−1] i.e. ψ(z1) = −5, ψ(z2) = ψ(z3) = −1.What are the extremal distributions?[1, 0, 0] : f = [5,−1,−1][0, 0.5, 0.5] : f = [−5, 1, 1][1/3, 1/3, 1/3]: f = [0, 0, 0]

F ε(π) := {z : π(z) ≥ maxz(π(z))− ε}

Definition

• π is called ε-extremal if π(F ε(π)) ≥ 1− ε.

Example: [0.004, 0.499, 0.497].

Huang and Meyn (UIUC) Feature Extraction June 2010 10 / 18

Page 18: Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimizaiton (ISIT 2010)

ε-Distinguishable Distributions

DistinguishableD(π1‖π0) = D(π0‖π1) =∞ ⇔ π1 6≺ π0 and π0 6≺ π1.

Example

π0(z1) = 0.5, π0(z2) = 0.5, π0(z3) = 0π1(z1) = 0, π1(z2) = 0.5, π1(z3) = 0.5

Approximately distinguishable

Example

π0(z1) = 0.49999, π0(z2) = 0.49999, π0(z3) = 0.00002π1(z1) = 0.00002, π1(z2) = 0.49999, π1(z3) = 0.49999

Definition

π1, π2 are ε-distinguishable if F ε(π1) 6⊆ F ε(π1) and F ε(π2) 6⊆ F ε(π1).

Huang and Meyn (UIUC) Feature Extraction June 2010 11 / 18

Page 19: Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimizaiton (ISIT 2010)

ε-Distinguishable Distributions

DistinguishableD(π1‖π0) = D(π0‖π1) =∞ ⇔ π1 6≺ π0 and π0 6≺ π1.

Example

π0(z1) = 0.5, π0(z2) = 0.5, π0(z3) = 0π1(z1) = 0, π1(z2) = 0.5, π1(z3) = 0.5

Approximately distinguishable

Example

π0(z1) = 0.49999, π0(z2) = 0.49999, π0(z3) = 0.00002π1(z1) = 0.00002, π1(z2) = 0.49999, π1(z3) = 0.49999

Definition

π1, π2 are ε-distinguishable if F ε(π1) 6⊆ F ε(π1) and F ε(π2) 6⊆ F ε(π1).

Huang and Meyn (UIUC) Feature Extraction June 2010 11 / 18

Page 20: Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimizaiton (ISIT 2010)

ε-Distinguishable Distributions

DistinguishableD(π1‖π0) = D(π0‖π1) =∞ ⇔ π1 6≺ π0 and π0 6≺ π1.

Example

π0(z1) = 0.5, π0(z2) = 0.5, π0(z3) = 0π1(z1) = 0, π1(z2) = 0.5, π1(z3) = 0.5

Approximately distinguishable

Example

π0(z1) = 0.49999, π0(z2) = 0.49999, π0(z3) = 0.00002π1(z1) = 0.00002, π1(z2) = 0.49999, π1(z3) = 0.49999

Definition

π1, π2 are ε-distinguishable if F ε(π1) 6⊆ F ε(π1) and F ε(π2) 6⊆ F ε(π1).

Huang and Meyn (UIUC) Feature Extraction June 2010 11 / 18

Page 21: Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimizaiton (ISIT 2010)

The Number of ε-Distinguishable ε-Extremal Distributions

Definition

N(E): The maximum N such that for any small ε > 0, there exist Ndistributions in E that are ε-extremal and pairwise ε-distinguishable.

Proposition

DenoteN(d) : max{N(E) : E is d-dimensional }

It admits the following lower and upper bounds:

N(d) ≥ exp(bd

2c[log(|Z|)− logbd

2c − 1]

)N(d) ≤ exp

((d + 1)(1 + log(|Z|)− log(d + 1))

)Many alternative distributions can be distinguished even with smalldimension d

Huang and Meyn (UIUC) Feature Extraction June 2010 12 / 18

Page 22: Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimizaiton (ISIT 2010)

The Number of ε-Distinguishable ε-Extremal Distributions

Definition

N(E): The maximum N such that for any small ε > 0, there exist Ndistributions in E that are ε-extremal and pairwise ε-distinguishable.

Proposition

DenoteN(d) : max{N(E) : E is d-dimensional }

It admits the following lower and upper bounds:

N(d) ≥ exp(bd

2c[log(|Z|)− logbd

2c − 1]

)N(d) ≤ exp

((d + 1)(1 + log(|Z|)− log(d + 1))

)Many alternative distributions can be distinguished even with smalldimension d

Huang and Meyn (UIUC) Feature Extraction June 2010 12 / 18

Page 23: Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimizaiton (ISIT 2010)

A Framework for Choosing Function Class

Scenario: Alternative distributions are in a set S (not known to thealgorithm). Observe p distributions from the set: π1, . . . , πp.

Objective function to be maximized:

maxF1p

∑pi=1 γ

iDMMF (πi‖π0)

subject to dim(F) ≤ d

Rank-constrained optimization:

maxX1p

∑pi=1 γ

i(〈πi ,Xi 〉 − log(〈π0, eXi 〉

)subject to rank (X ) ≤ d

Huang and Meyn (UIUC) Feature Extraction June 2010 13 / 18

Page 24: Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimizaiton (ISIT 2010)

A Framework for Choosing Function Class

Scenario: Alternative distributions are in a set S (not known to thealgorithm). Observe p distributions from the set: π1, . . . , πp.

Objective function to be maximized:

maxF1p

∑pi=1 γ

iDMMF (πi‖π0)

subject to dim(F) ≤ d

Rank-constrained optimization:

maxX1p

∑pi=1 γ

i(〈πi ,Xi 〉 − log(〈π0, eXi 〉

)subject to rank (X ) ≤ d

Huang and Meyn (UIUC) Feature Extraction June 2010 13 / 18

〈µ, f 〉 =∑

z µ(z)f (z)

Page 25: Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimizaiton (ISIT 2010)

Algorithm

Iterative gradient projection:

1 Y k+1 = X k + αk∇h(X k).

2 X k+1 = PS(Y k+1).

PS(Y ) = argmin{‖Y − X‖ : rank (X ) ≤ d}.

Provable local convergence.

Huang and Meyn (UIUC) Feature Extraction June 2010 14 / 18

Page 26: Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimizaiton (ISIT 2010)

Numerical Experiment

Randomly from a set S of distributions.

π0,

π1, . . . , πp for feature extraction.

π1′

for testing.

Experiment steps:

Feature extraction: Extract a d-dimensional function class F basedon π0 and π1, . . . , πp.

Test: Alternative distribution is π1′. Estimate probability of error by

simulation.

Huang and Meyn (UIUC) Feature Extraction June 2010 15 / 18

Page 27: Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimizaiton (ISIT 2010)

Numerical Experiment

Huang and Meyn (UIUC) Feature Extraction June 2010 16 / 18

S: 12-dimensional exponential family.|Z | = 20. n = 30.

Pr(φ = 1|H0).

Pr(φ

=1|

H1)

Page 28: Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimizaiton (ISIT 2010)

Numerical Experiment

Huang and Meyn (UIUC) Feature Extraction June 2010 16 / 18

S: 12-dimensional exponential family.|Z | = 20. n = 30.

Pr(φ = 1|H0).

Pr(φ

=1|

H1)

Page 29: Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimizaiton (ISIT 2010)

Numerical Experiment

Huang and Meyn (UIUC) Feature Extraction June 2010 16 / 18

S: 12-dimensional exponential family.|Z | = 20. n = 30.

Pr(φ = 1|H0).

Pr(φ

=1|

H1)

Page 30: Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimizaiton (ISIT 2010)

Numerical Experiment

Huang and Meyn (UIUC) Feature Extraction June 2010 16 / 18

S: 12-dimensional exponential family.|Z | = 20. n = 30.

Pr(φ = 1|H0).

Pr(φ

=1|

H1)

Page 31: Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimizaiton (ISIT 2010)

Numerical Experiment

Huang and Meyn (UIUC) Feature Extraction June 2010 16 / 18

S: 12-dimensional exponential family.|Z | = 20. n = 30.

Pr(φ = 1|H0).

Pr(φ

=1|

H1)

Page 32: Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimizaiton (ISIT 2010)

Conclusion and Future Work

Conclusions:

Variance is as important as error exponent.

Balance between variance and error-exponent.

Feature extraction algorithm: Exploit prior information to optimizeperformance of mismatched test.

Future Work:

Bound probability of error based on finer statistics.

Extend to processes with long memory.

Other heuristics (such as nuclear-norm) for algorithm design.

Huang and Meyn (UIUC) Feature Extraction June 2010 17 / 18