Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimizaiton (ISIT 2010)

Feature Extraction for Universal Hypothesis Testing viaRank-Constrained Optimization

Dayu Huang and Sean Meyn

Department of Electrical and Computer Engineeringand Coordinated Science Laboratory

University of Illinois, Urbana-Champaign

June 18, 2010

Huang and Meyn (UIUC) Feature Extraction June 2010 1 / 18

Introduction

Universal Hypothesis Testing

Sequence of observations: Zn1 := (Z1, . . . ,Zn).

i.i.d. π0 under H0, π1 under H1

π0: known π1: not known

Observation space Z is finite.

Task: Design a test to decide in favor of H0 or H1.

The Hoeffding testφHn = 1{D(Γn‖π0) ≥ η},

Empirical distribution

Γn(A) =1

n∑k=1

1{Zk ∈ A}, A ⊂ Z.

Introduction

Universal Hypothesis Testing

Sequence of observations: Zn1 := (Z1, . . . ,Zn).

i.i.d. π0 under H0, π1 under H1

π0: known π1: not known

Observation space Z is finite.

Task: Design a test to decide in favor of H0 or H1.

The Hoeffding testφHn = 1{D(Γn‖π0) ≥ η},

Empirical distribution

Γn(A) =1

n∑k=1

1{Zk ∈ A}, A ⊂ Z.

The Hoeffding Test

Theorem

1 1The Hoeffding test achieves the optimal error exponents inNeyman-Pearson criterion.

2 2The asymptotic variance of the Hoeffding test depends on the size ofthe observation space. When Zn

1 has marginal π0, we have

limn→∞

Var [nD(Γn‖π0)] = 12(|Z| − 1).

1. Hoeffding 1963;

The Hoeffding Test

Theorem

1 1The Hoeffding test achieves the optimal error exponents inNeyman-Pearson criterion.

2 2The asymptotic variance of the Hoeffding test depends on the size ofthe observation space. When Zn

1 has marginal π0, we have

limn→∞

Var [nD(Γn‖π0)] = 12(|Z| − 1).

Large variance when |Z| large

1. Hoeffding 1963; 2. Unnikrishnan, Huang, Meyn, Surana & Veeravalli; Wilks 1938;Clarke & Barron 1990.

Performance of the Hoeffding Test

0 0.2 0.4 0.6 0.8 1

babili

Probability of False Alarm

|Z|=39|Z|=19

Pr(φ = 1|H0).

Red: Better error exponent but larger variance

Mismatched Universal Test

Variational representation of KL divergence

D(µ‖π) = supf

(〈µ, f 〉 − log(〈π, ef 〉)

)Mismatched divergence 1

DMMF (µ‖π) := sup

f ∈F

(〈µ, f 〉 − log(〈π, ef 〉)

Mismatched universal test 2

φMMn = 1{DMM

F (Γn‖π0) ≥ η}

〈µ, f 〉 =∑

z µ(z)f (z)

Mismatched Universal Test

Variational representation of KL divergence

D(µ‖π) = supf

(〈µ, f 〉 − log(〈π, ef 〉)

)Mismatched divergence 1

DMMF (µ‖π) := sup

f ∈F

(〈µ, f 〉 − log(〈π, ef 〉)

)Mismatched universal test 2

φMMn = 1{DMM

F (Γn‖π0) ≥ η}

〈µ, f 〉 =∑

z µ(z)f (z)

1. Abbe, Medard, Meyn & Zheng 2007; 2. Unnikrishnan et al.

Function Class and Performance

Consider a linear function class:

fr :=d∑i

}Choice of function class F determines performance:

Mismatched divergence approximates KL divergence. Determineserror exponent of the mismatched universal test. When d is smallerthan |Z|, it is optimal for a restricted set of alternative distributions.

Dimension d determines asymptotic variance1: Under H0,

limn→∞

Var [nDMMF (Γn‖π0)] = 1

Problem: How to choose function class F?

1. Unnikrishnan et al.

Function Class and Performance

Consider a linear function class:

fr :=d∑i

}Choice of function class F determines performance:

Mismatched divergence approximates KL divergence. Determineserror exponent of the mismatched universal test. When d is smallerthan |Z|, it is optimal for a restricted set of alternative distributions.

Dimension d determines asymptotic variance1: Under H0,

limn→∞

Var [nDMMF (Γn‖π0)] = 1

Problem: How to choose function class F?

1. Unnikrishnan et al.

Our Contribution

1 Mismatched test even with a small dimension d is optimal for a largeset of alternative distributions.

2 Framework to choose F for the mismatched test.

How powerful is mismatched test?

Example

10 distributions.d =?

1 2 3 4 5 6 7 8 9

1 2 3 4 5 6 7 8 90

1 2 3 4 5 6 7 8 9

00.10.20.30.40.50.60.70.8

1 2 3 4 5 6 7 8 90

1 2 3 4 5 6 7 8 9

1 2 3 4 5 6 7 8 90

1 2 3 4 5 6 7 8 9

1 2 3 4 5 6 7 8 90

1 2 3 4 5 6 7 8 9

When MM is optimal?

When does DMMF (π1‖π0) = D(π1‖π0)?

Fact (1)

When F includes LLR.

Exponential family E(F) = {µ : µ(z) ∝ (exp f (z)), f ∈ F}.

Fact (2)

When π0, π1 are in the same exponential family.

How many distributions in an d-dimensional exponential family?

When MM is optimal?

When does DMMF (π1‖π0) = D(π1‖π0)?

Fact (1)

When F includes LLR.

Exponential family E(F) = {µ : µ(z) ∝ (exp f (z)), f ∈ F}.

Fact (2)

When π0, π1 are in the same exponential family.

How many distributions in an d-dimensional exponential family?

ε-Extremal Distributions

πθ(z) ∝ exp(θf (z)) ∈ E(F)

Extremal distributions: πθθ→∞−−−→ Distributions on the boundary of E(F).

Example

F = span(ψ): ψ = [5,−1,−1] i.e. ψ(z1) = −5, ψ(z2) = ψ(z3) = −1.What are the extremal distributions?[1, 0, 0] : f = [5,−1,−1][0, 0.5, 0.5] : f = [−5, 1, 1][1/3, 1/3, 1/3]: f = [0, 0, 0]

F ε(π) := {z : π(z) ≥ maxz(π(z))− ε}

Definition

• π is called ε-extremal if π(F ε(π)) ≥ 1− ε.

Example: [0.004, 0.499, 0.497].

Example

F ε(π) := {z : π(z) ≥ maxz(π(z))− ε}

Definition

Example: [0.004, 0.499, 0.497].

Example

F ε(π) := {z : π(z) ≥ maxz(π(z))− ε}

Definition

Example: [0.004, 0.499, 0.497].

ε-Distinguishable Distributions

DistinguishableD(π1‖π0) = D(π0‖π1) =∞ ⇔ π1 6≺ π0 and π0 6≺ π1.

Example

π0(z1) = 0.5, π0(z2) = 0.5, π0(z3) = 0π1(z1) = 0, π1(z2) = 0.5, π1(z3) = 0.5

Approximately distinguishable

Example

π0(z1) = 0.49999, π0(z2) = 0.49999, π0(z3) = 0.00002π1(z1) = 0.00002, π1(z2) = 0.49999, π1(z3) = 0.49999

Definition

π1, π2 are ε-distinguishable if F ε(π1) 6⊆ F ε(π1) and F ε(π2) 6⊆ F ε(π1).

Example

π0(z1) = 0.5, π0(z2) = 0.5, π0(z3) = 0π1(z1) = 0, π1(z2) = 0.5, π1(z3) = 0.5

Example

π0(z1) = 0.49999, π0(z2) = 0.49999, π0(z3) = 0.00002π1(z1) = 0.00002, π1(z2) = 0.49999, π1(z3) = 0.49999

Definition

Example

π0(z1) = 0.5, π0(z2) = 0.5, π0(z3) = 0π1(z1) = 0, π1(z2) = 0.5, π1(z3) = 0.5

Example

π0(z1) = 0.49999, π0(z2) = 0.49999, π0(z3) = 0.00002π1(z1) = 0.00002, π1(z2) = 0.49999, π1(z3) = 0.49999

Definition

The Number of ε-Distinguishable ε-Extremal Distributions

Definition

N(E): The maximum N such that for any small ε > 0, there exist Ndistributions in E that are ε-extremal and pairwise ε-distinguishable.

Proposition

DenoteN(d) : max{N(E) : E is d-dimensional }

It admits the following lower and upper bounds:

N(d) ≥ exp(bd

2c[log(|Z|)− logbd

2c − 1]

)N(d) ≤ exp

((d + 1)(1 + log(|Z|)− log(d + 1))

)Many alternative distributions can be distinguished even with smalldimension d

The Number of ε-Distinguishable ε-Extremal Distributions

Definition

N(E): The maximum N such that for any small ε > 0, there exist Ndistributions in E that are ε-extremal and pairwise ε-distinguishable.

Proposition

DenoteN(d) : max{N(E) : E is d-dimensional }

It admits the following lower and upper bounds:

N(d) ≥ exp(bd

2c[log(|Z|)− logbd

2c − 1]

)N(d) ≤ exp

((d + 1)(1 + log(|Z|)− log(d + 1))

)Many alternative distributions can be distinguished even with smalldimension d

A Framework for Choosing Function Class

Scenario: Alternative distributions are in a set S (not known to thealgorithm). Observe p distributions from the set: π1, . . . , πp.

Objective function to be maximized:

maxF1p

∑pi=1 γ

iDMMF (πi‖π0)

subject to dim(F) ≤ d

Rank-constrained optimization:

maxX1p

∑pi=1 γ

i(〈πi ,Xi 〉 − log(〈π0, eXi 〉

)subject to rank (X ) ≤ d

A Framework for Choosing Function Class

Scenario: Alternative distributions are in a set S (not known to thealgorithm). Observe p distributions from the set: π1, . . . , πp.

Objective function to be maximized:

maxF1p

∑pi=1 γ

iDMMF (πi‖π0)

subject to dim(F) ≤ d

Rank-constrained optimization:

maxX1p

∑pi=1 γ

i(〈πi ,Xi 〉 − log(〈π0, eXi 〉

)subject to rank (X ) ≤ d

〈µ, f 〉 =∑

z µ(z)f (z)

Algorithm

Iterative gradient projection:

1 Y k+1 = X k + αk∇h(X k).

2 X k+1 = PS(Y k+1).

PS(Y ) = argmin{‖Y − X‖ : rank (X ) ≤ d}.

Provable local convergence.

Numerical Experiment

Randomly from a set S of distributions.

π1, . . . , πp for feature extraction.

π1′

for testing.

Experiment steps:

Feature extraction: Extract a d-dimensional function class F basedon π0 and π1, . . . , πp.

Test: Alternative distribution is π1′. Estimate probability of error by

simulation.

S: 12-dimensional exponential family.|Z | = 20. n = 30.

Pr(φ = 1|H0).

Conclusion and Future Work

Conclusions:

Variance is as important as error exponent.

Balance between variance and error-exponent.

Feature extraction algorithm: Exploit prior information to optimizeperformance of mismatched test.

Future Work:

Bound probability of error based on finer statistics.

Extend to processes with long memory.

Other heuristics (such as nuclear-norm) for algorithm design.

Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimizaiton (ISIT 2010)

Technology

Transcript of Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimizaiton (ISIT 2010)

isit the Exhibiior - JICA · isit the Exhibiior . Created Date: 11/17/2008 6:31:01 PM

Solder Ball Attach WB-300 - Wagenbrett · ISIT Fraunhofer Institut Siliziumtechnologie Device qualification at ISIT Solder Ball Attach WB-300

isit - Athens County

FRAUNHOFER-INSTITUT FÜR SILIZIUMTECHNOLOGIE ISIT · FRAUNHOFER-INSTITUT FÜR SILIZIUMTECHNOLOGIE ISIT 2016 ... by Parameter Identification of a Grey Box Model ... Der Wettbewerb,

Logility - Inventory Optimizaiton

Mobile App Optimizaiton for Acquisition, Activation, Retention

EASY ISIT GUIDE · 2020-06-15 · Démarches et aides pour futur.e Isitiste /r5 !&óñóñ óñóò I S I T P A R I S EASY ISIT GUIDE

Quantum Information Theory Tutorialmarkwilde.com/teaching/2016-isit/ISIT-QIT-tutorial-staggered.pdf · published by Cambridge University Press (2nd edition forthcoming) July 10, 2016,

Financially Constrained Long-Range Transportation Plan ... · FINANCIALLY CONSTRAINED LONG-RANGE TRANSPORTATION PLAN ... FINANCIALLY CONSTRAINED LONG-RANGE TRANSPORTATION PLAN ...

Locality in Coding Theory - Massachusetts Institute of ...people.csail.mit.edu/madhu/talks/2015/ISIT-Locality.pdf · Locality in Coding Theory Madhu Sudan MSR June 16, 2015 ISIT:

Masaki Hirabaru masaki@isit.or.jp ISIT, Japan

ISIT 2012: A Proposal by Lizhong Zheng and Muriel Médard

ISIT 2014, Hawaii presentation

ISIT 200 Report

Commercial Condensing Boiler Optimizaiton

Maximum Distance Separable Symbol-Pair Codes - ISIT 2012

The TAO Linearly-Constrained Augmented Lagrangian Method for PDE-Constrained Optimization · 2012. 3. 2. · Our linearly-constrained augmented Lagrangian method for solving PDE-constrained

ISIT Summer session 2015

APPENDIX G ECOLOGICAL REPORTS...Mild, sunny spells, windy with patchy cloud. Area/location Visit 1 isit 2 isit 3 isit 4 isit 5 isit 6 isit 7 isit 8 isit 9 isit 10 AREA 1 Felt Fenced

V isit our w e b site - Schoolwires