Post on 06-Aug-2015
Feature Extraction for Universal Hypothesis Testing viaRank-Constrained Optimization
Dayu Huang and Sean Meyn
Department of Electrical and Computer Engineeringand Coordinated Science Laboratory
University of Illinois, Urbana-Champaign
June 18, 2010
Huang and Meyn (UIUC) Feature Extraction June 2010 1 / 18
Introduction
Universal Hypothesis Testing
Sequence of observations: Zn1 := (Z1, . . . ,Zn).
i.i.d. π0 under H0, π1 under H1
π0: known π1: not known
Observation space Z is finite.
Task: Design a test to decide in favor of H0 or H1.
The Hoeffding testφHn = 1{D(Γn‖π0) ≥ η},
Empirical distribution
Γn(A) =1
n
n∑k=1
1{Zk ∈ A}, A ⊂ Z.
Huang and Meyn (UIUC) Feature Extraction June 2010 2 / 18
Introduction
Universal Hypothesis Testing
Sequence of observations: Zn1 := (Z1, . . . ,Zn).
i.i.d. π0 under H0, π1 under H1
π0: known π1: not known
Observation space Z is finite.
Task: Design a test to decide in favor of H0 or H1.
The Hoeffding testφHn = 1{D(Γn‖π0) ≥ η},
Empirical distribution
Γn(A) =1
n
n∑k=1
1{Zk ∈ A}, A ⊂ Z.
Huang and Meyn (UIUC) Feature Extraction June 2010 2 / 18
The Hoeffding Test
Theorem
1 1The Hoeffding test achieves the optimal error exponents inNeyman-Pearson criterion.
2 2The asymptotic variance of the Hoeffding test depends on the size ofthe observation space. When Zn
1 has marginal π0, we have
limn→∞
Var [nD(Γn‖π0)] = 12(|Z| − 1).
Huang and Meyn (UIUC) Feature Extraction June 2010 3 / 18
1. Hoeffding 1963;
The Hoeffding Test
Theorem
1 1The Hoeffding test achieves the optimal error exponents inNeyman-Pearson criterion.
2 2The asymptotic variance of the Hoeffding test depends on the size ofthe observation space. When Zn
1 has marginal π0, we have
limn→∞
Var [nD(Γn‖π0)] = 12(|Z| − 1).
Huang and Meyn (UIUC) Feature Extraction June 2010 3 / 18
Large variance when |Z| large
1. Hoeffding 1963; 2. Unnikrishnan, Huang, Meyn, Surana & Veeravalli; Wilks 1938;Clarke & Barron 1990.
Performance of the Hoeffding Test
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Pro
babili
ty o
f D
ete
ctio
n
Probability of False Alarm
|Z|=39|Z|=19
Huang and Meyn (UIUC) Feature Extraction June 2010 4 / 18
Pr(φ = 1|H0).
Pr(φ
=1|
H1)
Red: Better error exponent but larger variance
Mismatched Universal Test
Variational representation of KL divergence
D(µ‖π) = supf
(〈µ, f 〉 − log(〈π, ef 〉)
)Mismatched divergence 1
DMMF (µ‖π) := sup
f ∈F
(〈µ, f 〉 − log(〈π, ef 〉)
)
Mismatched universal test 2
φMMn = 1{DMM
F (Γn‖π0) ≥ η}
Huang and Meyn (UIUC) Feature Extraction June 2010 5 / 18
〈µ, f 〉 =∑
z µ(z)f (z)
Mismatched Universal Test
Variational representation of KL divergence
D(µ‖π) = supf
(〈µ, f 〉 − log(〈π, ef 〉)
)Mismatched divergence 1
DMMF (µ‖π) := sup
f ∈F
(〈µ, f 〉 − log(〈π, ef 〉)
)Mismatched universal test 2
φMMn = 1{DMM
F (Γn‖π0) ≥ η}
Huang and Meyn (UIUC) Feature Extraction June 2010 5 / 18
〈µ, f 〉 =∑
z µ(z)f (z)
1. Abbe, Medard, Meyn & Zheng 2007; 2. Unnikrishnan et al.
Function Class and Performance
Consider a linear function class:
F ={
fr :=d∑i
riψi
}Choice of function class F determines performance:
Mismatched divergence approximates KL divergence. Determineserror exponent of the mismatched universal test. When d is smallerthan |Z|, it is optimal for a restricted set of alternative distributions.
Dimension d determines asymptotic variance1: Under H0,
limn→∞
Var [nDMMF (Γn‖π0)] = 1
2d
Problem: How to choose function class F?
Huang and Meyn (UIUC) Feature Extraction June 2010 6 / 18
1. Unnikrishnan et al.
Function Class and Performance
Consider a linear function class:
F ={
fr :=d∑i
riψi
}Choice of function class F determines performance:
Mismatched divergence approximates KL divergence. Determineserror exponent of the mismatched universal test. When d is smallerthan |Z|, it is optimal for a restricted set of alternative distributions.
Dimension d determines asymptotic variance1: Under H0,
limn→∞
Var [nDMMF (Γn‖π0)] = 1
2d
Problem: How to choose function class F?
Huang and Meyn (UIUC) Feature Extraction June 2010 6 / 18
1. Unnikrishnan et al.
Our Contribution
1 Mismatched test even with a small dimension d is optimal for a largeset of alternative distributions.
2 Framework to choose F for the mismatched test.
Huang and Meyn (UIUC) Feature Extraction June 2010 7 / 18
How powerful is mismatched test?
Example
10 distributions.d =?
0
0.05
0.1
0.15
0.2
1 2 3 4 5 6 7 8 9
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1 2 3 4 5 6 7 8 90
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1 2 3 4 5 6 7 8 9
00.10.20.30.40.50.60.70.8
1 2 3 4 5 6 7 8 90
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 90
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 90
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9
Huang and Meyn (UIUC) Feature Extraction June 2010 8 / 18
π0
When MM is optimal?
When does DMMF (π1‖π0) = D(π1‖π0)?
Fact (1)
When F includes LLR.
Exponential family E(F) = {µ : µ(z) ∝ (exp f (z)), f ∈ F}.
Fact (2)
When π0, π1 are in the same exponential family.
How many distributions in an d-dimensional exponential family?
Huang and Meyn (UIUC) Feature Extraction June 2010 9 / 18
When MM is optimal?
When does DMMF (π1‖π0) = D(π1‖π0)?
Fact (1)
When F includes LLR.
Exponential family E(F) = {µ : µ(z) ∝ (exp f (z)), f ∈ F}.
Fact (2)
When π0, π1 are in the same exponential family.
How many distributions in an d-dimensional exponential family?
Huang and Meyn (UIUC) Feature Extraction June 2010 9 / 18
ε-Extremal Distributions
πθ(z) ∝ exp(θf (z)) ∈ E(F)
Extremal distributions: πθθ→∞−−−→ Distributions on the boundary of E(F).
Example
F = span(ψ): ψ = [5,−1,−1] i.e. ψ(z1) = −5, ψ(z2) = ψ(z3) = −1.What are the extremal distributions?[1, 0, 0] : f = [5,−1,−1][0, 0.5, 0.5] : f = [−5, 1, 1][1/3, 1/3, 1/3]: f = [0, 0, 0]
F ε(π) := {z : π(z) ≥ maxz(π(z))− ε}
Definition
• π is called ε-extremal if π(F ε(π)) ≥ 1− ε.
Example: [0.004, 0.499, 0.497].
Huang and Meyn (UIUC) Feature Extraction June 2010 10 / 18
ε-Extremal Distributions
πθ(z) ∝ exp(θf (z)) ∈ E(F)
Extremal distributions: πθθ→∞−−−→ Distributions on the boundary of E(F).
Example
F = span(ψ): ψ = [5,−1,−1] i.e. ψ(z1) = −5, ψ(z2) = ψ(z3) = −1.What are the extremal distributions?[1, 0, 0] : f = [5,−1,−1][0, 0.5, 0.5] : f = [−5, 1, 1][1/3, 1/3, 1/3]: f = [0, 0, 0]
F ε(π) := {z : π(z) ≥ maxz(π(z))− ε}
Definition
• π is called ε-extremal if π(F ε(π)) ≥ 1− ε.
Example: [0.004, 0.499, 0.497].
Huang and Meyn (UIUC) Feature Extraction June 2010 10 / 18
ε-Extremal Distributions
πθ(z) ∝ exp(θf (z)) ∈ E(F)
Extremal distributions: πθθ→∞−−−→ Distributions on the boundary of E(F).
Example
F = span(ψ): ψ = [5,−1,−1] i.e. ψ(z1) = −5, ψ(z2) = ψ(z3) = −1.What are the extremal distributions?[1, 0, 0] : f = [5,−1,−1][0, 0.5, 0.5] : f = [−5, 1, 1][1/3, 1/3, 1/3]: f = [0, 0, 0]
F ε(π) := {z : π(z) ≥ maxz(π(z))− ε}
Definition
• π is called ε-extremal if π(F ε(π)) ≥ 1− ε.
Example: [0.004, 0.499, 0.497].
Huang and Meyn (UIUC) Feature Extraction June 2010 10 / 18
ε-Distinguishable Distributions
DistinguishableD(π1‖π0) = D(π0‖π1) =∞ ⇔ π1 6≺ π0 and π0 6≺ π1.
Example
π0(z1) = 0.5, π0(z2) = 0.5, π0(z3) = 0π1(z1) = 0, π1(z2) = 0.5, π1(z3) = 0.5
Approximately distinguishable
Example
π0(z1) = 0.49999, π0(z2) = 0.49999, π0(z3) = 0.00002π1(z1) = 0.00002, π1(z2) = 0.49999, π1(z3) = 0.49999
Definition
π1, π2 are ε-distinguishable if F ε(π1) 6⊆ F ε(π1) and F ε(π2) 6⊆ F ε(π1).
Huang and Meyn (UIUC) Feature Extraction June 2010 11 / 18
ε-Distinguishable Distributions
DistinguishableD(π1‖π0) = D(π0‖π1) =∞ ⇔ π1 6≺ π0 and π0 6≺ π1.
Example
π0(z1) = 0.5, π0(z2) = 0.5, π0(z3) = 0π1(z1) = 0, π1(z2) = 0.5, π1(z3) = 0.5
Approximately distinguishable
Example
π0(z1) = 0.49999, π0(z2) = 0.49999, π0(z3) = 0.00002π1(z1) = 0.00002, π1(z2) = 0.49999, π1(z3) = 0.49999
Definition
π1, π2 are ε-distinguishable if F ε(π1) 6⊆ F ε(π1) and F ε(π2) 6⊆ F ε(π1).
Huang and Meyn (UIUC) Feature Extraction June 2010 11 / 18
ε-Distinguishable Distributions
DistinguishableD(π1‖π0) = D(π0‖π1) =∞ ⇔ π1 6≺ π0 and π0 6≺ π1.
Example
π0(z1) = 0.5, π0(z2) = 0.5, π0(z3) = 0π1(z1) = 0, π1(z2) = 0.5, π1(z3) = 0.5
Approximately distinguishable
Example
π0(z1) = 0.49999, π0(z2) = 0.49999, π0(z3) = 0.00002π1(z1) = 0.00002, π1(z2) = 0.49999, π1(z3) = 0.49999
Definition
π1, π2 are ε-distinguishable if F ε(π1) 6⊆ F ε(π1) and F ε(π2) 6⊆ F ε(π1).
Huang and Meyn (UIUC) Feature Extraction June 2010 11 / 18
The Number of ε-Distinguishable ε-Extremal Distributions
Definition
N(E): The maximum N such that for any small ε > 0, there exist Ndistributions in E that are ε-extremal and pairwise ε-distinguishable.
Proposition
DenoteN(d) : max{N(E) : E is d-dimensional }
It admits the following lower and upper bounds:
N(d) ≥ exp(bd
2c[log(|Z|)− logbd
2c − 1]
)N(d) ≤ exp
((d + 1)(1 + log(|Z|)− log(d + 1))
)Many alternative distributions can be distinguished even with smalldimension d
Huang and Meyn (UIUC) Feature Extraction June 2010 12 / 18
The Number of ε-Distinguishable ε-Extremal Distributions
Definition
N(E): The maximum N such that for any small ε > 0, there exist Ndistributions in E that are ε-extremal and pairwise ε-distinguishable.
Proposition
DenoteN(d) : max{N(E) : E is d-dimensional }
It admits the following lower and upper bounds:
N(d) ≥ exp(bd
2c[log(|Z|)− logbd
2c − 1]
)N(d) ≤ exp
((d + 1)(1 + log(|Z|)− log(d + 1))
)Many alternative distributions can be distinguished even with smalldimension d
Huang and Meyn (UIUC) Feature Extraction June 2010 12 / 18
A Framework for Choosing Function Class
Scenario: Alternative distributions are in a set S (not known to thealgorithm). Observe p distributions from the set: π1, . . . , πp.
Objective function to be maximized:
maxF1p
∑pi=1 γ
iDMMF (πi‖π0)
subject to dim(F) ≤ d
Rank-constrained optimization:
maxX1p
∑pi=1 γ
i(〈πi ,Xi 〉 − log(〈π0, eXi 〉
)subject to rank (X ) ≤ d
Huang and Meyn (UIUC) Feature Extraction June 2010 13 / 18
A Framework for Choosing Function Class
Scenario: Alternative distributions are in a set S (not known to thealgorithm). Observe p distributions from the set: π1, . . . , πp.
Objective function to be maximized:
maxF1p
∑pi=1 γ
iDMMF (πi‖π0)
subject to dim(F) ≤ d
Rank-constrained optimization:
maxX1p
∑pi=1 γ
i(〈πi ,Xi 〉 − log(〈π0, eXi 〉
)subject to rank (X ) ≤ d
Huang and Meyn (UIUC) Feature Extraction June 2010 13 / 18
〈µ, f 〉 =∑
z µ(z)f (z)
Algorithm
Iterative gradient projection:
1 Y k+1 = X k + αk∇h(X k).
2 X k+1 = PS(Y k+1).
PS(Y ) = argmin{‖Y − X‖ : rank (X ) ≤ d}.
Provable local convergence.
Huang and Meyn (UIUC) Feature Extraction June 2010 14 / 18
Numerical Experiment
Randomly from a set S of distributions.
π0,
π1, . . . , πp for feature extraction.
π1′
for testing.
Experiment steps:
Feature extraction: Extract a d-dimensional function class F basedon π0 and π1, . . . , πp.
Test: Alternative distribution is π1′. Estimate probability of error by
simulation.
Huang and Meyn (UIUC) Feature Extraction June 2010 15 / 18
Numerical Experiment
Huang and Meyn (UIUC) Feature Extraction June 2010 16 / 18
S: 12-dimensional exponential family.|Z | = 20. n = 30.
Pr(φ = 1|H0).
Pr(φ
=1|
H1)
Numerical Experiment
Huang and Meyn (UIUC) Feature Extraction June 2010 16 / 18
S: 12-dimensional exponential family.|Z | = 20. n = 30.
Pr(φ = 1|H0).
Pr(φ
=1|
H1)
Numerical Experiment
Huang and Meyn (UIUC) Feature Extraction June 2010 16 / 18
S: 12-dimensional exponential family.|Z | = 20. n = 30.
Pr(φ = 1|H0).
Pr(φ
=1|
H1)
Numerical Experiment
Huang and Meyn (UIUC) Feature Extraction June 2010 16 / 18
S: 12-dimensional exponential family.|Z | = 20. n = 30.
Pr(φ = 1|H0).
Pr(φ
=1|
H1)
Numerical Experiment
Huang and Meyn (UIUC) Feature Extraction June 2010 16 / 18
S: 12-dimensional exponential family.|Z | = 20. n = 30.
Pr(φ = 1|H0).
Pr(φ
=1|
H1)
Conclusion and Future Work
Conclusions:
Variance is as important as error exponent.
Balance between variance and error-exponent.
Feature extraction algorithm: Exploit prior information to optimizeperformance of mismatched test.
Future Work:
Bound probability of error based on finer statistics.
Extend to processes with long memory.
Other heuristics (such as nuclear-norm) for algorithm design.
Huang and Meyn (UIUC) Feature Extraction June 2010 17 / 18