Learning a Small Mixture of Trees M. Pawan Kumar Daphne Koller pawan koller Aim: To efficiently...
-
Upload
hannah-stewart -
Category
Documents
-
view
213 -
download
0
Transcript of Learning a Small Mixture of Trees M. Pawan Kumar Daphne Koller pawan koller Aim: To efficiently...
Learning a Small Mixture of Trees M. Pawan Kumar Daphne Koller
http://ai.stanford.edu/~pawan http://ai.stanford.edu/~kollerAim: To efficiently learn a small mixture of trees
that approximates an observed distribution Results
Mixture of Trees
Minimizing -Divergence
Problem Formulation
Modifying Fractional Covering
Minimizing the KL Divergence
Meila and Jordan, 2000 (MJ00)
Plotkin et al., 1995
Variables V = {v1, v2, … , vn} Label xa Xa for variable va Labeling x
v1
v2 v3
v1
v2 v3
v1
v2 v3
z Hidden variable
t1 t2 t3
Pr(x | m) = ∑t T Pr(x | t)
Pr(x | t) =∏(a,b) t
ab(xa,xb)
∏a ta(xa)da-1
tab(xa,xb): Pairwise potentials
ta(xa): Unary potentials
da: Degree of va
Renyi, 1961
KL(1||2) = ∑x Pr(x | 1) log Pr(x | 1) Pr(x | 2)
1 : observed distribution
2 : simpler distribution
EM Algorithm (Relies heavily on initialization)
E-step: Estimate Pr(x | t) for each x and t
M-step: Obtain structure and potentials (Chow-Liu) Focuses on dominant mode
Rosset and Segal, 2002 (RS02)
arg minm p(xi)
Pr(xi | m)maxi log = arg maxm
p(xi) Pr(xi | m)mini m* =
MJ00 uses twice as many treesFractional Covering
Standard UCI datasets
MJ00 RS02 OurAgaricus 99.98 (.04) 100 (0) 100 (0)Nursery 99.2 (.02) 98.35 (0.3) 99.28 (.13)Splice 95.5 (0.3) 95.6 (.42) 96.1 (.15)
Learning Pictorial Structures11 characters in an episode of “Buffy”24,244 faces (first 80% train, last 20% test)13 facial features (variables) + positions (labels)Unary: logistic regression, Pairwise: m
Bag of visual words : 65.68% RS02 Our
66.05 66.05
66.01 66.65
66.01 66.86
66.08 67.25
66.08 67.48
66.16 67.50
66.20 67.68
Pr(x | 2)1- -1 D1(1 || 2) = KL(1 || 2)D(1||2) = Pr(x | 1) 1 log ∑x
Generalization of KL Divergence
Fitting q to pLarger is inclusive
Minka, 2005
Use =
= 1 = 0.5 =
Choose from all possible trees T = {tj} defined over n random variables
Matrix A where A(i,j) = Pr(xi|tj)
Vector b where b(i) = p(xi)
Vector ≥ 0 such that ∑ j = 1 P
max
s.t. ai ≥ bi
P
Constraints defined on infinite variables
min ∑iexp(-ai/bi)
s.t. P
Parameter log(m)
Width w= maxmaxi ai/bi
Initial solution 0
Define 0 = mini ai0/bi
Define = /4w
Finding -optimalsolution?
While < 20, iterate
Define yi = exp(-ai/bi)/bi
Find ’ = argmax yTAUpdate = (1-) + ’
Minimize first-orderapproximation
(1) Slow convergence(2) Singleton trees(Probability = 0 forunseen test examples)
Drawbacks
OverviewAn intuitive objective function for learning a mixture of treesFormulate the problem using fractional coveringIdentify the drawbacks of fractional coveringMake suitable modifications to the algorithm
(1) Start with = 1/w. Increase by a factor of 2 if necessary.
Large step-size Large yi for numericalstability
(2) Minimize using convex relaxation.
-Pr(xi|t)
t T
p(xi)min ∑i exp
s.t. Pr(xi|t) ≥ 0, ∑i Pr(xi|t)≤ 1
Dropped
Initialize tolerance , parameter , factor f
Solve for distribution Pr(. | t)
min f - ∑i log(Pr(xi|t)) -∑i log(1-Pr(xi|t))
Update f = f until m/f ≤
Log-barrier approach. Use Newton’s method.
To minimize g(z), update z = z - (2g(z))-1 g(z)
Hessian Gradient
Hessian with uniformoff-diagonal elements
Matrix inversion in linear time
Project to tree distribution using Chow-Liu May result in increase in
Discard best explainedsample and recompute t
Enforce Pr(xi’|t) = 0 i’ = argmaxi Pr(xi|t)/p(xi)
Given distribution p(.) find a mixture of trees by minimizing -divergence
Computationallyexpensive operation?
Use previous solution Only one log-barrieroptimization required
Convergence PropertiesMaximum number of increases for = O(log(log(m)))Maximum discarded samples = m-1Polynomial time per iteration. Polynomial time convergence of overall algorithm.
Mixtures in log-probability space?
Connections to Discrete AdaBoost?FutureWork
Pr(xi|t) = Pr(xi|t) + si Pr(xi’|t)si = p(xi |t)/∑k p(xk |t)
SSTTAANNFFOORRDD