Marginal Inference in MRFs using Frank-Wolfe · 2013. 12. 29. · Marginal Inference in MRFs using...
Transcript of Marginal Inference in MRFs using Frank-Wolfe · 2013. 12. 29. · Marginal Inference in MRFs using...
Marginal Inference in MRFs using Frank-Wolfe
David Belanger, Daniel Sheldon, Andrew McCallum
School of Computer ScienceUniversity of Massachusetts, Amherst
{belanger,sheldon,mccallum}@cs.umass.edu
December 10, 2013
Table of Contents
1 Markov Random Fields
2 Frank-Wolfe for Marginal Inference
3 Optimality Guarantees and Convergence Rate
4 Beyond MRFs
5 Fancier FW
December 10, 2013 2 / 26
Table of Contents
1 Markov Random Fields
2 Frank-Wolfe for Marginal Inference
3 Optimality Guarantees and Convergence Rate
4 Beyond MRFs
5 Fancier FW
December 10, 2013 3 / 26
Markov Random Fields
Φθ(x) =∑c∈C
θc(xc)
P(x) =exp (Φθ(x))
log(Z )
x→ µ
Φθ(x)→ 〈θ,µ〉
December 10, 2013 4 / 26
Markov Random Fields
Φθ(x) =∑c∈C
θc(xc)
P(x) =exp (Φθ(x))
log(Z )
x→ µ
Φθ(x)→ 〈θ,µ〉
December 10, 2013 4 / 26
Markov Random Fields
Φθ(x) =∑c∈C
θc(xc)
P(x) =exp (Φθ(x))
log(Z )
x→ µ
Φθ(x)→ 〈θ,µ〉
December 10, 2013 4 / 26
Markov Random Fields
Φθ(x) =∑c∈C
θc(xc)
P(x) =exp (Φθ(x))
log(Z )
x→ µ
Φθ(x)→ 〈θ,µ〉
December 10, 2013 4 / 26
Markov Random Fields
Φθ(x) =∑c∈C
θc(xc)
P(x) =exp (Φθ(x))
log(Z )
x→ µ
Φθ(x)→ 〈θ,µ〉
December 10, 2013 4 / 26
Marginal Inference
µMARG = EPθ[µ]
µMARG = arg maxµ∈M〈µ,θ〉+ HM(µ)
µ̄approx = arg maxµ∈L〈µ,θ〉+ HB(µ)
HB(µ) =∑c∈C
WcH(µc)
December 10, 2013 5 / 26
Marginal Inference
µMARG = EPθ[µ]
µMARG = arg maxµ∈M〈µ,θ〉+ HM(µ)
µ̄approx = arg maxµ∈L〈µ,θ〉+ HB(µ)
HB(µ) =∑c∈C
WcH(µc)
December 10, 2013 5 / 26
Marginal Inference
µMARG = EPθ[µ]
µMARG = arg maxµ∈M〈µ,θ〉+ HM(µ)
µ̄approx = arg maxµ∈L〈µ,θ〉+ HB(µ)
HB(µ) =∑c∈C
WcH(µc)
December 10, 2013 5 / 26
Marginal Inference
µMARG = EPθ[µ]
µMARG = arg maxµ∈M〈µ,θ〉+ HM(µ)
µ̄approx = arg maxµ∈L〈µ,θ〉+ HB(µ)
HB(µ) =∑c∈C
WcH(µc)
December 10, 2013 5 / 26
MAP Inference
µMAP = arg maxµ∈M〈µ,θ〉
Black&Box&&MAP&Solver&
✓ µMAP
Gray&Box&&MAP&Solver&
✓ µMAP
December 10, 2013 6 / 26
MAP Inference
µMAP = arg maxµ∈M〈µ,θ〉
Black&Box&&MAP&Solver&
✓ µMAP
Gray&Box&&MAP&Solver&
✓ µMAP
December 10, 2013 6 / 26
MAP Inference
µMAP = arg maxµ∈M〈µ,θ〉
Black&Box&&MAP&Solver&
✓ µMAP
Gray&Box&&MAP&Solver&
✓ µMAP
December 10, 2013 6 / 26
Marginal → MAP Reductions
Hazan and Jaakkola [2012]
Ermon et al. [2013]
December 10, 2013 7 / 26
Table of Contents
1 Markov Random Fields
2 Frank-Wolfe for Marginal Inference
3 Optimality Guarantees and Convergence Rate
4 Beyond MRFs
5 Fancier FW
December 10, 2013 8 / 26
Generic FW with Line Search
yt = arg minx∈X〈x,−∇f (xt−1)〉
xt = minγ∈[0,1]
f ((1− γ)xt + γyt)
December 10, 2013 9 / 26
Generic FW with Line Search
Linear&&Minimiza<on&
Oracle&
Line&Search&Compute&&Gradient&
xt
�rf(xt�1) yt
December 10, 2013 10 / 26
FW for Marginal Inference
MAP&Inference&Oracle&
Line&Search&Compute&Gradient&
rF (µt) = ✓ +rH(µt)
✓̃ µ̃MAP
µt+1
December 10, 2013 11 / 26
Subproblem Parametrization
F (µ) = 〈µ,θ〉+∑c∈C
WcH(µc)
θ̃ = ∇F (µt) = θ +∑c∈C
Wc∇H(µc)
December 10, 2013 12 / 26
Subproblem Parametrization
F (µ) = 〈µ,θ〉+∑c∈C
WcH(µc)
θ̃ = ∇F (µt) = θ +∑c∈C
Wc∇H(µc)
December 10, 2013 12 / 26
Line Search
µ̃MAP
µt
µt+1
Computing line search objective can scale with:
Bad: # possible values in cliques.
Good: # cliques in graph.
(see paper)
December 10, 2013 13 / 26
Line Search
µ̃MAP
µt
µt+1
Computing line search objective can scale with:
Bad: # possible values in cliques.
Good: # cliques in graph.
(see paper)
December 10, 2013 13 / 26
Line Search
µ̃MAP
µt
µt+1
Computing line search objective can scale with:
Bad: # possible values in cliques.
Good: # cliques in graph.
(see paper)
December 10, 2013 13 / 26
Line Search
µ̃MAP
µt
µt+1
Computing line search objective can scale with:
Bad: # possible values in cliques.
Good: # cliques in graph.
(see paper)
December 10, 2013 13 / 26
Experiment #1
December 10, 2013 14 / 26
Table of Contents
1 Markov Random Fields
2 Frank-Wolfe for Marginal Inference
3 Optimality Guarantees and Convergence Rate
4 Beyond MRFs
5 Fancier FW
December 10, 2013 15 / 26
Convergence Rate
Convergence Rate of Frank-Wolfe [Jaggi, 2013]
F (µt)− F (µ∗) ≤ 2CF
t + 2(1 + δ)
δCft+2 MAP suboptimality at iter t −→ NP-Hard
How to deal with MAP hardness?
Use MAP solver and hope for the best [Hazan and Jaakkola, 2012].
Relax to the local polytope.
December 10, 2013 16 / 26
Convergence Rate
Convergence Rate of Frank-Wolfe [Jaggi, 2013]
F (µt)− F (µ∗) ≤ 2CF
t + 2(1 + δ)
δCft+2 MAP suboptimality at iter t
−→ NP-Hard
How to deal with MAP hardness?
Use MAP solver and hope for the best [Hazan and Jaakkola, 2012].
Relax to the local polytope.
December 10, 2013 16 / 26
Convergence Rate
Convergence Rate of Frank-Wolfe [Jaggi, 2013]
F (µt)− F (µ∗) ≤ 2CF
t + 2(1 + δ)
δCft+2 MAP suboptimality at iter t −→ NP-Hard
How to deal with MAP hardness?
Use MAP solver and hope for the best [Hazan and Jaakkola, 2012].
Relax to the local polytope.
December 10, 2013 16 / 26
Convergence Rate
Convergence Rate of Frank-Wolfe [Jaggi, 2013]
F (µt)− F (µ∗) ≤ 2CF
t + 2(1 + δ)
δCft+2 MAP suboptimality at iter t −→ NP-Hard
How to deal with MAP hardness?
Use MAP solver and hope for the best [Hazan and Jaakkola, 2012].
Relax to the local polytope.
December 10, 2013 16 / 26
Curvature + Convergence Rate
Cf = supx ,s∈D;γ∈[0,1];y=x+γ(s−x)
2
γ2(f (y)− f (x)− 〈y − x ,∇f (x)〉)
µ̃MAP
µt
µt+1
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
entr
opy
prob x = 1
December 10, 2013 17 / 26
Curvature + Convergence Rate
Cf = supx ,s∈D;γ∈[0,1];y=x+γ(s−x)
2
γ2(f (y)− f (x)− 〈y − x ,∇f (x)〉)
µ̃MAP
µt
µt+1
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
entr
opy
prob x = 1
December 10, 2013 17 / 26
Experiment #2
December 10, 2013 18 / 26
Table of Contents
1 Markov Random Fields
2 Frank-Wolfe for Marginal Inference
3 Optimality Guarantees and Convergence Rate
4 Beyond MRFs
5 Fancier FW
December 10, 2013 19 / 26
Beyond MRFs
Question
Are MRFs the right Gibbs distribution to use Frank-Wolfe?
Problem Family MAP Algorithm Marginal Algorithmtree-structured graphical models Viterbi Forward-Backward
loopy graphical models Max-Product BP Sum-Product BPDirected Spanning Tree Chu Liu Edmonds Matrix Tree Theorem
Bipartite Matching Hungarian Algorithm ×
December 10, 2013 20 / 26
Beyond MRFs
Question
Are MRFs the right Gibbs distribution to use Frank-Wolfe?
Problem Family MAP Algorithm Marginal Algorithmtree-structured graphical models Viterbi Forward-Backward
loopy graphical models Max-Product BP Sum-Product BPDirected Spanning Tree Chu Liu Edmonds Matrix Tree Theorem
Bipartite Matching Hungarian Algorithm ×
December 10, 2013 20 / 26
Table of Contents
1 Markov Random Fields
2 Frank-Wolfe for Marginal Inference
3 Optimality Guarantees and Convergence Rate
4 Beyond MRFs
5 Fancier FW
December 10, 2013 21 / 26
norm-regularized marginal inference
µMARG = arg maxµ∈M〈µ,θ〉+ HM(µ) + λR(µ)
Harchaoui et al. [2013].
Local linear oracle for MRFs?
µ̃t = arg maxµ∈M∩Br (µt)
〈µ,θ〉
Garber and Hazan [2013]
December 10, 2013 22 / 26
norm-regularized marginal inference
µMARG = arg maxµ∈M〈µ,θ〉+ HM(µ) + λR(µ)
Harchaoui et al. [2013].
Local linear oracle for MRFs?
µ̃t = arg maxµ∈M∩Br (µt)
〈µ,θ〉
Garber and Hazan [2013]
December 10, 2013 22 / 26
Conclusion
We need to figure out how to handle the entropy gradient.
There are plenty of extensions to further Gibbs distributions +regularizers.
December 10, 2013 23 / 26
Conclusion
We need to figure out how to handle the entropy gradient.
There are plenty of extensions to further Gibbs distributions +regularizers.
December 10, 2013 23 / 26
Further Reading I
Stefano Ermon, Carla Gomes, Ashish Sabharwal, and Bart Selman. Taming thecurse of dimensionality: Discrete integration by hashing and optimization. InProceedings of the 30th International Conference on Machine Learning(ICML-13), pages 334–342, 2013.
D. Garber and E. Hazan. A Linearly Convergent Conditional Gradient Algorithmwith Applications to Online and Stochastic Optimization. ArXiv e-prints,January 2013.
Zaid Harchaoui, Anatoli Juditsky, and Arkadi Nemirovski. Conditional gradientalgorithms for norm-regularized smooth convex optimization. arXiv preprintarXiv:1302.2325, 2013.
Tamir Hazan and Tommi S Jaakkola. On the Partition Function and RandomMaximum A-Posteriori Perturbations. In Proceedings of the 29th InternationalConference on Machine Learning (ICML-12), pages 991–998, 2012.
Bert Huang and Tony Jebara. Approximating the permanent with beliefpropagation. arXiv preprint arXiv:0908.1769, 2009.
December 10, 2013 24 / 26
Further Reading II
Mark Huber. Exact sampling from perfect matchings of dense regular bipartitegraphs. Algorithmica, 44(3):183–193, 2006.
Martin Jaggi. Revisiting Frank-Wolfe: Projection-Free Sparse ConvexOptimization. In Proceedings of the 30th International Conference on MachineLearning (ICML-13), pages 427–435, 2013.
James Petterson, Tiberio Caetano, Julian McAuley, and Jin Yu. Exponentialfamily graph matching and ranking. 2009.
Tim Roughgarden and Michael Kearns. Marginals-to-models reducibility. InAdvances in Neural Information Processing Systems, pages 1043–1051, 2013.
Maksims Volkovs and Richard S Zemel. Efficient sampling for bipartite matchingproblems. In Advances in Neural Information Processing Systems, pages1322–1330, 2012.
Pascal O Vontobel. The bethe permanent of a non-negative matrix. InCommunication, Control, and Computing (Allerton), 2010 48th AnnualAllerton Conference on, pages 341–346. IEEE, 2010.
December 10, 2013 25 / 26
Finding the Marginal Matching
Sampling
Expensive, but doable [Huber, 2006, Volkovs and Zemel, 2012].
Used for maximum-likelihood learning [Petterson et al., 2009].
Sum-Product
Also requires Bethe approximation.Works well:
In practice [Huang and Jebara, 2009]
In theory [Vontobel, 2010]
Frank-Wolfe
Basically the same algorithm as for graphical models.
Same issue with curvature.
December 10, 2013 26 / 26
Finding the Marginal Matching
Sampling
Expensive, but doable [Huber, 2006, Volkovs and Zemel, 2012].Used for maximum-likelihood learning [Petterson et al., 2009].
Sum-Product
Also requires Bethe approximation.Works well:
In practice [Huang and Jebara, 2009]
In theory [Vontobel, 2010]
Frank-Wolfe
Basically the same algorithm as for graphical models.
Same issue with curvature.
December 10, 2013 26 / 26
Finding the Marginal Matching
Sampling
Expensive, but doable [Huber, 2006, Volkovs and Zemel, 2012].Used for maximum-likelihood learning [Petterson et al., 2009].
Sum-Product
Also requires Bethe approximation.Works well:
In practice [Huang and Jebara, 2009]
In theory [Vontobel, 2010]
Frank-Wolfe
Basically the same algorithm as for graphical models.
Same issue with curvature.
December 10, 2013 26 / 26
Finding the Marginal Matching
Sampling
Expensive, but doable [Huber, 2006, Volkovs and Zemel, 2012].Used for maximum-likelihood learning [Petterson et al., 2009].
Sum-Product
Also requires Bethe approximation.Works well:
In practice [Huang and Jebara, 2009]
In theory [Vontobel, 2010]
Frank-Wolfe
Basically the same algorithm as for graphical models.
Same issue with curvature.
December 10, 2013 26 / 26