Marginal Inference in MRFs using Frank-Wolfe · 2013. 12. 29. · Marginal Inference in MRFs using...

Marginal Inference in MRFs using Frank-Wolfe

David Belanger, Daniel Sheldon, Andrew McCallum

School of Computer ScienceUniversity of Massachusetts, Amherst

{belanger,sheldon,mccallum}@cs.umass.edu

December 10, 2013

Table of Contents

1 Markov Random Fields

2 Frank-Wolfe for Marginal Inference

3 Optimality Guarantees and Convergence Rate

4 Beyond MRFs

5 Fancier FW

December 10, 2013 2 / 26

Table of Contents




4 Beyond MRFs

5 Fancier FW

December 10, 2013 3 / 26

Markov Random Fields

Φθ(x) =∑c∈C

θc(xc)

P(x) =exp (Φθ(x))

log(Z )

x→ µ

Φθ(x)→ 〈θ,µ〉

December 10, 2013 4 / 26

Marginal Inference

µMARG = EPθ[µ]

µMARG = arg maxµ∈M〈µ,θ〉+ HM(µ)

µ̄approx = arg maxµ∈L〈µ,θ〉+ HB(µ)

HB(µ) =∑c∈C

WcH(µc)

December 10, 2013 5 / 26

MAP Inference

µMAP = arg maxµ∈M〈µ,θ〉

Black&Box&&MAP&Solver&

✓ µMAP

Gray&Box&&MAP&Solver&

✓ µMAP

December 10, 2013 6 / 26

Marginal → MAP Reductions

Hazan and Jaakkola [2012]

Ermon et al. [2013]

December 10, 2013 7 / 26

Table of Contents




4 Beyond MRFs

5 Fancier FW

December 10, 2013 8 / 26

Generic FW with Line Search

yt = arg minx∈X〈x,−∇f (xt−1)〉

xt = minγ∈[0,1]

f ((1− γ)xt + γyt)

December 10, 2013 9 / 26

Generic FW with Line Search

Linear&&Minimiza<on&

Oracle&

Line&Search&Compute&&Gradient&

xt

�rf(xt�1) yt

December 10, 2013 10 / 26

FW for Marginal Inference

MAP&Inference&Oracle&

Line&Search&Compute&Gradient&

rF (µt) = ✓ +rH(µt)

✓̃ µ̃MAP

µt+1

December 10, 2013 11 / 26

Subproblem Parametrization

F (µ) = 〈µ,θ〉+∑c∈C

WcH(µc)

θ̃ = ∇F (µt) = θ +∑c∈C

Wc∇H(µc)

December 10, 2013 12 / 26

Line Search

µ̃MAP

µt

µt+1

Computing line search objective can scale with:

Bad: # possible values in cliques.

Good: # cliques in graph.

(see paper)

December 10, 2013 13 / 26

Experiment #1

December 10, 2013 14 / 26

Table of Contents




4 Beyond MRFs

5 Fancier FW

December 10, 2013 15 / 26

Convergence Rate

Convergence Rate of Frank-Wolfe [Jaggi, 2013]

F (µt)− F (µ∗) ≤ 2CF

t + 2(1 + δ)

δCft+2 MAP suboptimality at iter t −→ NP-Hard

How to deal with MAP hardness?

Use MAP solver and hope for the best [Hazan and Jaakkola, 2012].

Relax to the local polytope.

December 10, 2013 16 / 26

Convergence Rate


F (µt)− F (µ∗) ≤ 2CF

t + 2(1 + δ)

δCft+2 MAP suboptimality at iter t

−→ NP-Hard




December 10, 2013 16 / 26

Convergence Rate


F (µt)− F (µ∗) ≤ 2CF

t + 2(1 + δ)

δCft+2 MAP suboptimality at iter t −→ NP-Hard




December 10, 2013 16 / 26

Curvature + Convergence Rate

Cf = supx ,s∈D;γ∈[0,1];y=x+γ(s−x)

2

γ2(f (y)− f (x)− 〈y − x ,∇f (x)〉)

µ̃MAP

µt

µt+1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

entr

opy

prob x = 1

December 10, 2013 17 / 26

Experiment #2

December 10, 2013 18 / 26

Table of Contents




4 Beyond MRFs

5 Fancier FW

December 10, 2013 19 / 26

Beyond MRFs

Question

Are MRFs the right Gibbs distribution to use Frank-Wolfe?

Problem Family MAP Algorithm Marginal Algorithmtree-structured graphical models Viterbi Forward-Backward

loopy graphical models Max-Product BP Sum-Product BPDirected Spanning Tree Chu Liu Edmonds Matrix Tree Theorem

Bipartite Matching Hungarian Algorithm ×

December 10, 2013 20 / 26

Table of Contents




4 Beyond MRFs

5 Fancier FW

December 10, 2013 21 / 26

norm-regularized marginal inference

µMARG = arg maxµ∈M〈µ,θ〉+ HM(µ) + λR(µ)

Harchaoui et al. [2013].

Local linear oracle for MRFs?

µ̃t = arg maxµ∈M∩Br (µt)

〈µ,θ〉

Garber and Hazan [2013]

December 10, 2013 22 / 26

Conclusion

We need to figure out how to handle the entropy gradient.

There are plenty of extensions to further Gibbs distributions +regularizers.

December 10, 2013 23 / 26

Further Reading I

Stefano Ermon, Carla Gomes, Ashish Sabharwal, and Bart Selman. Taming thecurse of dimensionality: Discrete integration by hashing and optimization. InProceedings of the 30th International Conference on Machine Learning(ICML-13), pages 334–342, 2013.

D. Garber and E. Hazan. A Linearly Convergent Conditional Gradient Algorithmwith Applications to Online and Stochastic Optimization. ArXiv e-prints,January 2013.

Zaid Harchaoui, Anatoli Juditsky, and Arkadi Nemirovski. Conditional gradientalgorithms for norm-regularized smooth convex optimization. arXiv preprintarXiv:1302.2325, 2013.

Tamir Hazan and Tommi S Jaakkola. On the Partition Function and RandomMaximum A-Posteriori Perturbations. In Proceedings of the 29th InternationalConference on Machine Learning (ICML-12), pages 991–998, 2012.

Bert Huang and Tony Jebara. Approximating the permanent with beliefpropagation. arXiv preprint arXiv:0908.1769, 2009.

December 10, 2013 24 / 26

Further Reading II

Mark Huber. Exact sampling from perfect matchings of dense regular bipartitegraphs. Algorithmica, 44(3):183–193, 2006.

Martin Jaggi. Revisiting Frank-Wolfe: Projection-Free Sparse ConvexOptimization. In Proceedings of the 30th International Conference on MachineLearning (ICML-13), pages 427–435, 2013.

James Petterson, Tiberio Caetano, Julian McAuley, and Jin Yu. Exponentialfamily graph matching and ranking. 2009.

Tim Roughgarden and Michael Kearns. Marginals-to-models reducibility. InAdvances in Neural Information Processing Systems, pages 1043–1051, 2013.

Maksims Volkovs and Richard S Zemel. Efficient sampling for bipartite matchingproblems. In Advances in Neural Information Processing Systems, pages1322–1330, 2012.

Pascal O Vontobel. The bethe permanent of a non-negative matrix. InCommunication, Control, and Computing (Allerton), 2010 48th AnnualAllerton Conference on, pages 341–346. IEEE, 2010.

December 10, 2013 25 / 26

Finding the Marginal Matching

Sampling

Expensive, but doable [Huber, 2006, Volkovs and Zemel, 2012].

Used for maximum-likelihood learning [Petterson et al., 2009].

Sum-Product

Also requires Bethe approximation.Works well:

In practice [Huang and Jebara, 2009]

In theory [Vontobel, 2010]

Frank-Wolfe

Basically the same algorithm as for graphical models.

Same issue with curvature.

December 10, 2013 26 / 26

Finding the Marginal Matching

Sampling

Expensive, but doable [Huber, 2006, Volkovs and Zemel, 2012].Used for maximum-likelihood learning [Petterson et al., 2009].

Sum-Product

Also requires Bethe approximation.Works well:

In practice [Huang and Jebara, 2009]

In theory [Vontobel, 2010]

Frank-Wolfe

Basically the same algorithm as for graphical models.

Same issue with curvature.

December 10, 2013 26 / 26

Marginal Inference in MRFs using Frank-Wolfe · 2013. 12. 29. · Marginal Inference in MRFs using...

Documents

Transcript of Marginal Inference in MRFs using Frank-Wolfe · 2013. 12. 29. · Marginal Inference in MRFs using...