Optimization perspective in approximate posterior...
Transcript of Optimization perspective in approximate posterior...
![Page 1: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes](https://reader034.fdocuments.us/reader034/viewer/2022050104/5f42f4d458ddd251dd058544/html5/thumbnails/1.jpg)
Optimization perspective on approximate Bayesian inference
Juho Kim
December 6, 2016
![Page 2: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes](https://reader034.fdocuments.us/reader034/viewer/2022050104/5f42f4d458ddd251dd058544/html5/thumbnails/2.jpg)
Project goals
• Solve an approximate Bayesian inference problem in the perspective of optimization.
• Consider variational Bayesian inference based on various divergence measures.
• Analyze convergence of each optimization empirically and theoretically (if possible).
![Page 3: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes](https://reader034.fdocuments.us/reader034/viewer/2022050104/5f42f4d458ddd251dd058544/html5/thumbnails/3.jpg)
Inference problem
Given a dataset y = {𝑦1, … , 𝑦𝑛}:
Bayes rule:
Computing posterior distribution is known as the inference problem.
But:
This integral can be very high-dimensional and difficult to compute.
𝑝 𝑦 = න𝑝 𝑦, 𝜃 𝑑𝜃
![Page 4: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes](https://reader034.fdocuments.us/reader034/viewer/2022050104/5f42f4d458ddd251dd058544/html5/thumbnails/4.jpg)
Approximate Bayesian inference
There are two approaches to approximate inference. They have complementary strengths and weaknesses.
![Page 5: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes](https://reader034.fdocuments.us/reader034/viewer/2022050104/5f42f4d458ddd251dd058544/html5/thumbnails/5.jpg)
Approximate Bayesian inference
There are two approaches to approximate inference. They have complementary strengths and weaknesses.
![Page 6: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes](https://reader034.fdocuments.us/reader034/viewer/2022050104/5f42f4d458ddd251dd058544/html5/thumbnails/6.jpg)
Approximate Bayesian inference
There are two approaches to approximate inference. They have complementary strengths and weaknesses.
![Page 7: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes](https://reader034.fdocuments.us/reader034/viewer/2022050104/5f42f4d458ddd251dd058544/html5/thumbnails/7.jpg)
Variational Bayesian inference
In variational Bayesian inference,
• Find an approximate density that is maximally similar to the true posterior distribution.
• Formulate a density estimation problem as an optimization problem.
![Page 8: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes](https://reader034.fdocuments.us/reader034/viewer/2022050104/5f42f4d458ddd251dd058544/html5/thumbnails/8.jpg)
Variational Bayesian inference
In variational Bayesian inference,
• Find an approximate and tractable density that is maximally similar to the true posterior distribution.
• Formulate a density estimation problem as an optimization problem.
We can use the Kullback-Leibler (KL) divergence as the measure:
Then we minimize KL-divergence.
But we still cannot compute .
![Page 9: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes](https://reader034.fdocuments.us/reader034/viewer/2022050104/5f42f4d458ddd251dd058544/html5/thumbnails/9.jpg)
Variational lower-bound
We can solve the equivalent optimization problem:
We now remove the intractable terms:
Variational lower-bound / evidence lower bound (ELBO)
![Page 10: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes](https://reader034.fdocuments.us/reader034/viewer/2022050104/5f42f4d458ddd251dd058544/html5/thumbnails/10.jpg)
Stochastic variational inference
Suppose the joint distribution is represented as the product of each data point.
We can run stochastic (natural) gradient descent on this optimization problem. (i.e. stochastic variational inference)
𝑝 𝜃, 𝐷 = 𝑝0(𝜃)ෑ
𝑛
𝑝(𝑦𝑛|𝜃)
![Page 11: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes](https://reader034.fdocuments.us/reader034/viewer/2022050104/5f42f4d458ddd251dd058544/html5/thumbnails/11.jpg)
When KL divergence does not work well
• Variational inference does not work for non-smooth potentials well.
• KL divergence tends to underestimate the support due to its zero-forcing behavior.
→ The optimal variational distribution q is defined as zero when 𝑝 𝜃 𝑦 = 0
to avoid that it has an infinite value when 𝑝 𝜃 𝑦 = 0 and 𝑞 ∙ > 0.
![Page 12: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes](https://reader034.fdocuments.us/reader034/viewer/2022050104/5f42f4d458ddd251dd058544/html5/thumbnails/12.jpg)
When KL divergence does not work well
• Variational inference does not work for non-smooth potentials well.
• KL divergence tends to underestimate the support due to its zero-forcing behavior.
→ The optimal variational distribution q is defined as zero when 𝑝 𝜃 𝑦 = 0
to avoid that it has an infinite value when 𝑝 𝜃 𝑦 = 0 and 𝑞 ∙ > 0.
• In this example, the result of variational inference will fit a delta function.
![Page 13: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes](https://reader034.fdocuments.us/reader034/viewer/2022050104/5f42f4d458ddd251dd058544/html5/thumbnails/13.jpg)
When KL divergence does not work well
One possible solution of this issue.
→ Use a different optimization formulation based on another divergence measure: Expectation Propagation (EP).
Minimizes KL(𝑝| 𝑞 instead of KL(𝑞| 𝑝
![Page 14: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes](https://reader034.fdocuments.us/reader034/viewer/2022050104/5f42f4d458ddd251dd058544/html5/thumbnails/14.jpg)
EP also has issues
EP tends to overestimate the support of the original distribution.
→ Try to use other divergence measure such as Renyi’s alpha divergence,
f-divergence, other operator based divergence, etc.
![Page 15: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes](https://reader034.fdocuments.us/reader034/viewer/2022050104/5f42f4d458ddd251dd058544/html5/thumbnails/15.jpg)
Alternative 1 – alpha divergence
• The two forms of KL divergence are members of the alpha divergence:
![Page 16: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes](https://reader034.fdocuments.us/reader034/viewer/2022050104/5f42f4d458ddd251dd058544/html5/thumbnails/16.jpg)
Inference based on alpha divergence
We can solve the equivalent optimization problem following the idea of variational inference:
We can derive the lower bound like ELBO for 𝛼 ≠ 1:
![Page 17: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes](https://reader034.fdocuments.us/reader034/viewer/2022050104/5f42f4d458ddd251dd058544/html5/thumbnails/17.jpg)
Inference based on alpha divergence
• Unfortunately, the lower bound is less tractable than ELBO.
• Apply Monte Carlo methods to estimate the lower bound:
Draw 𝜃𝑘~𝑞(𝜃), 𝑘 = 1,… , 𝐾:
• Future work: Find a stable gradient-based optimization method.
![Page 18: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes](https://reader034.fdocuments.us/reader034/viewer/2022050104/5f42f4d458ddd251dd058544/html5/thumbnails/18.jpg)
Simple experiment
1. Estimate a polynomial function below.
2. Estimate a 2D Gaussian distribution.
![Page 19: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes](https://reader034.fdocuments.us/reader034/viewer/2022050104/5f42f4d458ddd251dd058544/html5/thumbnails/19.jpg)
Simple experiment
Alpha = -1
![Page 20: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes](https://reader034.fdocuments.us/reader034/viewer/2022050104/5f42f4d458ddd251dd058544/html5/thumbnails/20.jpg)
Simple experiment
Alpha = -0.5
![Page 21: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes](https://reader034.fdocuments.us/reader034/viewer/2022050104/5f42f4d458ddd251dd058544/html5/thumbnails/21.jpg)
Simple experiment
Alpha = 0 (the same as KL divergence minimization)
![Page 22: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes](https://reader034.fdocuments.us/reader034/viewer/2022050104/5f42f4d458ddd251dd058544/html5/thumbnails/22.jpg)
Simple experiment
Alpha = 0.5
![Page 23: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes](https://reader034.fdocuments.us/reader034/viewer/2022050104/5f42f4d458ddd251dd058544/html5/thumbnails/23.jpg)
Simple experiment
Alpha = 1 (the same as expectation propagation)
![Page 24: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes](https://reader034.fdocuments.us/reader034/viewer/2022050104/5f42f4d458ddd251dd058544/html5/thumbnails/24.jpg)
Alternative 2 – chi-square divergence
Minimizing the chi-square divergence is equivalent to minimizing
This quantity is an upper bound to the model evidence:
By maximizing ELBO and minimizing chi-square bound together, we might estimate the distribution more accurately.
![Page 25: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes](https://reader034.fdocuments.us/reader034/viewer/2022050104/5f42f4d458ddd251dd058544/html5/thumbnails/25.jpg)
Alternative 3 – f-divergence
where 𝑓:ℝ+ → ℝ is a convex, lower-semicontinuous function with 𝑓 1 = 0.
![Page 26: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes](https://reader034.fdocuments.us/reader034/viewer/2022050104/5f42f4d458ddd251dd058544/html5/thumbnails/26.jpg)
Conclusion
• Consider optimization-based variational Bayesian inference methods based on statistical divergences different from KL divergence.
• Observe the behavior of inference methods based on alpha divergence and chi-square divergence.
![Page 27: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes](https://reader034.fdocuments.us/reader034/viewer/2022050104/5f42f4d458ddd251dd058544/html5/thumbnails/27.jpg)
Future work
• Suggest more stable gradient-based optimization methods by reducing the variance of gradients.
• Consider more general form of divergences.
• Analyze convergence of each optimization theoretically (if possible).
![Page 28: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes](https://reader034.fdocuments.us/reader034/viewer/2022050104/5f42f4d458ddd251dd058544/html5/thumbnails/28.jpg)
Questions?