Approximate Inference: Variational InferenceVariational Inference CMSC 678 UMBC Outline Recap of...

Post on 08-Aug-2020

7 views 0 download

Transcript of Approximate Inference: Variational InferenceVariational Inference CMSC 678 UMBC Outline Recap of...

Approximate Inference:Variational Inference

CMSC 678UMBC

Outline

Recap of graphical models & belief propagation

Posterior inference (Bayesian perspective)

Math: exponential family distributions

Variational InferenceBasic TechniqueExample: Topic Models

Recap from last time…

Graphical Models

𝑝𝑝 π‘₯π‘₯1, π‘₯π‘₯2, π‘₯π‘₯3, … , π‘₯π‘₯𝑁𝑁 = �𝑖𝑖

𝑝𝑝 π‘₯π‘₯𝑖𝑖 πœ‹πœ‹(π‘₯π‘₯𝑖𝑖))

Directed Models (Bayesian networks)

Undirected Models (Markov random fields)

𝑝𝑝 π‘₯π‘₯1, π‘₯π‘₯2, π‘₯π‘₯3, … , π‘₯π‘₯𝑁𝑁 =1𝑍𝑍�𝐢𝐢

πœ“πœ“πΆπΆ π‘₯π‘₯𝑐𝑐

Markov Blanket

x

Markov blanket of a node x is its parents, children, and

children's parents

𝑝𝑝 π‘₯π‘₯𝑖𝑖 π‘₯π‘₯𝑗𝑗≠𝑖𝑖 =𝑝𝑝(π‘₯π‘₯1, … , π‘₯π‘₯𝑁𝑁)

∫ 𝑝𝑝 π‘₯π‘₯1, … , π‘₯π‘₯𝑁𝑁 𝑑𝑑π‘₯π‘₯𝑖𝑖

=βˆπ‘˜π‘˜ 𝑝𝑝(π‘₯π‘₯π‘˜π‘˜|πœ‹πœ‹ π‘₯π‘₯π‘˜π‘˜ )

∫ βˆπ‘˜π‘˜ 𝑝𝑝 π‘₯π‘₯π‘˜π‘˜ πœ‹πœ‹ π‘₯π‘₯π‘˜π‘˜ )𝑑𝑑π‘₯π‘₯𝑖𝑖factor out terms not dependent on xi

factorization of graph

=βˆπ‘˜π‘˜:π‘˜π‘˜=𝑖𝑖 or π‘–π‘–βˆˆπœ‹πœ‹ π‘₯π‘₯π‘˜π‘˜ 𝑝𝑝(π‘₯π‘₯π‘˜π‘˜|πœ‹πœ‹ π‘₯π‘₯π‘˜π‘˜ )

∫ βˆπ‘˜π‘˜:π‘˜π‘˜=𝑖𝑖 or π‘–π‘–βˆˆπœ‹πœ‹ π‘₯π‘₯π‘˜π‘˜ 𝑝𝑝 π‘₯π‘₯π‘˜π‘˜ πœ‹πœ‹ π‘₯π‘₯π‘˜π‘˜ )𝑑𝑑π‘₯π‘₯𝑖𝑖

the set of nodes needed to form the complete conditional for a variable xi

Markov Random Fields withFactor Graph Notation

x: original pixel/state

y: observed (noisy)

pixel/state

factor nodes are added

according to maximal cliques

unaryfactor

variable

factor graphs are bipartite

binaryfactor

Two Problems for Undirected Models

Finding the normalizer

𝑍𝑍 = οΏ½π‘₯π‘₯

�𝑐𝑐

πœ“πœ“π‘π‘(π‘₯π‘₯𝑐𝑐)

Computing the marginals

𝑍𝑍𝑛𝑛(𝑣𝑣) = οΏ½π‘₯π‘₯:π‘₯π‘₯𝑛𝑛=𝑣𝑣

�𝑐𝑐

πœ“πœ“π‘π‘(π‘₯π‘₯𝑐𝑐)Q: Why are these difficult?

A: Many different combinations

Sum over all variable combinations, with the xn

coordinate fixed

𝑍𝑍2(𝑣𝑣) = οΏ½π‘₯π‘₯1

οΏ½π‘₯π‘₯3

�𝑐𝑐

πœ“πœ“π‘π‘(π‘₯π‘₯ = π‘₯π‘₯1, 𝑣𝑣, π‘₯π‘₯3 )

Example: 3 variables, fix the

2nd dimensionBelief propagation algorithms

β€’ sum-product (forward-backward in HMMs)

β€’ max-product/max-sum (Viterbi)

Sum-ProductFrom variables to factors

π‘žπ‘žπ‘›π‘›β†’π‘šπ‘š π‘₯π‘₯𝑛𝑛 = οΏ½π‘šπ‘šβ€²βˆˆπ‘€π‘€(𝑛𝑛)\π‘šπ‘š

π‘Ÿπ‘Ÿπ‘šπ‘šβ€²β†’π‘›π‘› π‘₯π‘₯𝑛𝑛

From factors to variables

π‘Ÿπ‘Ÿπ‘šπ‘šβ†’π‘›π‘› π‘₯π‘₯𝑛𝑛= οΏ½

π’˜π’˜π‘šπ‘š\𝑛𝑛

π‘“π‘“π‘šπ‘š π’˜π’˜π‘šπ‘š οΏ½π‘›π‘›β€²βˆˆπ‘π‘(π‘šπ‘š)\𝑛𝑛

π‘žπ‘žπ‘›π‘›β€²β†’π‘šπ‘š(π‘₯π‘₯𝑛𝑛𝑛)

n

m

n

m

set of variables that the mth factor depends on

set of factors in which variable n participates

sum over configuration of variables for the mth factor,

with variable n fixed

default value of 1 if empty product

Outline

Recap of graphical models & belief propagation

Posterior inference (Bayesian perspective)

Math: exponential family distributions

Variational InferenceBasic TechniqueExample: Topic Models

Goal: Posterior Inference

Hyperparameters αUnknown parameters ΘData:

Likelihood model:

p( | Θ )

pα( Θ | )

we’re going to be Bayesian (perform Bayesian inference)

Posterior Classification vs.Posterior Inference

β€œFrequentist” methods

prior over labels (maybe), not weights

Bayesian methods

Θ includes weight parameters

pα( Θ | )pα,w ( y| )

(Some) Learning Techniques

MAP/MLE: Point estimation, basic EM

Variational Inference: Functional Optimization

Sampling/Monte Carlo

today

next class

what we’ve already covered

Outline

Recap of graphical models & belief propagation

Posterior inference (Bayesian perspective)

Math: exponential family distributions

Variational InferenceBasic TechniqueExample: Topic Models

Exponential Family Form

Exponential Family Form

Support functionβ€’ Formally necessary, in practice

irrelevant

Exponential Family Form

Distribution Parametersβ€’ Natural parametersβ€’ Feature weights

Exponential Family Form

Feature function(s)β€’ Sufficient statistics

Exponential Family Form

Log-normalizer

Exponential Family Form

Log-normalizer

Why? Capture Common Distributions

Discrete (Finite distributions)

Why? Capture Common Distributions

β€’ Gaussian

https://kanbanize.com/blog/wp-content/uploads/2014/07/Standard_deviation_diagram.png

Why? Capture Common Distributions

Dirichlet (Distributions over (finite) distributions)

Why? Capture Common Distributions

Discrete (Finite distributions)

Dirichlet (Distributions over (finite) distributions)

Gaussian

Gamma, Exponential, Poisson, Negative-Binomial, Laplace, log-Normal,…

Why? β€œEasy” Gradients

Observed feature countsCount w.r.t. empirical distribution

Expected feature countsCount w.r.t. current model parameters

(we’ve already seen this with maxent models)

Why? β€œEasy” Expectations

expectation of the sufficient

statistics

gradient of the log normalizer

Why? β€œEasy” Posterior Inference

Why? β€œEasy” Posterior Inference

p is the conjugate prior for q

Why? β€œEasy” Posterior Inference

p is the conjugate prior for q

Posterior p has same form as prior p

Why? β€œEasy” Posterior Inference

p is the conjugate prior for q

Posterior p has same form as prior p

All exponential family models have a conjugate prior (in theory)

Why? β€œEasy” Posterior Inference

p is the conjugate prior for q

Posterior p has same form as prior p

Posterior Likelihood Prior

Dirichlet (Beta) Discrete (Bernoulli) Dirichlet (Beta)

Normal Normal (fixed var.) Normal

Gamma Exponential Gamma

Outline

Recap of graphical models & belief propagation

Posterior inference (Bayesian perspective)

Math: exponential family distributions

Variational InferenceBasic TechniqueExample: Topic Models

Goal: Posterior Inference

Hyperparameters αUnknown parameters ΘData:

Likelihood model:

p( | Θ )

pα( Θ | )

(Some) Learning Techniques

MAP/MLE: Point estimation, basic EM

Variational Inference: Functional Optimization

Sampling/Monte Carlo

today

next class

what we’ve already covered

Variational Inference

Difficult to compute

Variational Inference

Difficult to compute

Minimize the β€œdifference”

by changing Ξ»

Easy(ier) to compute

q(ΞΈ): controlled by parameters Ξ»

Variational Inference

Difficult to compute

Easy(ier) to compute

Minimize the β€œdifference”

by changing Ξ»

Variational Inference: A Gradient-Based Optimization Technique

Set t = 0Pick a starting value Ξ»t

Until converged:1. Get value y t = F(q(β€’;Ξ»t))2. Get gradient g t = F’(q(β€’;Ξ»t))3. Get scaling factor ρ t4. Set Ξ»t+1 = Ξ»t + ρt*g t5. Set t += 1

Variational Inference: A Gradient-Based Optimization Technique

Set t = 0Pick a starting value Ξ»t

Until converged:1. Get value y t = F(q(β€’;Ξ»t))2. Get gradient g t = F’(q(β€’;Ξ»t))3. Get scaling factor ρ t4. Set Ξ»t+1 = Ξ»t + ρt*g t5. Set t += 1

Variational Inference:The Function to Optimize

Posterior of desired model

Any easy-to-compute distribution

Variational Inference:The Function to Optimize

Posterior of desired model

Any easy-to-compute distribution

Find the best distribution (calculus of variations)

Variational Inference:The Function to Optimize

Find the best distribution

Parameters for desired model

Variational Inference:The Function to Optimize

Find the best distribution

Variational parameters for ΞΈ

Parameters for desired model

Variational Inference:The Function to Optimize

Find the best distribution

Variationalparameters for ΞΈ

Parameters for desired model

KL-Divergence (expectation)

DKL π‘žπ‘ž πœƒπœƒ || 𝑝𝑝(πœƒπœƒ|π‘₯π‘₯) =

π”Όπ”Όπ‘žπ‘ž πœƒπœƒ logπ‘žπ‘ž πœƒπœƒπ‘π‘(πœƒπœƒ|π‘₯π‘₯)

Variational Inference

Find the best distribution

Variational parameters for ΞΈ

Parameters for desired model

Exponential Family Recap: β€œEasy” Expectations

Exponential Family Recap: β€œEasy” Posterior Inference

p is the conjugate prior for Ο€

Variational Inference

Find the best distribution

When p and q are the same exponential family form, the variational update q(ΞΈ) is (often) computable (in closed form)

Variational Inference: A Gradient-Based Optimization Technique

Set t = 0Pick a starting value Ξ»tLetF(q(β€’;Ξ»t)) = KL[q(β€’;Ξ»t) || p(β€’)]

Until converged:1. Get value y t = F(q(β€’;Ξ»t))2. Get gradient g t = F’(q(β€’;Ξ»t))3. Get scaling factor ρ t4. Set Ξ»t+1 = Ξ»t + ρt*g t5. Set t += 1

Variational Inference:Maximization or Minimization?

Evidence Lower Bound (ELBO)

log𝑝𝑝 π‘₯π‘₯ = log∫ 𝑝𝑝 π‘₯π‘₯,πœƒπœƒ π‘‘π‘‘πœƒπœƒ

Evidence Lower Bound (ELBO)

log𝑝𝑝 π‘₯π‘₯ = log∫ 𝑝𝑝 π‘₯π‘₯,πœƒπœƒ π‘‘π‘‘πœƒπœƒ

= log∫ 𝑝𝑝 π‘₯π‘₯,πœƒπœƒπ‘žπ‘ž πœƒπœƒπ‘žπ‘ž(πœƒπœƒ)

π‘‘π‘‘πœƒπœƒ

Evidence Lower Bound (ELBO)

log𝑝𝑝 π‘₯π‘₯ = log∫ 𝑝𝑝 π‘₯π‘₯,πœƒπœƒ π‘‘π‘‘πœƒπœƒ

= log∫ 𝑝𝑝 π‘₯π‘₯,πœƒπœƒπ‘žπ‘ž πœƒπœƒπ‘žπ‘ž(πœƒπœƒ)

π‘‘π‘‘πœƒπœƒ

= logπ”Όπ”Όπ‘žπ‘ž πœƒπœƒπ‘π‘ π‘₯π‘₯,πœƒπœƒπ‘žπ‘ž πœƒπœƒ

Evidence Lower Bound (ELBO)

log𝑝𝑝 π‘₯π‘₯ = log∫ 𝑝𝑝 π‘₯π‘₯,πœƒπœƒ π‘‘π‘‘πœƒπœƒ

= log∫ 𝑝𝑝 π‘₯π‘₯,πœƒπœƒπ‘žπ‘ž πœƒπœƒπ‘žπ‘ž(πœƒπœƒ)

π‘‘π‘‘πœƒπœƒ

= logπ”Όπ”Όπ‘žπ‘ž πœƒπœƒπ‘π‘ π‘₯π‘₯,πœƒπœƒπ‘žπ‘ž πœƒπœƒ

β‰₯ π”Όπ”Όπ‘žπ‘ž πœƒπœƒ 𝑝𝑝 π‘₯π‘₯,πœƒπœƒ βˆ’ π”Όπ”Όπ‘žπ‘ž πœƒπœƒ π‘žπ‘ž πœƒπœƒ= β„’(π‘žπ‘ž)

Outline

Recap of graphical models & belief propagation

Posterior inference (Bayesian perspective)

Math: exponential family distributions

Variational InferenceBasic TechniqueExample: Topic Models

Bag-of-Items Models

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region . …

p( ) Three: 1,people: 2,attack: 2,

…p( )=Unigram counts

Bag-of-Items Models

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region . …

p( ) Three: 1,people: 2,attack: 2,

…pΟ†,Ο‰( )=

Unigram counts

Global (corpus-level) parameters interact with local (document-level) parameters

Latent Dirichlet Allocation(Blei et al., 2003)

Per-document (unigram) word counts

Latent Dirichlet Allocation(Blei et al., 2003)

Per-document (unigram) word counts

Count of word j in document i

j

i

Latent Dirichlet Allocation(Blei et al., 2003)

Per-document (latent) topic usage

Per-document (unigram) word counts

Per-topic word usage

Count of word j in document i

j

i

K topics

Latent Dirichlet Allocation(Blei et al., 2003)

Per-document (latent) topic usage

Per-document (unigram) word counts

Per-topic word usage

~ Multinomial ~ Dirichlet ~ Dirichlet

(regularize/place priors)

Count of word j in document i

j

i

K topics

Latent Dirichlet Allocation(Blei et al., 2003)

Per-document

(latent) topic usage

Per-document (unigram) word counts

Per-topic word usage

Latent Dirichlet Allocation(Blei et al., 2003)

Per-document

(latent) topic usage

Per-document (unigram) word counts

Per-topic word usage

d

Latent Dirichlet Allocation(Blei et al., 2003)

Per-document

(latent) topic usage

Per-document (unigram) word counts

Per-topic word usage

d

Latent Dirichlet Allocation(Blei et al., 2003)

Per-document

(latent) topic usage

Per-document (unigram) word counts

Per-topic word usage

d

Latent Dirichlet Allocation(Blei et al., 2003)

Per-document

(latent) topic usage

Per-document (unigram) word counts

Per-topic word usage

d

Latent Dirichlet Allocation(Blei et al., 2003)

Per-document

(latent) topic usage

Per-document (unigram) word counts

Per-topic word usage

d

Variational Inference: LDirA

Topic usage

Per-document (unigram) word counts

Topic words

p: True model

πœ™πœ™π‘˜π‘˜ ∼ Dirichlet(𝜷𝜷)𝑀𝑀(𝑑𝑑,𝑛𝑛) ∼ Discrete(πœ™πœ™π‘§π‘§ 𝑑𝑑,𝑛𝑛 )

πœƒπœƒ(𝑑𝑑) ∼ Dirichlet(𝜢𝜢)𝑧𝑧(𝑑𝑑,𝑛𝑛) ∼ Discrete(πœƒπœƒ(𝑑𝑑))

Variational Inference: LDirA

Topic usage

Per-document (unigram) word counts

Topic words

p: True model q: Mean-field approximation

πœ™πœ™π‘˜π‘˜ ∼ Dirichlet(𝜷𝜷)𝑀𝑀(𝑑𝑑,𝑛𝑛) ∼ Discrete(πœ™πœ™π‘§π‘§ 𝑑𝑑,𝑛𝑛 )

πœƒπœƒ(𝑑𝑑) ∼ Dirichlet(𝜢𝜢)𝑧𝑧(𝑑𝑑,𝑛𝑛) ∼ Discrete(πœƒπœƒ(𝑑𝑑))

πœ™πœ™π‘˜π‘˜ ∼ Dirichlet(π€π€π’Œπ’Œ)

πœƒπœƒ(𝑑𝑑) ∼ Dirichlet(πœΈπœΈπ’…π’…)𝑧𝑧(𝑑𝑑,𝑛𝑛) ∼ Discrete(πœ“πœ“(𝑑𝑑,𝑛𝑛))

Variational Inference: A Gradient-Based Optimization Technique

Set t = 0Pick a starting value Ξ»tLetF(q(β€’;Ξ»t)) = KL[q(β€’;Ξ»t) || p(β€’)]

Until converged:1. Get value y t = F(q(β€’;Ξ»t))2. Get gradient g t = F’(q(β€’;Ξ»t))3. Get scaling factor ρ t4. Set Ξ»t+1 = Ξ»t + ρt*g t5. Set t += 1

Variational Inference: LDirA

πœƒπœƒ(𝑑𝑑) ∼ Dirichlet(𝜢𝜢)𝑧𝑧(𝑑𝑑,𝑛𝑛) ∼ Discrete(πœƒπœƒ(𝑑𝑑))

πœƒπœƒ(𝑑𝑑) ∼ Dirichlet(πœΈπœΈπ’…π’…)𝑧𝑧(𝑑𝑑,𝑛𝑛) ∼ Discrete(πœ“πœ“(𝑑𝑑,𝑛𝑛))

p: True model q: Mean-field approximation

π”Όπ”Όπ‘žπ‘ž(πœƒπœƒ(𝑑𝑑)) log𝑝𝑝 πœƒπœƒ(𝑑𝑑) | 𝛼𝛼

Variational Inference: LDirA

πœƒπœƒ(𝑑𝑑) ∼ Dirichlet(𝜢𝜢)𝑧𝑧(𝑑𝑑,𝑛𝑛) ∼ Discrete(πœƒπœƒ(𝑑𝑑))

πœƒπœƒ(𝑑𝑑) ∼ Dirichlet(πœΈπœΈπ’…π’…)𝑧𝑧(𝑑𝑑,𝑛𝑛) ∼ Discrete(πœ“πœ“(𝑑𝑑,𝑛𝑛))

p: True model q: Mean-field approximation

π”Όπ”Όπ‘žπ‘ž(πœƒπœƒ(𝑑𝑑)) log𝑝𝑝 πœƒπœƒ(𝑑𝑑) | 𝛼𝛼 =

π”Όπ”Όπ‘žπ‘ž(πœƒπœƒ(𝑑𝑑)) 𝛼𝛼 βˆ’ 1 𝑇𝑇 logπœƒπœƒ(𝑑𝑑) + 𝐢𝐢

exponential family form of Dirichlet

𝑝𝑝 πœƒπœƒ =Ξ“(βˆ‘π‘˜π‘˜ π›Όπ›Όπ‘˜π‘˜)βˆπ‘˜π‘˜ Ξ“ π›Όπ›Όπ‘˜π‘˜

οΏ½π‘˜π‘˜

πœƒπœƒπ‘˜π‘˜π›Όπ›Όπ‘˜π‘˜βˆ’1

params = π›Όπ›Όπ‘˜π‘˜ βˆ’ 1 π‘˜π‘˜suff. stats.= logπœƒπœƒπ‘˜π‘˜ π‘˜π‘˜

Variational Inference: LDirA

πœƒπœƒ(𝑑𝑑) ∼ Dirichlet(𝜢𝜢)𝑧𝑧(𝑑𝑑,𝑛𝑛) ∼ Discrete(πœƒπœƒ(𝑑𝑑))

πœƒπœƒ(𝑑𝑑) ∼ Dirichlet(πœΈπœΈπ’…π’…)𝑧𝑧(𝑑𝑑,𝑛𝑛) ∼ Discrete(πœ“πœ“(𝑑𝑑,𝑛𝑛))

p: True model q: Mean-field approximation

π”Όπ”Όπ‘žπ‘ž(πœƒπœƒ(𝑑𝑑)) log𝑝𝑝 πœƒπœƒ(𝑑𝑑) | 𝛼𝛼 =

π”Όπ”Όπ‘žπ‘ž(πœƒπœƒ(𝑑𝑑)) 𝛼𝛼 βˆ’ 1 𝑇𝑇 logπœƒπœƒ(𝑑𝑑) + 𝐢𝐢

expectation of sufficient statistics of q distribution

params = π›Ύπ›Ύπ‘˜π‘˜ βˆ’ 1 π‘˜π‘˜

suff. stats. = logπœƒπœƒπ‘˜π‘˜ π‘˜π‘˜

Variational Inference: LDirA

πœƒπœƒ(𝑑𝑑) ∼ Dirichlet(𝜢𝜢)𝑧𝑧(𝑑𝑑,𝑛𝑛) ∼ Discrete(πœƒπœƒ(𝑑𝑑))

πœƒπœƒ(𝑑𝑑) ∼ Dirichlet(πœΈπœΈπ’…π’…)𝑧𝑧(𝑑𝑑,𝑛𝑛) ∼ Discrete(πœ“πœ“(𝑑𝑑,𝑛𝑛))

p: True model q: Mean-field approximation

π”Όπ”Όπ‘žπ‘ž(πœƒπœƒ(𝑑𝑑)) log𝑝𝑝 πœƒπœƒ(𝑑𝑑) | 𝛼𝛼 =

π”Όπ”Όπ‘žπ‘ž(πœƒπœƒ(𝑑𝑑)) 𝛼𝛼 βˆ’ 1 𝑇𝑇 logπœƒπœƒ(𝑑𝑑) + 𝐢𝐢 =expectation of the

sufficient statistics is the gradient of the

log normalizer

𝛼𝛼 βˆ’ 1 π‘‡π‘‡π”Όπ”Όπ‘žπ‘ž(πœƒπœƒ(𝑑𝑑)) logπœƒπœƒ(𝑑𝑑) + 𝐢𝐢

Variational Inference: LDirA

πœƒπœƒ(𝑑𝑑) ∼ Dirichlet(𝜢𝜢)𝑧𝑧(𝑑𝑑,𝑛𝑛) ∼ Discrete(πœƒπœƒ(𝑑𝑑))

πœƒπœƒ(𝑑𝑑) ∼ Dirichlet(πœΈπœΈπ’…π’…)𝑧𝑧(𝑑𝑑,𝑛𝑛) ∼ Discrete(πœ“πœ“(𝑑𝑑,𝑛𝑛))

p: True model q: Mean-field approximation

π”Όπ”Όπ‘žπ‘ž(πœƒπœƒ(𝑑𝑑)) log𝑝𝑝 πœƒπœƒ(𝑑𝑑) | 𝛼𝛼 =

π”Όπ”Όπ‘žπ‘ž(πœƒπœƒ(𝑑𝑑)) 𝛼𝛼 βˆ’ 1 𝑇𝑇 logπœƒπœƒ(𝑑𝑑) + 𝐢𝐢 =expectation of the

sufficient statistics is the gradient of the

log normalizer

𝛼𝛼 βˆ’ 1 𝑇𝑇𝛻𝛻𝛾𝛾𝑑𝑑𝐴𝐴 𝛾𝛾𝑑𝑑 βˆ’ 1 + 𝐢𝐢

Variational Inference: LDirA

πœƒπœƒ(𝑑𝑑) ∼ Dirichlet(𝜢𝜢)𝑧𝑧(𝑑𝑑,𝑛𝑛) ∼ Discrete(πœƒπœƒ(𝑑𝑑))

πœƒπœƒ(𝑑𝑑) ∼ Dirichlet(πœΈπœΈπ’…π’…)𝑧𝑧(𝑑𝑑,𝑛𝑛) ∼ Discrete(πœ“πœ“(𝑑𝑑,𝑛𝑛))

p: True model q: Mean-field approximation

π”Όπ”Όπ‘žπ‘ž(πœƒπœƒ(𝑑𝑑)) log𝑝𝑝 πœƒπœƒ(𝑑𝑑) | 𝛼𝛼 = 𝛼𝛼 βˆ’ 1 𝑇𝑇𝛻𝛻𝛾𝛾𝑑𝑑𝐴𝐴 𝛾𝛾𝑑𝑑 βˆ’ 1 + 𝐢𝐢

β„’ �𝛾𝛾𝑑𝑑

= 𝛼𝛼 βˆ’ 1 𝑇𝑇𝛻𝛻𝛾𝛾𝑑𝑑𝐴𝐴 𝛾𝛾𝑑𝑑 βˆ’ 1 + 𝑀𝑀 𝛾𝛾𝑑𝑑there’s more math

to do!

Variational Inference: A Gradient-Based Optimization Technique

Set t = 0Pick a starting value Ξ»tLetF(q(β€’;Ξ»t)) = KL[q(β€’;Ξ»t) || p(β€’)]

Until converged:1. Get value y t = F(q(β€’;Ξ»t))2. Get gradient g t = F’(q(β€’;Ξ»t))3. Get scaling factor ρ t4. Set Ξ»t+1 = Ξ»t + ρt*g t5. Set t += 1

Variational Inference: LDirA

πœƒπœƒ(𝑑𝑑) ∼ Dirichlet(𝜢𝜢)𝑧𝑧(𝑑𝑑,𝑛𝑛) ∼ Discrete(πœƒπœƒ(𝑑𝑑))

πœƒπœƒ(𝑑𝑑) ∼ Dirichlet(πœΈπœΈπ’…π’…)𝑧𝑧(𝑑𝑑,𝑛𝑛) ∼ Discrete(πœ“πœ“(𝑑𝑑,𝑛𝑛))

p: True model q: Mean-field approximation

β„’ �𝛾𝛾𝑑𝑑

= 𝛼𝛼 βˆ’ 1 𝑇𝑇𝛻𝛻𝛾𝛾𝑑𝑑𝐴𝐴 𝛾𝛾𝑑𝑑 βˆ’ 1 + 𝑀𝑀 𝛾𝛾𝑑𝑑

𝛻𝛻𝛾𝛾𝑑𝑑ℒ �𝛾𝛾𝑑𝑑= 𝛼𝛼 βˆ’ 1 𝑇𝑇𝛻𝛻𝛾𝛾𝑑𝑑

2 𝐴𝐴 𝛾𝛾𝑑𝑑 βˆ’ 1 + 𝛻𝛻𝛾𝛾𝑑𝑑𝑀𝑀 𝛾𝛾𝑑𝑑