EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models...

42
EM Algorithm for Latent Variable Models David Rosenberg New York University May 2, 2017 David Rosenberg (New York University) DS-GA 1003 May 2, 2017 1 / 42

Transcript of EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models...

Page 1: EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models EM:CoordinateAscentonLowerBound 1 Startat old. 2 Findq givingbestlowerboundat old=)L(q,

EM Algorithm for Latent Variable Models

David Rosenberg

New York University

May 2, 2017

David Rosenberg (New York University) DS-GA 1003 May 2, 2017 1 / 42

Page 2: EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models EM:CoordinateAscentonLowerBound 1 Startat old. 2 Findq givingbestlowerboundat old=)L(q,

Gaussian Mixture Models (Review)

Gaussian Mixture Models (Review)

David Rosenberg (New York University) DS-GA 1003 May 2, 2017 2 / 42

Page 3: EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models EM:CoordinateAscentonLowerBound 1 Startat old. 2 Findq givingbestlowerboundat old=)L(q,

Gaussian Mixture Models (Review)

Example: Old Faithful Geyser

Looks like two clusters.How to find these clusters algorithmically?

David Rosenberg (New York University) DS-GA 1003 May 2, 2017 3 / 42

Page 4: EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models EM:CoordinateAscentonLowerBound 1 Startat old. 2 Findq givingbestlowerboundat old=)L(q,

Gaussian Mixture Models (Review)

Probabilistic Model for Clustering

Let’s consider a generative model for the data.Suppose

1 There are k clusters.2 We have a probability density for each cluster.

Generate a point x as follows1 Choose a random cluster z ∈ {1,2, . . . ,k}.2 Choose a point x from the distribution for cluster z .

David Rosenberg (New York University) DS-GA 1003 May 2, 2017 4 / 42

Page 5: EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models EM:CoordinateAscentonLowerBound 1 Startat old. 2 Findq givingbestlowerboundat old=)L(q,

Gaussian Mixture Models (Review)

Gaussian Mixture Model (k = 3)

1 Choose z ∈ {1,2,3} with p(1) = p(2) = p(3) = 13 .

2 Choose x | z ∼ N (X | µz ,Σz).

David Rosenberg (New York University) DS-GA 1003 May 2, 2017 5 / 42

Page 6: EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models EM:CoordinateAscentonLowerBound 1 Startat old. 2 Findq givingbestlowerboundat old=)L(q,

Gaussian Mixture Models (Review)

Gaussian Mixture Model Parameters (k Components)

Cluster probabilities : π= (π1, . . . ,πk)

Cluster means : µ= (µ1, . . . ,µk)

Cluster covariance matrices: Σ= (Σ1, . . .Σk)

For now, suppose all these parameters are known.We’ll discuss how to learn or estimate them later.David Rosenberg (New York University) DS-GA 1003 May 2, 2017 6 / 42

Page 7: EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models EM:CoordinateAscentonLowerBound 1 Startat old. 2 Findq givingbestlowerboundat old=)L(q,

Gaussian Mixture Models (Review)

The GMM “Inference” Problem

Suppose we know all the model parameters π,µ,Σ, and thus p(x ,z).

The inference problem: We observe x . We want to know its cluster z .

We can get a soft cluster assignment from the conditional distribution:

p(z | x) = p(x ,z)/p(x)

A hard cluster assignment is given by

z∗ = argmaxz∈{1,...,k}

p(z | x).

So if we know the model parameters, we can compute p(z | x), and clustering is trival.

David Rosenberg (New York University) DS-GA 1003 May 2, 2017 7 / 42

Page 8: EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models EM:CoordinateAscentonLowerBound 1 Startat old. 2 Findq givingbestlowerboundat old=)L(q,

Gaussian Mixture Models (Review)

The GMM “Learning” Problem

Given data x1, . . . ,xn drawn i.i.d. from a GMM,Estimate the parameters:

Cluster probabilities : π= (π1, . . . ,πk)

Cluster means : µ= (µ1, . . . ,µk)

Cluster covariance matrices: Σ= (Σ1, . . .Σk)

Traditional approach is maximum [marginal] likelihood:

(π,µ,Σ) = argmaxπ,µ,Σ

p(x1, . . . ,xn)

Unfortunately, this is very difficult.

Note that the fully observed problem is easy. That is:

(π,µ,Σ) = argmaxπ,µ,Σ

p(x1, . . . ,xn,z1, . . . ,zn)

David Rosenberg (New York University) DS-GA 1003 May 2, 2017 8 / 42

Page 9: EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models EM:CoordinateAscentonLowerBound 1 Startat old. 2 Findq givingbestlowerboundat old=)L(q,

EM Algorithm for Latent Variable Models

EM Algorithm for Latent Variable Models

David Rosenberg (New York University) DS-GA 1003 May 2, 2017 9 / 42

Page 10: EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models EM:CoordinateAscentonLowerBound 1 Startat old. 2 Findq givingbestlowerboundat old=)L(q,

EM Algorithm for Latent Variable Models

General Latent Variable Model

Two sets of random variables: z and x .z consists of unobserved hidden variables.x consists of observed variables.Joint probability model parameterized by θ ∈Θ:

p(x ,z | θ)

DefinitionA latent variable model is a probability model for which certain variables are never observed.

e.g. The Gaussian mixture model is a latent variable model.

David Rosenberg (New York University) DS-GA 1003 May 2, 2017 10 / 42

Page 11: EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models EM:CoordinateAscentonLowerBound 1 Startat old. 2 Findq givingbestlowerboundat old=)L(q,

EM Algorithm for Latent Variable Models

Complete and Incomplete Data

Suppose we have a data set D= (x1, . . . ,xn).To simplify notation, take x to represent the entire dataset

x = (x1, . . . ,xn) ,

and z to represent the corresponding unobserved variables

z = (z1, . . . ,zn) .

An observation of x is called an incomplete data set.An observation (x ,z) is called a complete data set.

David Rosenberg (New York University) DS-GA 1003 May 2, 2017 11 / 42

Page 12: EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models EM:CoordinateAscentonLowerBound 1 Startat old. 2 Findq givingbestlowerboundat old=)L(q,

EM Algorithm for Latent Variable Models

Our Objectives

Learning problem: Given incomplete dataset D= x = (x1, . . . ,xn), find MLE

θ̂= argmaxθ

p(D | θ).

Inference problem: Given x , find conditional distribution over z :

p (zi | xi ,θ) .

For Gaussian mixture model, learning is hard, inference is easy.For more complicated models, inference can also be hard. (See DSGA-1005)

David Rosenberg (New York University) DS-GA 1003 May 2, 2017 12 / 42

Page 13: EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models EM:CoordinateAscentonLowerBound 1 Startat old. 2 Findq givingbestlowerboundat old=)L(q,

EM Algorithm for Latent Variable Models

Log-Likelihood and Terminology

Note thatargmax

θp(x | θ) = argmax

θ[logp(x | θ)] .

Often easier to work with this “ log-likelihood”.We often call p(x) the marginal likelihood,

because it is p(x ,z) with z “marginalized out”:

p(x) =∑z

p(x ,z)

We often call p(x ,y) the joint. (for “joint distribution”)Similarly, logp(x) is the marginal log-likelihood.

David Rosenberg (New York University) DS-GA 1003 May 2, 2017 13 / 42

Page 14: EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models EM:CoordinateAscentonLowerBound 1 Startat old. 2 Findq givingbestlowerboundat old=)L(q,

EM Algorithm for Latent Variable Models

The EM Algorithm Key Idea

Marginal log-likelihood is hard to optimize:

maxθ

logp(x | θ)

Typically the complete data log-likelihood is easy to optimize:

maxθ

logp(x ,z | θ)

What if we had a distribution q(z) for the latent variables z?Then maximize the expected complete data log-likelihood:

maxθ

∑z

q(z) logp(x ,z | θ)

EM assumes this maximization is relatively easy.

David Rosenberg (New York University) DS-GA 1003 May 2, 2017 14 / 42

Page 15: EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models EM:CoordinateAscentonLowerBound 1 Startat old. 2 Findq givingbestlowerboundat old=)L(q,

EM Algorithm for Latent Variable Models

Lower Bound for Marginal Log-Likelihood

Let q(z) be any PMF on Z, the support of z :

logp(x | θ) = log

[∑z

p(x ,z | θ)

]

= log

[∑z

q(z)

(p(x ,z | θ)

q(z)

)](log of an expectation)

>∑z

q(z) log(p(x ,z | θ)

q(z)

)︸ ︷︷ ︸

L(q,θ)

(expectation of log)

Inequality is by Jensen’s, by concavity of the log.

This inequality is the basis for “variational methods” , of which EM is a basic example.David Rosenberg (New York University) DS-GA 1003 May 2, 2017 15 / 42

Page 16: EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models EM:CoordinateAscentonLowerBound 1 Startat old. 2 Findq givingbestlowerboundat old=)L(q,

EM Algorithm for Latent Variable Models

The ELBO

For any PMF q(z), we have a lower bound on the marginal log-likelihood

logp(x | θ)>∑z

q(z) log(p(x ,z | θ)

q(z)

)︸ ︷︷ ︸

L(q,θ)

Marginal log likelihood logp(x | θ) also called the evidence.

L(q,θ) is the evidence lower bound, or “ELBO”.

In EM algorithm (and variational methods more generally), we maximize L(q,θ) over q and θ.

David Rosenberg (New York University) DS-GA 1003 May 2, 2017 16 / 42

Page 17: EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models EM:CoordinateAscentonLowerBound 1 Startat old. 2 Findq givingbestlowerboundat old=)L(q,

EM Algorithm for Latent Variable Models

MLE, EM, and the ELBO

For any PMF q(z), we have a lower bound on the marginal log-likelihood

logp(x | θ)> L(q,θ).

The MLE is defined as a maximum over θ:

θ̂MLE = argmaxθ

logp(x | θ).

In EM algorithm, we maximize the lower bound (ELBO) over θ and q:

θ̂EM = argmaxθ

[maxq

L(q,θ)

]

David Rosenberg (New York University) DS-GA 1003 May 2, 2017 17 / 42

Page 18: EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models EM:CoordinateAscentonLowerBound 1 Startat old. 2 Findq givingbestlowerboundat old=)L(q,

EM Algorithm for Latent Variable Models

A Family of Lower Bounds

For each q, we get a lower bound function: logp(x | θ)> L(q,θ) ∀θ.Two lower bounds (blue and green curves), as functions of θ:

Ideally, we’d find the maximum of the red curve. Maximum of green is close.From Bishop’s Pattern recognition and machine learning, Figure 9.14.

David Rosenberg (New York University) DS-GA 1003 May 2, 2017 18 / 42

Page 19: EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models EM:CoordinateAscentonLowerBound 1 Startat old. 2 Findq givingbestlowerboundat old=)L(q,

EM Algorithm for Latent Variable Models

EM: Coordinate Ascent on Lower Bound

Choose sequence of q’s and θ’s by “coordinate ascent”.EM Algorithm (high level):

1 Choose initial θold.2 Let q∗ = argmaxqL(q,θold)3 Let θnew = argmaxθL(q∗,θold).4 Go to step 2, until converged.

Will show: p(x | θnew)> p(x | θold)

Get sequence of θ’s with monotonically increasing likelihood.

David Rosenberg (New York University) DS-GA 1003 May 2, 2017 19 / 42

Page 20: EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models EM:CoordinateAscentonLowerBound 1 Startat old. 2 Findq givingbestlowerboundat old=)L(q,

EM Algorithm for Latent Variable Models

EM: Coordinate Ascent on Lower Bound

1 Start at θold.2 Find q giving best lower bound at θold =⇒ L(q,θ).3 θnew = argmaxθL(q,θ).

From Bishop’s Pattern recognition and machine learning, Figure 9.14.

David Rosenberg (New York University) DS-GA 1003 May 2, 2017 20 / 42

Page 21: EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models EM:CoordinateAscentonLowerBound 1 Startat old. 2 Findq givingbestlowerboundat old=)L(q,

EM Algorithm for Latent Variable Models

EM: Next Steps

We now give 2 different re-expressions of L(q,θ) that make it easy to computeargmaxqL(q,θ), for a given θ, andargmaxθL(q,θ), for a given q.

David Rosenberg (New York University) DS-GA 1003 May 2, 2017 21 / 42

Page 22: EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models EM:CoordinateAscentonLowerBound 1 Startat old. 2 Findq givingbestlowerboundat old=)L(q,

EM Algorithm for Latent Variable Models

ELBO in Terms of KL Divergence and Entropy

Let’s investigate the lower bound:

L(q,θ) =∑z

q(z) log(p(x ,z | θ)

q(z)

)=∑z

q(z) log(p(z | x ,θ)p(x | θ)

q(z)

)=∑z

q(z) log(p(z | x ,θ)

q(z)

)+∑z

q(z) logp(x | θ)

= −KL[q(z),p(z | x ,θ)]+ logp(x | θ)

Amazing! We get back an equality for the marginal likelihood:

logp(x | θ) = L(q,θ)+KL[q(z),p(z | x ,θ)]

David Rosenberg (New York University) DS-GA 1003 May 2, 2017 22 / 42

Page 23: EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models EM:CoordinateAscentonLowerBound 1 Startat old. 2 Findq givingbestlowerboundat old=)L(q,

EM Algorithm for Latent Variable Models

Maxizing over q for fixed θ= θold.

Find q maximizing

L(q,θold) = −KL[q(z),p(z | x ,θold)]+ logp(x | θold)︸ ︷︷ ︸no q here

Recall KL(p‖q)> 0, and KL(p‖p) = 0.Best q is q∗(z) = p(z | x ,θold) and

L(q∗,θold) = −KL[p(z | x ,θold),p(z | x ,θold)]︸ ︷︷ ︸=0

+ logp(x | θold)

Summary:

logp(x | θold) = L(q∗,θold) (tangent at θold).logp(x | θ) > L(q∗,θ) ∀θ

David Rosenberg (New York University) DS-GA 1003 May 2, 2017 23 / 42

Page 24: EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models EM:CoordinateAscentonLowerBound 1 Startat old. 2 Findq givingbestlowerboundat old=)L(q,

EM Algorithm for Latent Variable Models

Tight lower bound for any chosen θ

For θold, take q(z) = p(z | x ,θold). Then1 logp(x | θ)> L(q,θ) ∀θ. [Global lower bound].2 logp(x | θold) = L(q,θold). [Lower bound is tight at θold.]

From Bishop’s Pattern recognition and machine learning, Figure 9.14.

David Rosenberg (New York University) DS-GA 1003 May 2, 2017 24 / 42

Page 25: EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models EM:CoordinateAscentonLowerBound 1 Startat old. 2 Findq givingbestlowerboundat old=)L(q,

EM Algorithm for Latent Variable Models

Maximizing over θ for fixed q

Consider maximizing the lower bound L(q,θ):

L(q,θ) =∑z

q(z) log(p(x ,z | θ)

q(z)

)=

∑z

q(z) logp(x ,z | θ)︸ ︷︷ ︸E[complete data log-likelihood]

−∑z

q(z) logq(z)︸ ︷︷ ︸no θ here

Maximizing L(q,θ) equivalent to maximizing E [complete data log-likelihood] (for fixed q).

David Rosenberg (New York University) DS-GA 1003 May 2, 2017 25 / 42

Page 26: EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models EM:CoordinateAscentonLowerBound 1 Startat old. 2 Findq givingbestlowerboundat old=)L(q,

EM Algorithm for Latent Variable Models

General EM Algorithm

1 Choose initial θold.2 Expectation Step

Let q∗(z) = p(z | x ,θold). [q∗ gives best lower bound at θold]Let

J(θ) := L(q∗,θ) =∑z

q∗(z) log(p(x ,z | θ)

q∗(z)

)︸ ︷︷ ︸expectation w.r.t. z∼q∗(z)

3 Maximization Stepθnew = argmax

θJ(θ).

[Equivalent to maximizing expected complete log-likelihood.]4 Go to step 2, until converged.

David Rosenberg (New York University) DS-GA 1003 May 2, 2017 26 / 42

Page 27: EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models EM:CoordinateAscentonLowerBound 1 Startat old. 2 Findq givingbestlowerboundat old=)L(q,

Does EM Work?

Does EM Work?

David Rosenberg (New York University) DS-GA 1003 May 2, 2017 27 / 42

Page 28: EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models EM:CoordinateAscentonLowerBound 1 Startat old. 2 Findq givingbestlowerboundat old=)L(q,

Does EM Work?

EM Gives Monotonically Increasing Likelihood: By Picture

From Bishop’s Pattern recognition and machine learning, Figure 9.14.

David Rosenberg (New York University) DS-GA 1003 May 2, 2017 28 / 42

Page 29: EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models EM:CoordinateAscentonLowerBound 1 Startat old. 2 Findq givingbestlowerboundat old=)L(q,

Does EM Work?

EM Gives Monotonically Increasing Likelihood: By Math

1 Start at θold.2 Choose q∗(z) = argmaxqL(q,θold). We’ve shown

logp(x | θold) = L(q∗,θold)

3 Choose θnew = argmaxθL(q∗,θold). So

L(q∗,θnew) > L(q∗,θold).

Putting it together, we get

logp(x | θnew) > L(q∗,θnew) L is a lower bound> L(q∗,θold) By definition of θnew

= logp(x | θold) Bound is tight at θold.

David Rosenberg (New York University) DS-GA 1003 May 2, 2017 29 / 42

Page 30: EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models EM:CoordinateAscentonLowerBound 1 Startat old. 2 Findq givingbestlowerboundat old=)L(q,

Does EM Work?

Suppose We Maximize the ELBO...

Suppose we have found a global maximum of L(q,θ):

L(q∗,θ∗)> L(q,θ) ∀q,θ,

where of courseq∗(z) = p(z | x ,θ∗).

Claim: θ∗ is a global maximum of logp(x | θ∗).Proof: For any θ ′, we showed that for q ′(z) = p(z | x ,θ ′) we have

logp(x | θ ′) = L(q ′,θ ′)+KL[q ′,p(z | x ,θ ′)]

= L(q ′,θ ′)

6 L(q∗,θ∗)

= logp(x | θ∗)

David Rosenberg (New York University) DS-GA 1003 May 2, 2017 30 / 42

Page 31: EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models EM:CoordinateAscentonLowerBound 1 Startat old. 2 Findq givingbestlowerboundat old=)L(q,

Does EM Work?

Convergence of EM

Let θn be value of EM algorithm after n steps.Define “transition function” M(·) such that θn+1 =M(θn).Suppose log-likelihood function `(θ) = logp(x | θ) is differentiable.Let S be the set of stationary points of `(θ). (i.e. ∇θ`(θ) = 0)

TheoremUnder mild regularity conditionsa, for any starting point θ0,

limn→∞θn = θ∗ for some stationary point θ∗ ∈ S andθ∗ is a fixed point of the EM algorithm, i.e. M(θ∗) = θ∗. Moreover,`(θn) strictly increases to `(θ∗) as n→∞, unless θn ≡ θ∗.

aFor details, see “Parameter Convergence for EM and MM Algorithms” by Florin Vaida inStatistica Sinica (2005). http://www3.stat.sinica.edu.tw/statistica/oldpdf/a15n316.pdf

David Rosenberg (New York University) DS-GA 1003 May 2, 2017 31 / 42

Page 32: EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models EM:CoordinateAscentonLowerBound 1 Startat old. 2 Findq givingbestlowerboundat old=)L(q,

Variations on EM

Variations on EM

David Rosenberg (New York University) DS-GA 1003 May 2, 2017 32 / 42

Page 33: EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models EM:CoordinateAscentonLowerBound 1 Startat old. 2 Findq givingbestlowerboundat old=)L(q,

Variations on EM

EM Gives Us Two New Problems

The “E” Step: Computing

J(θ) := L(q∗,θ) =∑z

q∗(z) log(p(x ,z | θ)

q∗(z)

)The “M” Step: Computing

θnew = argmaxθ

J(θ).

Either of these can be too hard to do in practice.

David Rosenberg (New York University) DS-GA 1003 May 2, 2017 33 / 42

Page 34: EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models EM:CoordinateAscentonLowerBound 1 Startat old. 2 Findq givingbestlowerboundat old=)L(q,

Variations on EM

Generalized EM (GEM)

Addresses the problem of a difficult “M” step.Rather than finding

θnew = argmaxθ

J(θ),

find any θnew for whichJ(θnew)> J(θold).

Can use a standard nonlinear optimization strategye.g. take a gradient step on J.

We still get monotonically increasing likelihood.

David Rosenberg (New York University) DS-GA 1003 May 2, 2017 34 / 42

Page 35: EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models EM:CoordinateAscentonLowerBound 1 Startat old. 2 Findq givingbestlowerboundat old=)L(q,

Variations on EM

EM and More General Variational Methods

Suppose “E” step is difficult:Hard to take expectation w.r.t. q∗(z) = p(z | x ,θold).

Solution: Restrict to distributions Q that are easy to work with.Lower bound now looser:

q∗ = argminq∈Q

KL[q(z),p(z | x ,θold)]

David Rosenberg (New York University) DS-GA 1003 May 2, 2017 35 / 42

Page 36: EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models EM:CoordinateAscentonLowerBound 1 Startat old. 2 Findq givingbestlowerboundat old=)L(q,

Variations on EM

EM in Bayesian Setting

Suppose we have a prior p(θ).Want to find MAP estimate: θ̂MAP = argmaxθ p(θ | x):

p(θ | x) = p(x | θ)p(θ)/p(x)

logp(θ | x) = logp(x | θ)+ logp(θ)− logp(x)

.Still can use our lower bound on logp(x ,θ).

J(θ) := L(q∗,θ) =∑z

q∗(z) log(p(x ,z | θ)

q∗(z)

)Maximization step becomes

θnew = argmaxθ

[J(θ)+ logp(θ)]

Homework: Convince yourself our lower bound is still tight at θ.David Rosenberg (New York University) DS-GA 1003 May 2, 2017 36 / 42

Page 37: EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models EM:CoordinateAscentonLowerBound 1 Startat old. 2 Findq givingbestlowerboundat old=)L(q,

Summer Homework: Gaussian Mixture Model (Hints)

Summer Homework: Gaussian Mixture Model (Hints)

David Rosenberg (New York University) DS-GA 1003 May 2, 2017 37 / 42

Page 38: EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models EM:CoordinateAscentonLowerBound 1 Startat old. 2 Findq givingbestlowerboundat old=)L(q,

Summer Homework: Gaussian Mixture Model (Hints)

Homework: Derive EM for GMM from General EM Algorithm

Subsequent slides may help set things up.Key skills:

MLE for multivariate Gaussian distributions.Lagrange multipliers

David Rosenberg (New York University) DS-GA 1003 May 2, 2017 38 / 42

Page 39: EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models EM:CoordinateAscentonLowerBound 1 Startat old. 2 Findq givingbestlowerboundat old=)L(q,

Summer Homework: Gaussian Mixture Model (Hints)

Gaussian Mixture Model (k Components)

GMM Parameters

Cluster probabilities : π= (π1, . . . ,πk)

Cluster means : µ= (µ1, . . . ,µk)

Cluster covariance matrices: Σ= (Σ1, . . .Σk)

Let θ= (π,µ,Σ).

Marginal log-likelihood

logp(x | θ) = log

{k∑

z=1

πzN (x | µz ,Σz)

}

David Rosenberg (New York University) DS-GA 1003 May 2, 2017 39 / 42

Page 40: EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models EM:CoordinateAscentonLowerBound 1 Startat old. 2 Findq givingbestlowerboundat old=)L(q,

Summer Homework: Gaussian Mixture Model (Hints)

q∗(z) are “Soft Assignments”

Suppose we observe n points: X = (x1, . . . ,xn) ∈ Rn×d .

Let z1, . . . ,zn ∈ {1, . . . ,k} be corresponding hidden variables.

Optimal distribution q∗ is:

q∗(z) = p(z | x ,θ).

Convenient to define the conditional distribution for zi given xi as

γji := p (z = j | xi )

=πjN (xi | µj ,Σj)∑k

c=1πcN (xi | µc ,Σc)

David Rosenberg (New York University) DS-GA 1003 May 2, 2017 40 / 42

Page 41: EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models EM:CoordinateAscentonLowerBound 1 Startat old. 2 Findq givingbestlowerboundat old=)L(q,

Summer Homework: Gaussian Mixture Model (Hints)

Expectation Step

The complete log-likelihood is

logp(x ,z | θ) =

n∑i=1

log [πzN (xi | µz ,Σz)]

=

n∑i=1

logπz + logN (xi | µz ,Σz)︸ ︷︷ ︸simplifies nicely

Take the expected complete log-likelihood w.r.t. q∗:

J(θ) =∑z

q∗(z) logp(x ,z | θ)

=

n∑i=1

k∑j=1

γji [logπj + logN (xi | µj ,Σj)]

David Rosenberg (New York University) DS-GA 1003 May 2, 2017 41 / 42

Page 42: EMAlgorithmforLatentVariableModels · 2020. 12. 30. · EM Algorithm for Latent Variable Models EM:CoordinateAscentonLowerBound 1 Startat old. 2 Findq givingbestlowerboundat old=)L(q,

Summer Homework: Gaussian Mixture Model (Hints)

Maximization Step

Find θ∗ maximizing J(θ):

µnewc =1nc

n∑i=1

γci xi

Σnewc =1nc

n∑i=1

γci (xi −µMLE)(xi −µMLE)T

πnewc =ncn,

for each c = 1, . . . ,k .

David Rosenberg (New York University) DS-GA 1003 May 2, 2017 42 / 42