UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 21: EM … · 2020-01-12 · UVA CS...

UVACS6316/4501–Fall2016

MachineLearning

Lecture21:EM(Extra)

Dr.YanjunQi

UniversityofVirginia

DepartmentofComputerScience

11/29/16

Dr.YanjunQi/UVACS6316/f15

1

Wherearewe?èmajorsecGonsofthiscourse

q Regression(supervised)q ClassificaGon(supervised)

q FeatureselecGonq Unsupervisedmodels

q DimensionReducGon(PCA)q Clustering(K-means,GMM/EM,Hierarchical)

q Learningtheoryq Graphicalmodels

q (BNandHMMslidesshared)11/29/16 2


Today Outline•  Principles for Model Inference

–  Maximum Likelihood Estimation –  Bayesian Estimation

•  Strategies for Model Inference –  EM Algorithm – simplify difficult MLE

•  Algorithm •  Application •  Theory

–  MCMC – samples rather than maximizing

11/29/16


3

Model Inference through Maximum Likelihood Estimation (MLE)

Assumption: the data is coming from a known probability distribution The probability distribution has some parameters that are unknown to you

Example:dataisdistributedasGaussian,sotheunknownparametershereare

MLE is a tool that estimates the unknown parameters of the probability distribution from data

✓ = (µ,�2)yi = N(µ,�2)

11/29/16


4

MLE: e.g. Single Gaussian Model (when p=1)

•  Need to adjust the parameters (è model inference)

•  So that the resulting distribution fits the observed data well

11/29/16


5

Maximum Likelihood revisited yi = N(µ,�2)

!! Y = { y1 , y2 ,…, yN }

!!l(θ )= log(L(θ ;Y ))= log p( yi )

i=1

N

∏

)

11/29/16


6

MLE: e.g. Single Gaussian Model

•  Assume observation data yi are independent

•  Form the Likelihood:

•  Form the Log-likelihood:

!!

L(θ ;Y )= p( yi )i=1

N

∏ = 12πσ 2i=1

N

∏ exp(− ( yi − µ)2

2σ 2 );

Y = { y1 , y2 ,…, yN }

!!l(θ )= log( 1

2πσi=1

N

∏ exp(− ( yi − µ)2

2σ 2 ))= − ( yi − µ)22σ 2

i=1

N

∑ −N log( 2πσ )

11/29/16


7

MLE: e.g. Single Gaussian Model

•  To find out the unknown parameter values, maximize the

log-likelihood with respect to the unknown parameters:

∑∑=

= −=⇒=∂∂=⇒=

∂∂ N

ii

N

i i µyN

lNy

µµl

1

222

1 )(10;0 σσ

)

11/29/16


8

MLE: A Challenging Mixture Example

histogram

Indicator variable

is the probability with which the observation is chosen from density model 2 (1- ) is the probability with which the observation is chosen from density 1

Mixture model:

}1,0{;)1(),(~);,(~

21

2222

2111

∈ΔΔ+Δ−= YYYσµNYσµNY

),();,( 222111 σµθσµθ ==

⇡⇡


911/29/16

MLE: Gaussian Mixture Example

Maximum likelihood fitting for parameters:

Numerically (and of course analytically, too) Challenging to solve!!

),,,,( 2121 σσµµπ=θ

)

11/29/16


10

Bayesian Methods & Maximum Likelihood

•  Bayesian Pr(model|data) i.e. posterior =>Pr(data|model) Pr(model) => Likelihood * prior

•  Assume prior is uniform, equal to MLE

argmax_model Pr(data | model) Pr(model) = argmax_model Pr(data | model)

11/29/16


11






11/29/16


12

Here is the problem

11/29/16


13

All we have is

FromwhichweneedtoinferthelikelihoodfuncGonwhichgeneratetheobservaGons

histogram

11/29/16


14

Expectation Maximization: add latent variable è latent data

EM augments the data space– assumes with latent data

11/29/16


15

Computing log-likelihood based on complete data

Maximizing this form of log-likelihood is now tractable

Note that we cannot analytically maximize the previous log-likelihood with only observed Y={y_1, y_2, …, y_n}

T = {ti = (yi,�i), i = 1...N}


1611/29/16

EM: The Complete Data Likelihood

;)1(

)1(0

1

11

1

0

∑

∑

=

=

Δ−

Δ−=⇒=

∂∂

N

ii

N

iii yl µ

µ

;)1(

))(1(0

1

1

21

212

1

0

∑

∑

=

=

Δ−

−Δ−=⇒=

∂∂

N

ii

N

iii µy

σσl

By simple differentiations we have:

How do we get the latent variables?

So, maximization of the complete data likelihood is much easier!

11/29/16


17

EM: The Complete Data Likelihood

;0

1

12

2

0

∑

∑

=

=

Δ

Δ=⇒=

∂∂

N

ii

N

iii yl µ

µ

;)(

0

1

1

22

222

2

0

∑

∑

=

=

Δ

−Δ=⇒=

∂∂

N

ii

N

iii µy

σσl

;0 10

Nπ

πl

N

ii∑

=

Δ=⇒=

∂∂

By simple differentiations we have:

How do we get the latent variables?

So, maximization of the complete data likelihood is much easier!

11/29/16


18

Obtaining Latent Variables The latent variables are computed as expected values given the data and parameters:

),|1Pr(),|()( iiiii yyEθγ θθ =Δ=Δ=

Apply Bayes’ rule:

πyΦπyΦπyΦ

yyyyθγ

ii

i

iiiiii

iiiiii

)()1)(()(

)|0Pr(),0|Pr()|1Pr(),1|Pr()|1Pr(),1|Pr(),|1Pr()(

21

2

θθ

θ

θθθθθθθ

+−=

=Δ=Δ+=Δ=Δ=Δ=Δ==Δ=

11/29/16


19

Dilemma Situation

•  We need to know latent variable / data to maximize the complete log-likelihood to get the parameters

•  We need to know the parameters to calculate the expected values of latent variable / data

•  è Solve through iterations

11/29/16


20

So we iterate è EM for Gaussian Mixtures…

Y Y

Y11/29/16


21

EM for Gaussian Mixtures…

Y

11/29/16


22

EM for Two-component Gaussian Mixture

•  Initialize •  Iterate until convergence

–  Expectation of latent variables

– Maximization for finding parameters

)2

)(2

)(exp(11

1)()1)((

)()(

22

22

21

21

1

221

2

σµy

σµyπyΦπyΦ

πyΦθγ

iiii

ii −+−−−+

=+−

=

σσ

ππθθ

θ

;)1(

)1(

1

11

∑

∑

=

=

−

−= N

ii

N

iii

γ

yγµ ;

1

12

∑

∑

=

== N

ii

N

iii

γ

yγµ ;

)1(

))(1(

1

1

21

21

∑

∑

=

=

−

−−= N

ii

N

iii

γ

µyγσ ;

)(

1

1

22

22

∑

∑

=

=

−= N

ii

N

iii

γ

µyγσ ;1

N

γπ

N

ii∑

==

µ1,�1, µ2,�2,⇡

11/29/16


23

EM in….simple words

•  Given observed data, you need to come up with a generative model

•  You choose a model that comprises of some hidden variables hi (this is your belief!)

•  Problem: To estimate the parameters of model –  Assume some initial values parameters –  Replace values of hidden variable with their

expectation (given the old parameters) –  Recompute new values of parameters (given ) –  Check for convergence using log-likelihood

�i

�i

11/29/16


24

EM – Example (cont’d)

Selected iterations of the EM algorithm For mixture example

Iteration � 1 0.485 5 0.493 10 0.523 15 0.544 20 0.546

⇡

histogram

11/29/16


25

EM Summary •  An iterative approach for MLE •  Good idea when you have missing or latent

data •  Has a nice property of convergence •  Can get stuck in local minima (try different

starting points) •  Generally hard to calculate expectation over

all possible values of hidden variables •  Still not much known about the rate of

convergence

11/29/16


26






11/29/16


27

Applications of EM

– Mixture models – HMMs – Latent variable models – Missing data problems – …

11/29/16


28

Applications of EM (1)

•  Fibngmixturemodels

11/29/16


29


•  ProbabilisGcLatentSemanGcAnalysis(pLSA)– Techniquefromtextfortopicmodeling

P(w,d)

P(w|z)

P(z|d)

Z

W D

Z

D

W

11/29/16


30


•  Learningpartsandstructuremodels

11/29/16


31


•  AutomaGcsegmentaGonoflayersinvideo

hhp://www.psi.toronto.edu/images/figures/cutouts_vid.gif

11/29/16


32

Expectation Maximization (EM)

• Oldidea(late50’s)butformalizedbyDempster,LairdandRubinin1977

• SubjectofmuchinvesGgaGon.SeeMcLachlan&Krishnanbook1997.

11/29/16


33

11/29/16


34

single-variable + two-cluster case

11/29/16


35

multi-variable + multi-cluster case






11/29/16


36

11/29/16


37

WhyisLearningHarder?

•  Infullyobservediidsebngs,thecompleteloglikelihooddecomposesintoasumoflocalterms.

•  Whenwithlatentvariables,alltheparametersbecomecoupledtogetherviamarginaliza)on

),|(log)|(log)|,(log);( xzc zxpzpzxpD θθθθ +==l

!! l (θ ;D)= logp(x |θ )= log p(z |θz )p(x |z ,θ x )

z∑


38

GradientLearningformixturemodels

•  WecanlearnmixturedensiGesusinggradientdescentontheobservedloglikelihood.ThegradientsarequiteinteresGng:

•  Inotherwords,thegradientistheresponsibilityweightedsumoftheindividualloglikelihoodgradients.

•  CanpassthistoaconjugategradientrouGne.

∑∑

∑

∑

∑

∂∂=

∂∂

=

∂∂

=

∂∂

=∂∂

==

k k

kk

k k

kkkkk

k

kkkk

k

k

kkk

kkkk

rppp

ppp

pp

pp

θθθ

θθ

π

θθ

θθ

πθθ

πθθ

θπθθ

l

l

l

)x(log)|x()x(

)x(log)x(

)|x(

)x()|x(

)x(log)|x(log)(

1

11/29/16

11/29/16


39

ParameterConstraints•  Onenwehaveconstraintsontheparameters,e.g.

beingsymmetricposiGvedefinite.•  WecanuseconstrainedopGmizaGon,orwecanre-

parameterizeintermsofunconstrainedvalues.–  Fornormalizedweights,sonmaxtoe.g.

–  Forcovariancematrices,usetheCholeskydecomposiGon:

whereAisupperdiagonalwithposiGvediagonal:

–  Usechainruletocompute

AAT=Σ−1

( ) )( )( exp ijij ijijijiii <=>=>= 00 AAA ηλ

. ,A∂∂

∂∂ llπ

⌃k

π j

j=1

K

∑ = 1

11/29/16


40

IdenGfiability•  AmixturemodelinducesamulG-modallikelihood.•  Hencegradientascentcanonlyfindalocalmaximum.•  MixturemodelsareunidenGfiable,sincewecanalways

switchthehiddenlabelswithoutaffecGngthelikelihood.•  Henceweshouldbecarefulintryingtointerpretthe“meaning”oflatentvariables.

11/29/16


41

ExpectaGon-MaximizaGon(EM)Algorithm

•  EMisanIteraGvealgorithmwithtwolinkedsteps:– E-step:fill-inhiddenvaluesusinginference:p(z|x,\thetat).– M-step:updateparameters(t+1)roundsusingstandardMLE/MAPmethodappliedtocompleteddata

•  Wewillprovethatthisproceduremonotonicallyimproves(orleavesitunchanged).ThusitalwaysconvergestoalocalopGmumofthelikelihood.

11/29/16


42

TheoryunderlyingEM•  Whatarewedoing?

•  RecallthataccordingtoMLE,weintendtolearnthemodelparameterthatwouldhavemaximizethelikelihoodofthedata.

•  Butwedonotobservez,socompuGng

isdifficult!•  Whatshallwedo?

∑∑ ==z

xzz

c zxpzpzxpD ),|()|(log)|,(log);( θθθθl

11/29/16


43

(1)IncompleteLogLikelihoods

•  IncompleteloglikelihoodWithzunobserved,ourobjecGvebecomesthelogofamarginalprobability:

– ThisobjecKvewon'tdecouple

!! l (θ ;x)= logp(x |θ )= log p(x ,z |θ )

z∑

11/29/16


44

(2)CompleteLogLikelihoods

•  CompleteloglikelihoodLetXdenotetheobservablevariable(s),andZdenotethelatentvariable(s).IfZcouldbeobserved,then

– Usually,opGmizinglc()givenbothzandxisstraightorward(c.f.MLEforfullyobservedmodels).

–  RecalledthatinthiscasetheobjecGvefor,e.g.,MLE,decomposesintoasumoffactors,theparameterforeachfactorcanbeesGmatedseparately.

–  ButgiventhatZisnotobserved,lc()isarandomquanKty,cannotbemaximizeddirectly.

!! l c(θ ;x ,z)=deflogp(x ,z |θ )= logp(z |θz )p(x |z ,θ x )

Completelog-likelihood(CLL)

Log-likelihood[Incompletelog-likelihood(ILL)]

Expectedcompletelog-likelihood(ECLL)

Threetypesoflog-likelihoodovermulGpleobservedsamples(x_1,x_2,…,x_N)

Observeddata

Latentvariables

IteraGonindex

11/29/16


(3)ExpectedCompleteLogLikelihood

•  ForanydistribuGonq(z),defineexpectedcompleteloglikelihood(ECLL):

•  CLLisrandomvariableèECLLisadeterminisGcfuncGonofq

•  LinearinCLL()---inherititsfactorizabiility•  Doesmaximizingthissurrogateyieldamaximizerofthelikelihood?

!! ECLL= l c(θ ;x ,z) q

=def

q(z |x ,θ )logp(x ,z |θ )z∑

Jensen’sinequality

11/29/16


47

11/29/16


49

LowerBoundsandFreeEnergy

•  Forfixeddatax,defineafuncGonalcalledthefreeenergy:

•  TheEMalgorithmiscoordinate-ascentonF :

– E-step:

– M-step:

);()|()|,(

log)|(),(def

xxzq

zxpxzqqFz

θθθ l≤=∑

),(maxarg tq

t qFq θ=+1

),(maxarg ttt qF θθθ

11 ++ =

HowEMopGmizeILL?

11/29/16


50

11/29/16


51

E-step:maximizaGonofw.r.t.q •  Claim:

–  ThisistheposteriordistribuGonoverthelatentvariablesgiventhedataandtheparameters.OnenweneedthisattestGmeanyway(e.g.toperformclustering).

•  Proof(easy):thissebngahainstheboundofILL

•  CanalsoshowthisresultusingvariaGonalcalculusorthefactthat

),|(),(maxarg ttq

t xzpqFq θθ ==+1

);()|(log

)|(log),(

),()|,(log),()),,((

xxp

xpxzp

xzpzxpxzpxzpF

ttz

tt

zt

tttt

θθ

θθ

θθθθθ

l==

=

=

∑

∑

( )),|(||KL),();( θθθ xzpqqFx =−l

E-step: Alternative derivation

11/29/16


52

( )),|(||KL),();( θθθ xzpqqFx =−l

11/29/16


53

M-step:maximizaGonw.r.t.\theta

•  Notethatthefreeenergybreaksintotwoterms:

– Thefirsttermistheexpectedcompleteloglikelihood(energy)andthesecondterm,whichdoesnotdependon q,istheentropy.

qqc

zz

z

Hzx

xzqxzqzxpxzqxzq

zxpxzqqF

+=

−=

=

∑∑∑

),;(

)|(log)|()|,(log)|()|()|,(log)|(),(

θ

θ

θθ

l

11/29/16


54

M-step:maximizaGonw.r.t.\theta

•  Thus,intheM-step,maximizingwithrespecttoqforfixedqweonlyneedtoconsiderthefirstterm:

– UnderopGmalqt+1, thisisequivalenttosolvingastandardMLEoffullyobservedmodelp(x,z|q),withthesufficientstaGsGcsinvolvingzreplacedbytheirexpectaGonsw.r.t.p(z|x,q).

∑== ++

zqc

t zxpxzqzx t )|,(log)|(maxarg),;(maxarg θθθθθ

11 l

11/29/16


55

Summary:EMAlgorithm•  AwayofmaximizinglikelihoodfuncGonforlatentvariablemodels.

FindsMLEofparameterswhentheoriginal(hard)problemcanbebrokenupintotwo(easy)pieces:1.  EsGmatesome“missing”or“unobserved”datafromobserveddata

andcurrentparameters.2.  Usingthis“complete”data,findthemaximumlikelihoodparameter

esGmates.

•  Alternatebetweenfillinginthelatentvariablesusingthebestguess(posterior)andupdaGngtheparametersbasedonthisguess:–  E-step:–  M-step:

•  IntheM-stepweopGmizealowerboundonthelikelihood.IntheE-stepweclosethegap,makingbound=likelihood.

),(maxarg tq

t qFq θ=+1

),(maxarg ttt qF θθθ

11 ++ =

HowEMopGmizeILL?

11/29/16


56

11/29/16


57

AReportCardforEM•  SomegoodthingsaboutEM:

–  nolearningrate(step-size)parameter–  automaGcallyenforcesparameterconstraints–  veryfastforlowdimensions–  eachiteraGonguaranteedtoimprovelikelihood–  CallsinferenceandfullyobservedlearningassubrouGnes.

•  SomebadthingsaboutEM:

–  cangetstuckinlocalminima–  canbeslowerthanconjugategradient(especiallynearconvergence)

–  requiresexpensiveinferencestep–  isamaximumlikelihood/MAPmethod

References

•  BigthankstoProf.EricXing@CMUforallowingmetoreusesomeofhisslides

•  TheEMAlgorithmandExtensionsbyGeoffreyJ.MacLauchlan,ThriyambakamKrishnan

11/29/16 58


UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 21: EM … · 2020-01-12 · UVA CS...

Documents

Transcript of UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 21: EM … · 2020-01-12 · UVA CS...