Probabilistic Graphical Modelsepxing/Class/10708-20/lectures/lecture05... · 2020. 1. 29. ·...

Probabilistic Graphical Models

Parameter Estimation

Eric XingLecture 5, January 29, 2020

© Eric Xing @ CMU, 2005-2020 1Reading: see class homepage

Learning Graphical Models

The goal:q Given set of independent samples (assignments of random variables),

find the best (the most likely?) Bayesian Network (both DAG and CPDs)

(B,E,A,C,R)=(T,F,F,T,F)(B,E,A,C,R)=(T,F,T,T,F)……..(B,E,A,C,R)=(F,T,T,T,F)

© Eric Xing @ CMU, 2005-2020 2

E

R

B

A

C

0.9 0.1

e

be

0.2 0.8

0.01 0.990.9 0.1

bebb

e

BE P(A | E,B)

E

R

B

A

C

Structural learning

Parameter learning

Learning Graphical Models

q Scenarios:q completely observed GMs

q directedq undirected

q partially or unobserved GMsq directedq undirected (an open research topic)

q Estimation principles:q Maximal likelihood estimation (MLE)q Bayesian estimationq Maximal conditional likelihoodq Maximal "Margin" q Maximum entropy

q We use learning as a name for the process of estimating the parameters, and in some cases, the topology of the network, from data.

© Eric Xing @ CMU, 2005-2020 3

ML Parameter Est. for completely observed GMs of

given structure

Z

X

l The data:{ (z1,x1), (z2,x2), (z3,x3), ... (zN,xN)}

© Eric Xing @ CMU, 2005-2020 4

Parameter Learning

q Assume G is known and fixed,q from expert designq from an intermediate outcome of iterative structure learning

q Goal: estimate from a dataset of N independent, identically distributed (iid) training cases D = {x1, . . . , xN}.

q In general, each training case xn= (xn,1, . . . , xn,M) is a vector of M values, one per node,

q the model can be completely observable, i.e., every element in xn is known (no missing values, no hidden variables),

q or, partially observable, i.e., $i, s.t. xn,i is not observed. q In this lecture we consider learning parameters for a BN with given

structure and is completely observable

© Eric Xing @ CMU, 2005-2020 5

å åÕ Õ ÷øö

çèæ=÷÷

ø

öççè

æ==i n

ininn i

inin iixpxpDpD ),|(log),|(log)|(log);( ,,,, qqqq pp xxl

Review of density estimation

q Can be viewed as single-node GMs

q Instances of Exponential Family Dist.

q Building blocks of general GM

q MLE and Bayesian estimate

q See supplementary slides

© Eric Xing @ CMU, 2005-2020 6

x1 x2 x3 xN…

xiN

=´´´==

==Knnn x

Kxx

k

kni nkxPxP,2,1,

21

, })rollth theof side-die index the where,1({)(

qqqq !

Õ ÕÕ==

÷÷ø

öççè

æ==N

n k

xk

N

nnN

knxPxxxP11

21,)|()|,...,,( qqq

Õ=

K

k

xk

kn

1

,q

ÕÕ =å

= =

k

nk

k

x

kk

N

nkn

qq 1,

lqllq

lqq

===Þ=

=-=¶¶

ååk

kk

kkk

k

k

k

Nnn

n 0l

Nnk

MLEk =,q!

Estimation of conditional density

q Can be viewed as two-node graphical models

q Instances of GLIM (Generalized Linear Models)



q See supplementary slides

© Eric Xing @ CMU, 2005-2020 7

Q

X

Q

X

Exponential family, a basic building block

q For a numeric random variable X

is an exponential family distribution with natural (canonical) parameter h

q Function T(x) is a sufficient statistic.q Function A(h) = log Z(h) is the log normalizer.q Examples: Bernoulli, multinomial, Gaussian, Poisson, gamma,...

© Eric Xing @ CMU, 2005-2015 8

{ }{ })(exp)(

)(

)()(exp)()|(

xTxhZ

AxTxhxp

T

T

hh

hhh1=

-=

Example: Multivariate Gaussian Distribution

q For a continuous vector random variable XÎRk:

q Exponential family representation

q Note: a k-dimensional Gaussian is a (d+d2)-parameter distribution with a (d+d2)-element vector of sufficient statistics (but because of symmetry and positivity, parameters are constrained and have lower degree of freedom)

© Eric Xing @ CMU, 2005-2015 9

( )

( ) ( ){ }S-S-S+S-=

þýü

îíì -S--

S=S

---

-

logtrexp

)()(exp),(

/

//

µµµp

µµp

µ

12111

21

2

1212

21

21

21

TTTk

Tk

xxx

xxxp

( )[ ] ( )[ ]( )[ ]

( )( ) 2

221

112211

21

121

21

1211

211

22

/)(

log)(trlog)(vec;)(

and ,vec,vec;

k

TT

T

xh

AxxxxT

-

-

----

=

---=S+S=

=

S-=S==S-S=

p

hhhhµµh

hµhhhµh

Moment parameter

Natural parameter

Example: Multinomial distribution

q For a binary vector random variable

q Exponential family representation

© Eric Xing @ CMU, 2005-2015 10

),|(multi~ pxx

þýü

îíì== åk

kkxK

xx xxp K ppppp lnexp)( 1121 !

[ ]

1

1

0

1

1

1

=

÷øö

çèæ=÷

øö

çèæ --=

=

úûù

êëé

÷øöç

èæ=

åå=

-

=

)(

lnln)(

)(

;ln

xh

eA

xxTK

k

K

kk

K

k

khph

pph

þýü

îíì

÷øö

çèæ -+÷÷

ø

öççè

æå-

=

þýü

îíì

÷øö

çèæ -÷

øö

çèæ -+=

åå

ååå-

=

-

=-=

-

=

-

=

-

=

1

1

1

111

1

1

1

1

1

1

1ln1

lnexp

1ln1lnexp

K

kk

K

k kKk

kk

K

kk

K

kK

K

kkk

x

xx

pp

p

pp

Why exponential family?

q Moment generating property

© Eric Xing @ CMU, 2005-2015 11

{ }{ }

[ ])()(

)(exp)()(

)(exp)()(

)()(

)(log

xTE

dxZ

xTxhxT

dxxTxhdd

Z

Zdd

ZZ

dd

ddA

T

T

=

=

=

==

ò

ò

hh

hhh

hhh

hhh1

1

{ } { }

[ ] [ ][ ])(

)()(

)()()(

)(exp)()()(

)(exp)()(

xTVarxTExTE

Zdd

Zdx

ZxTxhxTdx

ZxTxhxT

dAd

2

TT

=-=

-= òò2

22

2 1 hhhh

hhh

h

Moment estimation

q We can easily compute moments of any exponential family distribution by taking the derivatives of the log normalizer A(h).

q The qth derivative gives the qth centered moment.

q When the sufficient statistic is a stacked vector, partial derivatives need to be considered.

© Eric Xing @ CMU, 2005-2015 12

!

variance)(

mean)(

=

=

2

2

hhhh

dAdddA

Moment vs canonical parameters

q The moment parameter µ can be derived from the natural (canonical) parameter

q A(h) is convex since

q Hence we can invert the relationship and infer the canonical parameter from the moment parameter (1-to-1):

q A distribution in the exponential family can be parameterized not only by h - the canonical parameterization, but also by µ - the moment parameterization.

© Eric Xing @ CMU, 2005-2015 13

[ ] µhh def

)()( == xTEddA

[ ] 02

2

>= )()( xTVardAdhh

)(def

µyh =

4

8

-2 -1 0 1 2

4

8

-2 -1 0 1 2

A

hh*

MLE for Exponential Family

q For iid data, the log-likelihood is

q Take derivatives and set to zero:

q This amounts to moment matching.q We can infer the canonical parameters using

© Eric Xing @ CMU, 2005-2015 14

{ }

å å

Õ

-÷øö

çèæ+=

-=

n nn

Tn

nn

Tn

NAxTxh

AxTxhD

)()()(log

)()(exp)(log);(

hh

hhhl

)(

)()(

)()(

å

å

å

=

=¶¶

Þ

=¶¶-=

¶¶

nnMLE

nn

nn

xTN

xTN

A

ANxT

1

1

0

µ

hh

hh

h

!

l

)( MLEMLE µyh !! =

Sufficiency

q For p(x|q), T(x) is sufficient for q if there is no information in X regarding qbeyond that in T(x).

q We can throw away X for the purpose of inference w.r.t. q .

q Bayesian view

q Frequentist view

q The Neyman factorization theoremq T(x) is sufficient for q if

© Eric Xing @ CMU, 2005-2015 15

T(x) qX ))(|()),(|( xTpxxTp qq =

T(x) qX ))(|()),(|( xTxpxTxp =q

T(x) qX

))(,()),(()),(,( xTxxTxTxp 21 yqyq =

))(,()),(()|( xTxhxTgxp qq =Þ

Examples

q Gaussian:

q Multinomial:

q Poisson:

© Eric Xing @ CMU, 2005-2015 16

( )[ ]( )[ ]

( ) 2211

21

1211

2 /)(

log)(vec;)(

vec;

k

T

T

xh

AxxxxT

-

-

--

=

S+S=

=

S-S=

p

µµh

µh

åå ==Þn

nn

nMLE xN

xTN

111 )(µ

[ ]

1

1

0

1

1

1

=

÷øö

çèæ=÷

øö

çèæ --=

=

úûù

êëé

÷øöç

èæ=

åå=

-

=

)(

lnln)(

)(

;ln

xh

eA

xxTK

k

K

kk

K

k

khph

pph

å=Þn

nMLE xN1µ

!)(

)()(log

xxh

eAxxT

1=

==

==

hlh

lh

å=Þn

nMLE xN1µ

Generalized Linear Models (GLIMs)

q The graphical modelq Linear regressionq Discriminative linear classificationq Commonality:

model Ep(Y)=µ=f(qTX)q What is p()? the cond. dist. of Y.q What is f()? the response function.

q GLIMq The observed input x is assumed to enter into the model via a linear combination of its

elementsq The conditional mean µ is represented as a function f(x) of x, where f is known as the response

functionq The observed output y is assumed to be characterized by an exponential family distribution

with conditional mean µ.

© Eric Xing @ CMU, 2005-2015 17

Xn

YnN

xTqx =

Recall Linear Regression

q Let us assume that the target variable and the inputs are related by the equation:

where ε is an error term of unmodeled effects or random noise

q Now assume that ε follows a Gaussian N(0,σ), then we have:

q We can use LMS algorithm, which is a gradient ascent/descent approach, to estimate the parameter

© Eric Xing @ CMU, 2005-2015 18

iiT

iy eq += x

÷÷ø

öççè

æ --= 2

2

221

sq

spq )(exp);|( i

Ti

iiyxyp x

Recall: Logistic Regression (sigmoid classifier, perceptron, etc.)

q The condition distribution: a Bernoulli

where µ is a logistic function

q We can used the brute-force gradient method as in LR

q But we can also apply generic laws by observing the p(y|x) is an exponential family function, more specifically, a generalized linear model!

© Eric Xing @ CMU, 2005-2015 19

yy xxxyp --= 11 ))(()()|( µµ

xTex

qµ

-+=1

1)(

More examples: parameterizing graphical models

q Markov random fields

© Eric Xing @ CMU, 2005-2015 20

{ })(exp1)(exp1)( xxx HZZ

pCc

cc -=þýü

îíì-= å

Î

f

þýü

îíì

+= ååÎ i

iiNji

jiij XXXZ

Xpi

0,

exp1)( qq

hidden units

visible units

{ }∑ ∑∑,

,, )(-),(+)(+)(exp=)|,(j ji

jijijijjji

iii Ahxhxhxp θfqfqfqq

Restricted Boltzmann Machines

© Eric Xing @ CMU, 2005-2015 21

Conditional Random Fields

þýü

îíì= åc

ccc yxfxZ

xyp ),(exp),(

1)|( qqq

© Eric Xing @ CMU, 2005-2015

22

•Discriminative

•Xi’s are assumed as features that are inter-dependent

•When labeling Xi future observations are taken into account

A AA AX2 X3X1 XT

Y2 Y3Y1 YT...

...

A AA AX2 X3X1 XT

Y2 Y3Y1 YT...

...

Y1 Y2 Y5…

X1 … Xn

GLIM, cont.

q The choice of exp family is constrained by the nature of the data Yq Example: y is a continuous vector à multivariate Gaussian

y is a class label à Bernoulli or multinomial q The choice of the response function

q Following some mild constrains, e.g., [0,1]. Positivity …q Canonical response function:

q In this case qTx directly corresponds to canonical parameter h.

© Eric Xing @ CMU, 2005-2015

23

( ){ })()(exp),(),|( 1 hhffh f Ayxyhyp T -=Þ

)(×= -1yf

{ })()(exp)()|( hhh Ayxyhyp T -=

hyfq

xµx yEXP

Example canonical response functions

© Eric Xing @ CMU, 2005-2015 24

MLE for GLIMs with natural response

q Log-likelihood

q Derivative of Log-likelihood

q Online learning for canonical GLIMsq Stochastic gradient ascent:

© Eric Xing @ CMU, 2005-2015 25

( )å å -+=n n

nnnT

n Ayxyh )()(log hql

( )

)(

)(

µ

µ

qh

hh

q

-=

-=

÷÷ø

öççè

æ-=

å

å

yX

xy

dd

ddAyx

dd

T

nnnn

n

n

n

nnn

l

This is a fixed point function because µ is a function of q

( ) ntnn

tt xy µrqq -+=+1

( ) size step a is and where rqµ nTtt

n x=

Batch learning for canonical GLIMs

q The Hessian matrix

where is the design matrix and

which can be computed by calculating the 2nd derivative of A(hn)

© Eric Xing @ CMU, 2005-2015 26

( )

WXX

xxddx

dd

ddx

ddxxy

dd

dddH

T

nT

nTn

n n

nn

Tn

n n

nn

nTn

nn

nnnTT

-=

=-=

-=

=-==

å

å

åå

qhhµ

qh

hµ

qµµ

qqq

since

l2

[ ]TnxX =

÷÷ø

öççè

æ=

N

N

dd

ddW

hµ

hµ ,,diag !

1

1

úúúú

û

ù

êêêê

ë

é

----

--------

=

nx

xx

X!!!

2

1

úúúú

û

ù

êêêê

ë

é

=

ny

yy

y!

" 2

1

Iteratively Reweighted Least Squares (IRLS)

q Recall Newton-Raphson methods with cost function J

q We now have

q Now:

q

where the adjusted response is

q This can be understood as solving the following " Iteratively reweighted least squares " problem

© Eric Xing @ CMU, 2005-2015 28

JHttqqq Ñ-= -+ 11

WXXHyXJT

T

-=

-=Ñ )( µq

( ) [ ]( ) ttTtT

tTttTtT

tt

zWXXWX

yXXWXXWX

H

1

1

11

-

-

-+

=

-+=

Ñ+=

)( µq

qq q l

( ) )( tttt yWXz µq -+=-1

)()(minarg qqqq

XzWXz Tt --=+1

( ) yXXX TT !1-=*q

Example 1: logistic regression (sigmoid classifier)

q The condition distribution: a Bernoulli

where µ is a logistic function

q p(y|x) is an exponential family function, with q mean:

q and canonical response function

q IRLS

© Eric Xing @ CMU, 2005-2015 29

yy xxxyp --= 11 ))(()()|( µµ

)()( xex hµ -+=1

1

[ ] )(| xexyE hµ -+

==1

1

xTqxh ==

÷÷÷

ø

ö

ççç

è

æ

-

-=

-=

)(

)(

)(

NN

W

dd

µµ

µµ

µµhµ

1

1

1

11

!

Example 2: linear regression

q The condition distribution: a Gaussian

where µ is a linear function

q p(y|x) is an exponential family function, with q mean:q and canonical response function

q IRLS

© Eric Xing @ CMU, 2005-2015 31

)()( xxx T hqµ ==

[ ] xxyE Tqµ ==|

xTqxh ==1

IWdd

=

= 1hµ

( )( ){ })()(exp)(

))(())((exp),,( //

hh

µµp

q

Ayxxh

xyxyxyp

T

Tk

-S-Þ

þýü

îíì -S--

S=S

-

-

121

1212 2

12

1

( )( ) ( )

( ) )(

)(tTTt

ttTT

ttTtTt

yXXX

yXXXX

zWXXWX

µq

µq

q

-+=

-+=

=

-

-

-+

1

1

11

Þ¥®

Þt

YXXX TT 1)( -=q

Steepest descent Normal equation

Rescale

Simple GMs are the building blocks of complex GMs

q Density estimation q Parametric and nonparametric methods

q Regressionq Linear, conditional mixture, nonparametric

q Classification q Generative and discriminative approach

q Clustering

© Eric Xing @ CMU, 2005-2020 32

Q

X

Q

X

X Y

µ,s

XX

MLE for general BNs

q If we assume the parameters for each CPD are globally independent, and all nodes are fully observed, then the log-likelihood function decomposes into a sum of local terms, one per node:

© Eric Xing @ CMU, 2005-2020 33

å åÕ Õ ÷øö

çèæ=÷÷

ø

öççè

æ==i n

ininn i

inin iixpxpDpD ),|(log),|(log)|(log);( ,,,, qqqq pp xxl

X2=1

X2=0

X5=0

X5=1

Decomposable likelihood of a BN

q Consider the distribution defined by the directed acyclic GM:

q This is exactly like learning four separate small BNs, each of which consists of a node and its parents.

© Eric Xing @ CMU, 2005-2020 34

),,|(),|(),|()|()|( 432431321211 qqqqq xxxpxxpxxpxpxp =

X1

X4

X2 X3

X4

X2 X3

X1X1

X2

X1

X3

MLE for BNs with tabular CPDs

q Assume each CPD is represented as a table (multinomial) where

q Note that in case of multiple parents, will have a composite state, and the CPD will be a high-dimensional table

q The sufficient statistics are counts of family configurations

q The log-likelihood is

q Using a Lagrange multiplier to enforce , we get:

© Eric Xing @ CMU, 2005-2020 35

)|(def

kXjXpiiijk === pq

ipX

å= nkn

jinijk ixxn p,,

def

åÕ ==kji

ijkijkkji

nijk nD ijk

,,,,

loglog);( qqql

1=åj ijkqå

=

''

jkij

ijkMLijk n

nq

Summary: Learning GM

q For fully observed BN, the log-likelihood function decomposes into a sum of local terms, one per node; thus learning is also factored

q Learning single-node GM – density estimation: exponential family dist.q Typical discrete distributionq Typical continuous distributionq Conjugate priors

q Learning two-node BN: GLIMq Conditional Density Est.q Classification

q Learning BN with more nodesq Local operations

© Eric Xing @ CMU, 2005-2020 36

ML Parameter Est. for partially observed GMs:

EM algorithm

© Eric Xing @ CMU, 2005-2020 37

Partially observed GMs

q Speech recognition

© Eric Xing @ CMU, 2005-2020 38

A AA AX2 X3X1 XT

Y2 Y3Y1 YT...

...

Partially observed GM

q Biological Evolution

© Eric Xing @ CMU, 2005-2020 39

ancestor

A C

Qh QmT years

?

AGAGAC

Mixture Models

© Eric Xing @ CMU, 2005-2020 40

Mixture Models, con'd

q A density model p(x) may be multi-modal.q We may be able to model it as a mixture of uni-modal distributions (e.g.,

Gaussians).q Each mode may correspond to a different sub-population (e.g., male and

female).

© Eric Xing @ CMU, 2005-2020 41

Þ

Unobserved Variables

q A variable can be unobserved (latent) because:q it is an imaginary quantity meant to provide some simplified and abstractive

view of the data generation processq e.g., speech recognition models, mixture models …

q it is a real-world object and/or phenomena, but difficult or impossible to measure

q e.g., the temperature of a star, causes of a disease, evolutionary ancestors …q it is a real-world object and/or phenomena, but sometimes wasn’t measured,

because of faulty sensors, etc.q Discrete latent variables can be used to partition/cluster data into sub-

groups.q Continuous latent variables (factors) can be used for dimensionality

reduction (factor analysis, etc).

© Eric Xing @ CMU, 2005-2020 42

Gaussian Mixture Models (GMMs)

q Consider a mixture of K Gaussian components:

q This model can be used for unsupervised clustering.q This model (fit by AutoClass) has been used to discover new kinds of stars in

astronomical data, etc.© Eric Xing @ CMU, 2005-2020 43

å S=Sk kkkn xNxp ),|,(),( µpµ

mixture proportion mixture component

Gaussian Mixture Models (GMMs)

q Consider a mixture of K Gaussian components:q Z is a latent class indicator vector:

q X is a conditional Gaussian variable with a class-specific mean/covariance

q The likelihood of a sample:

© Eric Xing @ CMU, 2005-2020 44

( )Õ==k

zknn

knzzp pp ):(multi)(

{ })-()-(-exp)(

),,|( // knkT

knk

mknn xxzxp µµ

pµ 1

21

212211 -SS

=S=

( )( ) åå Õå

S=S=

S===S

k kkkz kz

kknz

k

kkk

n

xNxN

zxpzpxp

n

kn

kn ),|,(),:(

),,|,()|(),(

µpµp

µpµ 11mixture proportion

mixture component

Z

X

Why is Learning Harder?

q In fully observed iid settings, the log likelihood decomposes into a sum of local terms (at least for directed models).

q With latent variables, all the parameters become coupled together via marginalization

© Eric Xing @ CMU, 2005-2020 45

),|(log)|(log)|,(log);( xzc zxpzpzxpD qqqq +==l

åå ==z

xzz

c zxpzpzxpD ),|()|(log)|,(log);( qqqql

Cxzz

xN

zxpzpxzpD

n kkn

kn

n kk

kn

n k

zkn

n k

zk

nnn

nn

nn

kn

kn

+-=

+=

==

ååååå Õå Õ

ÕÕ

221 )-(log

),;(loglog

),,|()|(log),(log);(

2 µp

sµp

sµp

s

θl

Toward the EM algorithm

q Recall MLE for completely observed data

q Data log-likelihood

q MLE

q What if we do not know zn?© Eric Xing @ CMU, 2005-2020 46

zi

xiN

),;(maxargˆ , DMLEk θlpp =

);(maxargˆ , DMLEk θlµµ =

);(maxargˆ , DMLEk θlss =åå=Þ

nkn

n nkn

MLEk zxz

,ˆ µ

Question

q “ … We solve problem X using Expectation-Maximization …”q What does it mean?

q Eq What do we take expectation with?q What do we take expectation over?

q Mq What do we maximize?q What do we maximize with respect to?

© Eric Xing @ CMU, 2005-2020 47

Recall: K-means

© Eric Xing @ CMU, 2005-2020 48

)()(maxarg )()()()( tkn

tk

Ttknk

tn xxz µµ -S-= -1

åå=+

ntn

n ntnt

k kzxkz),(),(

)(

)()(

dd

µ 1

Expectation-Maximization

q Start: q "Guess" the centroid µk and coveriance Sk of each of the K clusters

q Loop

© Eric Xing @ CMU, 2005-2020 49

E-step

q We maximize iteratively using the following iterative procedure:

─ Expectation step: computing the expected value of the sufficient statistics of the hidden variables (i.e., z) given current est. of the parameters (i.e., p and µ).

q Here we are essentially doing inference

© Eric Xing @ CMU, 2005-2020 50

å SS=S===

i

ti

tin

ti

tk

tkn

tkttk

nq

kn

tkn xN

xNxzpz t ),|,(),|,(),,|1( )()()(

)()()()()()(

)( µpµpµt

)(θcl

M-step

q We maximize iteratively using the following iterative procudure:─ Maximization step: compute the parameters under current results of the expected

value of the hidden variables

q This is isomorphic to MLE except that the variables that are hidden are replaced by their expectations (in general they will by replaced by their corresponding "sufficient statistics")

© Eric Xing @ CMU, 2005-2020 51

1 s.t. , ,0)( ,)(maxarg

)(*

k

*

)(

Nn

NNz

kll

kntk

nn qkn

k

kcck

t

k

===Þ

="=Þ=

åå

å¶¶

tp

pp p θθ

åå=Þ= +

ntk

n

n ntk

ntkk

xl )(

)()1(* ,)(maxarg

tt

µµ θ

åå ++

+ --=SÞ=S

ntk

n

nTt

kntkn

tknt

kk

xxl )(

)1()1()()1(* ))((

,)(maxargt

µµtθ T

T

T

xxxx =¶

¶

=¶

¶-

-

AA

A A

Alog

:Fact

1

1

)(θcl

Compare: K-means and EM

q K-meansq In the K-means “E-step” we do hard

assignment:

q In the K-means “M-step” we update the means as the weighted sum of the data, but now the weights are 0 or 1:

© Eric Xing @ CMU, 2005-2020 52

q EMq E-step

q M-step

)()(maxarg )()()()( tkn

tk

Ttknk

tn xxz µµ -S-= -1

åå=+

ntn

n ntnt

k kzxkz),(),(

)(

)()(

dd

µ 1

The EM algorithm for mixtures of Gaussians is like a "soft version" of the K-means algorithm.

å SS=S==

=

iti

tin

ti

tk

tkn

tkttk

n

q

kn

tkn

xNxNxzp

z t

),|,(),|,(),,|1( )()()(

)()()()()(

)()(

µpµpµ

t

åå=+

ntk

n

n ntk

ntk

x)(

)()1(

tt

µ

The EM Objective for Gaussian mixture model

q A mixture of K Gaussians:q Z is a latent class indicator vector

q X is a conditional Gaussian variable with class-specific mean/covariance

q The likelihood of a sample:

q The expected complete log likelihood

© Eric Xing @ CMU, 2005-2020 53

Zn

XnN

( )Õ==k

zknn

knzzp pp ):(multi)(

{ })-()-(-exp)(

),,|( // knkT

knk

mknn xxzxp µµ

pµ 1

21

212211 -SS

=S=

( )( ) åå Õå

S=S=

S===S

k kkkz kz

kknz

k

kkk

n

xNxN

zxpzpxp

n

kn

kn ),|,(),:(

),,|,()|(),(

µpµp

µpµ 11

( )åååå

åå

+S+-S--=

S+=

-

n kkknk

Tkn

kn

n kk

kn

nxzpnn

nxzpnc

Cxxzz

zxpzpzx

log)()(21log

),,|(log)|(log),;(

1

)|()|(

µµp

µpθl

Theory underlying EM

q What are we doing?

q Recall that according to MLE, we intend to learn the model parameter that would have maximize the likelihood of the data.

q But we do not observe z, so computing

is difficult!

q What shall we do?

© Eric Xing @ CMU, 2005-2020 54

åå ==z

xzz

c zxpzpzxpD ),|()|(log)|,(log);( qqqql

Complete & Incomplete Log Likelihoods

q Complete log likelihood:Let X denote the observable variable(s), and Z denote the latent variable(s). If Z could be observed, then

q Usually, optimizing lc() given both z and x is straightforward (c.f. MLE for fully observed models).q Recalled that in this case the objective for, e.g., MLE, decomposes into a sum of factors, the parameter for

each factor can be estimated separately.q But given that Z is not observed, lc() is a random quantity, cannot be maximized directly.

q Incomplete log likelihoodWith z unobserved, our objective becomes the log of a marginal probability:

q This objective won't decouple

© Eric Xing @ CMU, 2005-2020 55

)|,(log),;(def

qq zxpzxc =l

å==z

c zxpxpx )|,(log)|(log);( qqql

Expected Complete Log Likelihood

q For any distribution q(z), define expected complete log likelihood:

q A deterministic function of qq Linear in lc() --- inherit its factorizabiilityq Does maximizing this surrogate yield a maximizer of the likelihood?

q Jensen’s inequality

© Eric Xing @ CMU, 2005-2020 56

å=z

qc zxpxzqzx )|,(log),|(),;(def

qqql

å

å

å

³

=

=

=

z

z

z

xzqzxpxzq

xzqzxpxzq

zxpxpx

)|()|,(

log)|(

)|()|,(

)|(log

)|,(log

)|(log);(

q

q

qqql

qqc Hzxx +³Þ ),;();( qq ll

Lower Bounds and Free Energy

q For fixed data x, define a functional called the free energy:

q The EM algorithm is coordinate-ascent on F :q E-step:

q M-step:

© Eric Xing @ CMU, 2005-2020 57

);()|()|,(

log)|(),(def

xxzq

zxpxzqqFz

qqq l£=å

),(maxarg tq

t qFq q=+1

),(maxarg ttt qF qqq

11 ++ =

E-step: maximization of expected lc w.r.t. q

q Claim:

q This is the posterior distribution over the latent variables given the data and the parameters. Often we need this at test time anyway (e.g. to perform classification).

q Proof (easy): this setting attains the bound l(q;x)³F(q,q )

q Can also show this result using variational calculus or the fact that

© Eric Xing @ CMU, 2005-2020 58

),|(),(maxarg ttq

t xzpqFq qq ==+1

);()|(log

)|(log)|(

),()|,(

log),()),,((

xxp

xpxzqxzpzxpxzpxzpF

ttz

t

zt

tttt

qq

q

qqqqq

l==

=

=

å

å

( )),|(||KL),();( qqq xzpqqFx =-l

E-step º plug in posterior expectation of latent variables

q Without loss of generality: assume that p(x,z|q) is a generalized exponential family distribution:

q Special cases: if p(X|Z) are GLIMs, then

q The expected complete log likelihood under is

© Eric Xing @ CMU, 2005-2020 59

)(),(

)()|,(log),|(),;(

),|(qq

qqqq

qAzxf

Azxpxzqzx

ixzqi

ti

z

ttq

tc

t

t

-=

-=

åå+1l

þýü

îíì= å

iii zxfzxh

Zzxp ),(exp),(

)(),( q

qq 1

)()(),( xzzxf iTii xh=

),|( tt xzpq q=+1

)()()( ),|(

GLIM~qxhq

qAxz

iixzqi

ti

pt -= å

M-step: maximization of expected lc w.r.t. q

q Note that the free energy breaks into two terms:

q The first term is the expected complete log likelihood (energy) and the second term, which does not depend on q, is the entropy.

q Thus, in the M-step, maximizing with respect to q for fixed q we only need to consider the first term:

q Under optimal qt+1, this is equivalent to solving a standard MLE of fully observed model p(x,z|q), with the sufficient statistics involving z replaced by their expectations w.r.t. p(z|x,q).

© Eric Xing @ CMU, 2005-2020 60

qqc

zz

z

Hzx

xzqxzqzxpxzqxzq

zxpxzqqF

+=

-=

=

åå

å

),;(

)|(log)|()|,(log)|(

)|()|,(

log)|(),(

q

q

qq

l

å== ++

zqc

t zxpxzqzx t )|,(log)|(maxarg),;(maxarg qqqqq

11 l

Summary: EM Algorithm

q A way of maximizing likelihood function for latent variable models. Finds MLE of parameters when the original (hard) problem can be broken up into two (easy) pieces:

1. Estimate some “missing” or “unobserved” data from observed data and current parameters.2. Using this “complete” data, find the maximum likelihood parameter estimates.

q Alternate between filling in the latent variables using the best guess (posterior) and updating the parameters based on this guess:

q E-step: q M-step:

q In the M-step we optimize a lower bound on the likelihood. In the E-step we close the gap, making bound=likelihood.

© Eric Xing @ CMU, 2005-2020 61

),(maxarg tq

t qFq q=+1

),(maxarg ttt qF qqq

11 ++ =

© Eric Xing @ CMU, 2005-2020 62

Supplementary materials

I: Review of density estimation

q Can be viewed as single-node graphical models

q Instances of exponential family dist.



© Eric Xing @ CMU, 2005-2016 63

x1 x2 x3 xN…

xiN

GM:

Discrete Distributions

q Bernoulli distribution: Ber(p)

q Multinomial distribution: Mult(1,q)q Multinomial (indicator) variable:

© Eric Xing @ CMU, 2005-2016 64

. , w.p.

and ],,[

where , ∑

∑

[1,...,6]�

[1,...,6]�

11

110

6

5

4

3

2

1

==

==

úúúúúúúú

û

ù

êêêêêêêê

ë

é

=

jjjj

jjj

X

XX

XXXXXX

Xqq

( )x

k

xk

xT

xG

xC

xAj

j

kTGCA

jXPjxpqqqqqqq ==´´´==

==

∏}face-dice index the where,{))(( 1

îíì

==-

=101

xpxp

xPfor for

)( xx ppxP --= 11 )()(Þ

Discrete Distributions

q Multinomial distribution: Mult(n,q)

q Count variable:

© Eric Xing @ CMU, 2005-2016 65

å =úúú

û

ù

êêê

ë

é=

jj

K

Nnn

nn where , !

1

n

K

nK

nn

K nnnN

nnnNnp K qqqq

!!!!

!!!!)(

!!

! 2121

21

21 ==

Example: multinomial model

q Data: q We observed N iid die rolls (K-sided): D={5, 1, K, …, 3}

q Representation:Unit basis vectors:

q Model:

q How to write the likelihood of a single observation xn?

q The likelihood of datasetD={x1, …,xN}:

© Eric Xing @ CMU, 2005-2016 66

=´´´==

==Knnn x

Kxx

k

kni nkxPxP,2,1,

21

, })rollth theof side-die index the where,1({)(

qqqq !

Õ ÕÕ==

÷÷ø

öççè

æ==N

n k

xk

N

nnN

knxPxxxP11

21,)|()|,...,,( qqq

1 and },1,0{ where, 1

,,

,

2,

1,

==

÷÷÷÷÷

ø

ö

ççççç

è

æ

= å=

K

kknkn

Kn

n

n

n xx

x

xx

x!

1 and , w.p.1}1{

, == åÎ ,...Kk

kkknX qq

Õ=

K

k

xk

kn

1

,q

ÕÕ =å

= =

k

nk

k

x

kk

N

nkn

qq 1,

x1 x2 x3 xN…

xiN

GM:

MLE: constrained optimization with Lagrange multipliers

q Objective function:

q We need to maximize this subject to the constrain

q Constrained cost function with a Lagrange multiplier

q Take derivatives wrt qk

q Sufficient statisticsq The counts, are sufficient statistics of data D

© Eric Xing @ CMU, 2005-2016 67

åÕ ===k

kkk

nk nDPD k qqqq loglog)|(log);(l

11

=å=

K

kkq

÷÷ø

öççè

æ -+= åå=

K

kk

kkkn

11 qlqlogl

lqllq

lqq

===Þ=

=-=¶¶

ååk

kk

kkk

k

k

k

Nnn

n 0l

Nnk

MLEk =,q!

å=n

knMLEk xN ,,1q

!or

Frequency as sample mean

, ),,,( ,1 å==n knkK xnnnn !"

Bayesian estimation:

q Dirichlet distribution:

q Posterior distribution of q :

q Notice the isomorphism of the posterior to the prior, q such a prior is called a conjugate prior

q Posterior mean estimation:

© Eric Xing @ CMU, 2005-2016 68

ÕÕÕå

=G

G=

kk

kk

kk

kk

kk CP 1-1- )()(

)()( aa qaq

a

aq

ÕÕÕ -+- =µ=k

nk

kk

k

nk

N

NN

kkkk

xxppxxpxxP 11

1

11 ),...,(

)()|,...,(),...,|( aa qqqqqq

ò ò Õ ++=== -+

aaqqqqqqq a

NndCdDp kk

k

nkkkk

kk 1)|(

x1 x2 x3 xN…

GM:q

N

q

xi

Dirichlet parameters can be understood as pseudo-counts

More on Dirichlet Prior:

q Where is the normalize constant C(a) come from?

q Integration by parts q G(a) is the gamma function:q For inregers,

q Marginal likelihood:

q Posterior in closed-form:

q Posterior predictive rate:

© Eric Xing @ CMU, 2005-2016 69

( )åÕò ò G

G== --

k k

k kKK dd

CK

aa

qqqqa

aa )()(

!!! 111

11

1

ò¥ --=G0

1 dtet taa )(!)( nn =+G 1

)()()|()|()|()|},...,({a

aqaqqaa !""!!!!"!"!+

=== ò nCCdpnpnpxxp N1

Õ -++==k

nkN

kknCnppnpxxP 1

1 )()|()|()|()},,...,{|( aqa

aaqqaq !!

!!!!!!

)(Dir a!! += n

)()()()},,...,{|( 1

11N

ni

k

nkNN xnC

nCdnCxxixp kkkk

+++=´+== ò Õ +-+

+ aaqqqaa aa!!!!!!!!

|||| aa!! +

+=nn ii

Sequential Bayesian updating

q Start with Dirichlet priorq Observe N ' samples with sufficient statistics . Posterior becomes:

q Observe another N " samples with sufficient statistics . Posterior becomes:

q So sequentially absorbing data in any order is equivalent to batch update.

© Eric Xing @ CMU, 2005-2016 70

):(Dir)|( aqaq !!!! =P

'n!

)':(Dir)',|( nnP !!!!!! += aqaq"n!

)"':(Dir)",',|( nnnnP !!!!!!!! ++= aqaq

Hierarchical Bayesian Models

q q are the parameters for the likelihood p(x|q)q a are the parameters for the prior p(q|a) .q We can have hyper-hyper-parameters, etc.q We stop when the choice of hyper-parameters makes no difference to

the marginal likelihood; typically make hyper-parameters constants.q Where do we get the prior?

q Intelligent guessesq Empirical Bayes (Type-II maximum likelihood)

à computing point estimates of a :

© Eric Xing @ CMU, 2005-2016 71

)|(maxarg aaa

!!"!! npMLE ==

Limitation of Dirichlet Prior:

© Eric Xing @ CMU, 2005-2016 72

a

N

q

xi

The Logistic Normal Prior

© Eric Xing @ CMU, 2005-2016 73

- Log Partition Function- Normalization Constant

Problem

( )

( )

( ) log

logexp

,~

,~

÷ø

öçè

æ+=

þýü

îíì

÷ø

öçè

æ+-=

=S

S

å

å

-

=

-

=

-

1

1

1

1

1

1

1

0

K

i

K

iii

KK

K

i

i

eC

e

N

LN

g

g

g

gq

gµg

µqμ Σ

N

q

xi

q Pro: co-variance structureq Con: non-conjugate (we will discuss how to solve this later)

Logistic Normal Densities

© Eric Xing @ CMU, 2005-2016 74

Logistic

Normal

Continuous Distributions

q Uniform Probability Density Function

q Normal (Gaussian) Probability Density Function

q The distribution is symmetric, and is often illustrated as a bell-shaped curve. q Two parameters, µ (mean) and s (standard deviation), determine the location and shape of the distribution.q The highest point on the normal curve is at the mean, which is also the median and mode.q The mean can be any numerical value: negative, zero, or positive.

q Multivariate Gaussian

© Eric Xing @ CMU, 2005-2016 75

elsewhere for )/()(

01

=££-= bxaabxp

22 2

21 sµ

sp/)()( --= xexp

µµxx

ff((xx))

µµxx

ff((xx))

( ) ( ) ( )þýü

îíì -S--

S=S - µµ

pµ !!! XXXp T

n1

212 21

21 exp),;(

//

MLE for a multivariate-Gaussian

q It can be shown that the MLE for µ and Σ is

where the scatter matrix is

q The sufficient statistics are Snxn and SnxnxnT.

q Note that XTX=SnxnxnT may not be full rank (eg. if N <D), in which case ΣML is

not invertible

© Eric Xing @ CMU, 2005-2016 76

( )

( )( ) SN

xxN

xN

nT

MLnMLnMLE

n nMLE

11

1

=--=S

=

å

å

µµ

µ

( )( ) ( ) TMLMLn n

Tnnn

TMLnMLn NxxxxS µµµµ -=--= åå

÷÷÷÷÷

ø

ö

ççççç

è

æ

=

Kn

n

n

n

x

xx

x

,

2,

1,

!

÷÷÷÷÷

ø

ö

ççççç

è

æ

------

------------

=

TN

T

T

x

xx

X!2

1

Bayesian parameter estimation for a Gaussian

q There are various reasons to pursue a Bayesian approachq We would like to update our estimates sequentially over time.q We may have prior knowledge about the expected magnitude of the

parameters.q The MLE for Σ may not be full rank if we don’t have enough data.

q We will restrict our attention to conjugate priors.

q We will consider various cases, in order of increasing complexity:q Known σ, unknown µq Known µ, unknown σq Unknown µ and σ

© Eric Xing @ CMU, 2005-2016 77

Bayesian estimation: unknown µ, known σ

q Normal Prior:

q Joint probability:

q Posterior:

© Eric Xing @ CMU, 2005-2016 78

( ) { }220

212 22 tµµptµ /)(exp)( /--=

-P

1

222

022

2

22

2 11

11

-

÷øö

çèæ +=

++

+=

tssµ

tst

tssµ N

Nx

NN ~ and ,

///

///~ where

x1 x2 x3 xN…

GM:µ

N

µ

xi

Sample mean

( ) ( )

( ) { }220

212

1

22

22

22

212

tµµpt

µs

psµ

/)(exp

exp),(

/

/

--´

þýü

îíì --=

-

=

- åN

nn

N xxP

( ) { }22212 22 sµµspµ ~/)~(exp~)|( /--=

-xP

Bayesian estimation: unknown µ, known σ

q The posterior mean is a convex combination of the prior and the MLE, with weights proportional to the relative noise levels.

q The precision of the posterior 1/σ2N is the precision of the prior 1/σ2

0 plus one contribution of data precision 1/σ2 for each observed data point.

q Sequentially updating the meanq µ∗ = 0.8 (unknown), (σ2)∗ = 0.1 (known)

q Effect of single data point

q Uninformative (vague/ flat) prior, σ20 →∞

© Eric Xing @ CMU, 2005-2016 79

1

20

22

020

2

20

20

2

2 11

11

-

÷÷ø

öççè

æ+=

++

+=

sssµ

sss

sssµ N

Nx

NN

N~ ,

///

///

20

2

20

020

2

20

001 sssµ

sssµµµ

+--=

+-+= )( )( xxx

0µµ ®N

Other scenarios

q Known µ, unknown λ = 1/σ2q The conjugate prior for λ is a Gamma with shape a0 and rate (inverse scale)

b0

q The conjugate prior for σ2 is Inverse-Gamma

q Unknown µ and unknown σ2q The conjugate prior is

Normal-Inverse-Gammaq Semi conjugate prior

q Multivariate case:q The conjugate prior is

Normal-Inverse-Wishart© Eric Xing @ CMU, 2005-2016 80

Q

X

Q

X

II: Two node fully observed BNs

q Conditional mixtures

q Linear/Logistic Regression q

q Classificationq Generative and discriminative approaches

© Eric Xing @ CMU, 2005-2016 81

Classification:

q Goal: Wish to learn f: X ® Y

q Generative:q Modeling the joint distribution

of all data

q Discriminative:q Modeling only points

at the boundary

© Eric Xing @ CMU, 2005-2016 82

Conditional Gaussian

q The data:

q Both nodes are observed:q Y is a class indicator vector

q X is a conditional Gaussian variable with a class-specific mean

© Eric Xing @ CMU, 2005-2016 83

GM:Yi

XiN

Õ==k

yknn

knyyp ,):(multi)( pp

{ }221

2/12, )-(-exp)2(

1),,1|( 2 knknn xyxp µps

sµs

==

Õ Õ ÷÷ø

öççè

æ=n k

ykn

knxNyxp ,),:(),,|( sµsµ

{ }),(,),,(),,(),,( NN yxyxyxyx !332211

),,|()|(log),(log);( sµp nnn

nn

nn yxpypyxpD ÕÕ ==θl

),;(max,

,, Nn

Ny

Drga knkn

MLEkMLEk ===å

ppp

!! θlthe fraction of samples of class m

k

nnkn

nkn

nnkn

MLEkMLEk n

xy

y

xyD

ååå

===,

,

,

,, ),;(maxarg µµ !! θl the average of samples of class m

MLE of conditional Gaussian

© Eric Xing @ CMU, 2005-2016 84

GM:Yi

XiN

q Data log-likelihood

q MLE

Bsyesian estimation of conditional Gaussian

q Prior:

q Posterior mean (Bayesian est.)

© Eric Xing @ CMU, 2005-2016 85

GM:

Yi

XiN

µ

p):(Dir)|( apap !!!! =P

),:(Normal)|( tnµnµ kkP =K

1

222

22

2

22

2 11

11

-

÷øö

çèæ +=

++

+=

tssn

tstµ

tssµ N

nnn

Bayesk

MLkk

kBayesk and ,

///

///

,,!

aa

aa

aa

pa

p++=

++

+=

Nn

NNN kkk

MLkBayesk ,,!

Classification

q Gaussian Discriminative Analysis:q The joint probability of a datum and it label is:

q Given a datum xn, we predict its label using the conditional probability of the label given the datum:

q This is basic inference q introduce evidence, and then normalize

© Eric Xing @ CMU, 2005-2016 86

{ }221

2/12

,,,

)-(-exp)2(

1),,1|()1(),|1,(

2 knk

knnknknn

x

yxpypyxp

µps

p

sµsµ

s=

=´===

{ }{ }å

==

'

2'2

12/12'

221

2/12

,

)-(-exp)2(

1

)-(-exp)2(

1

),,|1(2

2

kknk

knk

nkn

x

xxyp

µps

p

µps

psµ

s

s

Yi

Xi

Transductive classification

q Given Xn, what is its corresponding Ynwhen we know the answer for a set of training data?

q Frequentist prediction:q we fit p, µ and s from data first, and then …

q Bayesian:q we compute the posterior dist. of the parameters first …

© Eric Xing @ CMU, 2005-2016 87

GM:

Yi

XiN

µ

p

KYn

Xn

å=

===

'''

,, ),|,(

),|,(),,|(

),,|,1(),,,|1(

kknk

knk

n

nknnkn xN

xNxp

xypxyp

sµpsµp

psµpsµ

psµ

Linear Regression

q The data:

q Both nodes are observed:q X is an input vectorq Y is a response vector

(we first consider y as a generic continuous response vector, then we consider the special case of classification where y is a discrete indicator)

q A regression scheme can be used to model p(y|x) directly,rather than p(x,y)

© Eric Xing @ CMU, 2005-2016 88

Yi

XiN

{ }),(,),,(),,(),,( NN yxyxyxyx !332211

A discriminative probabilistic model

q Let us assume that the target variable and the inputs are related by the equation:

where ε is an error term of unmodeled effects or random noise

q Now assume that ε follows a Gaussian N(0,σ), then we have:

q By independence assumption:

© Eric Xing @ CMU, 2005-2016 89

iiT

iy eq += x

÷÷ø

öççè

æ --= 2

2

221

sq

spq )(exp);|( i

Ti

iiyxyp x

÷÷

ø

öçç

è

æ --÷

øö

çèæ== åÕ =

=2

12

1 221

sq

spqq

n

i iT

inn

iii

yxypL

)(exp);|()(

x

Linear regression

q Hence the log-likelihood is:

q Do you recognize the last term?

Yes it is:

q It is same as the MSE!

© Eric Xing @ CMU, 2005-2016 90

å =--= n

i iT

iynl 12

2 211

21 )(log)( xq

sspq

å=

-=n

ii

Ti yJ

1

2

21 )()( qq x

A recap:

q LMS update rule

q Pros: on-line, low per-step costq Cons: coordinate, maybe slow-converging

q Steepest descent

q Pros: fast-converging, easy to implementq Cons: a batch,

q Normal equations

q Pros: a single-shot algorithm! Easiest to implement.q Cons: need to compute pseudo-inverse (XTX)-1, expensive, numerical issues (e.g.,

matrix is singular ..)© Eric Xing @ CMU, 2005-2016 91

ntT

nntt y xx )(1 qaqq -+=+

å=

+ -+=n

in

tTnn

tt y1

1 xx )( qaqq

( ) yXXX TT !1-=*q

Earthquake

Radio

Burglary

Alarm

Call

III: How to define parameter prior for general BN?

Õ=

==M

ii ixpp

1)|()( pxxX

© Eric Xing @ CMU, 2005-2020

93

Assumptions (Geiger & Heckerman 97,99):

Ø Complete Model EquivalenceØ Global Parameter IndependenceØ Local Parameter IndependenceØ Likelihood and Prior Modularity

Factorization:

ji

kii x

jkixp

pqp x

x |)|( =

Local Distributions defined by, e.g., multinomial parameters:

? )|( Gp q

■ Global Parameter IndependenceFor every DAG model:

■ Local Parameter IndependenceFor every node:

Õ=

=M

iim GpGp

1)|()|( qq

Earthquake

Radio

Burglary

Alarm

Call

Õ=

=i

ji

ki

q

jxi GpGp

1| )|()|(p

qqx

independent of

)( | YESAlarmCallP =q

)( | NOAlarmCallP =q

Global & Local Parameter Independence

© Eric Xing @ CMU, 2005-2020 94

Provided all variables are observed in all cases, we can perform Bayesian update each parameter independently !!!

sample 1

sample 2

!

q2|1q1 q2|1

X1 X2

X1 X2

Global ParameterIndependence

Local ParameterIndependence

Parameter Independence, Graphical View

© Eric Xing @ CMU, 2005-2020 95

Which PDFs Satisfy Our Assumptions? (Geiger & Heckerman 97,99)

q Discrete DAG Models:

Dirichlet prior:

q Gaussian DAG Models:

Normal prior:

Normal-Wishart prior:

© Eric Xing @ CMU, 2005-2020 96

ÕÕÕå

=G

G=

kk

kk

kk

kk

kk CP 1-1- )()(

)()( aa qaq

a

aq

)(Multi~| qp jxi i

x

),(Normal~| Sµp jxi i

x

þýü

îíì -Y--

Y=Y - )()'(exp

||)(),|( // nµnµ

pnµ 1

212 21

21

np

{ }

. where

,WtrexpW),(),|W( /)(/

1

212

21

-

--

S=þýü

îíì TT=T

W

nww

wwncp aaaa

( ),)(,Normal),,|( 1-= WW µµ ananµp

Parameter sharing

q Consider a time-invariant (stationary) 1st-order Markov modelq Initial state probability vector: q State transition probability matrix:

q The joint:

q The log-likelihood:

q Again, we optimize each parameter separatelyq p is a multinomial frequency vector, and we've seen it beforeq What about A?

© Eric Xing @ CMU, 2005-2020 97

X1 X2 X3 XT…

Ap

)(def

11 == kk Xpp

)|(def

11 1 === -it

jtij XXpA

ÕÕ= =

-=T

t tttT XXpxpXp

2 2111 )|()|()|( : pq

ååå=

-+=n

T

ttntn

nn AxxpxpD

211 ),|(log)|(log);( ,,, pql

Learning a Markov chain transition matrix

q A is a stochastic matrix: q Each row of A is multinomial distribution.q So MLE of Aij is the fraction of transitions from i to j

q Application: q if the states Xt represent words, this is called a bigram language model

q Sparse data problem:q If i à j did not occur in data, we will have Aij =0, then any future sequence with word

pair i à j will have zero probability. q A standard hack: backoff smoothing or deleted interpolation

© Eric Xing @ CMU, 2005-2020 98

1=å j ijA

å åå å

= -

= -=•®

®=n

T

titn

jtnn

T

titnML

ijx

xxi

jiA2 1

2 1

,

,,

)(#)(#

MLiti AA •®•® -+= )(~ llh 1

Bayesian language model

q Global and local parameter independence

q The posterior of Aià· and Ai'à· is factorized despite v-structure on Xt, because Xt-1acts like a multiplexerq Assign a Dirichlet prior bi to each row of the transition matrix:

q We could consider more realistic priors, e.g., mixtures of Dirichlets to account for types of words (adjectives, verbs, etc.)

© Eric Xing @ CMU, 2005-2020 99

X1 X2 X3 XT…

Ai à·p Ai'à·

ba

…

A

)(# where,)(

)(#)(#

),,|( ',

,def

•®+=-+=

+•®+®

==i

Ai

jiDijpA

i

ii

MLijikii

i

kii

Bayesij b

bllbl

bb

b 1

Unsupervised ML estimation

q Given x = x1…xN for which the true state path y = y1…yN is unknown,

q EXPECTATION MAXIMIZATION

0. Starting with our best guess of a model M, parameters q:1. Estimate Aij , Bik in the training data

q How? , ,

2. Update q according to Aij , Bikq Now a "supervised learning" problem

3. Repeat 1 & 2, until convergenceThis is called the Baum-Welch Algorithm

We can get to a provably more (or equally) likely parameter set q each iteration© Eric Xing @ CMU, 2005-2020 101

ktntn

itnik xyB ,, ,å=å -= tn

jtn

itnij yyA

, ,, 1

EM for general BNs

while not converged% E-stepfor each node iESSi = 0 % reset expected sufficient statistics

for each data sample ndo inference with Xn,Hfor each node i

% M-stepfor each node iqi := MLE(ESSi )

© Eric Xing @ CMU, 2005-2020 102

)|(,,,,

),( HnHni xxpninii xxSSESS

-

=+ p

Conditional mixture model: Mixture of experts

q We will model p(Y |X) using different experts, each responsible for different regions of the input space.

q Latent variable Z chooses expert using softmax gating function:

q Each expert can be a linear regression model:q The posterior expert responsibilities are

© Eric Xing @ CMU, 2005-2020 103

( )xxzP Tk xSoftmax)( == 1( )21 k

Tk

k xyzxyP sq ,;),( N==

å ==

==j jjj

jkkk

kk

xypxzpxypxzpyxzP

),,()(),,()(

),,( 2

2

111

sqsq

q

EM for conditional mixture model

q Model:

q The objective function

q EM:

q E-step:

q M-step:

q using the normal equation for standard LR , but with the data re-weighted by t(homework)

q IRLS and/or weighted IRLS algorithm to update {xk, qk, sk} based on data pair (xn,yn), with weights (homework?)

© Eric Xing @ CMU, 2005-2020 104

),,,|(),|()( sqx ik

kk xzypxzpxyP 11 ===å

å ==

===j jjnnjn

jn

kknnknkn

nnkn

tkn xypxzp

xypxzpyxzP),,()(

),,()(),,()(

2

2

111

sqsq

t θ

( ) åååå

åå

÷÷ø

öççè

æ++-=

+=

n kk

k

nTknk

nn k

nTk

kn

nyxzpnnn

nyxzpnnc

Cxyzxz

zxypxzpzyx

22

),|(),|(

log)-(21)softmax(log

),,,|(log),|(log),,;(

ssqx

sqxθl

YXXX TT 1-= )(q

)(tknt

Hierarchical mixture of experts

q This is like a soft version of a depth-2 classification/regression tree.q P(Y |X,G1,G2) can be modeled as a GLIM, with parameters dependent on the values of

G1 and G2 (which specify a "conditional path" to a given leaf in the tree).

© Eric Xing @ CMU, 2005-2020 105

Mixture of overlapping experts

q By removing the X à Z arc, we can make the partitions independent of the input, thus allowing overlap.

q This is a mixture of linear regressors; each subpopulation has a different conditional mean.

© Eric Xing @ CMU, 2005-2020 106

å ==

==j jjj

jkkk

kk

xypzpxypzpyxzP

),,()(),,()(

),,( 2

2

111

sqsq

q

Partially Hidden Data

q Of course, we can learn when there are missing (hidden) variables on some cases and not on others.

q In this case the cost function is:

q Note that Ym do not have to be the same in each case --- the data can have different missing values in each different sample

q Now you can think of this in a new way: in the E-step we estimate the hidden variables on the incomplete cases only.

q The M-step optimizes the log likelihood on the complete data plus the expected likelihood on the incomplete data using the E-step.

© Eric Xing @ CMU, 2005-2020 107

å ååÎÎ

+=MissingComplete

)|,(log)|,(log);(m y

mmn

nncm

yxpyxpD qqql

EM Variants

q Sparse EM:Do not re-compute exactly the posterior probability on each data point under all models, because it is almost zero. Instead keep an “active list” which you update every once in a while.

q Generalized (Incomplete) EM: It might be hard to find the ML parameters in the M-step, even given the completed data. We can still make progress by doing an M-step that improves the likelihood a bit (e.g. gradient step). Recall the IRLS step in the mixture of experts model.

© Eric Xing @ CMU, 2005-2020 108

A Report Card for EM

q Some good things about EM:q no learning rate (step-size) parameterq automatically enforces parameter constraintsq very fast for low dimensionsq each iteration guaranteed to improve likelihood

q Some bad things about EM:q can get stuck in local minimaq can be slower than conjugate gradient (especially near convergence)q requires expensive inference stepq is a maximum likelihood/MAP method

© Eric Xing @ CMU, 2005-2020 109

Probabilistic Graphical Modelsepxing/Class/10708-20/lectures/lecture05... · 2020. 1. 29. ·...

Documents

Transcript of Probabilistic Graphical Modelsepxing/Class/10708-20/lectures/lecture05... · 2020. 1. 29. ·...