Probabilistic Graphical Modelsepxing/Class/10708-20/lectures/lecture05... · 2020. 1. 29. ·...
Transcript of Probabilistic Graphical Modelsepxing/Class/10708-20/lectures/lecture05... · 2020. 1. 29. ·...
Probabilistic Graphical Models
Parameter Estimation
Eric XingLecture 5, January 29, 2020
© Eric Xing @ CMU, 2005-2020 1Reading: see class homepage
Learning Graphical Models
The goal:q Given set of independent samples (assignments of random variables),
find the best (the most likely?) Bayesian Network (both DAG and CPDs)
(B,E,A,C,R)=(T,F,F,T,F)(B,E,A,C,R)=(T,F,T,T,F)……..(B,E,A,C,R)=(F,T,T,T,F)
© Eric Xing @ CMU, 2005-2020 2
E
R
B
A
C
0.9 0.1
e
be
0.2 0.8
0.01 0.990.9 0.1
bebb
e
BE P(A | E,B)
E
R
B
A
C
Structural learning
Parameter learning
Learning Graphical Models
q Scenarios:q completely observed GMs
q directedq undirected
q partially or unobserved GMsq directedq undirected (an open research topic)
q Estimation principles:q Maximal likelihood estimation (MLE)q Bayesian estimationq Maximal conditional likelihoodq Maximal "Margin" q Maximum entropy
q We use learning as a name for the process of estimating the parameters, and in some cases, the topology of the network, from data.
© Eric Xing @ CMU, 2005-2020 3
ML Parameter Est. for completely observed GMs of
given structure
Z
X
l The data:{ (z1,x1), (z2,x2), (z3,x3), ... (zN,xN)}
© Eric Xing @ CMU, 2005-2020 4
Parameter Learning
q Assume G is known and fixed,q from expert designq from an intermediate outcome of iterative structure learning
q Goal: estimate from a dataset of N independent, identically distributed (iid) training cases D = {x1, . . . , xN}.
q In general, each training case xn= (xn,1, . . . , xn,M) is a vector of M values, one per node,
q the model can be completely observable, i.e., every element in xn is known (no missing values, no hidden variables),
q or, partially observable, i.e., $i, s.t. xn,i is not observed. q In this lecture we consider learning parameters for a BN with given
structure and is completely observable
© Eric Xing @ CMU, 2005-2020 5
å åÕ Õ ÷øö
çèæ=÷÷
ø
öççè
æ==i n
ininn i
inin iixpxpDpD ),|(log),|(log)|(log);( ,,,, qqqq pp xxl
Review of density estimation
q Can be viewed as single-node GMs
q Instances of Exponential Family Dist.
q Building blocks of general GM
q MLE and Bayesian estimate
q See supplementary slides
© Eric Xing @ CMU, 2005-2020 6
x1 x2 x3 xN…
xiN
=´´´==
==Knnn x
Kxx
k
kni nkxPxP,2,1,
21
, })rollth theof side-die index the where,1({)(
qqqq !
Õ ÕÕ==
÷÷ø
öççè
æ==N
n k
xk
N
nnN
knxPxxxP11
21,)|()|,...,,( qqq
Õ=
K
k
xk
kn
1
,q
ÕÕ =å
= =
k
nk
k
x
kk
N
nkn
qq 1,
lqllq
lqq
===Þ=
=-=¶¶
ååk
kk
kkk
k
k
k
Nnn
n 0l
Nnk
MLEk =,q!
Estimation of conditional density
q Can be viewed as two-node graphical models
q Instances of GLIM (Generalized Linear Models)
q Building blocks of general GM
q MLE and Bayesian estimate
q See supplementary slides
© Eric Xing @ CMU, 2005-2020 7
Q
X
Q
X
Exponential family, a basic building block
q For a numeric random variable X
is an exponential family distribution with natural (canonical) parameter h
q Function T(x) is a sufficient statistic.q Function A(h) = log Z(h) is the log normalizer.q Examples: Bernoulli, multinomial, Gaussian, Poisson, gamma,...
© Eric Xing @ CMU, 2005-2015 8
{ }{ })(exp)(
)(
)()(exp)()|(
xTxhZ
AxTxhxp
T
T
hh
hhh1=
-=
Example: Multivariate Gaussian Distribution
q For a continuous vector random variable XÎRk:
q Exponential family representation
q Note: a k-dimensional Gaussian is a (d+d2)-parameter distribution with a (d+d2)-element vector of sufficient statistics (but because of symmetry and positivity, parameters are constrained and have lower degree of freedom)
© Eric Xing @ CMU, 2005-2015 9
( )
( ) ( ){ }S-S-S+S-=
þýü
îíì -S--
S=S
---
-
logtrexp
)()(exp),(
/
//
µµµp
µµp
µ
12111
21
2
1212
21
21
21
TTTk
Tk
xxx
xxxp
( )[ ] ( )[ ]( )[ ]
( )( ) 2
221
112211
21
121
21
1211
211
22
/)(
log)(trlog)(vec;)(
and ,vec,vec;
k
TT
T
xh
AxxxxT
-
-
----
=
---=S+S=
=
S-=S==S-S=
p
hhhhµµh
hµhhhµh
Moment parameter
Natural parameter
Example: Multinomial distribution
q For a binary vector random variable
q Exponential family representation
© Eric Xing @ CMU, 2005-2015 10
),|(multi~ pxx
þýü
îíì== åk
kkxK
xx xxp K ppppp lnexp)( 1121 !
[ ]
1
1
0
1
1
1
=
÷øö
çèæ=÷
øö
çèæ --=
=
úûù
êëé
÷øöç
èæ=
åå=
-
=
)(
lnln)(
)(
;ln
xh
eA
xxTK
k
K
kk
K
k
khph
pph
þýü
îíì
÷øö
çèæ -+÷÷
ø
öççè
æå-
=
þýü
îíì
÷øö
çèæ -÷
øö
çèæ -+=
åå
ååå-
=
-
=-=
-
=
-
=
-
=
1
1
1
111
1
1
1
1
1
1
1ln1
lnexp
1ln1lnexp
K
kk
K
k kKk
kk
K
kk
K
kK
K
kkk
x
xx
pp
p
pp
Why exponential family?
q Moment generating property
© Eric Xing @ CMU, 2005-2015 11
{ }{ }
[ ])()(
)(exp)()(
)(exp)()(
)()(
)(log
xTE
dxZ
xTxhxT
dxxTxhdd
Z
Zdd
ZZ
dd
ddA
T
T
=
=
=
==
ò
ò
hh
hhh
hhh
hhh1
1
{ } { }
[ ] [ ][ ])(
)()(
)()()(
)(exp)()()(
)(exp)()(
xTVarxTExTE
Zdd
Zdx
ZxTxhxTdx
ZxTxhxT
dAd
2
TT
=-=
-= òò2
22
2 1 hhhh
hhh
h
Moment estimation
q We can easily compute moments of any exponential family distribution by taking the derivatives of the log normalizer A(h).
q The qth derivative gives the qth centered moment.
q When the sufficient statistic is a stacked vector, partial derivatives need to be considered.
© Eric Xing @ CMU, 2005-2015 12
!
variance)(
mean)(
=
=
2
2
hhhh
dAdddA
Moment vs canonical parameters
q The moment parameter µ can be derived from the natural (canonical) parameter
q A(h) is convex since
q Hence we can invert the relationship and infer the canonical parameter from the moment parameter (1-to-1):
q A distribution in the exponential family can be parameterized not only by h - the canonical parameterization, but also by µ - the moment parameterization.
© Eric Xing @ CMU, 2005-2015 13
[ ] µhh def
)()( == xTEddA
[ ] 02
2
>= )()( xTVardAdhh
)(def
µyh =
4
8
-2 -1 0 1 2
4
8
-2 -1 0 1 2
A
hh*
MLE for Exponential Family
q For iid data, the log-likelihood is
q Take derivatives and set to zero:
q This amounts to moment matching.q We can infer the canonical parameters using
© Eric Xing @ CMU, 2005-2015 14
{ }
å å
Õ
-÷øö
çèæ+=
-=
n nn
Tn
nn
Tn
NAxTxh
AxTxhD
)()()(log
)()(exp)(log);(
hh
hhhl
)(
)()(
)()(
å
å
å
=
=¶¶
Þ
=¶¶-=
¶¶
nnMLE
nn
nn
xTN
xTN
A
ANxT
1
1
0
µ
hh
hh
h
!
l
)( MLEMLE µyh !! =
Sufficiency
q For p(x|q), T(x) is sufficient for q if there is no information in X regarding qbeyond that in T(x).
q We can throw away X for the purpose of inference w.r.t. q .
q Bayesian view
q Frequentist view
q The Neyman factorization theoremq T(x) is sufficient for q if
© Eric Xing @ CMU, 2005-2015 15
T(x) qX ))(|()),(|( xTpxxTp qq =
T(x) qX ))(|()),(|( xTxpxTxp =q
T(x) qX
))(,()),(()),(,( xTxxTxTxp 21 yqyq =
))(,()),(()|( xTxhxTgxp qq =Þ
Examples
q Gaussian:
q Multinomial:
q Poisson:
© Eric Xing @ CMU, 2005-2015 16
( )[ ]( )[ ]
( ) 2211
21
1211
2 /)(
log)(vec;)(
vec;
k
T
T
xh
AxxxxT
-
-
--
=
S+S=
=
S-S=
p
µµh
µh
åå ==Þn
nn
nMLE xN
xTN
111 )(µ
[ ]
1
1
0
1
1
1
=
÷øö
çèæ=÷
øö
çèæ --=
=
úûù
êëé
÷øöç
èæ=
åå=
-
=
)(
lnln)(
)(
;ln
xh
eA
xxTK
k
K
kk
K
k
khph
pph
å=Þn
nMLE xN1µ
!)(
)()(log
xxh
eAxxT
1=
==
==
hlh
lh
å=Þn
nMLE xN1µ
Generalized Linear Models (GLIMs)
q The graphical modelq Linear regressionq Discriminative linear classificationq Commonality:
model Ep(Y)=µ=f(qTX)q What is p()? the cond. dist. of Y.q What is f()? the response function.
q GLIMq The observed input x is assumed to enter into the model via a linear combination of its
elementsq The conditional mean µ is represented as a function f(x) of x, where f is known as the response
functionq The observed output y is assumed to be characterized by an exponential family distribution
with conditional mean µ.
© Eric Xing @ CMU, 2005-2015 17
Xn
YnN
xTqx =
Recall Linear Regression
q Let us assume that the target variable and the inputs are related by the equation:
where ε is an error term of unmodeled effects or random noise
q Now assume that ε follows a Gaussian N(0,σ), then we have:
q We can use LMS algorithm, which is a gradient ascent/descent approach, to estimate the parameter
© Eric Xing @ CMU, 2005-2015 18
iiT
iy eq += x
÷÷ø
öççè
æ --= 2
2
221
sq
spq )(exp);|( i
Ti
iiyxyp x
Recall: Logistic Regression (sigmoid classifier, perceptron, etc.)
q The condition distribution: a Bernoulli
where µ is a logistic function
q We can used the brute-force gradient method as in LR
q But we can also apply generic laws by observing the p(y|x) is an exponential family function, more specifically, a generalized linear model!
© Eric Xing @ CMU, 2005-2015 19
yy xxxyp --= 11 ))(()()|( µµ
xTex
qµ
-+=1
1)(
More examples: parameterizing graphical models
q Markov random fields
© Eric Xing @ CMU, 2005-2015 20
{ })(exp1)(exp1)( xxx HZZ
pCc
cc -=þýü
îíì-= å
Î
f
þýü
îíì
+= ååÎ i
iiNji
jiij XXXZ
Xpi
0,
exp1)( qq
hidden units
visible units
{ }∑ ∑∑,
,, )(-),(+)(+)(exp=)|,(j ji
jijijijjji
iii Ahxhxhxp θfqfqfqq
Restricted Boltzmann Machines
© Eric Xing @ CMU, 2005-2015 21
Conditional Random Fields
þýü
îíì= åc
ccc yxfxZ
xyp ),(exp),(
1)|( qqq
© Eric Xing @ CMU, 2005-2015
22
•Discriminative
•Xi’s are assumed as features that are inter-dependent
•When labeling Xi future observations are taken into account
A AA AX2 X3X1 XT
Y2 Y3Y1 YT...
...
A AA AX2 X3X1 XT
Y2 Y3Y1 YT...
...
Y1 Y2 Y5…
X1 … Xn
GLIM, cont.
q The choice of exp family is constrained by the nature of the data Yq Example: y is a continuous vector à multivariate Gaussian
y is a class label à Bernoulli or multinomial q The choice of the response function
q Following some mild constrains, e.g., [0,1]. Positivity …q Canonical response function:
q In this case qTx directly corresponds to canonical parameter h.
© Eric Xing @ CMU, 2005-2015
23
( ){ })()(exp),(),|( 1 hhffh f Ayxyhyp T -=Þ
)(×= -1yf
{ })()(exp)()|( hhh Ayxyhyp T -=
hyfq
xµx yEXP
Example canonical response functions
© Eric Xing @ CMU, 2005-2015 24
MLE for GLIMs with natural response
q Log-likelihood
q Derivative of Log-likelihood
q Online learning for canonical GLIMsq Stochastic gradient ascent:
© Eric Xing @ CMU, 2005-2015 25
( )å å -+=n n
nnnT
n Ayxyh )()(log hql
( )
)(
)(
µ
µ
qh
hh
q
-=
-=
÷÷ø
öççè
æ-=
å
å
yX
xy
dd
ddAyx
dd
T
nnnn
n
n
n
nnn
l
This is a fixed point function because µ is a function of q
( ) ntnn
tt xy µrqq -+=+1
( ) size step a is and where rqµ nTtt
n x=
Batch learning for canonical GLIMs
q The Hessian matrix
where is the design matrix and
which can be computed by calculating the 2nd derivative of A(hn)
© Eric Xing @ CMU, 2005-2015 26
( )
WXX
xxddx
dd
ddx
ddxxy
dd
dddH
T
nT
nTn
n n
nn
Tn
n n
nn
nTn
nn
nnnTT
-=
=-=
-=
=-==
å
å
åå
qhhµ
qh
hµ
qµµ
qqq
since
l2
[ ]TnxX =
÷÷ø
öççè
æ=
N
N
dd
ddW
hµ
hµ ,,diag !
1
1
úúúú
û
ù
êêêê
ë
é
----
--------
=
nx
xx
X!!!
2
1
úúúú
û
ù
êêêê
ë
é
=
ny
yy
y!
" 2
1
Iteratively Reweighted Least Squares (IRLS)
q Recall Newton-Raphson methods with cost function J
q We now have
q Now:
q
where the adjusted response is
q This can be understood as solving the following " Iteratively reweighted least squares " problem
© Eric Xing @ CMU, 2005-2015 28
JHttqqq Ñ-= -+ 11
WXXHyXJT
T
-=
-=Ñ )( µq
( ) [ ]( ) ttTtT
tTttTtT
tt
zWXXWX
yXXWXXWX
H
1
1
11
-
-
-+
=
-+=
Ñ+=
)( µq
qq q l
( ) )( tttt yWXz µq -+=-1
)()(minarg qqqq
XzWXz Tt --=+1
( ) yXXX TT !1-=*q
Example 1: logistic regression (sigmoid classifier)
q The condition distribution: a Bernoulli
where µ is a logistic function
q p(y|x) is an exponential family function, with q mean:
q and canonical response function
q IRLS
© Eric Xing @ CMU, 2005-2015 29
yy xxxyp --= 11 ))(()()|( µµ
)()( xex hµ -+=1
1
[ ] )(| xexyE hµ -+
==1
1
xTqxh ==
÷÷÷
ø
ö
ççç
è
æ
-
-=
-=
)(
)(
)(
NN
W
dd
µµ
µµ
µµhµ
1
1
1
11
!
Example 2: linear regression
q The condition distribution: a Gaussian
where µ is a linear function
q p(y|x) is an exponential family function, with q mean:q and canonical response function
q IRLS
© Eric Xing @ CMU, 2005-2015 31
)()( xxx T hqµ ==
[ ] xxyE Tqµ ==|
xTqxh ==1
IWdd
=
= 1hµ
( )( ){ })()(exp)(
))(())((exp),,( //
hh
µµp
q
Ayxxh
xyxyxyp
T
Tk
-S-Þ
þýü
îíì -S--
S=S
-
-
121
1212 2
12
1
( )( ) ( )
( ) )(
)(tTTt
ttTT
ttTtTt
yXXX
yXXXX
zWXXWX
µq
µq
q
-+=
-+=
=
-
-
-+
1
1
11
Þ¥®
Þt
YXXX TT 1)( -=q
Steepest descent Normal equation
Rescale
Simple GMs are the building blocks of complex GMs
q Density estimation q Parametric and nonparametric methods
q Regressionq Linear, conditional mixture, nonparametric
q Classification q Generative and discriminative approach
q Clustering
© Eric Xing @ CMU, 2005-2020 32
Q
X
Q
X
X Y
µ,s
XX
MLE for general BNs
q If we assume the parameters for each CPD are globally independent, and all nodes are fully observed, then the log-likelihood function decomposes into a sum of local terms, one per node:
© Eric Xing @ CMU, 2005-2020 33
å åÕ Õ ÷øö
çèæ=÷÷
ø
öççè
æ==i n
ininn i
inin iixpxpDpD ),|(log),|(log)|(log);( ,,,, qqqq pp xxl
X2=1
X2=0
X5=0
X5=1
Decomposable likelihood of a BN
q Consider the distribution defined by the directed acyclic GM:
q This is exactly like learning four separate small BNs, each of which consists of a node and its parents.
© Eric Xing @ CMU, 2005-2020 34
),,|(),|(),|()|()|( 432431321211 qqqqq xxxpxxpxxpxpxp =
X1
X4
X2 X3
X4
X2 X3
X1X1
X2
X1
X3
MLE for BNs with tabular CPDs
q Assume each CPD is represented as a table (multinomial) where
q Note that in case of multiple parents, will have a composite state, and the CPD will be a high-dimensional table
q The sufficient statistics are counts of family configurations
q The log-likelihood is
q Using a Lagrange multiplier to enforce , we get:
© Eric Xing @ CMU, 2005-2020 35
)|(def
kXjXpiiijk === pq
ipX
å= nkn
jinijk ixxn p,,
def
åÕ ==kji
ijkijkkji
nijk nD ijk
,,,,
loglog);( qqql
1=åj ijkqå
=
''
jkij
ijkMLijk n
nq
Summary: Learning GM
q For fully observed BN, the log-likelihood function decomposes into a sum of local terms, one per node; thus learning is also factored
q Learning single-node GM – density estimation: exponential family dist.q Typical discrete distributionq Typical continuous distributionq Conjugate priors
q Learning two-node BN: GLIMq Conditional Density Est.q Classification
q Learning BN with more nodesq Local operations
© Eric Xing @ CMU, 2005-2020 36
ML Parameter Est. for partially observed GMs:
EM algorithm
© Eric Xing @ CMU, 2005-2020 37
Partially observed GMs
q Speech recognition
© Eric Xing @ CMU, 2005-2020 38
A AA AX2 X3X1 XT
Y2 Y3Y1 YT...
...
Partially observed GM
q Biological Evolution
© Eric Xing @ CMU, 2005-2020 39
ancestor
A C
Qh QmT years
?
AGAGAC
Mixture Models
© Eric Xing @ CMU, 2005-2020 40
Mixture Models, con'd
q A density model p(x) may be multi-modal.q We may be able to model it as a mixture of uni-modal distributions (e.g.,
Gaussians).q Each mode may correspond to a different sub-population (e.g., male and
female).
© Eric Xing @ CMU, 2005-2020 41
Þ
Unobserved Variables
q A variable can be unobserved (latent) because:q it is an imaginary quantity meant to provide some simplified and abstractive
view of the data generation processq e.g., speech recognition models, mixture models …
q it is a real-world object and/or phenomena, but difficult or impossible to measure
q e.g., the temperature of a star, causes of a disease, evolutionary ancestors …q it is a real-world object and/or phenomena, but sometimes wasn’t measured,
because of faulty sensors, etc.q Discrete latent variables can be used to partition/cluster data into sub-
groups.q Continuous latent variables (factors) can be used for dimensionality
reduction (factor analysis, etc).
© Eric Xing @ CMU, 2005-2020 42
Gaussian Mixture Models (GMMs)
q Consider a mixture of K Gaussian components:
q This model can be used for unsupervised clustering.q This model (fit by AutoClass) has been used to discover new kinds of stars in
astronomical data, etc.© Eric Xing @ CMU, 2005-2020 43
å S=Sk kkkn xNxp ),|,(),( µpµ
mixture proportion mixture component
Gaussian Mixture Models (GMMs)
q Consider a mixture of K Gaussian components:q Z is a latent class indicator vector:
q X is a conditional Gaussian variable with a class-specific mean/covariance
q The likelihood of a sample:
© Eric Xing @ CMU, 2005-2020 44
( )Õ==k
zknn
knzzp pp ):(multi)(
{ })-()-(-exp)(
),,|( // knkT
knk
mknn xxzxp µµ
pµ 1
21
212211 -SS
=S=
( )( ) åå Õå
S=S=
S===S
k kkkz kz
kknz
k
kkk
n
xNxN
zxpzpxp
n
kn
kn ),|,(),:(
),,|,()|(),(
µpµp
µpµ 11mixture proportion
mixture component
Z
X
Why is Learning Harder?
q In fully observed iid settings, the log likelihood decomposes into a sum of local terms (at least for directed models).
q With latent variables, all the parameters become coupled together via marginalization
© Eric Xing @ CMU, 2005-2020 45
),|(log)|(log)|,(log);( xzc zxpzpzxpD qqqq +==l
åå ==z
xzz
c zxpzpzxpD ),|()|(log)|,(log);( qqqql
Cxzz
xN
zxpzpxzpD
n kkn
kn
n kk
kn
n k
zkn
n k
zk
nnn
nn
nn
kn
kn
+-=
+=
==
ååååå Õå Õ
ÕÕ
221 )-(log
),;(loglog
),,|()|(log),(log);(
2 µp
sµp
sµp
s
θl
Toward the EM algorithm
q Recall MLE for completely observed data
q Data log-likelihood
q MLE
q What if we do not know zn?© Eric Xing @ CMU, 2005-2020 46
zi
xiN
),;(maxargˆ , DMLEk θlpp =
);(maxargˆ , DMLEk θlµµ =
);(maxargˆ , DMLEk θlss =åå=Þ
nkn
n nkn
MLEk zxz
,ˆ µ
Question
q “ … We solve problem X using Expectation-Maximization …”q What does it mean?
q Eq What do we take expectation with?q What do we take expectation over?
q Mq What do we maximize?q What do we maximize with respect to?
© Eric Xing @ CMU, 2005-2020 47
Recall: K-means
© Eric Xing @ CMU, 2005-2020 48
)()(maxarg )()()()( tkn
tk
Ttknk
tn xxz µµ -S-= -1
åå=+
ntn
n ntnt
k kzxkz),(),(
)(
)()(
dd
µ 1
Expectation-Maximization
q Start: q "Guess" the centroid µk and coveriance Sk of each of the K clusters
q Loop
© Eric Xing @ CMU, 2005-2020 49
E-step
q We maximize iteratively using the following iterative procedure:
─ Expectation step: computing the expected value of the sufficient statistics of the hidden variables (i.e., z) given current est. of the parameters (i.e., p and µ).
q Here we are essentially doing inference
© Eric Xing @ CMU, 2005-2020 50
å SS=S===
i
ti
tin
ti
tk
tkn
tkttk
nq
kn
tkn xN
xNxzpz t ),|,(),|,(),,|1( )()()(
)()()()()()(
)( µpµpµt
)(θcl
M-step
q We maximize iteratively using the following iterative procudure:─ Maximization step: compute the parameters under current results of the expected
value of the hidden variables
q This is isomorphic to MLE except that the variables that are hidden are replaced by their expectations (in general they will by replaced by their corresponding "sufficient statistics")
© Eric Xing @ CMU, 2005-2020 51
1 s.t. , ,0)( ,)(maxarg
)(*
k
*
)(
Nn
NNz
kll
kntk
nn qkn
k
kcck
t
k
===Þ
="=Þ=
åå
嶶
tp
pp p θθ
åå=Þ= +
ntk
n
n ntk
ntkk
xl )(
)()1(* ,)(maxarg
tt
µµ θ
åå ++
+ --=SÞ=S
ntk
n
nTt
kntkn
tknt
kk
xxl )(
)1()1()()1(* ))((
,)(maxargt
µµtθ T
T
T
xxxx =¶
¶
=¶
¶-
-
AA
A A
Alog
:Fact
1
1
)(θcl
Compare: K-means and EM
q K-meansq In the K-means “E-step” we do hard
assignment:
q In the K-means “M-step” we update the means as the weighted sum of the data, but now the weights are 0 or 1:
© Eric Xing @ CMU, 2005-2020 52
q EMq E-step
q M-step
)()(maxarg )()()()( tkn
tk
Ttknk
tn xxz µµ -S-= -1
åå=+
ntn
n ntnt
k kzxkz),(),(
)(
)()(
dd
µ 1
The EM algorithm for mixtures of Gaussians is like a "soft version" of the K-means algorithm.
å SS=S==
=
iti
tin
ti
tk
tkn
tkttk
n
q
kn
tkn
xNxNxzp
z t
),|,(),|,(),,|1( )()()(
)()()()()(
)()(
µpµpµ
t
åå=+
ntk
n
n ntk
ntk
x)(
)()1(
tt
µ
The EM Objective for Gaussian mixture model
q A mixture of K Gaussians:q Z is a latent class indicator vector
q X is a conditional Gaussian variable with class-specific mean/covariance
q The likelihood of a sample:
q The expected complete log likelihood
© Eric Xing @ CMU, 2005-2020 53
Zn
XnN
( )Õ==k
zknn
knzzp pp ):(multi)(
{ })-()-(-exp)(
),,|( // knkT
knk
mknn xxzxp µµ
pµ 1
21
212211 -SS
=S=
( )( ) åå Õå
S=S=
S===S
k kkkz kz
kknz
k
kkk
n
xNxN
zxpzpxp
n
kn
kn ),|,(),:(
),,|,()|(),(
µpµp
µpµ 11
( )åååå
åå
+S+-S--=
S+=
-
n kkknk
Tkn
kn
n kk
kn
nxzpnn
nxzpnc
Cxxzz
zxpzpzx
log)()(21log
),,|(log)|(log),;(
1
)|()|(
µµp
µpθl
Theory underlying EM
q What are we doing?
q Recall that according to MLE, we intend to learn the model parameter that would have maximize the likelihood of the data.
q But we do not observe z, so computing
is difficult!
q What shall we do?
© Eric Xing @ CMU, 2005-2020 54
åå ==z
xzz
c zxpzpzxpD ),|()|(log)|,(log);( qqqql
Complete & Incomplete Log Likelihoods
q Complete log likelihood:Let X denote the observable variable(s), and Z denote the latent variable(s). If Z could be observed, then
q Usually, optimizing lc() given both z and x is straightforward (c.f. MLE for fully observed models).q Recalled that in this case the objective for, e.g., MLE, decomposes into a sum of factors, the parameter for
each factor can be estimated separately.q But given that Z is not observed, lc() is a random quantity, cannot be maximized directly.
q Incomplete log likelihoodWith z unobserved, our objective becomes the log of a marginal probability:
q This objective won't decouple
© Eric Xing @ CMU, 2005-2020 55
)|,(log),;(def
qq zxpzxc =l
å==z
c zxpxpx )|,(log)|(log);( qqql
Expected Complete Log Likelihood
q For any distribution q(z), define expected complete log likelihood:
q A deterministic function of qq Linear in lc() --- inherit its factorizabiilityq Does maximizing this surrogate yield a maximizer of the likelihood?
q Jensen’s inequality
© Eric Xing @ CMU, 2005-2020 56
å=z
qc zxpxzqzx )|,(log),|(),;(def
qqql
å
å
å
³
=
=
=
z
z
z
xzqzxpxzq
xzqzxpxzq
zxpxpx
)|()|,(
log)|(
)|()|,(
)|(log
)|,(log
)|(log);(
q
q
qqql
qqc Hzxx +³Þ ),;();( qq ll
Lower Bounds and Free Energy
q For fixed data x, define a functional called the free energy:
q The EM algorithm is coordinate-ascent on F :q E-step:
q M-step:
© Eric Xing @ CMU, 2005-2020 57
);()|()|,(
log)|(),(def
xxzq
zxpxzqqFz
qqq l£=å
),(maxarg tq
t qFq q=+1
),(maxarg ttt qF qqq
11 ++ =
E-step: maximization of expected lc w.r.t. q
q Claim:
q This is the posterior distribution over the latent variables given the data and the parameters. Often we need this at test time anyway (e.g. to perform classification).
q Proof (easy): this setting attains the bound l(q;x)³F(q,q )
q Can also show this result using variational calculus or the fact that
© Eric Xing @ CMU, 2005-2020 58
),|(),(maxarg ttq
t xzpqFq qq ==+1
);()|(log
)|(log)|(
),()|,(
log),()),,((
xxp
xpxzqxzpzxpxzpxzpF
ttz
t
zt
tttt
q
qqqqq
l==
=
=
å
å
( )),|(||KL),();( qqq xzpqqFx =-l
E-step º plug in posterior expectation of latent variables
q Without loss of generality: assume that p(x,z|q) is a generalized exponential family distribution:
q Special cases: if p(X|Z) are GLIMs, then
q The expected complete log likelihood under is
© Eric Xing @ CMU, 2005-2020 59
)(),(
)()|,(log),|(),;(
qqqq
qAzxf
Azxpxzqzx
ixzqi
ti
z
ttq
tc
t
t
-=
-=
åå+1l
þýü
îíì= å
iii zxfzxh
Zzxp ),(exp),(
)(),( q
qq 1
)()(),( xzzxf iTii xh=
),|( tt xzpq q=+1
)()()( ),|(
GLIM~qxhq
qAxz
iixzqi
ti
pt -= å
M-step: maximization of expected lc w.r.t. q
q Note that the free energy breaks into two terms:
q The first term is the expected complete log likelihood (energy) and the second term, which does not depend on q, is the entropy.
q Thus, in the M-step, maximizing with respect to q for fixed q we only need to consider the first term:
q Under optimal qt+1, this is equivalent to solving a standard MLE of fully observed model p(x,z|q), with the sufficient statistics involving z replaced by their expectations w.r.t. p(z|x,q).
© Eric Xing @ CMU, 2005-2020 60
qqc
zz
z
Hzx
xzqxzqzxpxzqxzq
zxpxzqqF
+=
-=
=
åå
å
),;(
)|(log)|()|,(log)|(
)|()|,(
log)|(),(
q
q
l
å== ++
zqc
t zxpxzqzx t )|,(log)|(maxarg),;(maxarg qqqqq
11 l
Summary: EM Algorithm
q A way of maximizing likelihood function for latent variable models. Finds MLE of parameters when the original (hard) problem can be broken up into two (easy) pieces:
1. Estimate some “missing” or “unobserved” data from observed data and current parameters.2. Using this “complete” data, find the maximum likelihood parameter estimates.
q Alternate between filling in the latent variables using the best guess (posterior) and updating the parameters based on this guess:
q E-step: q M-step:
q In the M-step we optimize a lower bound on the likelihood. In the E-step we close the gap, making bound=likelihood.
© Eric Xing @ CMU, 2005-2020 61
),(maxarg tq
t qFq q=+1
),(maxarg ttt qF qqq
11 ++ =
© Eric Xing @ CMU, 2005-2020 62
Supplementary materials
I: Review of density estimation
q Can be viewed as single-node graphical models
q Instances of exponential family dist.
q Building blocks of general GM
q MLE and Bayesian estimate
© Eric Xing @ CMU, 2005-2016 63
x1 x2 x3 xN…
xiN
GM:
Discrete Distributions
q Bernoulli distribution: Ber(p)
q Multinomial distribution: Mult(1,q)q Multinomial (indicator) variable:
© Eric Xing @ CMU, 2005-2016 64
. , w.p.
and ],,[
where , ∑
∑
[1,...,6]�
[1,...,6]�
11
110
6
5
4
3
2
1
==
==
úúúúúúúú
û
ù
êêêêêêêê
ë
é
=
jjjj
jjj
X
XX
XXXXXX
Xqq
( )x
k
xk
xT
xG
xC
xAj
j
kTGCA
jXPjxpqqqqqqq ==´´´==
==
∏}face-dice index the where,{))(( 1
îíì
==-
=101
xpxp
xPfor for
)( xx ppxP --= 11 )()(Þ
Discrete Distributions
q Multinomial distribution: Mult(n,q)
q Count variable:
© Eric Xing @ CMU, 2005-2016 65
å =úúú
û
ù
êêê
ë
é=
jj
K
Nnn
nn where , !
1
n
K
nK
nn
K nnnN
nnnNnp K qqqq
!!!!
!!!!)(
!!
! 2121
21
21 ==
Example: multinomial model
q Data: q We observed N iid die rolls (K-sided): D={5, 1, K, …, 3}
q Representation:Unit basis vectors:
q Model:
q How to write the likelihood of a single observation xn?
q The likelihood of datasetD={x1, …,xN}:
© Eric Xing @ CMU, 2005-2016 66
=´´´==
==Knnn x
Kxx
k
kni nkxPxP,2,1,
21
, })rollth theof side-die index the where,1({)(
qqqq !
Õ ÕÕ==
÷÷ø
öççè
æ==N
n k
xk
N
nnN
knxPxxxP11
21,)|()|,...,,( qqq
1 and },1,0{ where, 1
,,
,
2,
1,
==
÷÷÷÷÷
ø
ö
ççççç
è
æ
= å=
K
kknkn
Kn
n
n
n xx
x
xx
x!
1 and , w.p.1}1{
, == åÎ ,...Kk
kkknX qq
Õ=
K
k
xk
kn
1
,q
ÕÕ =å
= =
k
nk
k
x
kk
N
nkn
qq 1,
x1 x2 x3 xN…
xiN
GM:
MLE: constrained optimization with Lagrange multipliers
q Objective function:
q We need to maximize this subject to the constrain
q Constrained cost function with a Lagrange multiplier
q Take derivatives wrt qk
q Sufficient statisticsq The counts, are sufficient statistics of data D
© Eric Xing @ CMU, 2005-2016 67
åÕ ===k
kkk
nk nDPD k qqqq loglog)|(log);(l
11
=å=
K
kkq
÷÷ø
öççè
æ -+= åå=
K
kk
kkkn
11 qlqlogl
lqllq
lqq
===Þ=
=-=¶¶
ååk
kk
kkk
k
k
k
Nnn
n 0l
Nnk
MLEk =,q!
å=n
knMLEk xN ,,1q
!or
Frequency as sample mean
, ),,,( ,1 å==n knkK xnnnn !"
Bayesian estimation:
q Dirichlet distribution:
q Posterior distribution of q :
q Notice the isomorphism of the posterior to the prior, q such a prior is called a conjugate prior
q Posterior mean estimation:
© Eric Xing @ CMU, 2005-2016 68
ÕÕÕå
=G
G=
kk
kk
kk
kk
kk CP 1-1- )()(
)()( aa qaq
a
aq
ÕÕÕ -+- =µ=k
nk
kk
k
nk
N
NN
kkkk
xxppxxpxxP 11
1
11 ),...,(
)()|,...,(),...,|( aa qqqqqq
ò ò Õ ++=== -+
aaqqqqqqq a
NndCdDp kk
k
nkkkk
kk 1)|(
x1 x2 x3 xN…
GM:q
N
q
xi
Dirichlet parameters can be understood as pseudo-counts
More on Dirichlet Prior:
q Where is the normalize constant C(a) come from?
q Integration by parts q G(a) is the gamma function:q For inregers,
q Marginal likelihood:
q Posterior in closed-form:
q Posterior predictive rate:
© Eric Xing @ CMU, 2005-2016 69
( )åÕò ò G
G== --
k k
k kKK dd
CK
aa
qqqqa
aa )()(
!!! 111
11
1
ò¥ --=G0
1 dtet taa )(!)( nn =+G 1
)()()|()|()|()|},...,({a
aqaqqaa !""!!!!"!"!+
=== ò nCCdpnpnpxxp N1
Õ -++==k
nkN
kknCnppnpxxP 1
1 )()|()|()|()},,...,{|( aqa
aaqqaq !!
!!!!!!
)(Dir a!! += n
)()()()},,...,{|( 1
11N
ni
k
nkNN xnC
nCdnCxxixp kkkk
+++=´+== ò Õ +-+
+ aaqqqaa aa!!!!!!!!
|||| aa!! +
+=nn ii
Sequential Bayesian updating
q Start with Dirichlet priorq Observe N ' samples with sufficient statistics . Posterior becomes:
q Observe another N " samples with sufficient statistics . Posterior becomes:
q So sequentially absorbing data in any order is equivalent to batch update.
© Eric Xing @ CMU, 2005-2016 70
):(Dir)|( aqaq !!!! =P
'n!
)':(Dir)',|( nnP !!!!!! += aqaq"n!
)"':(Dir)",',|( nnnnP !!!!!!!! ++= aqaq
Hierarchical Bayesian Models
q q are the parameters for the likelihood p(x|q)q a are the parameters for the prior p(q|a) .q We can have hyper-hyper-parameters, etc.q We stop when the choice of hyper-parameters makes no difference to
the marginal likelihood; typically make hyper-parameters constants.q Where do we get the prior?
q Intelligent guessesq Empirical Bayes (Type-II maximum likelihood)
à computing point estimates of a :
© Eric Xing @ CMU, 2005-2016 71
)|(maxarg aaa
!!"!! npMLE ==
Limitation of Dirichlet Prior:
© Eric Xing @ CMU, 2005-2016 72
a
N
q
xi
The Logistic Normal Prior
© Eric Xing @ CMU, 2005-2016 73
- Log Partition Function- Normalization Constant
Problem
( )
( )
( ) log
logexp
,~
,~
÷ø
öçè
æ+=
þýü
îíì
÷ø
öçè
æ+-=
=S
S
å
å
-
=
-
=
-
1
1
1
1
1
1
1
0
K
i
K
iii
KK
K
i
i
eC
e
N
LN
g
g
g
gq
gµg
µqμ Σ
N
q
xi
q Pro: co-variance structureq Con: non-conjugate (we will discuss how to solve this later)
Logistic Normal Densities
© Eric Xing @ CMU, 2005-2016 74
Logistic
Normal
Continuous Distributions
q Uniform Probability Density Function
q Normal (Gaussian) Probability Density Function
q The distribution is symmetric, and is often illustrated as a bell-shaped curve. q Two parameters, µ (mean) and s (standard deviation), determine the location and shape of the distribution.q The highest point on the normal curve is at the mean, which is also the median and mode.q The mean can be any numerical value: negative, zero, or positive.
q Multivariate Gaussian
© Eric Xing @ CMU, 2005-2016 75
elsewhere for )/()(
01
=££-= bxaabxp
22 2
21 sµ
sp/)()( --= xexp
µµxx
ff((xx))
µµxx
ff((xx))
( ) ( ) ( )þýü
îíì -S--
S=S - µµ
pµ !!! XXXp T
n1
212 21
21 exp),;(
//
MLE for a multivariate-Gaussian
q It can be shown that the MLE for µ and Σ is
where the scatter matrix is
q The sufficient statistics are Snxn and SnxnxnT.
q Note that XTX=SnxnxnT may not be full rank (eg. if N <D), in which case ΣML is
not invertible
© Eric Xing @ CMU, 2005-2016 76
( )
( )( ) SN
xxN
xN
nT
MLnMLnMLE
n nMLE
11
1
=--=S
=
å
å
µµ
µ
( )( ) ( ) TMLMLn n
Tnnn
TMLnMLn NxxxxS µµµµ -=--= åå
÷÷÷÷÷
ø
ö
ççççç
è
æ
=
Kn
n
n
n
x
xx
x
,
2,
1,
!
÷÷÷÷÷
ø
ö
ççççç
è
æ
------
------------
=
TN
T
T
x
xx
X!2
1
Bayesian parameter estimation for a Gaussian
q There are various reasons to pursue a Bayesian approachq We would like to update our estimates sequentially over time.q We may have prior knowledge about the expected magnitude of the
parameters.q The MLE for Σ may not be full rank if we don’t have enough data.
q We will restrict our attention to conjugate priors.
q We will consider various cases, in order of increasing complexity:q Known σ, unknown µq Known µ, unknown σq Unknown µ and σ
© Eric Xing @ CMU, 2005-2016 77
Bayesian estimation: unknown µ, known σ
q Normal Prior:
q Joint probability:
q Posterior:
© Eric Xing @ CMU, 2005-2016 78
( ) { }220
212 22 tµµptµ /)(exp)( /--=
-P
1
222
022
2
22
2 11
11
-
÷øö
çèæ +=
++
+=
tssµ
tst
tssµ N
Nx
NN ~ and ,
///
///~ where
x1 x2 x3 xN…
GM:µ
N
µ
xi
Sample mean
( ) ( )
( ) { }220
212
1
22
22
22
212
tµµpt
µs
psµ
/)(exp
exp),(
/
/
--´
þýü
îíì --=
-
=
- åN
nn
N xxP
( ) { }22212 22 sµµspµ ~/)~(exp~)|( /--=
-xP
Bayesian estimation: unknown µ, known σ
q The posterior mean is a convex combination of the prior and the MLE, with weights proportional to the relative noise levels.
q The precision of the posterior 1/σ2N is the precision of the prior 1/σ2
0 plus one contribution of data precision 1/σ2 for each observed data point.
q Sequentially updating the meanq µ∗ = 0.8 (unknown), (σ2)∗ = 0.1 (known)
q Effect of single data point
q Uninformative (vague/ flat) prior, σ20 →∞
© Eric Xing @ CMU, 2005-2016 79
1
20
22
020
2
20
20
2
2 11
11
-
÷÷ø
öççè
æ+=
++
+=
sssµ
sss
sssµ N
Nx
NN
N~ ,
///
///
20
2
20
020
2
20
001 sssµ
sssµµµ
+--=
+-+= )( )( xxx
0µµ ®N
Other scenarios
q Known µ, unknown λ = 1/σ2q The conjugate prior for λ is a Gamma with shape a0 and rate (inverse scale)
b0
q The conjugate prior for σ2 is Inverse-Gamma
q Unknown µ and unknown σ2q The conjugate prior is
Normal-Inverse-Gammaq Semi conjugate prior
q Multivariate case:q The conjugate prior is
Normal-Inverse-Wishart© Eric Xing @ CMU, 2005-2016 80
Q
X
Q
X
II: Two node fully observed BNs
q Conditional mixtures
q Linear/Logistic Regression q
q Classificationq Generative and discriminative approaches
© Eric Xing @ CMU, 2005-2016 81
Classification:
q Goal: Wish to learn f: X ® Y
q Generative:q Modeling the joint distribution
of all data
q Discriminative:q Modeling only points
at the boundary
© Eric Xing @ CMU, 2005-2016 82
Conditional Gaussian
q The data:
q Both nodes are observed:q Y is a class indicator vector
q X is a conditional Gaussian variable with a class-specific mean
© Eric Xing @ CMU, 2005-2016 83
GM:Yi
XiN
Õ==k
yknn
knyyp ,):(multi)( pp
{ }221
2/12, )-(-exp)2(
1),,1|( 2 knknn xyxp µps
sµs
==
Õ Õ ÷÷ø
öççè
æ=n k
ykn
knxNyxp ,),:(),,|( sµsµ
{ }),(,),,(),,(),,( NN yxyxyxyx !332211
),,|()|(log),(log);( sµp nnn
nn
nn yxpypyxpD ÕÕ ==θl
),;(max,
,, Nn
Ny
Drga knkn
MLEkMLEk ===å
ppp
!! θlthe fraction of samples of class m
k
nnkn
nkn
nnkn
MLEkMLEk n
xy
y
xyD
ååå
===,
,
,
,, ),;(maxarg µµ !! θl the average of samples of class m
MLE of conditional Gaussian
© Eric Xing @ CMU, 2005-2016 84
GM:Yi
XiN
q Data log-likelihood
q MLE
Bsyesian estimation of conditional Gaussian
q Prior:
q Posterior mean (Bayesian est.)
© Eric Xing @ CMU, 2005-2016 85
GM:
Yi
XiN
µ
p):(Dir)|( apap !!!! =P
),:(Normal)|( tnµnµ kkP =K
1
222
22
2
22
2 11
11
-
÷øö
çèæ +=
++
+=
tssn
tstµ
tssµ N
nnn
Bayesk
MLkk
kBayesk and ,
///
///
,,!
aa
aa
aa
pa
p++=
++
+=
Nn
NNN kkk
MLkBayesk ,,!
Classification
q Gaussian Discriminative Analysis:q The joint probability of a datum and it label is:
q Given a datum xn, we predict its label using the conditional probability of the label given the datum:
q This is basic inference q introduce evidence, and then normalize
© Eric Xing @ CMU, 2005-2016 86
{ }221
2/12
,,,
)-(-exp)2(
1),,1|()1(),|1,(
2 knk
knnknknn
x
yxpypyxp
µps
p
sµsµ
s=
=´===
{ }{ }å
==
'
2'2
12/12'
221
2/12
,
)-(-exp)2(
1
)-(-exp)2(
1
),,|1(2
2
kknk
knk
nkn
x
xxyp
µps
p
µps
psµ
s
s
Yi
Xi
Transductive classification
q Given Xn, what is its corresponding Ynwhen we know the answer for a set of training data?
q Frequentist prediction:q we fit p, µ and s from data first, and then …
q Bayesian:q we compute the posterior dist. of the parameters first …
© Eric Xing @ CMU, 2005-2016 87
GM:
Yi
XiN
µ
p
KYn
Xn
å=
===
'''
,, ),|,(
),|,(),,|(
),,|,1(),,,|1(
kknk
knk
n
nknnkn xN
xNxp
xypxyp
sµpsµp
psµpsµ
psµ
Linear Regression
q The data:
q Both nodes are observed:q X is an input vectorq Y is a response vector
(we first consider y as a generic continuous response vector, then we consider the special case of classification where y is a discrete indicator)
q A regression scheme can be used to model p(y|x) directly,rather than p(x,y)
© Eric Xing @ CMU, 2005-2016 88
Yi
XiN
{ }),(,),,(),,(),,( NN yxyxyxyx !332211
A discriminative probabilistic model
q Let us assume that the target variable and the inputs are related by the equation:
where ε is an error term of unmodeled effects or random noise
q Now assume that ε follows a Gaussian N(0,σ), then we have:
q By independence assumption:
© Eric Xing @ CMU, 2005-2016 89
iiT
iy eq += x
÷÷ø
öççè
æ --= 2
2
221
sq
spq )(exp);|( i
Ti
iiyxyp x
÷÷
ø
öçç
è
æ --÷
øö
çèæ== åÕ =
=2
12
1 221
sq
spqq
n
i iT
inn
iii
yxypL
)(exp);|()(
x
Linear regression
q Hence the log-likelihood is:
q Do you recognize the last term?
Yes it is:
q It is same as the MSE!
© Eric Xing @ CMU, 2005-2016 90
å =--= n
i iT
iynl 12
2 211
21 )(log)( xq
sspq
å=
-=n
ii
Ti yJ
1
2
21 )()( qq x
A recap:
q LMS update rule
q Pros: on-line, low per-step costq Cons: coordinate, maybe slow-converging
q Steepest descent
q Pros: fast-converging, easy to implementq Cons: a batch,
q Normal equations
q Pros: a single-shot algorithm! Easiest to implement.q Cons: need to compute pseudo-inverse (XTX)-1, expensive, numerical issues (e.g.,
matrix is singular ..)© Eric Xing @ CMU, 2005-2016 91
ntT
nntt y xx )(1 qaqq -+=+
å=
+ -+=n
in
tTnn
tt y1
1 xx )( qaqq
( ) yXXX TT !1-=*q
Bayesian linear regression
© Eric Xing @ CMU, 2005-2016 92
Earthquake
Radio
Burglary
Alarm
Call
III: How to define parameter prior for general BN?
Õ=
==M
ii ixpp
1)|()( pxxX
© Eric Xing @ CMU, 2005-2020
93
Assumptions (Geiger & Heckerman 97,99):
Ø Complete Model EquivalenceØ Global Parameter IndependenceØ Local Parameter IndependenceØ Likelihood and Prior Modularity
Factorization:
ji
kii x
jkixp
pqp x
x |)|( =
Local Distributions defined by, e.g., multinomial parameters:
? )|( Gp q
■ Global Parameter IndependenceFor every DAG model:
■ Local Parameter IndependenceFor every node:
Õ=
=M
iim GpGp
1)|()|( qq
Earthquake
Radio
Burglary
Alarm
Call
Õ=
=i
ji
ki
q
jxi GpGp
1| )|()|(p
qqx
independent of
)( | YESAlarmCallP =q
)( | NOAlarmCallP =q
Global & Local Parameter Independence
© Eric Xing @ CMU, 2005-2020 94
Provided all variables are observed in all cases, we can perform Bayesian update each parameter independently !!!
sample 1
sample 2
!
q2|1q1 q2|1
X1 X2
X1 X2
Global ParameterIndependence
Local ParameterIndependence
Parameter Independence, Graphical View
© Eric Xing @ CMU, 2005-2020 95
Which PDFs Satisfy Our Assumptions? (Geiger & Heckerman 97,99)
q Discrete DAG Models:
Dirichlet prior:
q Gaussian DAG Models:
Normal prior:
Normal-Wishart prior:
© Eric Xing @ CMU, 2005-2020 96
ÕÕÕå
=G
G=
kk
kk
kk
kk
kk CP 1-1- )()(
)()( aa qaq
a
aq
)(Multi~| qp jxi i
x
),(Normal~| Sµp jxi i
x
þýü
îíì -Y--
Y=Y - )()'(exp
||)(),|( // nµnµ
pnµ 1
212 21
21
np
{ }
. where
,WtrexpW),(),|W( /)(/
1
212
21
-
--
S=þýü
îíì TT=T
W
nww
wwncp aaaa
( ),)(,Normal),,|( 1-= WW µµ ananµp
Parameter sharing
q Consider a time-invariant (stationary) 1st-order Markov modelq Initial state probability vector: q State transition probability matrix:
q The joint:
q The log-likelihood:
q Again, we optimize each parameter separatelyq p is a multinomial frequency vector, and we've seen it beforeq What about A?
© Eric Xing @ CMU, 2005-2020 97
X1 X2 X3 XT…
Ap
)(def
11 == kk Xpp
)|(def
11 1 === -it
jtij XXpA
ÕÕ= =
-=T
t tttT XXpxpXp
2 2111 )|()|()|( : pq
ååå=
-+=n
T
ttntn
nn AxxpxpD
211 ),|(log)|(log);( ,,, pql
Learning a Markov chain transition matrix
q A is a stochastic matrix: q Each row of A is multinomial distribution.q So MLE of Aij is the fraction of transitions from i to j
q Application: q if the states Xt represent words, this is called a bigram language model
q Sparse data problem:q If i à j did not occur in data, we will have Aij =0, then any future sequence with word
pair i à j will have zero probability. q A standard hack: backoff smoothing or deleted interpolation
© Eric Xing @ CMU, 2005-2020 98
1=å j ijA
å åå å
= -
= -=•®
®=n
T
titn
jtnn
T
titnML
ijx
xxi
jiA2 1
2 1
,
,,
)(#)(#
MLiti AA •®•® -+= )(~ llh 1
Bayesian language model
q Global and local parameter independence
q The posterior of Aià· and Ai'à· is factorized despite v-structure on Xt, because Xt-1acts like a multiplexerq Assign a Dirichlet prior bi to each row of the transition matrix:
q We could consider more realistic priors, e.g., mixtures of Dirichlets to account for types of words (adjectives, verbs, etc.)
© Eric Xing @ CMU, 2005-2020 99
X1 X2 X3 XT…
Ai à·p Ai'à·
ba
…
A
)(# where,)(
)(#)(#
),,|( ',
,def
•®+=-+=
+•®+®
==i
Ai
jiDijpA
i
ii
MLijikii
i
kii
Bayesij b
bllbl
bb
b 1
IV: More on EM
© Eric Xing @ CMU, 2005-2019 100
Unsupervised ML estimation
q Given x = x1…xN for which the true state path y = y1…yN is unknown,
q EXPECTATION MAXIMIZATION
0. Starting with our best guess of a model M, parameters q:1. Estimate Aij , Bik in the training data
q How? , ,
2. Update q according to Aij , Bikq Now a "supervised learning" problem
3. Repeat 1 & 2, until convergenceThis is called the Baum-Welch Algorithm
We can get to a provably more (or equally) likely parameter set q each iteration© Eric Xing @ CMU, 2005-2020 101
ktntn
itnik xyB ,, ,å=å -= tn
jtn
itnij yyA
, ,, 1
EM for general BNs
while not converged% E-stepfor each node iESSi = 0 % reset expected sufficient statistics
for each data sample ndo inference with Xn,Hfor each node i
% M-stepfor each node iqi := MLE(ESSi )
© Eric Xing @ CMU, 2005-2020 102
)|(,,,,
),( HnHni xxpninii xxSSESS
-
=+ p
Conditional mixture model: Mixture of experts
q We will model p(Y |X) using different experts, each responsible for different regions of the input space.
q Latent variable Z chooses expert using softmax gating function:
q Each expert can be a linear regression model:q The posterior expert responsibilities are
© Eric Xing @ CMU, 2005-2020 103
( )xxzP Tk xSoftmax)( == 1( )21 k
Tk
k xyzxyP sq ,;),( N==
å ==
==j jjj
jkkk
kk
xypxzpxypxzpyxzP
),,()(),,()(
),,( 2
2
111
sqsq
q
EM for conditional mixture model
q Model:
q The objective function
q EM:
q E-step:
q M-step:
q using the normal equation for standard LR , but with the data re-weighted by t(homework)
q IRLS and/or weighted IRLS algorithm to update {xk, qk, sk} based on data pair (xn,yn), with weights (homework?)
© Eric Xing @ CMU, 2005-2020 104
),,,|(),|()( sqx ik
kk xzypxzpxyP 11 ===å
å ==
===j jjnnjn
jn
kknnknkn
nnkn
tkn xypxzp
xypxzpyxzP),,()(
),,()(),,()(
2
2
111
sqsq
t θ
( ) åååå
åå
÷÷ø
öççè
æ++-=
+=
n kk
k
nTknk
nn k
nTk
kn
nyxzpnnn
nyxzpnnc
Cxyzxz
zxypxzpzyx
22
),|(),|(
log)-(21)softmax(log
),,,|(log),|(log),,;(
ssqx
sqxθl
YXXX TT 1-= )(q
)(tknt
Hierarchical mixture of experts
q This is like a soft version of a depth-2 classification/regression tree.q P(Y |X,G1,G2) can be modeled as a GLIM, with parameters dependent on the values of
G1 and G2 (which specify a "conditional path" to a given leaf in the tree).
© Eric Xing @ CMU, 2005-2020 105
Mixture of overlapping experts
q By removing the X à Z arc, we can make the partitions independent of the input, thus allowing overlap.
q This is a mixture of linear regressors; each subpopulation has a different conditional mean.
© Eric Xing @ CMU, 2005-2020 106
å ==
==j jjj
jkkk
kk
xypzpxypzpyxzP
),,()(),,()(
),,( 2
2
111
sqsq
q
Partially Hidden Data
q Of course, we can learn when there are missing (hidden) variables on some cases and not on others.
q In this case the cost function is:
q Note that Ym do not have to be the same in each case --- the data can have different missing values in each different sample
q Now you can think of this in a new way: in the E-step we estimate the hidden variables on the incomplete cases only.
q The M-step optimizes the log likelihood on the complete data plus the expected likelihood on the incomplete data using the E-step.
© Eric Xing @ CMU, 2005-2020 107
å ååÎÎ
+=MissingComplete
)|,(log)|,(log);(m y
mmn
nncm
yxpyxpD qqql
EM Variants
q Sparse EM:Do not re-compute exactly the posterior probability on each data point under all models, because it is almost zero. Instead keep an “active list” which you update every once in a while.
q Generalized (Incomplete) EM: It might be hard to find the ML parameters in the M-step, even given the completed data. We can still make progress by doing an M-step that improves the likelihood a bit (e.g. gradient step). Recall the IRLS step in the mixture of experts model.
© Eric Xing @ CMU, 2005-2020 108
A Report Card for EM
q Some good things about EM:q no learning rate (step-size) parameterq automatically enforces parameter constraintsq very fast for low dimensionsq each iteration guaranteed to improve likelihood
q Some bad things about EM:q can get stuck in local minimaq can be slower than conjugate gradient (especially near convergence)q requires expensive inference stepq is a maximum likelihood/MAP method
© Eric Xing @ CMU, 2005-2020 109