Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides...
-
date post
21-Dec-2015 -
Category
Documents
-
view
228 -
download
7
Transcript of Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides...
![Page 1: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/1.jpg)
Maximum Likelihood (ML), Expectation Maximization (EM)
Pieter AbbeelUC Berkeley EECS
Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics
![Page 2: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/2.jpg)
Maximum likelihood (ML)
Priors, and maximum a posteriori (MAP)
Cross-validation
Expectation Maximization (EM)
Outline
![Page 3: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/3.jpg)
Let µ = P(up), 1-µ = P(down)
How to determine µ ?
Empirical estimate: 8 up, 2 down
Thumbtack
![Page 4: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/4.jpg)
http://web.me.com/todd6ton/Site/Classroom_Blog/Entries/2009/10/7_A_Thumbtack_Experiment.html
![Page 5: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/5.jpg)
µ = P(up), 1-µ = P(down)
Observe:
Likelihood of the observation sequence depends on µ:
Maximum likelihood finds
extrema at µ = 0, µ = 1, µ = 0.8
Inspection of each extremum yields µML = 0.8
Maximum Likelihood
![Page 6: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/6.jpg)
More generally, consider binary-valued random variable with µ = P(1), 1-µ = P(0), assume we observe n1 ones, and n0 zeros
Likelihood:
Derivative:
Hence we have for the extrema:
n1/(n0+n1) is the maximum
= empirical counts.
Maximum Likelihood
![Page 7: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/7.jpg)
The function
is a monotonically increasing function of x
Hence for any (positive-valued) function f:
In practice often more convenient to optimize the log-likelihood rather than the likelihood itself
Example:
Log-likelihood
![Page 8: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/8.jpg)
Reconsider thumbtacks: 8 up, 2 down Likelihood
Definition: A function f is concave if and only
Concave functions are generally easier to maximize then non-concave functions
Log-likelihood Likelihood
log-likelihood
ConcaveNot Concave
![Page 9: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/9.jpg)
f is concave if and only
“Easy” to maximize
Concavity and Convexity
x1x2
¸ x2+(1-¸)x2
f is convex if and only
“Easy” to minimize
x1 x2
¸ x2+(1-¸)x2
![Page 10: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/10.jpg)
Consider having received samples
ML for Multinomial
![Page 11: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/11.jpg)
Given samples
Dynamics model:
Observation model:
Independent ML problems for each and each
ML for Fully Observed HMM
![Page 12: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/12.jpg)
Consider having received samples 3.1, 8.2, 1.7
ML for Exponential Distribution
Source: wikipedia
ll
![Page 13: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/13.jpg)
Consider having received samples
ML for Exponential Distribution
Source: wikipedia
![Page 14: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/14.jpg)
Consider having received samples
Uniform
![Page 15: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/15.jpg)
Consider having received samples
ML for Gaussian
![Page 16: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/16.jpg)
Equivalently:
More generally:
ML for Conditional Gaussian
![Page 17: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/17.jpg)
ML for Conditional Gaussian
![Page 18: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/18.jpg)
ML for Conditional Multivariate Gaussian
![Page 19: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/19.jpg)
Aside: Key Identities for Derivation on Previous Slide
![Page 20: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/20.jpg)
Consider the Linear Gaussian setting:
Fully observed, i.e., given
Two separate ML estimation problems for conditional multivariate Gaussian:
1:
2:
ML Estimation in Fully Observed Linear Gaussian Bayes Filter Setting
![Page 21: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/21.jpg)
Let µ = P(up), 1-µ = P(down)
How to determine µ ?
ML estimate: 5 up, 0 down
Laplace estimate: add a fake count of 1 for each outcome
Priors --- Thumbtack
![Page 22: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/22.jpg)
Alternatively, consider $µ$ to be random variable
Prior P(µ) / µ(1-µ)
Measurements: P( x | µ )
Posterior:
Maximum A Posterior (MAP) estimation = find µ that maximizes the posterior
Priors --- Thumbtack
![Page 23: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/23.jpg)
Priors --- Beta Distribution
Figure source: Wikipedia
![Page 24: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/24.jpg)
Generalizes Beta distribution
MAP estimate corresponds to adding fake counts n1, …, nK
Priors --- Dirichlet Distribution
![Page 25: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/25.jpg)
Assume variance known. (Can be extended to also find MAP for variance.)
Prior:
MAP for Mean of Univariate Gaussian
![Page 26: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/26.jpg)
Assume variance known. (Can be extended to also find MAP for variance.)
Prior:
MAP for Univariate Conditional Linear Gaussian
[Interpret!]
![Page 27: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/27.jpg)
MAP for Univariate Conditional Linear Gaussian: Example
TRUE ---Samples .ML ---MAP ---
![Page 28: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/28.jpg)
Choice of prior will heavily influence quality of result
Fine-tune choice of prior through cross-validation: 1. Split data into “training” set and “validation”
set 2. For a range of priors,
Train: compute µMAP on training set Cross-validate: evaluate performance on validation set
by evaluating the likelihood of the validation data under µMAP just found
3. Choose prior with highest validation score For this prior, compute µMAP on (training+validation) set
Typical training / validation splits:
1-fold: 70/30, random split
10-fold: partition into 10 sets, average performance for each of the sets being the validation set and the other 9 being the training set
Cross Validation
![Page 29: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/29.jpg)
Maximum likelihood (ML)
Priors, and maximum a posteriori (MAP)
Cross-validation
Expectation Maximization (EM)
Outline
![Page 30: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/30.jpg)
Generally:
Example:
ML Objective: given data z(1), …, z(m)
Setting derivatives w.r.t. µ, ¹, § equal to zero does not enable to solve for their ML estimates in closed form
We can evaluate function we can in principle perform local optimization, see future lectures. In this lecture: “EM” algorithm, which is typically used to efficiently optimize the objective (locally)
Mixture of Gaussians
![Page 31: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/31.jpg)
Example:
Model:
Goal: Given data z(1), …, z(m) (but no x(i) observed) Find maximum likelihood estimates of ¹1, ¹2
EM basic idea: if x(i) were known two easy-to-solve separate ML problems
EM iterates over E-step: For i=1,…,m fill in missing data x(i) according to
what is most likely given the current model ¹ M-step: run ML for completed data, which gives new
model ¹
Expectation Maximization (EM)
![Page 32: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/32.jpg)
EM solves a Maximum Likelihood problem of the form:
µ: parameters of the probabilistic model we try to findx: unobserved variablesz: observed variables
EM Derivation
Jensen’s Inequality
![Page 33: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/33.jpg)
Jensen’s inequality
x1x2
E[X] = ¸ x2+(1-¸)x2
Illustration: P(X=x1) = 1-¸, P(X=x2) = ¸
![Page 34: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/34.jpg)
EM Algorithm: Iterate
1. E-step: Compute
2. M-step: Compute
EM Derivation (ctd)
Jensen’s Inequality: equality holds when is an affine
function. This is achieved for
M-step optimization can be done efficiently in most casesE-step is usually the more expensive stepIt does not fill in the missing data x with hard values, but finds a distribution q(x)
![Page 35: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/35.jpg)
M-step objective is upper-bounded by true objective
M-step objective is equal to true objective at current parameter estimate
EM Derivation (ctd)
Improvement in true objective is at least as large as improvement in M-step objective
![Page 36: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/36.jpg)
Estimate 1-d mixture of two Gaussians with unit variance:
one parameter ¹ ; ¹1 = ¹ - 7.5, ¹2 = ¹+7.5
EM 1-D Example --- 2 iterations
![Page 37: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/37.jpg)
X ~ Multinomial Distribution, P(X=k ; µ) = µk
Z ~ N(¹k, §k)
Observed: z(1), z(2), …, z(m)
EM for Mixture of Gaussians
![Page 38: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/38.jpg)
E-step:
M-step:
EM for Mixture of Gaussians
![Page 39: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/39.jpg)
Given samples
Dynamics model:
Observation model:
ML objective:
No simple decomposition into independent ML problems for each and each
No closed form solution found by setting derivatives equal to zero
ML Objective HMM
![Page 40: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/40.jpg)
µ and ° computed from “soft” counts
EM for HMM --- M-step
![Page 41: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/41.jpg)
No need to find conditional full joint
Run smoother to find:
EM for HMM --- E-step
![Page 42: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/42.jpg)
Linear Gaussian setting:
Given
ML objective:
EM-derivation: same as HMM
ML Objective for Linear Gaussians
![Page 43: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/43.jpg)
Forward:
Backward:
EM for Linear Gaussians --- E-Step
![Page 44: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/44.jpg)
EM for Linear Gaussians --- M-step
[Updates for A, B, C, d. TODO: Fill in once found/derived.]
![Page 45: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/45.jpg)
When running EM, it can be good to keep track of the log-likelihood score --- it is supposed to increase every iteration
EM for Linear Gaussians --- The Log-likelihood
![Page 46: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.](https://reader034.fdocuments.us/reader034/viewer/2022050714/56649d605503460f94a41287/html5/thumbnails/46.jpg)
As the linearization is only an approximation, when performing the updates, we might end up with parameters that result in a lower (rather than higher) log-likelihood score
Solution: instead of updating the parameters to the newly estimated ones, interpolate between the previous parameters and the newly estimated ones. Perform a “line-search” to find the setting that achieves the highest log-likelihood score
EM for Extended Kalman Filter Setting