Maximum Likelihood Estimation & Expectation Maximization
description
Transcript of Maximum Likelihood Estimation & Expectation Maximization
![Page 1: Maximum Likelihood Estimation & Expectation Maximization](https://reader036.fdocuments.us/reader036/viewer/2022062301/568139f3550346895da1ac1b/html5/thumbnails/1.jpg)
Lectures 3 – Oct 5, 2011CSE 527 Computational Biology, Fall 2011
Instructor: Su-In LeeTA: Christopher Miles
Monday & Wednesday 12:00-1:20Johnson Hall (JHN) 022
Maximum Likelihood Estimation & Expectation Maximization
1
![Page 2: Maximum Likelihood Estimation & Expectation Maximization](https://reader036.fdocuments.us/reader036/viewer/2022062301/568139f3550346895da1ac1b/html5/thumbnails/2.jpg)
2
Outline Probabilistic models in biology
Model selection problem
Mathematical foundations
Bayesian networks
Learning from data Maximum likelihood estimation Maximum a posteriori (MAP) Expectation and maximization
![Page 3: Maximum Likelihood Estimation & Expectation Maximization](https://reader036.fdocuments.us/reader036/viewer/2022062301/568139f3550346895da1ac1b/html5/thumbnails/3.jpg)
3
Parameter Estimation Assumptions
Fixed network structure Fully observed instances of the network variables: D={d[1],
…,d[M]} Maximum likelihood estimation (MLE)!
“Parameters” of the Bayesian network
For example, {i0,d1,g1,l0,s0
}
from Koller & Friedman
![Page 4: Maximum Likelihood Estimation & Expectation Maximization](https://reader036.fdocuments.us/reader036/viewer/2022062301/568139f3550346895da1ac1b/html5/thumbnails/4.jpg)
4
The Thumbtack example Parameter learning for a single variable.
Variable X: an outcome of a thumbtack toss Val(X) = {head, tail}
Data A set of thumbtack tosses: x[1],…x[M]
X
![Page 5: Maximum Likelihood Estimation & Expectation Maximization](https://reader036.fdocuments.us/reader036/viewer/2022062301/568139f3550346895da1ac1b/html5/thumbnails/5.jpg)
5
Maximum likelihood estimation Say that P(x=head) = Θ, P(x=tail) = 1-Θ
P(HHTTHHH…<Mh heads, Mt tails>; Θ) =
Definition: The likelihood function L(Θ : D) = P(D; Θ)
Maximum likelihood estimation (MLE) Given data D=HHTTHHH…<Mh heads, Mt tails>,
find Θ that maximizes the likelihood function L(Θ : D).
![Page 6: Maximum Likelihood Estimation & Expectation Maximization](https://reader036.fdocuments.us/reader036/viewer/2022062301/568139f3550346895da1ac1b/html5/thumbnails/6.jpg)
6
Likelihood function
![Page 7: Maximum Likelihood Estimation & Expectation Maximization](https://reader036.fdocuments.us/reader036/viewer/2022062301/568139f3550346895da1ac1b/html5/thumbnails/7.jpg)
7
MLE for the Thumbtack problem Given data D=HHTTHHH…<Mh heads, Mt
tails>, MLE solution Θ* = Mh / (Mh+Mt ).
Proof:
![Page 8: Maximum Likelihood Estimation & Expectation Maximization](https://reader036.fdocuments.us/reader036/viewer/2022062301/568139f3550346895da1ac1b/html5/thumbnails/8.jpg)
Continuous space Assuming sample x1, x2,…, xn is from a
parametric distribution f (x|Θ) , estimate Θ.
Say that the n samples are from a normal distribution with mean μ and variance σ2.
8
![Page 9: Maximum Likelihood Estimation & Expectation Maximization](https://reader036.fdocuments.us/reader036/viewer/2022062301/568139f3550346895da1ac1b/html5/thumbnails/9.jpg)
Continuous space (cont.) Let Θ1=μ, Θ2= σ2
9
),...,,:,( 2121 nxxxL
),...,,:,(log 2121 nxxxL
),...,,:,(log 21211
nxxxL
),...,,|,(log 21212
nxxxL
![Page 10: Maximum Likelihood Estimation & Expectation Maximization](https://reader036.fdocuments.us/reader036/viewer/2022062301/568139f3550346895da1ac1b/html5/thumbnails/10.jpg)
Any Drawback? Is it biased?
Is it? Yes. As an extreme, when n = 1, =0. The MLE systematically underestimates θ2 .
Why? A bit harder to see, but think about n = 2. Then θ1 is exactly between the two sample points, the position that exactly minimizes the expression for . Any other choices for θ1, θ2 make the likelihood of the observed data slightly lower. But it’s actually pretty unlikely that two sample points would be chosen exactly equidistant from, and on opposite sides of the mean, so the MLE systematically underestimates θ2 .
10
2̂
2̂
2̂
2̂
![Page 11: Maximum Likelihood Estimation & Expectation Maximization](https://reader036.fdocuments.us/reader036/viewer/2022062301/568139f3550346895da1ac1b/html5/thumbnails/11.jpg)
Maximum a posteriori Incorporating priors. How?
MLE vs MAP estimation
11
![Page 12: Maximum Likelihood Estimation & Expectation Maximization](https://reader036.fdocuments.us/reader036/viewer/2022062301/568139f3550346895da1ac1b/html5/thumbnails/12.jpg)
12
MLE for general problems Learning problem setting
A set of random variables X from unknown distribution P*
Training data D = M instances of X: {d[1],…,d[M]}
A parametric model P(X; Θ) (a ‘legal’ distribution)
Define the likelihood function: L(Θ : D) =
Maximum likelihood estimation Choose parameters Θ* that satisfy:
![Page 13: Maximum Likelihood Estimation & Expectation Maximization](https://reader036.fdocuments.us/reader036/viewer/2022062301/568139f3550346895da1ac1b/html5/thumbnails/13.jpg)
13
MLE for Bayesian networks
Likelihood decomposition:
The local likelihood function for Xi is:
x2
x3
x1
x4
Structure G
Given D: x[1],…x[m]…,x[M], estimate θ.
(x1[m],x2[m],x3[m],x4[m])
Θx1, Θx2 , Θx3|x1,x2 , Θx4|x1,x3
(more generally Θxi|pai)
PG = P(x1,x2,x3,x4)
Parameters θ
= P(x1) P(x2) P(x3|x1,x2) P(x4|x1,x3)More generally? )|x(PP i
iiG pa
![Page 14: Maximum Likelihood Estimation & Expectation Maximization](https://reader036.fdocuments.us/reader036/viewer/2022062301/568139f3550346895da1ac1b/html5/thumbnails/14.jpg)
14
Bayesian network with table CPDs
MtMhMh
θ
ˆ
Difficulty
GradeX
Intelligence
D: {H…x[m]…T} D: {(i1,d1,g1)…(i[m],d[m],g[m])…}
The Thumbtack exampleThe Student example
Data
Likelihood function
Parameters
MLE solution
Joint distribution
vs
θI, θD, θG|I,D
P(X) P(I,D,G) =
L(θ:D) = P(D;θ)
θ
θMh(1-θ)Mt
![Page 15: Maximum Likelihood Estimation & Expectation Maximization](https://reader036.fdocuments.us/reader036/viewer/2022062301/568139f3550346895da1ac1b/html5/thumbnails/15.jpg)
15
Maximum Likelihood Estimation Review
Find parameter estimates which make observed data most likely
General approach, as long as tractable likelihood function exists
Can use all available information
![Page 16: Maximum Likelihood Estimation & Expectation Maximization](https://reader036.fdocuments.us/reader036/viewer/2022062301/568139f3550346895da1ac1b/html5/thumbnails/16.jpg)
16
Instruction for making the proteins Instruction for when and where to make them
What turns genes on (producing a protein) and off?
When is a gene turned on or off? Where (in which cells) is a gene turned on? How many copies of the gene product are
produced?
“Coding” Regions
“Regulatory” Regions (Regulons)
Example – Gene Expression
Regulatory regions contain “binding sites” (6-20 bp). “Binding sites” attract a special class of proteins, known
as “transcription factors”. Bound transcription factors can initiate transcription
(making RNA). Proteins that inhibit transcription can also be bound to
their binding sites.
![Page 17: Maximum Likelihood Estimation & Expectation Maximization](https://reader036.fdocuments.us/reader036/viewer/2022062301/568139f3550346895da1ac1b/html5/thumbnails/17.jpg)
17
Regulation of Genes
GeneRegulatory Element (binding sites)
RNA polymerase
(Protein)
Transcription Factor(Protein)
DNA
source: M. Tompa, U. of Washington
AC..TCG..A
![Page 18: Maximum Likelihood Estimation & Expectation Maximization](https://reader036.fdocuments.us/reader036/viewer/2022062301/568139f3550346895da1ac1b/html5/thumbnails/18.jpg)
18
Regulation of Genes
Gene
Transcription Factor(Protein)
Regulatory Element
DNA
source: M. Tompa, U. of Washington
RNA polymerase
(Protein)
AC..TCG..A
![Page 19: Maximum Likelihood Estimation & Expectation Maximization](https://reader036.fdocuments.us/reader036/viewer/2022062301/568139f3550346895da1ac1b/html5/thumbnails/19.jpg)
19
Regulation of Genes
Gene
RNA polymerase
Transcription Factor(Protein)
Regulatory Element
DNA
source: M. Tompa, U. of Washington
AC..TCG..A
![Page 20: Maximum Likelihood Estimation & Expectation Maximization](https://reader036.fdocuments.us/reader036/viewer/2022062301/568139f3550346895da1ac1b/html5/thumbnails/20.jpg)
20
Regulation of Genes
RNA polymera
se
Transcription Factor
Regulatory Element
DNA
New proteinsource: M. Tompa, U. of Washington
AC..TCG..A
![Page 21: Maximum Likelihood Estimation & Expectation Maximization](https://reader036.fdocuments.us/reader036/viewer/2022062301/568139f3550346895da1ac1b/html5/thumbnails/21.jpg)
21
The Gene regulation example What determines the expression level of a gene? What are observed and hidden variables?
e.G, e.TF’s: observed; Process.G: hidden variables want to infer!
e.G
Process.G
e.TF1 e.TF2
e.TFN
...e.TF3 e.TF4
= p1= p2
= p3
Expression level of a gene
Biological process the gene is involved in
Expression level of TF1
![Page 22: Maximum Likelihood Estimation & Expectation Maximization](https://reader036.fdocuments.us/reader036/viewer/2022062301/568139f3550346895da1ac1b/html5/thumbnails/22.jpg)
22
The Gene regulation example What determines the expression level of a gene? What are observed and hidden variables?
e.G, e.TF’s: observed; Process.G: hidden variables want to infer! How about BS.G’s? How deterministic is the sequence of a binding
site? How much do we know?
e.G
Process.G
e.TF1 e.TF2
e.TFN
...e.TF3 e.TF4
BS1.G
BSN.G
...
Expression level of a gene
= Yes = Yes
Whether the gene has TF1’s binding
site
![Page 23: Maximum Likelihood Estimation & Expectation Maximization](https://reader036.fdocuments.us/reader036/viewer/2022062301/568139f3550346895da1ac1b/html5/thumbnails/23.jpg)
23
Not all data are perfect Most MLE problems are simple to solve with
complete data.
Available data are “incomplete” in some way.
![Page 24: Maximum Likelihood Estimation & Expectation Maximization](https://reader036.fdocuments.us/reader036/viewer/2022062301/568139f3550346895da1ac1b/html5/thumbnails/24.jpg)
24
Outline Learning from data
Maximum likelihood estimation (MLE) Maximum a posteriori (MAP) Expectation-maximization (EM) algorithm
![Page 25: Maximum Likelihood Estimation & Expectation Maximization](https://reader036.fdocuments.us/reader036/viewer/2022062301/568139f3550346895da1ac1b/html5/thumbnails/25.jpg)
25
Continuous space revisited Assuming sample x1, x2,…, xn is from a
mixture of parametric distributions,
x
x1 x2 … xm xm+1 … xn
![Page 26: Maximum Likelihood Estimation & Expectation Maximization](https://reader036.fdocuments.us/reader036/viewer/2022062301/568139f3550346895da1ac1b/html5/thumbnails/26.jpg)
26
A real example CpG content of human gene promoters
“A genome-wide analysis of CpG dinucleotides in the human genome distinguishes twodistinct classes of promoters” Saxonov, Berg, and Brutlag, PNAS 2006;103:1412-1417
GC frequency
![Page 27: Maximum Likelihood Estimation & Expectation Maximization](https://reader036.fdocuments.us/reader036/viewer/2022062301/568139f3550346895da1ac1b/html5/thumbnails/27.jpg)
27
Mixture of Gaussians
Parameters θ means variances
mixing parameters
P.D.F
),...,:,,,,,( 12122
2121 nxxL
![Page 28: Maximum Likelihood Estimation & Expectation Maximization](https://reader036.fdocuments.us/reader036/viewer/2022062301/568139f3550346895da1ac1b/html5/thumbnails/28.jpg)
28
Apply MLE?
No closed form solution known for finding θ maximizing L.
However, what if we knew the hidden data?
),...,:,,,,,( 12122
2121 nxxL
![Page 29: Maximum Likelihood Estimation & Expectation Maximization](https://reader036.fdocuments.us/reader036/viewer/2022062301/568139f3550346895da1ac1b/html5/thumbnails/29.jpg)
29
EM as Chicken vs Egg IF zij known, could estimate parameters θ
e.g., only points in cluster 2 influence μ2, σ2.
IF parameters θ known, could estimate zij
e.g., if |xi - μ1|/σ1 << |xi – μ2|/σ2, then zi1 >> zi2
BUT we know neither; (optimistically) iterate: E-step: calculate expected zij, given parameters M-step: do “MLE” for parameters (μ,σ), given E(zij)
Overall, a clever “hill-climbing” strategy
Convergence provable? YES
![Page 30: Maximum Likelihood Estimation & Expectation Maximization](https://reader036.fdocuments.us/reader036/viewer/2022062301/568139f3550346895da1ac1b/html5/thumbnails/30.jpg)
30
“Classification EM” If zij < 0.5, pretend it’s 0; zij > 0.5, pretend
it’s 1i.e., classify points as component 0 or 1
Now recalculate θ, assuming that partition
Then recalculate zij , assuming that θ
Then recalculate θ, assuming new zij , etc., etc.
![Page 31: Maximum Likelihood Estimation & Expectation Maximization](https://reader036.fdocuments.us/reader036/viewer/2022062301/568139f3550346895da1ac1b/html5/thumbnails/31.jpg)
31
Full EM xi’s are known; Θ unknown. Goal is to find the MLE Θ
of:L (Θ : x1,…,xn ) (hidden data likelihood)
Would be easy if zij’s were known, i.e., consider
L (Θ : x1,…,xn, z11,z12,…,zn2 ) (complete data likelihood)
But zij’s are not known. Instead, maximize expected likelihood of observed
dataE[ L(Θ : x1,…,xn, z11,z12,…,zn2 ) ]
where expectation is over distribution of hidden data (zij’s).
![Page 32: Maximum Likelihood Estimation & Expectation Maximization](https://reader036.fdocuments.us/reader036/viewer/2022062301/568139f3550346895da1ac1b/html5/thumbnails/32.jpg)
32
The E-step Find E(zij), i.e., P(zij=1)
Assume θ known & fixed. Let A: the event that xi was drawn from f1
B: the event that xi was drawn from f2
D: the observed data xi
Then, expected value of zi1 is P(A|D)
P(A|D) =
![Page 33: Maximum Likelihood Estimation & Expectation Maximization](https://reader036.fdocuments.us/reader036/viewer/2022062301/568139f3550346895da1ac1b/html5/thumbnails/33.jpg)
33
Complete data likelihood Recall:
so, correspondingly,
Formulas with “if’s” are messy; can we blend more smoothly?
![Page 34: Maximum Likelihood Estimation & Expectation Maximization](https://reader036.fdocuments.us/reader036/viewer/2022062301/568139f3550346895da1ac1b/html5/thumbnails/34.jpg)
34
M-step Find θ maximizing E[ log(Likelihood) ]
![Page 35: Maximum Likelihood Estimation & Expectation Maximization](https://reader036.fdocuments.us/reader036/viewer/2022062301/568139f3550346895da1ac1b/html5/thumbnails/35.jpg)
35
EM summary Fundamentally an MLE problem
Useful if analysis is more tractable when 0/1
Hidden data z known
Iterate:E-step: estimate E(z) for each z, given θM-step: estimate θ maximizing E(log likelihood)
given E(z) where “E(logL)” is wrt random z ~ E(z) = p(z=1)
![Page 36: Maximum Likelihood Estimation & Expectation Maximization](https://reader036.fdocuments.us/reader036/viewer/2022062301/568139f3550346895da1ac1b/html5/thumbnails/36.jpg)
36
EM Issues EM is guaranteed to increase likelihood with
every E-M iteration, hence will converge.
But may converge to local, not global, max.
Issue is intrinsic (probably), since EM is often applied to NP-hard problems (including clustering, above, and motif-discovery, soon)
Nevertheless, widely used, often effective
![Page 37: Maximum Likelihood Estimation & Expectation Maximization](https://reader036.fdocuments.us/reader036/viewer/2022062301/568139f3550346895da1ac1b/html5/thumbnails/37.jpg)
37
Acknowledgement Profs Daphne Koller & Nir Friedman,
“Probabilistic Graphical Models”
Prof Larry Ruzo, CSE 527, Autumn 2009
Prof Andrew Ng, ML lecture note