Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W....
-
Upload
gwendolyn-allison -
Category
Documents
-
view
218 -
download
0
Transcript of Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W....
![Page 1: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/1.jpg)
Sep 6th, 2001Copyright © 2001, 2004, Andrew W.
Moore
Learning with Maximum Likelihood
Andrew W. MooreProfessor
School of Computer ScienceCarnegie Mellon University
www.cs.cmu.edu/[email protected]
412-268-7599
Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received.
![Page 2: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/2.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 2
Maximum Likelihood learning of Gaussians for Data Mining
• Why we should care• Learning Univariate Gaussians• Learning Multivariate Gaussians• What’s a biased estimator?• Bayesian Learning of Gaussians
![Page 3: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/3.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 3
Why we should care• Maximum Likelihood Estimation is a
very very very very fundamental part of data analysis.
• “MLE for Gaussians” is training wheels for our future techniques
• Learning Gaussians is more useful than you might guess…
![Page 4: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/4.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 4
Learning Gaussians from Data
• Suppose you have x1, x2, … xR ~ (i.i.d) N(,2)• But you don’t know
(you do know 2)
MLE: For which is x1, x2, … xR most likely?
MAP: Which maximizes p(|x1, x2, … xR , 2)?
![Page 5: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/5.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 5
Learning Gaussians from Data
• Suppose you have x1, x2, … xR ~(i.i.d) N(,2)• But you don’t know
(you do know 2)
MLE: For which is x1, x2, … xR most likely?
MAP: Which maximizes p(|x1, x2, … xR , 2)?
Sneer
![Page 6: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/6.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 6
Learning Gaussians from Data
• Suppose you have x1, x2, … xR ~(i.i.d) N(,2)• But you don’t know
(you do know 2)
MLE: For which is x1, x2, … xR most likely?
MAP: Which maximizes p(|x1, x2, … xR , 2)?
Sneer
Despite this, we’ll spend 95% of our time on MLE. Why? Wait and see…
![Page 7: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/7.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 7
MLE for univariate Gaussian• Suppose you have x1, x2, … xR ~(i.i.d) N(,2)• But you don’t know (you do know 2)• MLE: For which is x1, x2, … xR most likely?
),|,...,(maxarg 221
R
mle xxxp
![Page 8: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/8.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 8
Algebra Euphoria),|,...,(maxarg 2
21
Rmle xxxp
= (by i.i.d)
= (monotonicity of log)
= (plug in formula for Gaussian)
= (after simplification)
![Page 9: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/9.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 9
= (by i.i.d)
= (monotonicity of log)
= (plug in formula for Gaussian)
= (after simplification)
R
iixp
1
2 ),|(maxarg
Algebra Euphoria),|,...,(maxarg 2
21
Rmle xxxp
R
iixp
1
2 ),|(logmaxarg
R
i
ix
12
2
2
)(
2
1maxarg
R
iix
1
2)(minarg
![Page 10: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/10.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 10
Intermission: A General Scalar MLE strategy
Task: Find MLE assuming known form for p(Data| ,stuff)
1. Write LL = log P(Data| ,stuff)2. Work out LL/ using high-school calculus3. Set LL/=0 for a maximum, creating an equation in
terms of 4. Solve it*5. Check that you’ve found a maximum rather than a
minimum or saddle-point, and be careful if is constrained
*This is a perfect example of something that works perfectly in all textbook examples and usually involves surprising
pain if you need it for something new.
![Page 11: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/11.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 11
The MLE ),|,...,(maxarg 2
21
Rmle xxxp
R
iix
1
2)(minarg
LL0s.t.
= (what?)
![Page 12: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/12.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 12
The MLE ),|,...,(maxarg 2
21
Rmle xxxp
R
iix
1
2)(minarg
LL0s.t.
R
iix
1
2)(
R
iix
1
)(2
R
iixR 1
1 Thus
![Page 13: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/13.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 13
Lawks-a-lawdy!
R
ii
mle xR 1
1
• The best estimate of the mean of a distribution is the mean of the sample!
At first sight:This kind of pedantic, algebra-filled and ultimately unsurprising fact is exactly the reason people throw down their “Statistics” book and
pick up their “Agent Based Evolutionary Data Mining Using The
Neuro-Fuzz Transform” book.
![Page 14: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/14.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 14
A General MLE strategySuppose = (1, 2, …, n)T is a vector of parameters.
Task: Find MLE assuming known form for p(Data| ,stuff)
1. Write LL = log P(Data| ,stuff)2. Work out LL/ using high-school calculus
nθ
LL
θ
LLθ
LL
LL
2
1
θ
![Page 15: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/15.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 15
A General MLE strategySuppose = (1, 2, …, n)T is a vector of parameters.
Task: Find MLE assuming known form for p(Data| ,stuff)
1. Write LL = log P(Data| ,stuff)2. Work out LL/ using high-school calculus3. Solve the set of simultaneous equations
0
0
0
2
1
nθ
LL
θ
LLθ
LL
![Page 16: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/16.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 16
A General MLE strategySuppose = (1, 2, …, n)T is a vector of parameters.
Task: Find MLE assuming known form for p(Data| ,stuff)
1. Write LL = log P(Data| ,stuff)2. Work out LL/ using high-school calculus3. Solve the set of simultaneous equations
0
0
0
2
1
nθ
LL
θ
LLθ
LL
4. Check that you’re at a maximum
![Page 17: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/17.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 17
A General MLE strategySuppose = (1, 2, …, n)T is a vector of parameters.
Task: Find MLE assuming known form for p(Data| ,stuff)
1. Write LL = log P(Data| ,stuff)2. Work out LL/ using high-school calculus3. Solve the set of simultaneous equations
0
0
0
2
1
nθ
LL
θ
LLθ
LL
4. Check that you’re at a maximum
If you can’t solve them, what should
you do?
![Page 18: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/18.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 18
MLE for univariate Gaussian• Suppose you have x1, x2, … xR ~(i.i.d) N(,2)• But you don’t know or 2
• MLE: For which =(,2) is x1, x2,…xR most likely?
2
12
2221 )(
2
1)log
2
1(log),|,...,( log
R
iiR xRxxxp
)(1
12
R
iix
LL
2
1422
)(2
1
2
R
iix
RLL
![Page 19: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/19.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 19
MLE for univariate Gaussian• Suppose you have x1, x2, … xR ~(i.i.d) N(,2)• But you don’t know or 2
• MLE: For which =(,2) is x1, x2,…xR most likely?
2
12
2221 )(
2
1)log
2
1(log),|,...,( log
R
iiR xRxxxp
)(1
01
2
R
iix
2
142
)(2
1
20
R
iix
R
![Page 20: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/20.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 20
MLE for univariate Gaussian• Suppose you have x1, x2, … xR ~(i.i.d) N(,2)• But you don’t know or 2
• MLE: For which =(,2) is x1, x2,…xR most likely?
2
12
2221 )(
2
1)log
2
1(log),|,...,( log
R
iiR xRxxxp
R
ii
R
ii x
Rx
112
1)(
10
what?)(2
1
20 2
142
R
iix
R
![Page 21: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/21.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 21
MLE for univariate Gaussian• Suppose you have x1, x2, … xR ~(i.i.d) N(,2)• But you don’t know or 2
• MLE: For which =(,2) is x1, x2,…xR most likely?
R
ii
mle xR 1
1
2
1
2 )(1 mle
R
iimle x
R
![Page 22: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/22.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 22
Unbiased Estimators• An estimator of a parameter is unbiased if the
expected value of the estimate is the same as the true value of the parameters.
• If x1, x2, … xR ~(i.i.d) N(,2) then
R
ii
mle xR
EE1
1][
mle is unbiased
![Page 23: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/23.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 23
Biased Estimators• An estimator of a parameter is biased if the expected
value of the estimate is different from the true value of the parameters.
• If x1, x2, … xR ~(i.i.d) N(,2) then
2
2
11
2
1
2 11)(
1
R
jj
R
ii
mleR
iimle x
Rx
REx
REE
2mle is biased
![Page 24: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/24.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 24
MLE Variance Bias• If x1, x2, … xR ~(i.i.d) N(,2) then
22
2
11
2 11
11
Rx
Rx
REE
R
jj
R
iimle
Intuition check: consider the case of R=1
Why should our guts expect that 2mle would be an
underestimate of true 2?
How could you prove that?
![Page 25: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/25.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 25
Unbiased estimate of Variance
• If x1, x2, … xR ~(i.i.d) N(,2) then
22
2
11
2 11
11
Rx
Rx
REE
R
jj
R
iimle
So define
R
mle
11
22unbiased
22unbiased So E
![Page 26: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/26.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 26
Unbiased estimate of Variance
• If x1, x2, … xR ~(i.i.d) N(,2) then
22
2
11
2 11
11
Rx
Rx
REE
R
jj
R
iimle
So define
R
mle
11
22unbiased
2
1
2unbiased )(
1
1 mleR
iixR
22unbiased So E
![Page 27: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/27.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 27
Unbiaseditude discussion• Which is best?
2
1
2unbiased )(
1
1 mleR
iixR
2
1
2 )(1 mle
R
iimle x
R
Answer:
•It depends on the task
•And doesn’t make much difference once R--> large
![Page 28: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/28.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 28
Don’t get too excited about being unbiased
• Assume x1, x2, … xR ~(i.i.d) N(,2)
• Suppose we had these estimators for the mean
R
ii
suboptimal xRR 17
1
1xcrap
Are either of these unbiased?
Will either of them asymptote to the correct value as R gets large?
Which is more useful?
![Page 29: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/29.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 29
MLE for m-dimensional Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(,)• But you don’t know or • MLE: For which =(,) is x1, x2, … xR most likely?
R
kk
mle
R 1
1xμ
R
k
Tmlek
mlek
mle
R 1
1μxμxΣ
![Page 30: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/30.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 30
MLE for m-dimensional Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(,)• But you don’t know or • MLE: For which =(,) is x1, x2, … xR most likely?
R
kk
mle
R 1
1xμ
R
k
Tmlek
mlek
mle
R 1
1 xxΣ
R
kki
mlei Rμ
1
1x
Where 1 i m
And xki is value of the ith component of xk (the ith attribute of the kth record)
And imle is the ith
component of mle
![Page 31: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/31.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 31
MLE for m-dimensional Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(,)• But you don’t know or • MLE: For which =(,) is x1, x2, … xR most likely?
R
kk
mle
R 1
1xμ
R
k
Tmlek
mlek
mle
R 1
1 xxΣ
R
k
mlejkj
mleiki
mleij R 1
1 xx
Where 1 i m, 1 j m
And xki is value of the ith component of xk (the ith attribute of the kth record)
And ijmle is the (i,j)th
component of mle
![Page 32: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/32.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 32
MLE for m-dimensional Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(,)• But you don’t know or • MLE: For which =(,) is x1, x2, … xR most likely?
R
kk
mle
R 1
1xμ
R
k
Tmlek
mlek
mle
R 1
1 xxΣ
Q: How would you prove this?
A: Just plug through the MLE recipe.
Note how mle is forced to be symmetric non-negative definite
Note the unbiased case
How many datapoints would you need before the Gaussian has a chance of being non-degenerate?
R
k
Tmlek
mlek
mle
RR
1
unbiased
1
11
1 xx
ΣΣ
![Page 33: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/33.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 33
Confidence intervals
We need to talk
We need to discuss how accurate we expect mle and mle to be as a function of R
And we need to consider how to estimate these accuracies from data…
•Analytically *
•Non-parametrically (using randomization and bootstrapping) *
But we won’t. Not yet.
*Will be discussed in future Andrew lectures…just before we need this technology.
![Page 34: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/34.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 34
Structural error
Actually, we need to talk about something else too..
What if we do all this analysis when the true distribution is in fact not Gaussian?
How can we tell? *
How can we survive? *
*Will be discussed in future Andrew lectures…just before we need this technology.
![Page 35: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/35.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 35
Gaussian MLE in action Using R=392 cars from the “MPG” UCI dataset supplied by Ross Quinlan
![Page 36: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/36.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 36
Data-starved Gaussian MLEUsing three subsets of MPG.
Each subset has 6 randomly-chosen cars.
![Page 37: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/37.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 37
Biv
ari
ate
MLE
in
act
ion
![Page 38: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/38.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 38
Multivariate MLE
Covariance matrices are not exciting to look at
![Page 39: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/39.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 39
Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(,)• But you don’t know or • MAP: Which (,) maximizes p(, |x1, x2, … xR)?
Step 1: Put a prior on (,)
![Page 40: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/40.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 40
Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(,)• But you don’t know or • MAP: Which (,) maximizes p(, |x1, x2, … xR)?
Step 1: Put a prior on (,)
Step 1a: Put a prior on
0-m-1) ~ IW(0, 0-m-1) 0 )
This thing is called the Inverse-Wishart distribution.
A PDF over SPD matrices!
![Page 41: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/41.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 41
Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(,)• But you don’t know or • MAP: Which (,) maximizes p(, |x1, x2, … xR)?
Step 1: Put a prior on (,)
Step 1a: Put a prior on
0-m-1) ~ IW(0, 0-m-1) 0 )
This thing is called the Inverse-Wishart distribution.
A PDF over SPD matrices!
0 small: “I am not sure about my guess
of 0 “
0 large: “I’m pretty sure about my guess
of 0 “
0 : (Roughly) my best guess of
0
![Page 42: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/42.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 42
Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(,)• But you don’t know or • MAP: Which (,) maximizes p(, |x1, x2, … xR)?
Together, “” and “ | ” define a joint distribution on (,)
Step 1: Put a prior on (,)
Step 1a: Put a prior on
0-m-1) ~ IW(0, 0-m-1)0 )
Step 1b: Put a prior on |
| ~ N(0 , / 0)
![Page 43: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/43.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 43
Step 1: Put a prior on (,)
Step 1a: Put a prior on
0-m-1) ~ IW(0, 0-m-1)0 )
Step 1b: Put a prior on |
| ~ N(0 , / 0)
Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(,)• But you don’t know or • MAP: Which (,) maximizes p(, |x1, x2, … xR)?
0 : My best guess of
E0
Together, “” and “ | ” define a joint distribution on (,)
0 small: “I am not sure about my guess
of 0 “
0 large: “I’m pretty sure about my guess
of 0 “
Notice how we are forced to express our ignorance of proportionally to
![Page 44: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/44.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 44
Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(,)• But you don’t know or • MAP: Which (,) maximizes p(, |x1, x2, … xR)?
Why do we use this form of prior?
Step 1: Put a prior on (,)
Step 1a: Put a prior on
0-m-1) ~ IW(0, 0-m-1)0 )
Step 1b: Put a prior on |
| ~ N(0 , / 0)
![Page 45: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/45.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 45
Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(,)• But you don’t know or • MAP: Which (,) maximizes p(, |x1, x2, … xR)?
Step 1: Put a prior on (,)
Step 1a: Put a prior on
0-m-1) ~ IW(0, 0-m-1)0 )
Step 1b: Put a prior on |
| ~ N(0 , / 0)
Why do we use this form of prior?
Actually, we don’t have to
But it is computationally and algebraically convenient…
…it’s a conjugate prior.
![Page 46: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/46.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 46
Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(,)• MAP: Which (,) maximizes p(, |x1, x2, … xR)?
Step 1: Prior: 0-m-1) ~ IW(0, 0-m-1)0 ), | ~ N(0 , / 0)
Step 2:
R
kkR 1
1xx
R
RR
0
00
xμ
μRR 0
RR 0
R
mmTR
k
TkkRR /1/1
)1()1(0
00
100
μxμx
xxxxΣΣ
Step 3: Posterior: (R+m-1) ~ IW(R, (R+m-1) R ),
| ~ N(R , / R)
Result: map = R, E[ |x1, x2, … xR ]= R
![Page 47: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/47.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 47
Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(,)• MAP: Which (,) maximizes p(, |x1, x2, … xR)?
Step 1: Prior: 0-m-1) ~ IW(0, 0-m-1)0 ), | ~ N(0 , / 0)
Step 2:
R
kkR 1
1xx
R
RR
0
00
xμ
μRR 0
RR 0
R
mmTR
k
TkkRR /1/1
)1()1(0
00
100
μxμx
xxxxΣΣ
Step 3: Posterior: (R+m-1) ~ IW(R, (R+m-1) R ),
| ~ N(R , / R)
Result: map = R, E[ |x1, x2, … xR ]= R
•Look carefully at what these formulae are doing. It’s all very sensible.•Conjugate priors mean prior form and posterior form are same and characterized by “sufficient statistics” of the data.•The marginal distribution on is a student-t•One point of view: it’s pretty academic if R > 30
![Page 48: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/48.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 48
Where we’re atIn
pu
ts
ClassifierPredict
category
Inp
uts Density
EstimatorProb-ability
Inp
uts
RegressorPredictreal no.
Categorical inputs only
Mixed Real / Cat okay
Real-valued inputs only
Dec TreeJoint BC
Naïve BC
Joint DE
Naïve DE
Gauss DE
![Page 49: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/49.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 49
What you should know• The Recipe for MLE• What do we sometimes prefer MLE to
MAP?• Understand MLE estimation of
Gaussian parameters• Understand “biased estimator” versus
“unbiased estimator”• Appreciate the outline behind Bayesian
estimation of Gaussian parameters
![Page 50: Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697c0031a28abf838cc3980/html5/thumbnails/50.jpg)
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 50
Useful exercise• We’d already done some MLE in this
class without even telling you!• Suppose categorical arity-n inputs x1,
x2, … xR~(i.i.d.) from a multinomial
M(p1, p2, … pn)
where
P(xk=j|p)=pj
• What is the MLE p=(p1, p2, … pn)?