ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon*...
Transcript of ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon*...
![Page 1: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!](https://reader035.fdocuments.us/reader035/viewer/2022071218/604fa9216260c51e346ee229/html5/thumbnails/1.jpg)
Probability Es-ma-on
Machine Learning 10-‐601B
Seyoung Kim
Many of these slides are derived from Tom Mitchell, William Cohen, Eric Xing. Thanks!
![Page 2: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!](https://reader035.fdocuments.us/reader035/viewer/2022071218/604fa9216260c51e346ee229/html5/thumbnails/2.jpg)
Overview
• Joint probability distribuIon – A funcIonal mapping f: X-‐>Y via probability distribuIon
• Probability esImaIon – Maximum likelihood esImaIon
– Maximum a priori esImaIon
![Page 3: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!](https://reader035.fdocuments.us/reader035/viewer/2022071218/604fa9216260c51e346ee229/html5/thumbnails/3.jpg)
What does all this have to do with func-on approxima-on for f: X-‐>Y?
![Page 4: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!](https://reader035.fdocuments.us/reader035/viewer/2022071218/604fa9216260c51e346ee229/html5/thumbnails/4.jpg)
Joint Probability Distribu-on
Once you have the joint distribuIon, you can ask for the probability of any logical expression involving your aRribute
![Page 5: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!](https://reader035.fdocuments.us/reader035/viewer/2022071218/604fa9216260c51e346ee229/html5/thumbnails/5.jpg)
Using the Joint Distribu-on
P(Poor, Male) = 0.4654
P(A) = P(A ^ B) + P(A ^ ~B)
![Page 6: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!](https://reader035.fdocuments.us/reader035/viewer/2022071218/604fa9216260c51e346ee229/html5/thumbnails/6.jpg)
Using the Joint Distribu-on
P(Poor) = 0.7604
![Page 7: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!](https://reader035.fdocuments.us/reader035/viewer/2022071218/604fa9216260c51e346ee229/html5/thumbnails/7.jpg)
Inference with the Joint
![Page 8: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!](https://reader035.fdocuments.us/reader035/viewer/2022071218/604fa9216260c51e346ee229/html5/thumbnails/8.jpg)
Inference with the Joint
P(Male | Poor) = 0.4654 / 0.7604 = 0.612
![Page 9: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!](https://reader035.fdocuments.us/reader035/viewer/2022071218/604fa9216260c51e346ee229/html5/thumbnails/9.jpg)
Learning and the Joint
Distribution
Suppose we want to learn the function f: <G, H> ! W
Equivalently, P(W | G, H)
Solution: learn joint distribution from data, calculate P(W | G, H)
e.g., P(W=rich | G = female, H = 40.5- ) =
[A. Moore]
![Page 10: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!](https://reader035.fdocuments.us/reader035/viewer/2022071218/604fa9216260c51e346ee229/html5/thumbnails/10.jpg)
Copyright © Andrew W. Moore
Density Es-ma-on
• Our Joint DistribuIon learner is our first example of something called Density EsImaIon
• A Density EsImator learns a mapping from a set of aRributes values to a Probability
Density EsImator
Probability Input ARributes
![Page 11: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!](https://reader035.fdocuments.us/reader035/viewer/2022071218/604fa9216260c51e346ee229/html5/thumbnails/11.jpg)
Copyright © Andrew W. Moore
Density Es-ma-on
• Compare it against the two other major kinds of models:
Regressor PredicIon of real-‐valued output
Input ARributes
Density EsImator
Probability Input ARributes
Classifier PredicIon of categorical output or class
Input ARributes
One of a few discrete values
![Page 12: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!](https://reader035.fdocuments.us/reader035/viewer/2022071218/604fa9216260c51e346ee229/html5/thumbnails/12.jpg)
Copyright © Andrew W. Moore
Density Es-ma-on ! Classifica-on
Classifier PredicIon of categorical output
Input ARributes x
One of y1, …., yk
To classify x 1. Use your esImator to compute P(x,y1), …., P(x,yk) 2. Return the class y* with the highest predicted probability
Ideally is correct with P(y*| x) = P(x,y*)/(P(x,y1) + …. + P(x,yk))
^ ^
^ ^ ^ ^
Density EsImator
P(x,y) Input ARributes
Class
^
Binary case: predict POS if P(x, ypos)>0.5
^
![Page 13: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!](https://reader035.fdocuments.us/reader035/viewer/2022071218/604fa9216260c51e346ee229/html5/thumbnails/13.jpg)
Classifica-on vs Density Es-ma-on
Classifica-on Density Es-ma-on
![Page 14: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!](https://reader035.fdocuments.us/reader035/viewer/2022071218/604fa9216260c51e346ee229/html5/thumbnails/14.jpg)
Classifica-on vs density es-ma-on
![Page 15: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!](https://reader035.fdocuments.us/reader035/viewer/2022071218/604fa9216260c51e346ee229/html5/thumbnails/15.jpg)
Modeling Uncertainty with Probabili-es
• Y is a Boolean-‐valued random variable if – Y denotes an event, – there is uncertainty as to whether Y occurs.
• More examples – Y = You wake up tomorrow with a headache
– Y = The US president in 2023 will be male – Y = there is intelligent life elsewhere in our galaxy – Y = the 1,000,000,000,000th digit of π is 7 – Y = I woke up today with a headache
• Define P(Y|X) as “the fracIon of possible worlds in which Y is true, given X”
![Page 16: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!](https://reader035.fdocuments.us/reader035/viewer/2022071218/604fa9216260c51e346ee229/html5/thumbnails/16.jpg)
sounds like the solution to learning F: X "Y,
or P(Y | X).
Are we done?
![Page 17: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!](https://reader035.fdocuments.us/reader035/viewer/2022071218/604fa9216260c51e346ee229/html5/thumbnails/17.jpg)
[C. Guestrin]
![Page 18: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!](https://reader035.fdocuments.us/reader035/viewer/2022071218/604fa9216260c51e346ee229/html5/thumbnails/18.jpg)
[C. Guestrin]
D = { D1, D2, D3, D4, D5}
€
αH +αT
αH
⎛
⎝ ⎜
⎞
⎠ ⎟
![Page 19: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!](https://reader035.fdocuments.us/reader035/viewer/2022071218/604fa9216260c51e346ee229/html5/thumbnails/19.jpg)
[C. Guestrin]
€
αH +αT
αH
⎛
⎝ ⎜
⎞
⎠ ⎟
![Page 20: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!](https://reader035.fdocuments.us/reader035/viewer/2022071218/604fa9216260c51e346ee229/html5/thumbnails/20.jpg)
Maximum Likelihood Estimate for Θ
[C. Guestrin]
![Page 21: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!](https://reader035.fdocuments.us/reader035/viewer/2022071218/604fa9216260c51e346ee229/html5/thumbnails/21.jpg)
[C. Guestrin]
![Page 22: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!](https://reader035.fdocuments.us/reader035/viewer/2022071218/604fa9216260c51e346ee229/html5/thumbnails/22.jpg)
[C. Guestrin]
![Page 23: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!](https://reader035.fdocuments.us/reader035/viewer/2022071218/604fa9216260c51e346ee229/html5/thumbnails/23.jpg)
Issues with MLE es-mate I bought a loaded 20-‐faced die (d20) on EBay…but it didn’t come with any specs. How can I find out how it behaves?
1. Collect some data (20 rolls) 2. EsImate P(i)=CountOf(rolls of i)/CountOf(any roll)
![Page 24: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!](https://reader035.fdocuments.us/reader035/viewer/2022071218/604fa9216260c51e346ee229/html5/thumbnails/24.jpg)
Issues with MLE es-mate I bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves?
P(1)=0
P(2)=0
P(3)=0
P(4)=0.1
…
P(19)=0.25
P(20)=0.2
But: Do I really think it’s impossible to roll a 1,2 or 3?
![Page 25: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!](https://reader035.fdocuments.us/reader035/viewer/2022071218/604fa9216260c51e346ee229/html5/thumbnails/25.jpg)
A bePer solu-on I bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves?
1. Collect some data (20 rolls) 2. EsImate P(i)
0. Imagine some data (20 rolls, each i shows up 1x)
![Page 26: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!](https://reader035.fdocuments.us/reader035/viewer/2022071218/604fa9216260c51e346ee229/html5/thumbnails/26.jpg)
A bePer solu-on?
P(1)=1/40
P(2)=1/40
P(3)=1/40
P(4)=(2+1)/40
…
P(19)=(5+1)/40
P(20)=(4+1)/40=1/8
€
ˆ P (i) =CountOf (i) +1
CountOf (ANY ) + CountOf (IMAGINED)0.2 vs. 0.125 – really different! Maybe I should “imagine” less data?
MAP = maximum a posteriori esImate
![Page 27: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!](https://reader035.fdocuments.us/reader035/viewer/2022071218/604fa9216260c51e346ee229/html5/thumbnails/27.jpg)
[C. Guestrin]
![Page 28: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!](https://reader035.fdocuments.us/reader035/viewer/2022071218/604fa9216260c51e346ee229/html5/thumbnails/28.jpg)
Beta prior distribution – P(θ)
[C. Guestrin]
θ
![Page 29: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!](https://reader035.fdocuments.us/reader035/viewer/2022071218/604fa9216260c51e346ee229/html5/thumbnails/29.jpg)
Beta prior distribution – P(θ)
[C. Guestrin]
€
αH +αT
αH
⎛
⎝ ⎜
⎞
⎠ ⎟
![Page 30: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!](https://reader035.fdocuments.us/reader035/viewer/2022071218/604fa9216260c51e346ee229/html5/thumbnails/30.jpg)
[C. Guestrin]
![Page 31: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!](https://reader035.fdocuments.us/reader035/viewer/2022071218/604fa9216260c51e346ee229/html5/thumbnails/31.jpg)
[C. Guestrin] €
αH + βH −1αH + βH −1+αT + βT −1
![Page 32: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!](https://reader035.fdocuments.us/reader035/viewer/2022071218/604fa9216260c51e346ee229/html5/thumbnails/32.jpg)
Conjugate priors
[A. Singh]
![Page 33: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!](https://reader035.fdocuments.us/reader035/viewer/2022071218/604fa9216260c51e346ee229/html5/thumbnails/33.jpg)
Dirichlet distribution
• number of heads in N flips of a two-sided coin – follows a binomial distribution – Beta is a good prior (conjugate prior for binomial)
• what it’s not two-sided, but k-sided? – follows a multinomial distribution – Dirichlet distribution is the conjugate prior
![Page 34: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!](https://reader035.fdocuments.us/reader035/viewer/2022071218/604fa9216260c51e346ee229/html5/thumbnails/34.jpg)
Conjugate priors
[A. Singh]
![Page 35: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!](https://reader035.fdocuments.us/reader035/viewer/2022071218/604fa9216260c51e346ee229/html5/thumbnails/35.jpg)
Es-ma-ng Parameters
• Maximum Likelihood EsImate (MLE): choose θ that maximizes probability of observed data
• Maximum a Posteriori (MAP) esImate: choose θ that is most probable given prior probability and the data
![Page 36: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!](https://reader035.fdocuments.us/reader035/viewer/2022071218/604fa9216260c51e346ee229/html5/thumbnails/36.jpg)
Expected values
Given discrete random variable X, the expected value of X, wriRen E[X] is
We also can talk about the expected value of funcIons of X
![Page 37: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!](https://reader035.fdocuments.us/reader035/viewer/2022071218/604fa9216260c51e346ee229/html5/thumbnails/37.jpg)
Covariance
Given two discrete r.v.’s X and Y, we define the covariance of X and Y as
e.g., X=gender, Y=playsFootball or X=gender, Y=lezHanded
Remember:
![Page 38: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!](https://reader035.fdocuments.us/reader035/viewer/2022071218/604fa9216260c51e346ee229/html5/thumbnails/38.jpg)
You should know
• Density esImaIon and its relaIon to classificaIon
• EsImaIng parameters from data – maximum likelihood esImates
– maximum a posteriori esImates
– distribuIons – binomial, Beta, Dirichlet, … – conjugate priors