lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 ·...
Transcript of lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 ·...
![Page 1: lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 · Naïve&Bayes& Lecture17 David&Sontag& New&York&University& Slides adapted from Luke Zettlemoyer,](https://reader031.fdocuments.us/reader031/viewer/2022040917/5e90da1deb569a56e30d4bb7/html5/thumbnails/1.jpg)
Naïve Bayes Lecture 17
David Sontag New York University
Slides adapted from Luke Zettlemoyer, Carlos Guestrin, Dan Klein, and Mehryar Mohri
![Page 2: lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 · Naïve&Bayes& Lecture17 David&Sontag& New&York&University& Slides adapted from Luke Zettlemoyer,](https://reader031.fdocuments.us/reader031/viewer/2022040917/5e90da1deb569a56e30d4bb7/html5/thumbnails/2.jpg)
Bayesian Learning
• Use Bayes’ rule!
• Or equivalently:
• For uniform priors, this reduces to
maximum likelihood estimation!
Prior
Normalization
Data Likelihood
Posterior
Brief Article
The Author
January 11, 2012
θ̂ = argmaxθ
lnP (D | θ)
ln θαH
d
dθlnP (D | θ) =
d
dθ[ln θαH (1− θ)αT ] =
d
dθ[αH ln θ + αT ln(1− θ)]
= αH
d
dθln θ + αT
d
dθln(1− θ) =
αH
θ− αT
1− θ= 0
δ ≥ 2e−2N�2 ≥ P (mistake)
ln δ ≥ ln 2− 2N�2
N ≥ ln(2/δ)
2�2
N ≥ ln(2/0.05)2× 0.12 ≈ 3.8
0.02= 190
P (θ) ∝ 1
1
Brief Article
The Author
January 11, 2012
θ̂ = argmaxθ
lnP (D | θ)
ln θαH
d
dθlnP (D | θ) =
d
dθ[ln θαH (1− θ)αT ] =
d
dθ[αH ln θ + αT ln(1− θ)]
= αH
d
dθln θ + αT
d
dθln(1− θ) =
αH
θ− αT
1− θ= 0
δ ≥ 2e−2N�2 ≥ P (mistake)
ln δ ≥ ln 2− 2N�2
N ≥ ln(2/δ)
2�2
N ≥ ln(2/0.05)2× 0.12 ≈ 3.8
0.02= 190
P (θ) ∝ 1
P (θ | D) ∝ P (D | θ)
1
![Page 3: lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 · Naïve&Bayes& Lecture17 David&Sontag& New&York&University& Slides adapted from Luke Zettlemoyer,](https://reader031.fdocuments.us/reader031/viewer/2022040917/5e90da1deb569a56e30d4bb7/html5/thumbnails/3.jpg)
Prior + Data -‐> Posterior
• Likelihood funcBon: • Posterior:
Brief Article
The Author
January 11, 2012
θ̂ = argmaxθ
lnP (D | θ)
ln θαH
d
dθlnP (D | θ) =
d
dθ[ln θαH (1− θ)αT ] =
d
dθ[αH ln θ + αT ln(1− θ)]
= αH
d
dθln θ + αT
d
dθln(1− θ) =
αH
θ− αT
1− θ= 0
δ ≥ 2e−2N�2 ≥ P (mistake)
ln δ ≥ ln 2− 2N�2
N ≥ ln(2/δ)
2�2
N ≥ ln(2/0.05)2× 0.12 ≈ 3.8
0.02= 190
P (θ) ∝ 1
P (θ | D) ∝ P (D | θ)
P (θ | D) ∝ θαH (1− θ)αT θβH−1(1− θ)βT−1
1
Brief Article
The Author
January 11, 2012
θ̂ = argmaxθ
lnP (D | θ)
ln θαH
d
dθlnP (D | θ) =
d
dθ[ln θαH (1− θ)αT ] =
d
dθ[αH ln θ + αT ln(1− θ)]
= αH
d
dθln θ + αT
d
dθln(1− θ) =
αH
θ− αT
1− θ= 0
δ ≥ 2e−2N�2 ≥ P (mistake)
ln δ ≥ ln 2− 2N�2
N ≥ ln(2/δ)
2�2
N ≥ ln(2/0.05)2× 0.12 ≈ 3.8
0.02= 190
P (θ) ∝ 1
P (θ | D) ∝ P (D | θ)
P (θ | D) ∝ θαH (1− θ)αT θβH−1(1− θ)βT−1
1
Brief Article
The Author
January 11, 2012
θ̂ = argmaxθ
lnP (D | θ)
ln θαH
d
dθlnP (D | θ) =
d
dθ[ln θαH (1− θ)αT ] =
d
dθ[αH ln θ + αT ln(1− θ)]
= αH
d
dθln θ + αT
d
dθln(1− θ) =
αH
θ− αT
1− θ= 0
δ ≥ 2e−2N�2 ≥ P (mistake)
ln δ ≥ ln 2− 2N�2
N ≥ ln(2/δ)
2�2
N ≥ ln(2/0.05)2× 0.12 ≈ 3.8
0.02= 190
P (θ) ∝ 1
P (θ | D) ∝ P (D | θ)
P (θ | D) ∝ θαH (1− θ)αT θβH−1(1− θ)βT−1 = θαH+βH−1(1− θ)αT+βt+1
1
Brief Article
The Author
January 11, 2012
θ̂ = argmaxθ
lnP (D | θ)
ln θαH
d
dθlnP (D | θ) =
d
dθ[ln θαH (1− θ)αT ] =
d
dθ[αH ln θ + αT ln(1− θ)]
= αH
d
dθln θ + αT
d
dθln(1− θ) =
αH
θ− αT
1− θ= 0
δ ≥ 2e−2N�2 ≥ P (mistake)
ln δ ≥ ln 2− 2N�2
N ≥ ln(2/δ)
2�2
N ≥ ln(2/0.05)2× 0.12 ≈ 3.8
0.02= 190
P (θ) ∝ 1
P (θ | D) ∝ P (D | θ)
P (θ | D) ∝ θαH (1−θ)αT θβH−1(1−θ)βT−1 = θαH+βH−1(1−θ)αT+βt+1 = Beta(αH+βH , αT+βT )
1
![Page 4: lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 · Naïve&Bayes& Lecture17 David&Sontag& New&York&University& Slides adapted from Luke Zettlemoyer,](https://reader031.fdocuments.us/reader031/viewer/2022040917/5e90da1deb569a56e30d4bb7/html5/thumbnails/4.jpg)
What about conBnuous variables?
• Billionaire says: If I am measuring a conBnuous variable, what can you do for me?
• You say: Let me tell you about Gaussians…
![Page 5: lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 · Naïve&Bayes& Lecture17 David&Sontag& New&York&University& Slides adapted from Luke Zettlemoyer,](https://reader031.fdocuments.us/reader031/viewer/2022040917/5e90da1deb569a56e30d4bb7/html5/thumbnails/5.jpg)
Some properBes of Gaussians
• Affine transformaBon (mulBplying by scalar and adding a constant) are Gaussian – X ~ N(µ,σ2) – Y = aX + b Y ~ N(aµ+b,a2σ2)
• Sum of Gaussians is Gaussian – X ~ N(µX,σ2
X) – Y ~ N(µY,σ2
Y) – Z = X+Y Z ~ N(µX+µY, σ2
X+σ2Y)
• Easy to differenBate, as we will see soon!
![Page 6: lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 · Naïve&Bayes& Lecture17 David&Sontag& New&York&University& Slides adapted from Luke Zettlemoyer,](https://reader031.fdocuments.us/reader031/viewer/2022040917/5e90da1deb569a56e30d4bb7/html5/thumbnails/6.jpg)
Learning a Gaussian
• Collect a bunch of data – Hopefully, i.i.d. samples
– e.g., exam scores
• Learn parameters – Mean: µ – Variance: σ
xi i =
Exam Score
0 85
1 95
2 100
3 12
… …
99 89
![Page 7: lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 · Naïve&Bayes& Lecture17 David&Sontag& New&York&University& Slides adapted from Luke Zettlemoyer,](https://reader031.fdocuments.us/reader031/viewer/2022040917/5e90da1deb569a56e30d4bb7/html5/thumbnails/7.jpg)
MLE for Gaussian:
• Prob. of i.i.d. samples D={x1,…,xN}:
• Log-likelihood of data:
µMLE , σMLE = argmaxµ,σ
P (D | µ,σ)
2
![Page 8: lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 · Naïve&Bayes& Lecture17 David&Sontag& New&York&University& Slides adapted from Luke Zettlemoyer,](https://reader031.fdocuments.us/reader031/viewer/2022040917/5e90da1deb569a56e30d4bb7/html5/thumbnails/8.jpg)
Your second learning algorithm: MLE for mean of a Gaussian
• What’s MLE for mean?
µMLE , σMLE = argmaxµ,σ
P (D | µ,σ)
= −N�
i=1
(xi − µ)
σ2= 0
2
µMLE , σMLE = argmaxµ,σ
P (D | µ,σ)
= −N�
i=1
(xi − µ)
σ2= 0
2
µMLE , σMLE = argmaxµ,σ
P (D | µ,σ)
= −N�
i=1
(xi − µ)
σ2= 0
= −N�
i=1
xi +Nµ = 0
= −N
σ+
N�
i=1
(xi − µ)2
σ3= 0
2
![Page 9: lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 · Naïve&Bayes& Lecture17 David&Sontag& New&York&University& Slides adapted from Luke Zettlemoyer,](https://reader031.fdocuments.us/reader031/viewer/2022040917/5e90da1deb569a56e30d4bb7/html5/thumbnails/9.jpg)
MLE for variance • Again, set derivaBve to zero:
µMLE , σMLE = argmaxµ,σ
P (D | µ,σ)
= −N�
i=1
(xi − µ)
σ2= 0
2
µMLE , σMLE = argmaxµ,σ
P (D | µ,σ)
= −N�
i=1
(xi − µ)
σ2= 0
= −N
σ+
N�
i=1
(xi − µ)2
σ3= 0
2
![Page 10: lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 · Naïve&Bayes& Lecture17 David&Sontag& New&York&University& Slides adapted from Luke Zettlemoyer,](https://reader031.fdocuments.us/reader031/viewer/2022040917/5e90da1deb569a56e30d4bb7/html5/thumbnails/10.jpg)
Learning Gaussian parameters
• MLE:
• BTW. MLE for the variance of a Gaussian is biased – Expected result of esBmaBon is not true parameter! – Unbiased variance esBmator:
![Page 11: lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 · Naïve&Bayes& Lecture17 David&Sontag& New&York&University& Slides adapted from Luke Zettlemoyer,](https://reader031.fdocuments.us/reader031/viewer/2022040917/5e90da1deb569a56e30d4bb7/html5/thumbnails/11.jpg)
Bayesian learning of Gaussian parameters
• Conjugate priors – Mean: Gaussian prior
– Variance: Wishart DistribuBon
• Prior for mean:
![Page 12: lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 · Naïve&Bayes& Lecture17 David&Sontag& New&York&University& Slides adapted from Luke Zettlemoyer,](https://reader031.fdocuments.us/reader031/viewer/2022040917/5e90da1deb569a56e30d4bb7/html5/thumbnails/12.jpg)
pageMehryar Mohri - Introduction to Machine Learning
Bayesian Prediction
Definition: the expected conditional loss of predicting is
Bayesian decision: predict class minimizing expected conditional loss, that is
• zero-one loss:
6
�y ∈ Y
Maximum a Posteriori (MAP) principle.
L[�y|x] =�
y∈YL(�y, y) Pr[y|x].
�y∗ = argminby
L[�y|x] = argminby
�
y∈YL(�y, y) Pr[y|x].
�y∗ = argmaxby
Pr[�y|x].
![Page 13: lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 · Naïve&Bayes& Lecture17 David&Sontag& New&York&University& Slides adapted from Luke Zettlemoyer,](https://reader031.fdocuments.us/reader031/viewer/2022040917/5e90da1deb569a56e30d4bb7/html5/thumbnails/13.jpg)
pageMehryar Mohri - Introduction to Machine Learning
Binary Classification - Illustration
7
1
0 x
Pr[y1 | x]
Pr[y2 | x]
![Page 14: lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 · Naïve&Bayes& Lecture17 David&Sontag& New&York&University& Slides adapted from Luke Zettlemoyer,](https://reader031.fdocuments.us/reader031/viewer/2022040917/5e90da1deb569a56e30d4bb7/html5/thumbnails/14.jpg)
pageMehryar Mohri - Introduction to Machine Learning
Maximum a Posteriori (MAP)
Definition: the MAP principle consists of predicting according to the rule
Equivalently, by the Bayes formula:
8
�y = argmaxy∈Y
Pr[y|x].
How do we determine and ?Density estimation problem.
Pr[x|y] Pr[y]
�y = argmaxy∈Y
Pr[x|y] Pr[y]Pr[x]
= argmaxy∈Y
Pr[x|y] Pr[y].
![Page 15: lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 · Naïve&Bayes& Lecture17 David&Sontag& New&York&University& Slides adapted from Luke Zettlemoyer,](https://reader031.fdocuments.us/reader031/viewer/2022040917/5e90da1deb569a56e30d4bb7/html5/thumbnails/15.jpg)
pageMehryar Mohri - Introduction to Machine Learning
Density Estimation
Data: sample drawn i.i.d. from set according to some distribution ,
Problem: find distribution out of a set that best estimates .
• Note: we will study density estimation specifically in a future lecture.
10
D
x1, . . . , xm ! X.
X
D
p P
[Slide from Mehryar Mohri]
![Page 16: lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 · Naïve&Bayes& Lecture17 David&Sontag& New&York&University& Slides adapted from Luke Zettlemoyer,](https://reader031.fdocuments.us/reader031/viewer/2022040917/5e90da1deb569a56e30d4bb7/html5/thumbnails/16.jpg)
Density esBmaBon
• Can make parametric assumpBon, e.g. that Pr(x|y) is a mulBvariate Gaussian distribuBon
• When the dimension of x is small enough, can use a non-‐parametric approach (e.g., kernel density es.ma.on)
x
![Page 17: lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 · Naïve&Bayes& Lecture17 David&Sontag& New&York&University& Slides adapted from Luke Zettlemoyer,](https://reader031.fdocuments.us/reader031/viewer/2022040917/5e90da1deb569a56e30d4bb7/html5/thumbnails/17.jpg)
Difficulty of (naively) esBmaBng high-‐dimensional distribuBons
• Can we directly esBmate the data distribuBon P(X,Y)?
• How do we represent these? How many parameters? – Prior, P(Y):
• Suppose Y is composed of k classes
– Likelihood, P(X|Y): • Suppose X is composed of n binary features
• Complex model ! High variance with limited data!!!
![Page 18: lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 · Naïve&Bayes& Lecture17 David&Sontag& New&York&University& Slides adapted from Luke Zettlemoyer,](https://reader031.fdocuments.us/reader031/viewer/2022040917/5e90da1deb569a56e30d4bb7/html5/thumbnails/18.jpg)
CondiBonal Independence • X is condiConally independent of Y given Z, if the probability distribuBon for X is independent of the value of Y, given the value of Z
• e.g.,
• Equivalent to:
![Page 19: lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 · Naïve&Bayes& Lecture17 David&Sontag& New&York&University& Slides adapted from Luke Zettlemoyer,](https://reader031.fdocuments.us/reader031/viewer/2022040917/5e90da1deb569a56e30d4bb7/html5/thumbnails/19.jpg)
Naïve Bayes • Naïve Bayes assumpBon:
– Features are independent given class:
– More generally:
• How many parameters now? • Suppose X is composed of n binary features
![Page 20: lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 · Naïve&Bayes& Lecture17 David&Sontag& New&York&University& Slides adapted from Luke Zettlemoyer,](https://reader031.fdocuments.us/reader031/viewer/2022040917/5e90da1deb569a56e30d4bb7/html5/thumbnails/20.jpg)
The Naïve Bayes Classifier • Given:
– Prior P(Y) – n condiBonally independent features X given the class Y
– For each Xi, we have likelihood P(Xi|Y)
• Decision rule:
If certain assumption holds, NB is optimal classifier! (they typically don’t)
Y
X1 Xn X2