Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier...
-
Upload
phungkhanh -
Category
Documents
-
view
219 -
download
1
Transcript of Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier...
![Page 1: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022020314/5a79cfa97f8b9a99188c4716/html5/thumbnails/1.jpg)
Naïve Bayes
Vibhav Gogate
The University of Texas at Dallas
![Page 2: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022020314/5a79cfa97f8b9a99188c4716/html5/thumbnails/2.jpg)
Supervised Learning of Classifiers Find f
• Given: Training set {(xi, yi) | i = 1 … n}
• Find: A good approximation to f : X Y
Examples: what are X and Y ?
• Spam Detection
– Map email to {Spam,Ham}
• Digit recognition
– Map pixels to {0,1,2,3,4,5,6,7,8,9}
• Stock Prediction
– Map new, historic prices, etc. to (the real numbers) Â
Classification
![Page 3: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022020314/5a79cfa97f8b9a99188c4716/html5/thumbnails/3.jpg)
3
Bayesian Categorization/Classification
• Let the set of categories be {c1, c2,…cn}
• Let E be description of an instance.
• Determine category of E by determining for each ci
• P(E) can be ignored (normalization constant)
• Select the class with the max. probability.
)(
)|()()|(
EP
cEPcPEcP ii
i
)|()(~)|( iii cEPcPEcP
![Page 4: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022020314/5a79cfa97f8b9a99188c4716/html5/thumbnails/4.jpg)
Text classification
• Classify e-mails
– Y = {Spam,NotSpam}
• Classify news articles
– Y = {what is the topic of the article?}
• Classify webpages
– Y = {Student, professor, project, …}
• What to use for features, X?
![Page 5: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022020314/5a79cfa97f8b9a99188c4716/html5/thumbnails/5.jpg)
Features X are word sequence in document Xi for ith word in article
![Page 6: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022020314/5a79cfa97f8b9a99188c4716/html5/thumbnails/6.jpg)
Features for Text Classification
• X is sequence of words in document
• X (and hence P(X|Y)) is huge!!! – Article at least 1000 words, X={X1,…,X1000}
– Xi represents ith word in document, i.e., the domain of Xi is entire vocabulary, e.g., Webster Dictionary (or more), 10,000 words, etc.
• 10,0001000 = 104000
• Atoms in Universe: 1080
– We may have a problem…
![Page 7: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022020314/5a79cfa97f8b9a99188c4716/html5/thumbnails/7.jpg)
Bag of Words Model Typical additional assumption –
– Position in document doesn’t matter:
• P(Xi=xi|Y=y) = P(Xk=xi|Y=y) • (all positions have the same distribution)
– Ignore the order of words
– Sounds really silly, but often works very well!
– From now on:
• Xi = Boolean: “wordi is in document”
• X = X1 … Xn
![Page 8: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022020314/5a79cfa97f8b9a99188c4716/html5/thumbnails/8.jpg)
Bag of Words Approach
aardvark 0
about 2
all 2
Africa 1
apple 0
anxious 0
...
gas 1
...
oil 1
…
Zaire 0
![Page 9: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022020314/5a79cfa97f8b9a99188c4716/html5/thumbnails/9.jpg)
9
Bayesian Categorization
• Need to know: – Priors: P(yi)
– Conditionals: P(X | yi)
• P(yi) are easily estimated from data. – If ni of the examples in D are in yi, then P(yi) = ni / |D|
• Conditionals: – X = X1 … Xn
– Estimate P(X1 … Xn | yi)
• Too many possible instances to estimate! – (exponential in n)
– Even with bag of words assumption!
P(y1 | X) ~ P(yi)P(X|yi)
![Page 10: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022020314/5a79cfa97f8b9a99188c4716/html5/thumbnails/10.jpg)
Need to Simplify Somehow
• Too many probabilities
– P(x1 x2 x3 | yi)
• Can we assume some are the same?
– P(x1 x2|yi )=P(x1|yi) P(x2|yi)
10
P(x1 x2 x3 | spam) P(x1 x2 x3 | spam) P(x1 x2 x3 | spam) …. P( x1 x2 x3 | spam)
![Page 11: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022020314/5a79cfa97f8b9a99188c4716/html5/thumbnails/11.jpg)
Conditional Independence
• X is conditionally independent of Y given Z, if the probability distribution for X is independent of the value of Y, given the value of Z
• e.g.,
• Equivalent to:
![Page 12: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022020314/5a79cfa97f8b9a99188c4716/html5/thumbnails/12.jpg)
Naïve Bayes
• Naïve Bayes assumption: – Features are independent given class:
– More generally:
• How many parameters now? – Suppose X is composed of n binary features
![Page 13: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022020314/5a79cfa97f8b9a99188c4716/html5/thumbnails/13.jpg)
The Naïve Bayes Classifier • Given:
– Prior P(Y)
– n conditionally independent features X given the class Y
– For each Xi, we have likelihood P(Xi|Y)
• Decision rule:
Y
X1 Xn X2
![Page 14: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022020314/5a79cfa97f8b9a99188c4716/html5/thumbnails/14.jpg)
MLE for the parameters of NB
• Given dataset, count occurrences for all pairs
– Count(Xi=xi,Y=y)----- How many pairs?
• MLE for discrete NB, simply:
– Prior:
– Likelihood:
![Page 15: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022020314/5a79cfa97f8b9a99188c4716/html5/thumbnails/15.jpg)
NAÏVE BAYES CALCULATIONS
![Page 16: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022020314/5a79cfa97f8b9a99188c4716/html5/thumbnails/16.jpg)
Subtleties of NB Classifier: #1 Violating the NB Assumption
• Usually, features are not conditionally independent:
• The naïve Bayes assumption is often violated, yet it performs surprisingly well in many cases.
• Plausible reason: Only need the probability of the correct class to be the largest! – Example: two-way classification; just need to figure out the
correct side of 0.5 and not the actual probability (0.51 is the same as 0.99).
![Page 17: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022020314/5a79cfa97f8b9a99188c4716/html5/thumbnails/17.jpg)
Subtleties of NB Classifier: #2 Insufficient Training Data
• What if you never see a training instance where X1=a and Y=b – You never saw, Y=Spam, X1={Enlargement}
– P(X1=Enlargement|Y=Spam)=0
• Thus no matter what values X2, X3,….,Xn take: – P(X1=Enlargement, X2=a2,…,Xn=an|Y=Spam)=0
– Why?
![Page 18: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022020314/5a79cfa97f8b9a99188c4716/html5/thumbnails/18.jpg)
For Binary Features: We already know the answer!
• MAP: use most likely parameter
• Beta prior equivalent to extra observations for each feature
• As N → 1, prior is “forgotten” • But, for small sample size, prior is important!
![Page 19: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022020314/5a79cfa97f8b9a99188c4716/html5/thumbnails/19.jpg)
That’s Great for Binomial
• Works for Spam / Ham
• What about multiple classes
– Eg, given a wikipedia page, predicting type
19
![Page 20: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022020314/5a79cfa97f8b9a99188c4716/html5/thumbnails/20.jpg)
Multinomials: Laplace Smoothing
• Laplace’s estimate: – Pretend you saw every outcome k
extra times
– What’s Laplace with k = 0?
– k is the strength of the prior
– Can derive this as a MAP estimate for multinomial with Dirichlet priors
• Laplace for conditionals: – Smooth each condition
independently:
H H T
![Page 21: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022020314/5a79cfa97f8b9a99188c4716/html5/thumbnails/21.jpg)
Probabilities: Important Detail!
Any more potential problems here?
• P(spam | X1 … Xn) = P(spam | Xi)
i
We are multiplying lots of small numbers Danger of underflow!
0.557 = 7 E -18
Solution? Use logs and add!
p1 * p2 = e log(p1)+log(p2)
Always keep in log form
![Page 22: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022020314/5a79cfa97f8b9a99188c4716/html5/thumbnails/22.jpg)
NB for Text Classification: Learning • Learning phase: P(Ym) and P(Xi|Ym)
![Page 23: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022020314/5a79cfa97f8b9a99188c4716/html5/thumbnails/23.jpg)
NB for Text Classification: Classification
• Given a new document having length “L”
![Page 24: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022020314/5a79cfa97f8b9a99188c4716/html5/thumbnails/24.jpg)
Example: (Borrowed from Dan Jurafsky)
![Page 25: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022020314/5a79cfa97f8b9a99188c4716/html5/thumbnails/25.jpg)
Bayesian Learning
What if Features are Continuous?
P(Y | X) P(X | Y) P(Y)
Prior
Data Likelihood
Posterior
Eg., Character Recognition:
Xi is ith pixel
![Page 26: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022020314/5a79cfa97f8b9a99188c4716/html5/thumbnails/26.jpg)
Bayesian Learning
What if Features are Continuous?
Eg., Character Recognition:
Xi is ith pixel
P(Xi =x| Y=yk) = N(ik, ik)
P(Y | X) P(X | Y) P(Y)
N(ik, ik) =
![Page 27: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022020314/5a79cfa97f8b9a99188c4716/html5/thumbnails/27.jpg)
Gaussian Naïve Bayes
P(Xi =x| Y=yk) = N(ik, ik)
P(Y | X) P(X | Y) P(Y)
N(ik, ik) =
Sometimes Assume Variance – is independent of Y (i.e., i),
– or independent of Xi (i.e., k)
– or both (i.e., )
![Page 28: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022020314/5a79cfa97f8b9a99188c4716/html5/thumbnails/28.jpg)
Maximum Likelihood Estimates: • Mean:
• Variance:
Learning Gaussian Parameters
![Page 29: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022020314/5a79cfa97f8b9a99188c4716/html5/thumbnails/29.jpg)
Maximum Likelihood Estimates: • Mean:
• Variance:
Learning Gaussian Parameters
jth training example
(x)1 if x true,
else 0
![Page 30: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022020314/5a79cfa97f8b9a99188c4716/html5/thumbnails/30.jpg)
Maximum Likelihood Estimates: • Mean:
• Variance:
Learning Gaussian Parameters
![Page 31: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022020314/5a79cfa97f8b9a99188c4716/html5/thumbnails/31.jpg)
What you need to know about Naïve Bayes
• Naïve Bayes classifier – What’s the assumption
– Why we use it
– How do we learn it
– Why is Bayesian estimation important
• Text classification – Bag of words model
• Gaussian NB – Features are still conditionally independent
– Each feature has a Gaussian distribution given class