Naive Bayes Applied Machine Learning - cs.mcgill.ca
Transcript of Naive Bayes Applied Machine Learning - cs.mcgill.ca
Applied Machine LearningApplied Machine LearningNaive Bayes
Siamak RavanbakhshSiamak Ravanbakhsh
COMP 551COMP 551 (winter 2020)(winter 2020)
1
generative vs. discriminative classifierNaive Bayes classifier
assumptiondifferent design choices
Learning objectivesLearning objectives
2
so far we modeled the conditional distribution: p(y ∣ x)discriminative
learn the joint distribution p(y,x) = p(y)p(x ∣ y)generative
how to classify new input x?
p(y = c ∣ x) =
p(x)p(c)p(x∣c)
Bayes rule
prior class probability: frequency of observing this label
likelihood of input features given the class label(input features for each label come from a different distribution)
posterior probabilityof a given class
p(x, c )∑c =1′C ′
marginal probability of the input (evidence)
3 . 1
image: https://rpsychologist.com
p(x∣y = 1)p(x∣y = 0)
x
Discreminative vs generative classificationDiscreminative vs generative classification
Winter 2020 | Applied Machine Learning (COMP551)
Example:Example: Bayes rule for classification Bayes rule for classification
patient having cancer?y ∈ {yes, no}
x ∈ {−, +} test results, a single binary feature
p(c ∣ x) =
p(x)p(c)p(x∣c)
prior: 1% of population has cancer p(yes) = .01
posterior: p(yes∣+) = .08
likelihood:
p(+∣yes) = .9
TP rate of the test (90%)
p(+) = p(yes)p(+∣yes) + p(no)p(+∣no) = .01 × .9 + .99 × .05 = .189evidence:
FP rate of the test (5%)
3 . 2
in a generative classifier likelihood & prior class probabilities are learned from data
p(y = c ∣ x) =
p(x)p(c)p(x∣c)
p(x, c )∑c =1′C ′
prior class probability: frequency of observing this label
likelihood of input features given the class label(input features for each label come from a different distribution)
posterior probabilityof a given class
marginal probability of the input (evidence)
Generative classificationGenerative classification
image: https://rpsychologist.com
Some generative classifiers:
Gaussian Discriminant Analysis: the likelihood is multivariate Gaussian
Naive Bayes: decomposed likelihood
4 . 1
Naive Bayes: Naive Bayes: modelmodel
assumption about the likelihood p(x∣y) = p(x ∣y)∏d=1D
d
number of input features
when is this assumption correct?when features are conditionally independent given the label
knowing the label, the value of one input feature gives us no information about the other input features
x ⊥i x ∣j y
p(x∣y) = p(x ∣y)p(x ∣y,x )p(x ∣y,x ,x ) … p(x ∣y,x , … ,x )1 2 1 3 1 2 D 1 D−1
chain rule of probability (true for any distribution)
p(x ∣y,x ,x ) =3 1 2 p(x ∣y)3
conditional independence assumptionx1, x2 give no extra information, so
4 . 2
= log p (y ) +∑n u(n) log p (x ∣y )w
(n) (n)
given the training dataset D = {(x , y ), … , (x , y )}(1) (1) (N) (N)
Naive Bayes: Naive Bayes: objectiveobjective
maximize the joint likelihood (contrast with logistic regression)
ℓ(w,u) = log p (x , y )∑n u,w(n) (n)
= log p (y ) +∑n u(n)
log p (x ∣y )∑n w(n) (n)
= log p (y ) +∑n u(n)
log p (x ∣y )∑d∑n w [d] d
(n) (n)using Naive Bayes assumption
separate MLE estimates for each part
4 . 3
Winter 2020 | Applied Machine Learning (COMP551)
given the training dataset D = {(x , y ), … , (x , y )}(1) (1) (N) (N)
Naive Bayes: Naive Bayes: train-testtrain-test
learn the prior class probabilities
learn the likelihood components
p (y)u
p (x ∣y) ∀dw [d] d
training time
test time
arg max p(c∣x) =c arg max c p (c ) p (x ∣c )∑
c =1′C
u′ ∏
d=1D
w [d] d′
p (c) p (x ∣c)u ∏d=1D
w [d] d
find posterior class probabilities
4 . 4
Class priorClass prior
frequency of class 1 in the dataset
= N log(u) +1 (N − N ) log(1 −1 u)frequency of class 0 in the dataset
Bernoulli distribution p (y) =u u (1 −y u)1−y
binary classification
ℓ(u) = y log(u) +∑n=1N (n) (1 − y ) log(1 −(n) u)
maximizing the log-likelihood
ℓ(u) =dud
−uN 1
=1−uN−N 1 0 ⇒ u =
∗NN 1 max-likelihood estimate (MLE) is the
frequency of class labels
setting its derivative to zero
5 . 1
p(c∣x) =
p (c) p (x ∣c )∑c =1′C
u ∏d=1D
w [d] d′
p (c) p (x ∣c)u ∏d=1D
w [d] d
Winter 2020 | Applied Machine Learning (COMP551)
categorical distribution p (y) =u u ∏c=1C
cy c
multiclass classification
assuming one-hot coding for labels
is now a parameter vectoru = [u , … ,u ]1 C
p(c∣x) =
p (c) p (x ∣c )∑c =1′C
u ∏d=1D
w [d] d′
p (c) p (x ∣c)u ∏d=1D
w [d] d
maximizing the log likelihood ℓ(u) = y log(u )∑n∑c c(n)
c
subject to: u =∑c c 1
closed form for the optimal parameter u =∗ [ , … , ]NN 1
NN C
number of instances in class 1
all instances in the dataset
Class priorClass prior
5 . 2
choice of likelihood distribution depends on the type of features(likelihood encodes our assumption about "generative process")
Bernoulli: binary features
Categorical: categorical features
Gaussian: continuous distribution
...
Likelihood terms Likelihood termsp(c∣x) =
p (c) p (x ∣c )∑c =1′C
u ∏d=1D
w [d] d′
p (c) p (x ∣c)u ∏d=1D
w [d] d
x d
w =[d]∗ arg max log p (x ∣w [d] ∑n=1
Nw [d] d
(n)y )(n)
each feature may use a different likelihood
separate max-likelihood estimates for each feature
note that these are different from the choice ofdistribution for class prior
(class-conditionals)
6 . 1
binary features: likelihood is Bernoulli
BernoulliBernoulli Naive Bayes Naive Bayes
max-likelihood estimation is similar to what we saw for the prior
closed form solution of MLEw =[d],c
∗ N(y=c)
N(y=c,x =1)dnumber of training instances satisfying this condition
p (x ∣w [d] d y = 0) = Bernoulli(x ;w )d [d],0
p (x ∣w [d] d y = 1) = Bernoulli(x ;w )d [d],1{short form: p (x ∣w [d] d y) = Bernoulli(x ;w )d [d],y
one parameter per label
6 . 2
ExampleExample:: Bernoulli Naive Bayes Bernoulli Naive Bayes
using naive Bayes for document classification:
2 classes (documents types)
600 binary features
word d is present in the document n (vocabulary of 600)x =d
(n) 1
d
w [d],0∗
likelihood of words in two document types
w [d],1∗
def BernoulliNaiveBayes(prior,# vector of size 2 for class prior likelihood, #600 x 2: likelihood of each word under each class x, #vector of size 600: binary features for a new document ): logp = np.log(prior) + np.sum(np.log(likelihood * x[:,None]), 0) + \ np.sum(np.log((1-likelihood) * (1 - x[:,None])), 0) log_p -= np.max(log_p) #numerical stability posterior = np.exp(log_p) # vector of size 2 posterior /= np.sum(posterior) # normalize return posterior # posterior class probability
12345678910
6 . 3
Winter 2020 | Applied Machine Learning (COMP551)
MultinomialMultinomial Naive Bayes Naive Bayes
what if we wanted to use word frequencies in document classification
Multinomial likelihood: p (x∣c) =w w
x !∏d=1D
d
( x )!∑d d ∏d=1D
d,cx d
x d
(n)is the number of times word d appears in document n
we have a vector of size D for each class C × D (parameters)
MLE estimates: w =d,c∗
x y ∑n∑d′d′(n)
c(n)
x y ∑ d
(n)c(n) count of word d in all documents labelled y
total word count in all documents labelled y
6 . 4
GaussianGaussian Naive Bayes Naive Bayes
p (x ∣w [d] d y) = N (x ;μ ,σ ) =d d,y d,y2
e 2πσ d,y
2
1 −
2σ d,y2
(x −μ )d d,y2
Gaussian likelihood terms
w =[d] (μ ,σ , … ,μ ,σ )d,1 d,1 d,C d,C
one mean and std. parameter for each class-feature pair
writing log-likelihood and setting derivative to zero we get maximum likelihood estimate:
μ =d,y x y
N c
1 ∑n=1N
d
(n)c(n)
σ =d,y2
y (x −N c
1 ∑n=1N
c(n)
d
(n)μ )d,y
2
empirical mean & std of feature across instances with label y
x d
7 . 1
Example: Example: Gaussian Naive BayesGaussian Naive Bayes
classification on Iris flowers dataset:
samples with D=4 features, for each of C=3species of Iris flower
N =c 50
(a classic dataset originally used by Fisher)
our setting3 classes 2 features(septal width, petal length)
7 . 2
Example: Example: Gaussian Naive BayesGaussian Naive Bayes
categorical class prior & Gaussian likelihood
def GaussianNaiveBayes( X, # N x D y, # N x C Xtest,# N_test x D ):
12345
N,C = y.shape6 D = X.shape[1]7 mu, s = np.zeros((C,D)), np.zeros((C,D))8 for c in range(C): #calculate mean and std9 inds = np.nonzero(y[:,c])[0]10 mu[c,:] = np.mean(X[inds,:], 0)11 s[c,:] = np.std(X[inds,:], 0)12 log_prior = np.log(np.mean(y, 0))[:,None]13 log_likelihood = - np.sum( np.log(s[:,None,:]) +.5*(((Xt[None,:,:] - mu[:,None,:])/s[:,None,:])**2), 2)
14
return log_prior + log_likelihood #N_text x C15
N,C = y.shape D = X.shape[1]
def GaussianNaiveBayes(1 X, # N x D2 y, # N x C3 Xtest,# N_test x D4 ):5
67
mu, s = np.zeros((C,D)), np.zeros((C,D))8 for c in range(C): #calculate mean and std9 inds = np.nonzero(y[:,c])[0]10 mu[c,:] = np.mean(X[inds,:], 0)11 s[c,:] = np.std(X[inds,:], 0)12 log_prior = np.log(np.mean(y, 0))[:,None]13 log_likelihood = - np.sum( np.log(s[:,None,:]) +.5*(((Xt[None,:,:] - mu[:,None,:])/s[:,None,:])**2), 2)
14
return log_prior + log_likelihood #N_text x C15
mu, s = np.zeros((C,D)), np.zeros((C,D)) for c in range(C): #calculate mean and std inds = np.nonzero(y[:,c])[0] mu[c,:] = np.mean(X[inds,:], 0) s[c,:] = np.std(X[inds,:], 0)
def GaussianNaiveBayes(1 X, # N x D2 y, # N x C3 Xtest,# N_test x D4 ):5 N,C = y.shape6 D = X.shape[1]7
89101112
log_prior = np.log(np.mean(y, 0))[:,None]13 log_likelihood = - np.sum( np.log(s[:,None,:]) +.5*(((Xt[None,:,:] - mu[:,None,:])/s[:,None,:])**2), 2)
14
return log_prior + log_likelihood #N_text x C15
log_prior = np.log(np.mean(y, 0))[:,None] log_likelihood = - np.sum( np.log(s[:,None,:]) +.5*(((Xt[None,:,:] - mu[:,None,:])/s[:,None,:])**2), 2)
def GaussianNaiveBayes(1 X, # N x D2 y, # N x C3 Xtest,# N_test x D4 ):5 N,C = y.shape6 D = X.shape[1]7 mu, s = np.zeros((C,D)), np.zeros((C,D))8 for c in range(C): #calculate mean and std9 inds = np.nonzero(y[:,c])[0]10 mu[c,:] = np.mean(X[inds,:], 0)11 s[c,:] = np.std(X[inds,:], 0)12
1314
return log_prior + log_likelihood #N_text x C15 return log_prior + log_likelihood #N_text x C
def GaussianNaiveBayes(1 X, # N x D2 y, # N x C3 Xtest,# N_test x D4 ):5 N,C = y.shape6 D = X.shape[1]7 mu, s = np.zeros((C,D)), np.zeros((C,D))8 for c in range(C): #calculate mean and std9 inds = np.nonzero(y[:,c])[0]10 mu[c,:] = np.mean(X[inds,:], 0)11 s[c,:] = np.std(X[inds,:], 0)12 log_prior = np.log(np.mean(y, 0))[:,None]13 log_likelihood = - np.sum( np.log(s[:,None,:]) +.5*(((Xt[None,:,:] - mu[:,None,:])/s[:,None,:])**2), 2)
14
15
decision boundaries are not linear!
7 . 3
Example: Example: Gaussian Naive BayesGaussian Naive Bayes
categorical class prior & Gaussian likelihood
def GaussianNaiveBayes( X, # N x D y, # N x C Xtest,# N_test x D ): N,C = y.shape D = X.shape[1] mu, s = np.zeros((C,D)), np.zeros((C,D)) for c in range(C): #calculate mean and std inds = np.nonzero(y[:,c])[0] mu[c,:] = np.mean(X[inds,:], 0) s[c,:] = np.std(X[inds,:], 0) log_prior = np.log(np.mean(y, 0))[:,None] log_likelihood = - np.sum( np.log(s[:,None,:]) +.5*(((Xt[None,:,:] - mu[:,None,:])/s[:,None,:])**2), 2) return log_prior + log_likelihood #N_text x C
1234567891011121314
15
posterior class probability for c=1
7 . 4
Winter 2020 | Applied Machine Learning (COMP551)
Example: Example: Gaussian Naive BayesGaussian Naive Bayes
using the same variance for all classes
log_likelihood = - np.sum(.5*(((Xt[None,:,:] - mu[:,None,:]))**2), 2)
def GaussianNaiveBayes(1 X, # N x D2 y, # N x C3 Xtest,# N_test x D4 ):5 N,C = y.shape6 D = X.shape[1]7 mu, s = np.zeros((C,D)), np.zeros((C,D))8 for c in range(C): #calculate mean and std9 inds = np.nonzero(y[:,c])[0]10 mu[c,:] = np.mean(X[inds,:], 0)11 log_prior = np.log(np.mean(y, 0))[:,None]12
13 return log_prior + log_likelihood #N_text x C14
decision boundaries are linearits value does not make a difference
7 . 5
Decision boundaryDecision boundary in generative classifiers in generative classifiers
p(y∣x) = p(y ∣x)′decision boundaries: two classes have the same probability
log =p(y=c ∣x)′p(y=c∣x) log =
p(c )p(x∣c )′ ′p(c)p(x∣c) log +
p(c )′p(c) log =
p(x∣c )′p(x∣c) 0which means
not a function of x (ignore)
this ratio is linear (in some bases) for a large family of probabilities(called linear exponential family)
e.g., Bernoulli is a member of this family with ϕ(x) = x
Bernoulli Naive Bayes has a linear decision boundary linear.
p(x∣c) =
Z(w )y,c
ew ϕ(x)y,cT
log =p(x∣c )′p(x∣c) (w −y,c w ) ϕ(x) +y,c′
T g(w ,w )y,c y,c′
linear using some bases not a function of x
8
maximize conditional likelihood
p(y ∣ x)discriminative
maximize joint likelihood
p(y,x) = p(y)p(x ∣ y) generative
Discreminative vs generative Discreminative vs generative classificationclassification
it makes assumptions about p(x) makes no assumption about p(x)
can deal with missing valuesoften works better on larger datasets
can learn from unlabelled data
often works better on smaller datasets
9 . 1
Winter 2020 | Applied Machine Learning (COMP551)
Example naive Bayes vs logistic regression on UCI datasets naive Bayes
logistic regression
from: Ng & Jordan 2001
Discreminative vs generative Discreminative vs generative classificationclassification
m is #instances
9 . 2
SummarySummarygenerative classification
learn the class prior and likelihoodBayes rule for conditional class probability
Naive Bayesassumes conditional independence
e.g., word appearances indep. of each other given document typeclass prior: Bernoulli or Categoricallikelihood: Bernoulli, Gaussian, Categorical...MLE has closed form (in contrast to logistic regression)estimated separately for each feature and each label
evaluation measures for classification accuracy
10
Measuring performance Measuring performance
binary classification
Posit
ive
Neg
ativ
e11 . 1
We use the confusion matrix
Example
A side note on measuring performance of classifiers
count the combinations of y and y
Measuring performance Measuring performance binary classification
Accuracy = P+NTP+TN
F score =1 2 Precision+RecallPrecision×Recall
Recall = PTP
Precision = RPTP
{Harmonic mean}
RP = TP + FP
RN = FN + TN
P = TP + FN
N = FP + TN
Error rate = P+NFP+FN
11 . 2
use the confusion matrix toquantify difference metrics
marginals:
Accuracy = P+NTP+TN
F score =1 2 Precision+RecallPrecision×Recall
Recall = PTP
Precision = RPTP
{Harmonic mean}
Miss rate = PFN
Fallout = NFP
False discovery rate = RPFP
Selectivity = NTN
False omission rate = RNFN
Negative predictive value = RNTN
Less common
Measuring performance Measuring performance binary classification
11 . 3
Threshold invariant: ROC & AUCThreshold invariant: ROC & AUCROC as a function of threshold
TPR = TP/P (recall, sensitivity)
FPR = FP/N (fallout, false alarm)
11 . 4