Gaussian Naive Bayes Classifier & Logistic Regression
Transcript of Gaussian Naive Bayes Classifier & Logistic Regression
Bayesian Learning
Gaussian Naive Bayes Classifier &
Logistic Regression
Machine Learning 1
Gaussian Naive Bayes Classifier
Machine Learning 2
Machine Learning 3
Bayes Theorem - Example
Sample Space for
events A and B
P(A) = 4/7 P(B) = 3/ 7 P(B|A) = 2/4 P(A|B) = 2/3
Is Bayes Theorem correct?
P(B|A) = P(A|B) P(B) / P(A) = ( 2/3 * 3/7 ) / 4/7 = 2/4 CORRECT
P(A|B) = P(B|A) P(A) / P(B) = ( 2/4 * 4/7 ) / 3/7 = 2/3 CORRECT
A holds T T F F T F T
B holds T F T F T F F
Naive Bayes Classifier
β’ Practical Bayesian learning method is Naive Bayes Learner (Naive Bayes Classifier).
β’ The naive Bayes classifier applies to learning tasks where each instance x is described
by a conjunction of attribute values and where the target function f(x) can take on any
value from some finite set V.
β’ A set of training examples is provided, and a new instance is presented, described by
the tuple of attribute values (al, a2 ...an).
β’ The learner is asked to predict the target value (classification), for this new instance.
Machine Learning 4
Naive Bayes Classifier
β’ The Bayesian approach to classifying the new instance is to assign the most probable
target value vMAP, given the attribute values (al, a2 ... an) that describe the instance.
π―πππ = ππ«π π¦ππ±π―π£βπ
π(π―π£|ππ, β¦ , ππ§)
β’ By Bayes theorem:
π―πππ = ππ«π π¦ππ±π―π£βπ
π(ππ,β¦,ππ§|π―π£) π(π―π£)
π(ππ,β¦,ππ§)
π―πππ = ππ«π π¦ππ±π―π£βπ
π(ππ, β¦ , ππ§|π―π£) π(π―π£)
Machine Learning 5
Naive Bayes Classifier
β’ It is easy to estimate each of the P(vj) simply by counting the frequency with which
each target value vj occurs in the training data.
β’ However, estimating the different P(al,a2β¦an | vj) terms is not feasible unless we have
a very, very large set of training data.
β The problem is that the number of these terms is equal to the number of possible instances
times the number of possible target values.
β Therefore, we need to see every instance in the instance space many times in order to obtain
reliable estimates.
β’ The naive Bayes classifier is based on the simplifying assumption that the attribute
values are conditionally independent given the target value.
β’ For a given the target value of the instance, the probability of observing conjunction
al,a2β¦an , is just the product of the probabilities for the individual attributes:
P(a1, β¦ , an|vj) = ΟiP(ai|vj)
β’ Naive Bayes classifier: π―ππ = ππ«π π¦ππ±π―π£βπ
π(π―π£) Οπ’π(ππ’|π―π£)
Machine Learning 6
Naive Bayes in a Nutshell
Bayes rule:
Assuming conditional independence among Xiβs:
So, classification rule for Xnew = < X1newβ¦Xn
new> is:
Machine Learning 7
P Y = yk X1β¦Xn =P Y = yk P(X1β¦Xn|Y = yk)
ΟjP Y = yj P(X1β¦Xn|Y = yj)
π π = π²π€ ππβ¦ππ§ =π π = π²π€ Οπ’π(ππ’|π = π²π€)
Οπ£π π = π²π£ Οπ’π(ππ’|π = π²π£)
ππ§ππ° β ππ«π π¦ππ±π²π€
π π = π²π€ ΰ·
π’
π(ππ’π§ππ°|π = π²π€)
Naive Bayes in a Nutshell
β’ Another way to view NaΓ―ve Bayes when Y is a Boolean attribute:
β’ Decision rule: is this quantity greater or less than 1?
Machine Learning 8
P Y = 1 X1β¦XnP Y = 0 X1β¦Xn
=P Y = 1 Οπ P(Xi|Y = 1)
P Y = 0 Οπ P(Xi|Y = 0)
Naive Bayes in a Nutshell
β’ What if we have continuous Xi ?
β’ Still we have:
β’ Just need to decide how to represent P(Xi | Y)
β’ Common approach: assume P(Xi|Y=yk) follows a Normal (Gaussian) distribution
Machine Learning 9
π π = π²π€ ππβ¦ππ§ =π π = π²π€ Οππ(πΏπ’|π = ππ)
Οπ£π π = π²π£ Οππ(πΏπ|π = ππ£)
Gaussian Distribution(also called βNormal Distributionβ)
Machine Learning 10
A Normal Distribution
(Gaussian Distribution) is a
bell-shaped distribution defined
by probability density function
β’ A Normal distribution is fully determined by two parameters in the formula: and .
β’ If the random variable X follows a normal distribution:
- The probability that X will fall into the interval (a, b) is
- The expected, or mean value of X, E[X] =
- The variance of X, Var(X) = 2
- The standard deviation of X, x =
π© π± =π
πππππβππ
π±βππ
π
ΰΆ±a
b
)p x d(x
Gaussian Naive Bayes (GNB)
Gaussian Naive Bayes (GNB) assumes
where
β’ ΞΌik is the mean of Xi values of the instances whose Y attribute is yk.
β’ Οik is the standard deviation of Xi values of the instances whose Y attribute is yk.
β’ Sometimes Οik and ΞΌik are assumed as Οi and ΞΌiβ’ independent of Y
Machine Learning 11
P ππ’ = π±|π = π²π€ =π
ππππ’π€ππβ
π
π
π±βππ’π€ππ’π€
π
Gaussian Naive Bayes Algorithmcontinuous Xi (but still discrete Y)
β’ Train Gaussian Naive Bayes Classifier: For each value yk
β Estimate π π = π²π€ :
π π = π²π€ = (# of yk examples) / (total # of examples)
β For each attribute Xi, in order to estimate P ππ’|π = π²π€ :
β’ Estimate class conditional mean ik and standard deviation πik
β’ Classify (Xnew):
Machine Learning 12
ππ§ππ° β ππ«π π¦ππ±π²π€
π π = π²π€ ΰ·
π’
π(ππ’π§ππ°|π = π²π€)
ππ§ππ° β ππ«π π¦ππ±π²π€
π π = π²π€ ΰ·
π’
π
ππππ’π€π
πβππ
ππ’π§ππ°βππ’π€ππ’π€
π
Estimating Parameters:
β’ When the Xi are continuous we must choose some other way to represent the
distributions P(Xi|Y).
β’ We assume that for each possible discrete value yk of Y, the distribution of each
continuous Xi is Gaussian, and is defined by a mean and standard deviation specific to
Xi and yk.
β’ In order to train such a Naive Bayes classifier we must therefore estimate the mean
and variance of each of these Gaussians:
Machine Learning 13
Estimating Parameters: Y discrete, Xi continuous
For each kth class (Y=yk) and ith feature (Xi):
β’ ππ’π€ is the mean of Xi values of the examples whose Y attribute is yk.
β’ ππ’π€π is the variance of Xi values of the examples whose Y attribute is yk.
Machine Learning 14
ππ’π€π =
summation of all (ππ’π£β ik)
2 for ππ£ where ππ£ is a yk example
# of π²π€ examples
ik =summation of all Xi values of π²π€ examples
# of π²π€ examples
Estimating Parameters:
β’ How many parameters must we estimate for Gaussian Naive Bayes if Y has
k possible values, and examples have n attributes?
2*n*k parameters (n*k mean values, and n*k variances)
Machine Learning 15
P ππ’ = π±|π = π²π€ =π
ππππ’π€ππβ
π
π
π±βππ’π€ππ’π€
π
Logistic Regression
Machine Learning 16
Logistic Regression
Idea:
β’ Naive Bayes allows computing P(Y|X) by learning P(Y) and P(X|Y)
β’ Why not learn P(Y|X) directly?
logistic regression
Machine Learning 17
Logistic Regression
β’ Consider learning f: X βY, where
β X is a vector of real-valued features, < X1 β¦ Xn >
β Y is boolean (binary logistic regression)
β’ In general, Y is a discrete attribute and it can take k 2 different values.
β assume all Xi are conditionally independent given Y.
β model P(Xi | Y=yk) as Gaussian
β model P(Y) as Bernoulli (Ο): summation of all possible probabilities is 1.
P(Y=1) = Ο P(Y=0) = 1-Ο
β’ What do these assumptions imply about the form of P(Y|X)?
Machine Learning 18
P ππ’|π = π²π€ =π
ππππ’π€ππβ
π
π
ππ’βππ’π€ππ’π€
π
Logistic RegressionDerive P(Y|X) for Gaussian P(Xi|Y=yk) assuming Οik = Οi
β’ In addition, assume variance is independent of class, i.e. Οi0 = Οi1 = Οi
Machine Learning 19
P Y = 1 X =P Y = 1 P(X|Y = 1)
P Y = 1 P X Y = 1 + P Y = 0 P(X|Y = 0)by Bayes theorem
P Y = 1 X =1
1 +P Y = 0 P(X|Y = 0)P Y = 1 P(X|Y = 1)
divide numerator and
denumerator with same term
P Y = 1 X =1
1 + exp ( lnP Y = 0 P X Y = 0P Y = 1 P X Y = 1
)exp(x)=ex and x = eln(x)
P Y = 1 X =1
1 + exp ( lnP Y = 0P Y = 1
+ lnP X Y = 0P X Y = 1
)
math fact
Logistic RegressionDerive P(Y|X) for Gaussian P(Xi|Y=yk) assuming Οik = Οi
β’ P(Y=1)=Ο and P(Y=0)=1-Ο by modelling P(Y) as Bernoulli
β’ By independence assumption
Machine Learning 20
P Y = 1 X =1
1 + exp ( ln1βΟ
Ο+ lnΟi
P Xi Y = 0P Xi Y = 1
)
math fact
P X Y = 0
P X Y = 1=ΰ·
i
P Xi Y = 0
P Xi Y = 1
P Y = 1 X =1
1 + exp ( ln1βΟ
Ο+ Οπ ln
P Xi Y = 0P Xi Y = 1
)
P Y = 1 X =1
1 + exp ( lnP Y = 0P Y = 1
+ lnP X Y = 0P X Y = 1
)
by assumptions above
Logistic RegressionDerive P(Y|X) for Gaussian P(Xi|Y=yk) assuming Οik = Οi
Since and Οi0 = Οi1 = Οi
Machine Learning 21
P ππ’|π = π²π€ =π
ππππ’π€ππβ
π
π
ππ’βππ’π€ππ’π€
π
ln1βΟ
Ο+
i
lnP Xi Y = 0
P Xi Y = 1= ln
1βΟ
Ο+
i
ΞΌi12 β ΞΌi0
2
2Οi2 +
i
ΞΌi0 β ΞΌi1
Οi2 Xi
π π = π π =π
π + ππ±π© ( π°π + Οπ’π°π’ ππ’ )
P Y = 1 X =1
1 + exp ( ln1βΟ
Ο+ Οπ ln
P Xi Y = 0P Xi Y = 1
)
Logistic RegressionDerive P(Y|X) for Gaussian P(Xi|Y=yk) assuming Οik = Οi
implies
implies
implies
Machine Learning 22
π π = π π =π
π + ππ±π© ( π°π + Οπ’π°π’ ππ’ )
π π = π π = π β π π = π π =ππ±π© ( π°π + Οπ’π°π’ ππ’ )
π + ππ±π© ( π°π + Οπ’π°π’ ππ’ )
π π = π π
π π = π π= ππ±π© ( π°π +
π’
π°π’ ππ’ )
π₯π§π π = π π
π π = π π= π°π+
π’
π°π’ ππ’
a linear
classification
rule
Logistic Regression
Logistic (sigmoid) function
β’ In Logistic Regression, P(Y|X) is assumed to be the following functional form which
is a sigmoid (logistic) function.
β’ The sigmoid function π² =π
π+ππ±π©(βπ)takes a real value and maps it to the range [0,1].
Machine Learning 23
P Y = 1 X =1
1 + exp ( w0 + Οiwi Xi )
Logistic Regression is a Linear Classifier
β’ In Logistic Regression, P(Y|X) is assumed:
β’ Decision Boundary:
P(Y=0|X) > P(Y=1|X) Classification is 0 π°π + Οπ’π°π’ ππ’ > π
P(Y=0|X) < P(Y=1|X) Classification is 1 π°π + Οπ’π°π’ ππ’ < π
β π°π + Οπ’π°π’ ππ’ is a linear decision boundary.
Machine Learning 24
P Y = 1 X =1
1 + exp ( w0 + Οiwi Xi )
P Y = 0 X =exp ( w0 + Οiwi Xi )
1 + exp ( w0 + Οiwi Xi )
Logistic Regression for More Than 2 Classes
β’ In general case of logistic regression, we have M classes and Yβ{y1,β¦,yM}
β’ for k < M
β’ for k = M (no weights for the last class)
Machine Learning 25
π π = π²π€ π =ππ±π© ( π°π€π + Οπ’π°π€π’ ππ’ )
π + Οπ£=ππβπ ππ±π© ( π°π£π + Οπ’π°π£π’ ππ’ )
π π = π²π π =π
π + Οπ£=ππβπ ππ±π© ( π°π£π + Οπ’π°π£π’ ππ’ )
Training Logistic Regression
β’ We will focus on binary logistic regression.
β’ Training Data: We have n training examples: { <X1,Y1>, β¦, < Xn,Yn > }
β’ Attributes: We have d attributes: We have to learn weights W: w0,w1,β¦,wd
β’ We want to learn weights which produces maximum probability for the training data.
Maximum Likelihood Estimate for parameters W:
Machine Learning 26
ππππ = ππ«π π¦ππ±π
π { ππ, ππ , β¦ , ππ§, ππ§ π)
ππππ = ππ«π π¦ππ±π
ΰ·π=π
π
π ππ£, ππ£ π)
Training Logistic RegressionMaximum Conditional Likelihood Estimate
β’ Maximum Conditional Likelihood Estimate for parameters W:
where
Machine Learning 27
ππππ = ππ«π π¦ππ±π
ΰ·π=π
π
π ππ£, ππ£ π)data likelihood
πππππ = ππ«π π¦ππ±π
ΰ·π£=π
π§
π ππ£ ππ£,π)conditional data likelihood
P Y = 0 X,W =1
1 + exp ( w0 + Οiwi Xi )
P Y = 1 X,W =exp ( w0 + Οiwi Xi )
1 + exp ( w0 + Οiwi Xi )
Training Logistic RegressionExpressing Conditional Log Likelihood
β’ Log value of conditional data likelihood l(W)
Machine Learning 28
l(W) = π₯π§ Οπ£=ππ§ π ππ£ ππ£,π) = Οπ£=π
π§ π₯π§ π ππ£ ππ£,π)
P Y = 0 X,W =1
1 + exp ( w0 + Οiwi Xi )P Y = 1 X,W =
exp ( w0 + Οiwi Xi )
1 + exp ( w0 + Οiwi Xi )
l(W) = Οπ£=ππ§ ππ£ π₯π§
π ππ£=π ππ£,π)
π ππ£=π ππ£,π)+ π₯π§ π ππ£ = π ππ£,π)
β’ Y can take only values 0 or 1, so only one of the two terms in
the expression will be non-zero for any given Yj
l(W) = Οπ£=ππ§ ππ£ π₯π§ π ππ£ = π ππ£,π) + (π β ππ£) π₯π§ π ππ£ = π ππ£,π)
l(W) = Οπ£=ππ§ ππ£ π°π +Οπ’π°π’ ππ’
π£β π₯π§(π + ππ±π©(π°π +Οπ’π°π’ ππ’
π£) )
Training Logistic RegressionMaximizing Conditional Log Likelihood
Bad News:
β’ Unfortunately, there is no closed form solution to maximizing l(W) with respect to W.
Good News:
β’ l(W) is concave function of W. Concave functions are easy to optimize.
β’ Therefore, one common approach is to use gradient ascent
β’ maximum of a concave function = minimum of a convex function
Gradient Ascent (concave) / Gradient Descent (convex)
Machine Learning 29
l(W) = Οπ£=ππ§ ππ£ π°π + Οπ’π°π’ ππ’
π£β π₯π§(π + ππ±π©(π°π + Οπ’π°π’ ππ’
π£) )
Gradient Descent
Machine Learning 30
Gradient Descent
Machine Learning 31
gradient descent
Maximizing Conditional Log Likelihood
Gradient Ascent
Gradient = ππ₯(π)
ππ°π, β¦ ,
ππ₯(π)
ππ°π§
Update Rule:
Machine Learning 32
l(W) = Οπ£=ππ§ ππ£ π°π + Οπ’π°π’ ππ’
π£β π₯π§(π + ππ±π©(π°π + Οπ’π°π’ ππ’
π£) )
π°π’ = π°π’ + πππ₯(π)
ππ°π’
gradient ascent
learning rate
Maximizing Conditional Log Likelihood
Gradient Ascent
Machine Learning 33
l(W) = Οj=1n Yj w0 + Οiwi Xi
jβ ln( 1 + exp(w0 + Οiwi Xi
j) )
ππ₯(π)
ππ°π’= Οπ=π
π ππ£ ππ’π£β ππ’
π£ ππ±π©(π°π+Οπ’π°π’ ππ’π£)
π+ππ±π©(π°π+Οπ’π°π’ ππ’π£)
ππ₯(π)
ππ°π’= Οπ=π
π ππ’π£(ππ£ β π ππ£ = π ππ£,π) )
π°π’ = π°π’ + πΟπ=ππ ππ’
π£(ππ£ β π ππ£ = π ππ£,π) )
Gradient ascent algorithm: iterate until change <
For all i, repeat
G.NaΓ―ve Bayes vs. Logistic Regression
Machine Learning 34
G.NaΓ―ve Bayes vs. Logistic Regression
The bottom line:
β’ GNB2 and LR both use linear decision surfaces, GNB need not
β’ Given infinite data, LR is better than GNB2 because training procedure does not make
assumptions 1 or 2 (though our derivation of the form of P(Y|X) did).
β’ But GNB2 converges more quickly to its perhaps-less-accurate asymptotic error
β’ And GNB is both more biased (assumption1) and less (no assumption 2) than LR, so
either might beat the other
Machine Learning 35