Post on 26-Jun-2020
1
NaïveBayesClassifier
PradeepRavikumar
Co-instructor:Ziv Bar-Joseph
MachineLearning10-701
2
Goal:
Classification
SportsScienceNews
Features,X Labels,Y
ProbabilityofError
0
0.5
1
OptimalClassificationOptimalpredictor:(Bayes classifier)
3
• EventheoptimalclassifiermakesmistakesR(f*)>0• Optimalclassifierdependsonunknown distribution
Bayes risk
X
OptimalClassifier
Bayes Rule:
Optimalclassifier:
4
Classconditionaldensity
Classprior
5
Wecannowconsiderappropriatemodelsforthetwoterms
ClassprobabilityP(Y=y),ClassconditionaldistributionoffeaturesP(X=x|Y=y)
Classconditionaldistribution
Classprobability
ModelbasedApproach
= θ = 1 − θ
ModelingClassprobabilityP(Y=y)=Bernoulli(θ)
Likeacoinflip
ModelingClassConditionalDistributionofFeatures
• Gaussianclassconditionaldensities(1-dimension/feature)
6DecisionBoundary
• Gaussianclassconditionaldensities (2-dimensions/features)
7
ModelingClassConditionalDistributionofFeatures
DecisionBoundary
µ1
µ1
µ2
µ2
Handwrittendigitrecognition
8Note:8digits shownoutof10(0,1,…,9);
Axesareobtainedbynonlineardimensionality reduction (laterincourse)
φ2(X)
φ1(X)
Multi-classclassification
Handwrittendigitrecognition
9
TrainingData:
GaussianBayesmodel:
P(Y=y)=py forallyin0,1,2,…,9 p0,p1,…,p9 (sumto1)
P(X=x|Y =y)~N(μy,Σy)foreachy μy – d-dimvectorΣy - dxd matrix
1
…ngreyscaleimages
…nlabels
Input,X
Label,Y
Eachimagerepresentedasavectorofintensityvaluesatthedpixels(features)
=
2
664
X1
X2
. . .Xd
3
775
2
X
GaussianBayesclassifier
10
P(Y=y)=py forallyin0,1,2,…,9 p0,p1,…,p9 (sumto1)
P(X=x|Y =y)~N(μy,Σy)foreachy μy – d-dimvectorΣy - dxd matrix
11
1p(2⇡)d|⌃y|
• Binaryclassificationwithcontinuousfeaturesdecisionboundaryissetofpointsx:P(Y=1|X=x)=P(Y=0|X=x)
IfclassconditionalfeaturedistributionP(X=x|Y=y)is2-dimGaussianN(μy,Σy)
DecisionBoundaryofGaussianBayes
P (Y = 1|X = x)
P (Y = 0|X = x)=
P (X = x|Y = 1)P (Y = 1)
P (X = x|Y = 0)P (Y = 0)
=
s|⌃0||⌃1|
exp
✓��(x� µ1)⌃
�11 (x� µ1)0
2+
(x� µ0)⌃�10 (x� µ0)0
2
◆✓
1� ✓
Note:Ingeneral,thisimpliesaquadraticequationinx.ButifΣ1=Σ0,thenquadraticpartcancelsoutandequationislinear.
GaussianBayesclassifier
12
P(Y=y)=py forallyin0,1,2,…,9 p0,p1,…,p9 (sumto1)
P(X=x|Y =y)~N(μy,Σy)foreachy μy – d-dimvectorΣy - dxd matrix
Howtolearnparameterspy,μy,Σy fromdata?
Howmanyparametersdoweneedtolearn?
13
Kd +Kd(d+1)/2=O(Kd2) ifdfeatures
Quadraticindimensiond!Ifd=256x256pixels,~21.5billionparameters!
Classprobability:
P(Y=y)=py forallyin0,1,2,…,9 p0,p1,…,p9 (sumto1)
Classconditionaldistributionoffeatures:
P(X=x|Y =y)~N(μy,Σy)foreachy μy – d-dimvectorΣy - dxd matrix
K-1ifKlabels
Whataboutdiscretefeatures?
1414
TrainingData:
DiscreteBayesmodel:
P(Y=y)=py forallyin0,1,2,…,9 p0,p1,…,p9 (sumto1)
P(X=x|Y =y)~Foreachlabely,maintainprobabilitytablewith2d-1entries
1
…nblack-whiteimages
…nlabels
Input,X
Label,Y
Eachimagerepresentedasavectorofdbinaryfeatures(black1orwhite0)
=
2
664
X1
X2
. . .Xd
3
775
2
X
Howmanyparametersdoweneedtolearn?
15
Classprobability:
P(Y=y)=py forallyin0,1,2,…,9 p0,p1,…,p9 (sumto1)
Classconditionaldistributionoffeatures:
P(X=x|Y =y)~Foreachlabely,maintainprobabilitytablewith2d-1entries
K-1ifKlabels
K(2d – 1)ifdbinaryfeatures
Exponentialindimensiond!
What’swrongwithtoomanyparameters?
• Howmanytrainingdataneededtolearnoneparameter(biasofacoin)?
• Needlotsoftrainingdatatolearntheparameters!– Trainingdata>numberofparameters
16
NaïveBayesClassifier
17
• BayesClassifierwithadditional“naïve”assumption:– Featuresareindependentgivenclass:
– Moregenerally:
• Ifconditionalindependenceassumptionholds,NBisoptimalclassifier!Butworseotherwise.
X =
X1
X2
�
=
2
664
X1
X2
. . .Xd
3
775X =
X1
X2
�
ConditionalIndependence
18
• Xisconditionallyindependent ofYgivenZ:probabilitydistributiongoverningXisindependentofthevalueofY,giventhevalueofZ
• Equivalentto:
• e.g.,Note: doesNOTmeanThunderisindependentofRain
Conditionalvs.MarginalIndependence
19
Conditionalvs.MarginalIndependence
20
Wearingcoatsisindependentofaccidentsconditionedonthefactthatitrained
NaïveBayesClassifier
21
• BayesClassifierwithadditional“naïve”assumption:– Featuresareindependentgivenclass:
• Howmanyparametersnow?
Handwrittendigitrecognition(continuousfeatures)
22
TrainingData:
Howmanyparameters?
ClassprobabilityP(Y=y)=py forally
Classconditionaldistributionoffeatures(usingNaïveBayesassumption)
P(Xi =xi|Y =y)~N(μ(y)i,σ2i(y))foreachyandeachpixeli
K-1ifKlabels
2Kd
1 2
…ngreyscaleimageswithdpixels
…nlabels
X
Y
=
2
664
X1
X2
. . .Xd
3
775
May not hold
LinearinsteadofQuadraticind!
Handwrittendigitrecognition(discretefeatures)
23
TrainingData:
Howmanyparameters?
ClassprobabilityP(Y=y)=py forally
Classconditionaldistributionoffeatures(usingNaïveBayesassumption)
P(Xi =xi|Y =y)– oneprobabilityvalueforeachy,pixeli
K-1ifKlabels
Kd
1 2
…nblack-white(1/0)imageswithdpixels
…nlabels
X
Y
=
2
664
X1
X2
. . .Xd
3
775
May not hold
LinearinsteadofExponentialind!
NaïveBayesClassifier
24
• BayesClassifierwithadditional“naïve”assumption:– Featuresareindependentgivenclass:
• Hasfewerparameters,andhencerequiresfewertrainingdata,eventhoughassumptionmaybeviolatedinpractice
NaïveBayes Algo – Discretefeatures
• TrainingData
• MaximumLikelihoodEstimates– ForClassprobability
– Forclassconditionaldistribution
• NBPredictionfortestdata
25
IssueswithNaïveBayes
26
• Issue1: Usually,featuresarenotconditionallyindependent:
Nonetheless,NBisthesinglemostusedclassifierparticularlywhendataislimited,workswell
• Issue2: TypicallyuseMAPestimatesinsteadofMLEsinceinsufficientdatamaycauseMLEtobezero.
InsufficientdataforMLE
27
• WhatifyouneverseeatraininginstancewhereX1=awhenY=b?– e.g.,b={SpamEmail},a={‘Earn’}– P(X1=a|Y=b)=0
• Thus,nomatterwhatthevaluesX2,…,Xd take:
• Whatnow???
=0
NaïveBayes Algo – Discretefeatures
• TrainingData
• MaximumAPosteriori(MAP)Estimates– addm“virtual”datapts
Assumegivensomepriordistribution(typicallyuniform):
MAPEstimate
Now,evenifyouneverobserveaclass/featureposteriorprobabilityneverzero.
28
#virtualexampleswithY=b
CaseStudy:TextClassification
29
• Classifye-mails– Y={Spam,NotSpam}
• Classifynewsarticles– Y={whatisthetopicofthearticle?}
• Classifywebpages– Y={Student,professor,project,…}
• WhataboutthefeaturesX?– Thetext!
Bagofwordsapproach
30
aardvark 0
about 2
all 2
Africa 1
apple 0
anxious 0
...
gas 1
...
oil 1
…
Zaire 0
NBforTextClassification
31
• FeaturesX arethecountofhowmanytimeseachwordinthevocabularyappearsindocument
• ProbabilitytableforP(X|Y)ishuge!!!
• NBassumptionhelpsalot!!!
• Bagofwords+NaïveBayesassumptionimplyP(X|Y=y)isjusttheproduct ofprobabilityofeachword,raisedtoitscount, inadocumentontopicy
Bagofwordsmodel
32
• Typicaladditionalassumption– Positionindocumentdoesn’tmatter– “Bagofwords”model– orderofwordsonthepageignored– Soundsreallysilly,butoftenworksverywell!
inislecturelecture nextoverpersonrememberroomsittingthethe the toto upwakewhenyou
Bagofwordsmodel
33
• Typicaladditionalassumption– Positionindocumentdoesn’tmatter– “Bagofwords”model– orderofwordsonthepageignored– Soundsreallysilly,butoftenworksverywell!
Whenthelectureisover,remembertowakeupthepersonsittingnexttoyouinthelectureroom.
NBwithBagofWordsfortextclassification
34
• Learningphase:– ClassPriorP(Y):fraction oftimestopicYappearsinthecollectionofdocuments
– P(w|Y):fractionoftimesword wappearsindocumentswithtopicY
• Testphase:– Foreachdocument
• UseBagofwords+naïveBayesdecisionrule
Twentynewsgroupsresults
35
Whatiffeaturesarecontinuous?
36
Eg.,characterrecognition:Xi isintensityatith pixel
GaussianNaïveBayes (GNB):
Differentmeanandvarianceforeachclasskandeachpixeli.
Sometimesassumevariance• isindependentofY(i.e.,σi),• orindependentofXi (i.e.,σk)• orboth(i.e.,σ)
Estimatingparameters:Ydiscrete,Xi continuous
37
Maximumlikelihoodestimates:
jth trainingimageith pixelin
jth trainingimage
kth class
Example:GNBforclassifyingmentalstates
38
~1mmresolution
~2imagespersec.
15,000voxels/image
non-invasive,safe
measuresBloodOxygenLevelDependent(BOLD)response
[Mitchelletal.]
GaussianNaïveBayes:Learnedµvoxel,word
39
[Mitchelletal.]
15,000voxelsorfeatures
10trainingexamplesorsubjectsperclass(12wordcategories)
LearnedNaïveBayes Models–MeansforP(BrainActivity |WordCategory)
40
AnimalwordsPeoplewordsPairwise classificationaccuracy:85% [Mitchelletal.]
Whatyoushouldknow…
41
• OptimaldecisionusingBayes Classifier• NaïveBayes classifier
– What’stheassumption– Whyweuseit– Howdowelearnit– WhyisMAPestimationimportant
• Textclassification– Bagofwordsmodel
• GaussianNB– Featuresarestillconditionallyindependent– EachfeaturehasaGaussiandistributiongivenclass
GaussianNaïveBayes vs.LogisticRegression
42
• Representationequivalence(bothyieldlineardecisionboundaries)– Butonlyinaspecialcase!!!(GNBwithclass-independentvariances)
– LRmakesnoassumptionsabout P(X|Y)inlearning!!!– Optimizedifferentfunctions(MLE/MCLE)or(MAP/MCAP)! Obtaindifferentsolutions
SetofGaussianNaïveBayes parameters
(featurevarianceindependentofclasslabel)
SetofLogisticRegressionparameters
Discriminativevs GenerativeClassifiers
43
Generative(Modelbased)approach:e.g.NaïveBayes• AssumesomeprobabilitymodelforP(Y)andP(X|Y)• Estimateparametersofprobabilitymodelsfromtrainingdata
Discriminative(Modelfree)approach:e.g.LogisticregressionWhynotlearnP(Y|X)directly?Orbetteryet,whynotlearnthedecisionboundarydirectly?• AssumesomefunctionalformforP(Y|X)orforthedecisionboundary• Estimateparametersoffunctionalformdirectlyfromtrainingdata
OptimalClassifier: