The Naïve Bayes Classifier

MachineLearning

TheNaïveBayesClassifier

Today’slecture

• ThenaïveBayesClassifier

• LearningthenaïveBayesClassifier

• Practicalconcerns

Today’slecture

• Practicalconcerns

Wherearewe?

WehaveseenBayesianlearning– Usingaprobabilisticcriteriontoselectahypothesis– Maximumaposterioriandmaximumlikelihoodlearning

Youshouldknowwhatisthedifferencebetweenthem

Wecouldalsolearnfunctionsthatpredict probabilitiesofoutcomes

– Differentfromusingaprobabilisticcriteriontolearn

Maximumaposteriori(MAP)prediction asopposedtoMAPlearning

Wherearewe?

WehaveseenBayesianlearning– Usingaprobabilisticcriteriontoselectahypothesis– Maximumaposterioriandmaximumlikelihoodlearning

Youshouldknowwhatisthedifferencebetweenthem

Wecouldalsolearnfunctionsthatpredict probabilitiesofoutcomes

– Differentfromusingaprobabilisticcriteriontolearn

Maximumaposteriori(MAP)prediction asopposedtoMAPlearning

MAPprediction

UsingtheBayesruleforpredicting𝑦 givenaninput𝐱

𝑃 𝑌 = 𝑦 𝑋 = 𝐱 =𝑃 𝑋 = 𝐱 𝑌 = 𝑦 𝑃 𝑌 = 𝑦

𝑃(𝑋 = 𝐱)

Posteriorprobabilityoflabelbeingy forthisinputx

MAPprediction

𝑃 𝑌 = 𝑦 𝑋 = 𝐱 =𝑃 𝑋 = 𝐱 𝑌 = 𝑦 𝑃 𝑌 = 𝑦

𝑃(𝑋 = 𝐱)

Predictthelabel𝑦 fortheinput𝐱 using

argmax.

𝑃 𝑋 = 𝐱 𝑌 = 𝑦 𝑃 𝑌 = 𝑦𝑃(𝑋 = 𝐱)

MAPprediction

𝑃 𝑌 = 𝑦 𝑋 = 𝐱 =𝑃 𝑋 = 𝐱 𝑌 = 𝑦 𝑃 𝑌 = 𝑦

𝑃(𝑋 = 𝐱)

argmax.𝑃 𝑋 = 𝐱 𝑌 = 𝑦 𝑃 𝑌 = 𝑦

MAPprediction

𝑃 𝑌 = 𝑦 𝑋 = 𝐱 =𝑃 𝑋 = 𝐱 𝑌 = 𝑦 𝑃 𝑌 = 𝑦

𝑃(𝑋 = 𝐱)

Don’tconfusewithMAPlearning:findshypothesisby

MAPprediction

Likelihood ofobservingthisinputx whenthelabelisy

Priorprobabilityofthelabelbeingy

Allweneedarethesetwosetsofprobabilities

Example:Tennisagain

Temperature Wind P(T, W|Tennis=Yes)

Hot Strong 0.15

Hot Weak 0.4

Cold Strong 0.1

Cold Weak 0.35

Temperature Wind P(T, W|Tennis=No)

Hot Strong 0.4

Hot Weak 0.1

Cold Strong 0.3

Cold Weak 0.2

Playtennis P(Playtennis)

Yes 0.3

No 0.7Prior

Likelihood

Withoutanyotherinformation,whatisthepriorprobabilitythatIshouldplaytennis?

OndaysthatIdo playtennis,whatistheprobabilitythatthetemperatureisTandthewindisW?

OndaysthatIdon’t playtennis,whatistheprobabilitythatthetemperatureisTandthewindisW?

Example:Tennisagain

Hot Strong 0.15

Hot Weak 0.4

Cold Strong 0.1

Cold Weak 0.35

Hot Strong 0.4

Hot Weak 0.1

Cold Strong 0.3

Cold Weak 0.2

Yes 0.3

No 0.7Prior

Likelihood

Example:Tennisagain

Hot Strong 0.15

Hot Weak 0.4

Cold Strong 0.1

Cold Weak 0.35

Hot Strong 0.4

Hot Weak 0.1

Cold Strong 0.3

Cold Weak 0.2

Yes 0.3

No 0.7Prior

Likelihood

Example:Tennisagain

Hot Strong 0.15

Hot Weak 0.4

Cold Strong 0.1

Cold Weak 0.35

Hot Strong 0.4

Hot Weak 0.1

Cold Strong 0.3

Cold Weak 0.2

Yes 0.3

No 0.7Prior

Likelihood

Input:Temperature=Hot(H)Wind=Weak(W)

ShouldIplaytennis?

Example:Tennisagain

Hot Strong 0.15

Hot Weak 0.4

Cold Strong 0.1

Cold Weak 0.35

Hot Strong 0.4

Hot Weak 0.1

Cold Strong 0.3

Cold Weak 0.2

Yes 0.3

No 0.7Prior

Likelihood

ShouldIplaytennis?

argmaxy P(H,W|play?)P(play?)

Example:Tennisagain

Hot Strong 0.15

Hot Weak 0.4

Cold Strong 0.1

Cold Weak 0.35

Hot Strong 0.4

Hot Weak 0.1

Cold Strong 0.3

Cold Weak 0.2

Yes 0.3

No 0.7Prior

Likelihood

ShouldIplaytennis?

P(H,W|Yes)P(Yes)=0.4£ 0.3=0.12

P(H,W|No)P(No)=0.1£ 0.7=0.07

Example:Tennisagain

Hot Strong 0.15

Hot Weak 0.4

Cold Strong 0.1

Cold Weak 0.35

Hot Strong 0.4

Hot Weak 0.1

Cold Strong 0.3

Cold Weak 0.2

Yes 0.3

No 0.7Prior

Likelihood

ShouldIplaytennis?

P(H,W|Yes)P(Yes)=0.4£ 0.3=0.12

P(H,W|No)P(No)=0.1£ 0.7=0.07

MAPprediction=Yes

Howhardisittolearnprobabilisticmodels?

O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -

Outlook: S(unny),O(vercast),R(ainy)

Temperature: H(ot),M(edium),C(ool)

Humidity: H(igh),N(ormal),L(ow)

Wind: S(trong),W(eak)

Outlook: S(unny),O(vercast),R(ainy)

Temperature: H(ot),M(edium),C(ool)

Humidity: H(igh),N(ormal),L(ow)

Wind: S(trong),W(eak)

Weneedtolearn

1.Theprior𝑃(Play? )2.Thelikelihoods𝑃 x Play? )

PriorP(play?)

• Asinglenumber(Whyonlyone?)

LikelihoodP(X|Play?)

• Thereare4features

• ForeachvalueofPlay? (+/-),weneedavalueforeachpossibleassignment:P(x1,x2, x3,x4 |Play?)

• (24 – 1)parametersineachcase

Oneforeachassignment

PriorP(play?)

• ForeachvalueofPlay? (+/-),weneedavalueforeachpossibleassignment:P(O,T,H,W|Play?)

3 3 3 2

PriorP(play?)

22Valuesforthisfeature

3 3 3 2

PriorP(play?)

• (3 ⋅ 3 ⋅ 3 ⋅ 2 − 1)parametersineachcase

Oneforeachassignment

23Valuesforthisfeature

PriorP(Y)

• Ifthereareklabels,thenk– 1parameters(whynotk?)

LikelihoodP(X|Y)

• Iftherearedfeatures,then:

• WeneedavalueforeachpossibleP(x1,x2,!,xd |y)foreachy

• k(2d – 1)parameters

Needalotofdatatoestimatethesemanynumbers!

Ingeneral

PriorP(Y)

LikelihoodP(X|Y)

• IftherearedBooleanfeatures:

Ingeneral

PriorP(Y)

LikelihoodP(X|Y)

Ingeneral

PriorP(Y)

LikelihoodP(X|Y)

Highmodelcomplexity

Ifthereisverylimiteddata,highvarianceintheparameters

PriorP(Y)

LikelihoodP(X|Y)

Highmodelcomplexity

Howcanwedealwiththis?

PriorP(Y)

LikelihoodP(X|Y)

Highmodelcomplexity

Howcanwedealwiththis?

Answer:Makeindependenceassumptions

Recall:Conditionalindependence

SupposeX,YandZarerandomvariables

XisconditionallyindependentofYgivenZiftheprobabilitydistributionofXisindependentofthevalueofYwhenZisobserved

Orequivalently

Modelingthefeatures

𝑃(𝑥:, 𝑥<,⋯ , 𝑥>|𝑦) requiredk(2d – 1)parameters

Whatifallthefeatureswereconditionallyindependentgiventhelabel?

Thatis,𝑃 𝑥:, 𝑥<,⋯ , 𝑥> 𝑦 = 𝑃 𝑥: 𝑦 𝑃 𝑥< 𝑦 ⋯𝑃 𝑥> 𝑦

Requiresonlydnumbersforeachlabel.kd featuresoverall.Notbad!

TheNaïveBayesAssumption

Modelingthefeatures

𝑃(𝑥:, 𝑥<,⋯ , 𝑥>|𝑦) requiredk(2d – 1)parameters

Whatifallthefeatureswereconditionallyindependentgiventhelabel?

Thatis,𝑃 𝑥:, 𝑥<,⋯ , 𝑥> 𝑦 = 𝑃 𝑥: 𝑦 𝑃 𝑥< 𝑦 ⋯𝑃 𝑥> 𝑦

Requiresonlydnumbersforeachlabel.kd parametersoverall.Notbad!

TheNaïveBayesAssumption

Assumption:FeaturesareconditionallyindependentgiventhelabelY

Topredict,weneedtwosetsofprobabilities– PriorP(y)– Foreachxj,wehavethelikelihoodP(xj |y)

Decisionrule

ℎAB 𝒙 = argmax.

𝑃 𝑦 𝑃 𝑥:, 𝑥<,⋯ , 𝑥> 𝑦)

Decisionrule

ℎAB 𝒙 = argmax.

𝑃 𝑦 𝑃 𝑥:, 𝑥<,⋯ , 𝑥> 𝑦)

= argmax.

𝑃 𝑦 D𝑃(𝑥E|𝑦)�

DecisionboundariesofnaïveBayes

WhatisthedecisionboundaryofthenaïveBayesclassifier?

Considerthetwoclasscase.Wepredictthelabeltobe+if

𝑃 𝑦 = + D𝑃 𝑥E 𝑦 = + > 𝑃 𝑦 = − D𝑃 𝑥E 𝑦 = −)�

Considerthetwoclasscase.Wepredictthelabeltobe+if

𝑃 𝑦 = + D𝑃 𝑥E 𝑦 = + > 𝑃 𝑦 = − D𝑃 𝑥E 𝑦 = −)�

𝑃 𝑦 = + ∏ 𝑃 𝑥E 𝑦 = +)�E

𝑃 𝑦 = − ∏ 𝑃(𝑥E|𝑦 = −)�E

Takinglogandsimplifying,weget

Thisisalinearfunctionofthefeaturespace!

Easytoprove.Seenoteoncoursewebsite

log𝑃(𝑦 = −|𝒙)𝑃(𝑦 = +|𝒙) = 𝒘L𝒙 + 𝑏

Today’slecture

• PracticalConcerns

LearningthenaïveBayesClassifier

• Whatisthehypothesisfunctionh definedby?– Acollectionofprobabilities

• Priorforeachlabel:P(y)• Likelihoodsforfeaturexj givenalabel:P(xj|y)

IfwehaveadatasetD={(xi,yi)}withmexamplesAndwewanttolearntheclassifierinaprobabilisticway– Whatistheprobabilisticcriteriontoselectthehypothesis?

• Priorforeachlabel:𝑃(𝑦)• Likelihoodsforfeaturexj givenalabel:𝑃(𝑥𝑗|𝑦)

IfwehaveadatasetD={(xi,yi)}withmexamplesAndwewanttolearntheclassifierinaprobabilisticway– Whatistheprobabilisticcriteriontoselectthehypothesis?

Supposewehaveadataset𝐷 = {(𝒙𝑖, 𝑦𝑖)}withmexamples

Anoteonconventionforthissection:• Examplesinthedatasetareindexedbythesubscript𝑖 (e.g. 𝒙𝑖)• Featureswithinanexampleareindexedbythesubscript𝑗

• The𝑗ST featureofthe𝑖ST examplewillbe𝑥UE

Ifwehaveadataset𝐷 = {(𝒙𝑖, 𝑦𝑖)}withmexamplesAndwewanttolearntheclassifierinaprobabilisticway– Whatisaprobabilisticcriteriontoselectthehypothesis?

Maximumlikelihoodestimation

HerehisdefinedbyalltheprobabilitiesusedtoconstructthenaïveBayesdecision

Givenadataset𝐷 = {(𝒙𝑖, 𝑦𝑖)} withmexamples

Eachexampleinthedatasetisindependentandidenticallydistributed

SowecanrepresentP(D|h)asthisproduct

Givenadataset𝐷 = {(𝒙𝑖, 𝑦𝑖)}withmexamples

Asks“Whatprobabilitywouldthisparticularh assigntothepair(xi,yi)?”

Eachexampleinthedatasetisindependentandidenticallydistributed

SowecanrepresentP(D|h)asthisproduct

GivenadatasetD={(xi,yi)}withmexamples

TheNaïveBayesassumption

xij isthejthfeatureofxi

Howdoweproceed?

Whatnext?

Weneedtomakeamodelingassumptionaboutthefunctionalformoftheseprobabilitydistributions

Forsimplicity,supposetherearetwolabels1 and0 andallfeaturesarebinary

• Prior:P(y=1)=p andP(y=0)=1– p

Thatis,thepriorprobabilityisfromtheBernoullidistribution.

• Likelihood foreachfeaturegivenalabel• P(xj =1|y=1)=aj andP(xj =0 |y=1)=1– aj• P(xj =1|y=0)=bj andP(xj =0 |y=0)=1- bj

Thatis,thelikelihoodofeachfeatureisalsoisfromtheBernoullidistribution.

hconsistsofp,allthea’sandb’s

[z]iscalledtheindicatorfunctionortheIversonbracket

Itsvalueis1iftheargumentzistrueandzerootherwise

Likelihoodforeachfeaturegivenalabel• P(xj =1|y=1)=aj andP(xj =0 |y=1)=1– aj• P(xj =1|y=0)=bj andP(xj =0 |y=0)=1- bj

Substitutingandderivingtheargmax,weget

P(y=1)=p

P(xj =1|y=1)=aj

P(y=1)=p

P(xj =1|y=1)=aj

P(xj =1|y=0)=bj

Let’slearnanaïveBayesclassifier

WiththeassumptionthatallourprobabilitiesarefromtheBernoullidistribution

𝑃 𝑃𝑙𝑎𝑦 = + =914 𝑃 𝑃𝑙𝑎𝑦 = − =

𝑃(𝑶 = 𝑆|𝑃𝑙𝑎𝑦 = +) =29

𝑃 𝑃𝑙𝑎𝑦 = + =914 𝑃 𝑃𝑙𝑎𝑦 = − =

𝑃(𝑶 = 𝑅|𝑃𝑙𝑎𝑦 = +) = 39

𝑃(𝑶 = 𝑆|𝑃𝑙𝑎𝑦 = +) =29

𝑃 𝑃𝑙𝑎𝑦 = + =914 𝑃 𝑃𝑙𝑎𝑦 = − =

𝑃(𝑶 = 𝑂|𝑃𝑙𝑎𝑦 = +) = 49

Andsoon,forotherattributesandalsoforPlay=-

𝑃(𝑶 = 𝑅|𝑃𝑙𝑎𝑦 = +) = 39

𝑃(𝑶 = 𝑆|𝑃𝑙𝑎𝑦 = +) =29

𝑃 𝑃𝑙𝑎𝑦 = + =914 𝑃 𝑃𝑙𝑎𝑦 = − =

NaïveBayes:LearningandPrediction

• Learning– Counthowoftenfeaturesoccurwitheachlabel.Normalizetogetlikelihoods

– Priorsfromfractionofexampleswitheachlabel– Generalizestomulticlass

• Prediction– Uselearnedprobabilitiestofindhighestscoringlabel

Today’slecture

• Practicalconcerns+anexample

ImportantcaveatswithNaïveBayes

1. Featuresneednotbeconditionallyindependentgiventhelabel– Justbecauseweassumethattheyaredoesn’tmeanthat

that’showtheybehaveinnature– Wemadeamodelingassumptionbecauseitmakes

computation andlearningeasier

2. Notenoughtrainingdatatogetgoodestimatesoftheprobabilitiesfromcounts

1. Featuresarenotconditionallyindependentgiventhelabel

AllbetsareoffifthenaïveBayesassumptionisnotsatisfied

Andyet,veryoftenusedinpracticebecauseofsimplicityWorksreasonablywellevenwhentheassumptionisviolated

Thebasicoperationforlearninglikelihoodsiscountinghowoftenafeatureoccurswithalabel.

Whatifweneverseeaparticularfeaturewithaparticularlabel?Eg:SupposeweneverobserveTemperature=coldwithPlayTennis=Yes

Shouldwetreatthosecountsaszero?

Shouldwetreatthosecountsaszero? Butthatwillmaketheprobabilitieszero

Shouldwetreatthosecountsaszero?

Answer:Smoothing• Addfakecounts(verysmallnumberssothatthecountsarenotzero)• TheBayesianinterpretationofsmoothing:Priors onthehypothesis(MAPlearning)

Butthatwillmaketheprobabilitieszero

Example:Classifyingtext

• Instancespace:Textdocuments• Labels:Spam orNotSpam

• Goal:TolearnafunctionthatcanpredictwhetheranewdocumentisSpam orNotSpam

HowwouldyoubuildaNaïveBayesclassifier?

Letusbrainstorm

Howtorepresentdocuments?Howtoestimateprobabilities?Howtoclassify?

1. RepresentdocumentsbyavectorofwordsAsparsevectorconsistingofonefeatureperword

2. LearningfromNlabeleddocuments1. Priors

2. Foreachwordwinvocabulary:

Howoftendoesawordoccurwithalabel?

Smoothing

Continuousfeatures

• Sofar,wehavebeenlookingatdiscretefeatures– P(xj |y)isaBernoullitrial(i.e.acointoss)

• WecouldmodelP(xj |y)withotherdistributionstoo– Thisisaseparateassumptionfromtheindependence

assumptionthatnaiveBayesmakes– Eg:Forrealvaluedfeatures,(Xj |Y)couldbedrawnfroma

normaldistribution

• Exercise:Derivethemaximumlikelihoodestimatewhenthefeaturesareassumedtobedrawnfromthenormaldistribution

Summary:NaïveBayes

• Independenceassumption– Allfeaturesareindependentofeachothergiventhelabel

• Maximumlikelihoodlearning:Learningissimple– Generalizestorealvaluedfeatures

• PredictionviaMAPestimation– Generalizestobeyondbinaryclassification

• Importantcaveatstoremember– Smoothing– Independenceassumptionmaynotbevalid

• Decisionboundaryislinearforbinaryclassification

The Naïve Bayes Classifier - svivek · •The naïve Bayes Classifier •Learning the naïve Bayes...

Documents

Transcript of The Naïve Bayes Classifier - svivek · •The naïve Bayes Classifier •Learning the naïve Bayes...

Naïve Bayes Classifier - unibas.ch · The Naïve Bayes classifier is more than a spam or text classifier. It is a general classification model based on the Bayes classifier with

Naïve Bayes - unibas.ch · •Bayes classifier with the assumption of independent features •Probabilistic, generative classifier •Easy-to-estimate likelihoods: Product of feature

Last lecture summary Naïve Bayes Classifier

Bayes optimal classifier Naïve Bayesguestrin/Class/15781/slides/epxing_naive... · 2007. 9. 16. · 1 ©Carlos Guestrin 2005-2007 Bayes optimal classifier Naïve Bayes Machine Learning

Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian

23: Naïve Bayes - Stanford Universityweb.stanford.edu/.../lectures/23_naive_bayes_blank.pdf21 “Brute Force Bayes” 24b_brute_force_bayes 32 Naïve Bayes Classifier 24c_naive_bayes

23: Naïve Bayes€¦ · Lisa Yan, CS109, 2020 Quick slide reference 2 3 Intro: Machine Learning 23a_intro 21 “Brute Force Bayes” 24b_brute_force_bayes 32 Naïve Bayes Classifier

Learning: Naïve Bayes Classifier

Classification Naïve Bayes classifier Nearest-neighbor ...

Enhanced Smoothing Methods Using Na ve Bayes Classifier ... · III Naïve Bayes classifier A Naive Bayes classifier is a simple probabilistic classifier which is based on applying

NAÏVE BAYES CLASSIFIER

Naïve Bayes Classifier We will start off with a visual intuition, before looking at the math… Thomas Bayes 1702 - 1761.

Naïve Bayes classifier - VIUcsci.viu.ca/~barskym/teaching/DM2012/lectures/Lecture5.NaiveBayes.pdfClassifier based on Bayes rule •We can build a classifier which will classify a

ADDING A SOURCE CODE SEARCHING CAPABILITY TO YIOOP · is to use a Naïve Bayes classifier. As a part of the experiment, a Naïve Bayes classifier was implemented as a PHP program.

Classification k-nearest neighbor classifier Naïve Bayes Logistic ... · Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines. ApplicationsofGradientDescent,LagrangeandKKTarecountlessIam

Tool wear monitoring using naïve Bayes classifiers · 3 Naïve Bayes classifier for tool condition monitoring Intoolconditionmonitoring,theuncertainvariableofinterest is the state

COMP24111 Machine Learning Naïve Bayes Classifier Ke Chen.

NAÏVE BAYES CLASSIFIER: A MAPREDUCE APPROACH A …

Text Classification: Naïve Bayes Classifier with Sentiment ... · as Support Vector Machines, Neural Networks, Naive Bayes Classifier, Decision Trees, Rocchio’s Algorithms, and