The Naïve Bayes Classifier - svivek · •The naïve Bayes Classifier •Learning the naïve Bayes...

Post on 22-Jul-2020

16 views 0 download

Transcript of The Naïve Bayes Classifier - svivek · •The naïve Bayes Classifier •Learning the naïve Bayes...

MachineLearning

TheNaïveBayesClassifier

1

Today’slecture

• ThenaïveBayesClassifier

• LearningthenaïveBayesClassifier

• Practicalconcerns

2

Today’slecture

• ThenaïveBayesClassifier

• LearningthenaïveBayesClassifier

• Practicalconcerns

3

Wherearewe?

WehaveseenBayesianlearning– Usingaprobabilisticcriteriontoselectahypothesis– Maximumaposterioriandmaximumlikelihoodlearning

Youshouldknowwhatisthedifferencebetweenthem

Wecouldalsolearnfunctionsthatpredict probabilitiesofoutcomes

– Differentfromusingaprobabilisticcriteriontolearn

Maximumaposteriori(MAP)prediction asopposedtoMAPlearning

4

Wherearewe?

WehaveseenBayesianlearning– Usingaprobabilisticcriteriontoselectahypothesis– Maximumaposterioriandmaximumlikelihoodlearning

Youshouldknowwhatisthedifferencebetweenthem

Wecouldalsolearnfunctionsthatpredict probabilitiesofoutcomes

– Differentfromusingaprobabilisticcriteriontolearn

Maximumaposteriori(MAP)prediction asopposedtoMAPlearning

5

MAPprediction

UsingtheBayesruleforpredicting𝑦 givenaninput𝐱

𝑃 𝑌 = 𝑦 𝑋 = 𝐱 =𝑃 𝑋 = 𝐱 𝑌 = 𝑦 𝑃 𝑌 = 𝑦

𝑃(𝑋 = 𝐱)

6

Posteriorprobabilityoflabelbeingy forthisinputx

MAPprediction

UsingtheBayesruleforpredicting𝑦 givenaninput𝐱

𝑃 𝑌 = 𝑦 𝑋 = 𝐱 =𝑃 𝑋 = 𝐱 𝑌 = 𝑦 𝑃 𝑌 = 𝑦

𝑃(𝑋 = 𝐱)

Predictthelabel𝑦 fortheinput𝐱 using

argmax.

𝑃 𝑋 = 𝐱 𝑌 = 𝑦 𝑃 𝑌 = 𝑦𝑃(𝑋 = 𝐱)

7

MAPprediction

UsingtheBayesruleforpredicting𝑦 givenaninput𝐱

𝑃 𝑌 = 𝑦 𝑋 = 𝐱 =𝑃 𝑋 = 𝐱 𝑌 = 𝑦 𝑃 𝑌 = 𝑦

𝑃(𝑋 = 𝐱)

Predictthelabel𝑦 fortheinput𝐱 using

argmax.𝑃 𝑋 = 𝐱 𝑌 = 𝑦 𝑃 𝑌 = 𝑦

8

MAPprediction

UsingtheBayesruleforpredicting𝑦 givenaninput𝐱

𝑃 𝑌 = 𝑦 𝑋 = 𝐱 =𝑃 𝑋 = 𝐱 𝑌 = 𝑦 𝑃 𝑌 = 𝑦

𝑃(𝑋 = 𝐱)

Predictthelabel𝑦 fortheinput𝐱 using

argmax.𝑃 𝑋 = 𝐱 𝑌 = 𝑦 𝑃 𝑌 = 𝑦

9

Don’tconfusewithMAPlearning:findshypothesisby

MAPprediction

Predictthelabel𝑦 fortheinput𝐱 using

argmax.𝑃 𝑋 = 𝐱 𝑌 = 𝑦 𝑃 𝑌 = 𝑦

10

Likelihood ofobservingthisinputx whenthelabelisy

Priorprobabilityofthelabelbeingy

Allweneedarethesetwosetsofprobabilities

Example:Tennisagain

11

Temperature Wind P(T, W|Tennis=Yes)

Hot Strong 0.15

Hot Weak 0.4

Cold Strong 0.1

Cold Weak 0.35

Temperature Wind P(T, W|Tennis=No)

Hot Strong 0.4

Hot Weak 0.1

Cold Strong 0.3

Cold Weak 0.2

Playtennis P(Playtennis)

Yes 0.3

No 0.7Prior

Likelihood

Withoutanyotherinformation,whatisthepriorprobabilitythatIshouldplaytennis?

OndaysthatIdo playtennis,whatistheprobabilitythatthetemperatureisTandthewindisW?

OndaysthatIdon’t playtennis,whatistheprobabilitythatthetemperatureisTandthewindisW?

Example:Tennisagain

12

Temperature Wind P(T, W|Tennis=Yes)

Hot Strong 0.15

Hot Weak 0.4

Cold Strong 0.1

Cold Weak 0.35

Temperature Wind P(T, W|Tennis=No)

Hot Strong 0.4

Hot Weak 0.1

Cold Strong 0.3

Cold Weak 0.2

Playtennis P(Playtennis)

Yes 0.3

No 0.7Prior

Likelihood

Withoutanyotherinformation,whatisthepriorprobabilitythatIshouldplaytennis?

OndaysthatIdo playtennis,whatistheprobabilitythatthetemperatureisTandthewindisW?

OndaysthatIdon’t playtennis,whatistheprobabilitythatthetemperatureisTandthewindisW?

Example:Tennisagain

13

Temperature Wind P(T, W|Tennis=Yes)

Hot Strong 0.15

Hot Weak 0.4

Cold Strong 0.1

Cold Weak 0.35

Temperature Wind P(T, W|Tennis=No)

Hot Strong 0.4

Hot Weak 0.1

Cold Strong 0.3

Cold Weak 0.2

Playtennis P(Playtennis)

Yes 0.3

No 0.7Prior

Likelihood

Withoutanyotherinformation,whatisthepriorprobabilitythatIshouldplaytennis?

OndaysthatIdo playtennis,whatistheprobabilitythatthetemperatureisTandthewindisW?

OndaysthatIdon’t playtennis,whatistheprobabilitythatthetemperatureisTandthewindisW?

Example:Tennisagain

14

Temperature Wind P(T, W|Tennis=Yes)

Hot Strong 0.15

Hot Weak 0.4

Cold Strong 0.1

Cold Weak 0.35

Temperature Wind P(T, W|Tennis=No)

Hot Strong 0.4

Hot Weak 0.1

Cold Strong 0.3

Cold Weak 0.2

Playtennis P(Playtennis)

Yes 0.3

No 0.7Prior

Likelihood

Input:Temperature=Hot(H)Wind=Weak(W)

ShouldIplaytennis?

Example:Tennisagain

15

Temperature Wind P(T, W|Tennis=Yes)

Hot Strong 0.15

Hot Weak 0.4

Cold Strong 0.1

Cold Weak 0.35

Temperature Wind P(T, W|Tennis=No)

Hot Strong 0.4

Hot Weak 0.1

Cold Strong 0.3

Cold Weak 0.2

Playtennis P(Playtennis)

Yes 0.3

No 0.7Prior

Likelihood

Input:Temperature=Hot(H)Wind=Weak(W)

ShouldIplaytennis?

argmaxy P(H,W|play?)P(play?)

Example:Tennisagain

16

Temperature Wind P(T, W|Tennis=Yes)

Hot Strong 0.15

Hot Weak 0.4

Cold Strong 0.1

Cold Weak 0.35

Temperature Wind P(T, W|Tennis=No)

Hot Strong 0.4

Hot Weak 0.1

Cold Strong 0.3

Cold Weak 0.2

Playtennis P(Playtennis)

Yes 0.3

No 0.7Prior

Likelihood

Input:Temperature=Hot(H)Wind=Weak(W)

ShouldIplaytennis?

argmaxy P(H,W|play?)P(play?)

P(H,W|Yes)P(Yes)=0.4£ 0.3=0.12

P(H,W|No)P(No)=0.1£ 0.7=0.07

Example:Tennisagain

17

Temperature Wind P(T, W|Tennis=Yes)

Hot Strong 0.15

Hot Weak 0.4

Cold Strong 0.1

Cold Weak 0.35

Temperature Wind P(T, W|Tennis=No)

Hot Strong 0.4

Hot Weak 0.1

Cold Strong 0.3

Cold Weak 0.2

Playtennis P(Playtennis)

Yes 0.3

No 0.7Prior

Likelihood

Input:Temperature=Hot(H)Wind=Weak(W)

ShouldIplaytennis?

argmaxy P(H,W|play?)P(play?)

P(H,W|Yes)P(Yes)=0.4£ 0.3=0.12

P(H,W|No)P(No)=0.1£ 0.7=0.07

MAPprediction=Yes

Howhardisittolearnprobabilisticmodels?

O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -

Outlook: S(unny),O(vercast),R(ainy)

Temperature: H(ot),M(edium),C(ool)

Humidity: H(igh),N(ormal),L(ow)

Wind: S(trong),W(eak)

18

Howhardisittolearnprobabilisticmodels?

O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -

Outlook: S(unny),O(vercast),R(ainy)

Temperature: H(ot),M(edium),C(ool)

Humidity: H(igh),N(ormal),L(ow)

Wind: S(trong),W(eak)

19

Weneedtolearn

1.Theprior𝑃(Play? )2.Thelikelihoods𝑃 x Play? )

Howhardisittolearnprobabilisticmodels?

O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -

PriorP(play?)

• Asinglenumber(Whyonlyone?)

LikelihoodP(X|Play?)

• Thereare4features

• ForeachvalueofPlay? (+/-),weneedavalueforeachpossibleassignment:P(x1,x2, x3,x4 |Play?)

• (24 – 1)parametersineachcase

Oneforeachassignment

20

Howhardisittolearnprobabilisticmodels?

O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -

PriorP(play?)

• Asinglenumber(Whyonlyone?)

LikelihoodP(X|Play?)

• Thereare4features

• ForeachvalueofPlay? (+/-),weneedavalueforeachpossibleassignment:P(O,T,H,W|Play?)

21

Howhardisittolearnprobabilisticmodels?

O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -

3 3 3 2

PriorP(play?)

• Asinglenumber(Whyonlyone?)

LikelihoodP(X|Play?)

• Thereare4features

• ForeachvalueofPlay? (+/-),weneedavalueforeachpossibleassignment:P(O,T,H,W|Play?)

22Valuesforthisfeature

Howhardisittolearnprobabilisticmodels?

O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -

3 3 3 2

PriorP(play?)

• Asinglenumber(Whyonlyone?)

LikelihoodP(X|Play?)

• Thereare4features

• ForeachvalueofPlay? (+/-),weneedavalueforeachpossibleassignment:P(O,T,H,W|Play?)

• (3 ⋅ 3 ⋅ 3 ⋅ 2 − 1)parametersineachcase

Oneforeachassignment

23Valuesforthisfeature

Howhardisittolearnprobabilisticmodels?

O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -

PriorP(Y)

• Ifthereareklabels,thenk– 1parameters(whynotk?)

LikelihoodP(X|Y)

• Iftherearedfeatures,then:

• WeneedavalueforeachpossibleP(x1,x2,!,xd |y)foreachy

• k(2d – 1)parameters

Needalotofdatatoestimatethesemanynumbers!

24

Ingeneral

Howhardisittolearnprobabilisticmodels?

O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -

PriorP(Y)

• Ifthereareklabels,thenk– 1parameters(whynotk?)

LikelihoodP(X|Y)

• IftherearedBooleanfeatures:

• WeneedavalueforeachpossibleP(x1,x2,!,xd |y)foreachy

• k(2d – 1)parameters

Needalotofdatatoestimatethesemanynumbers!

25

Ingeneral

Howhardisittolearnprobabilisticmodels?

O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -

PriorP(Y)

• Ifthereareklabels,thenk– 1parameters(whynotk?)

LikelihoodP(X|Y)

• IftherearedBooleanfeatures:

• WeneedavalueforeachpossibleP(x1,x2,!,xd |y)foreachy

• k(2d – 1)parameters

Needalotofdatatoestimatethesemanynumbers!

26

Ingeneral

Howhardisittolearnprobabilisticmodels?

PriorP(Y)

• Ifthereareklabels,thenk– 1parameters(whynotk?)

LikelihoodP(X|Y)

• IftherearedBooleanfeatures:

• WeneedavalueforeachpossibleP(x1,x2,!,xd |y)foreachy

• k(2d – 1)parameters

Needalotofdatatoestimatethesemanynumbers!

27

Highmodelcomplexity

Ifthereisverylimiteddata,highvarianceintheparameters

Howhardisittolearnprobabilisticmodels?

PriorP(Y)

• Ifthereareklabels,thenk– 1parameters(whynotk?)

LikelihoodP(X|Y)

• IftherearedBooleanfeatures:

• WeneedavalueforeachpossibleP(x1,x2,!,xd |y)foreachy

• k(2d – 1)parameters

Needalotofdatatoestimatethesemanynumbers!

28

Highmodelcomplexity

Ifthereisverylimiteddata,highvarianceintheparameters

Howcanwedealwiththis?

Howhardisittolearnprobabilisticmodels?

PriorP(Y)

• Ifthereareklabels,thenk– 1parameters(whynotk?)

LikelihoodP(X|Y)

• IftherearedBooleanfeatures:

• WeneedavalueforeachpossibleP(x1,x2,!,xd |y)foreachy

• k(2d – 1)parameters

Needalotofdatatoestimatethesemanynumbers!

29

Highmodelcomplexity

Ifthereisverylimiteddata,highvarianceintheparameters

Howcanwedealwiththis?

Answer:Makeindependenceassumptions

Recall:Conditionalindependence

SupposeX,YandZarerandomvariables

XisconditionallyindependentofYgivenZiftheprobabilitydistributionofXisindependentofthevalueofYwhenZisobserved

Orequivalently

30

Modelingthefeatures

𝑃(𝑥:, 𝑥<,⋯ , 𝑥>|𝑦) requiredk(2d – 1)parameters

Whatifallthefeatureswereconditionallyindependentgiventhelabel?

Thatis,𝑃 𝑥:, 𝑥<,⋯ , 𝑥> 𝑦 = 𝑃 𝑥: 𝑦 𝑃 𝑥< 𝑦 ⋯𝑃 𝑥> 𝑦

Requiresonlydnumbersforeachlabel.kd featuresoverall.Notbad!

31

TheNaïveBayesAssumption

Modelingthefeatures

𝑃(𝑥:, 𝑥<,⋯ , 𝑥>|𝑦) requiredk(2d – 1)parameters

Whatifallthefeatureswereconditionallyindependentgiventhelabel?

Thatis,𝑃 𝑥:, 𝑥<,⋯ , 𝑥> 𝑦 = 𝑃 𝑥: 𝑦 𝑃 𝑥< 𝑦 ⋯𝑃 𝑥> 𝑦

Requiresonlydnumbersforeachlabel.kd parametersoverall.Notbad!

32

TheNaïveBayesAssumption

TheNaïveBayesClassifier

Assumption:FeaturesareconditionallyindependentgiventhelabelY

Topredict,weneedtwosetsofprobabilities– PriorP(y)– Foreachxj,wehavethelikelihoodP(xj |y)

33

TheNaïveBayesClassifier

Assumption:FeaturesareconditionallyindependentgiventhelabelY

Topredict,weneedtwosetsofprobabilities– PriorP(y)– Foreachxj,wehavethelikelihoodP(xj |y)

Decisionrule

34

ℎAB 𝒙 = argmax.

𝑃 𝑦 𝑃 𝑥:, 𝑥<,⋯ , 𝑥> 𝑦)

TheNaïveBayesClassifier

Assumption:FeaturesareconditionallyindependentgiventhelabelY

Topredict,weneedtwosetsofprobabilities– PriorP(y)– Foreachxj,wehavethelikelihoodP(xj |y)

Decisionrule

35

ℎAB 𝒙 = argmax.

𝑃 𝑦 𝑃 𝑥:, 𝑥<,⋯ , 𝑥> 𝑦)

= argmax.

𝑃 𝑦 D𝑃(𝑥E|𝑦)�

E

DecisionboundariesofnaïveBayes

WhatisthedecisionboundaryofthenaïveBayesclassifier?

Considerthetwoclasscase.Wepredictthelabeltobe+if

36

𝑃 𝑦 = + D𝑃 𝑥E 𝑦 = + > 𝑃 𝑦 = − D𝑃 𝑥E 𝑦 = −)�

E

E

DecisionboundariesofnaïveBayes

WhatisthedecisionboundaryofthenaïveBayesclassifier?

Considerthetwoclasscase.Wepredictthelabeltobe+if

37

𝑃 𝑦 = + D𝑃 𝑥E 𝑦 = + > 𝑃 𝑦 = − D𝑃 𝑥E 𝑦 = −)�

E

E

𝑃 𝑦 = + ∏ 𝑃 𝑥E 𝑦 = +)�E

𝑃 𝑦 = − ∏ 𝑃(𝑥E|𝑦 = −)�E

> 1

DecisionboundariesofnaïveBayes

WhatisthedecisionboundaryofthenaïveBayesclassifier?

Takinglogandsimplifying,weget

38

Thisisalinearfunctionofthefeaturespace!

Easytoprove.Seenoteoncoursewebsite

log𝑃(𝑦 = −|𝒙)𝑃(𝑦 = +|𝒙) = 𝒘L𝒙 + 𝑏

Today’slecture

• ThenaïveBayesClassifier

• LearningthenaïveBayesClassifier

• PracticalConcerns

39

LearningthenaïveBayesClassifier

• Whatisthehypothesisfunctionh definedby?– Acollectionofprobabilities

• Priorforeachlabel:P(y)• Likelihoodsforfeaturexj givenalabel:P(xj|y)

IfwehaveadatasetD={(xi,yi)}withmexamplesAndwewanttolearntheclassifierinaprobabilisticway– Whatistheprobabilisticcriteriontoselectthehypothesis?

40

LearningthenaïveBayesClassifier

• Whatisthehypothesisfunctionh definedby?– Acollectionofprobabilities

• Priorforeachlabel:𝑃(𝑦)• Likelihoodsforfeaturexj givenalabel:𝑃(𝑥𝑗|𝑦)

IfwehaveadatasetD={(xi,yi)}withmexamplesAndwewanttolearntheclassifierinaprobabilisticway– Whatistheprobabilisticcriteriontoselectthehypothesis?

41

LearningthenaïveBayesClassifier

• Whatisthehypothesisfunctionh definedby?– Acollectionofprobabilities

• Priorforeachlabel:𝑃(𝑦)• Likelihoodsforfeaturexj givenalabel:𝑃(𝑥𝑗|𝑦)

Supposewehaveadataset𝐷 = {(𝒙𝑖, 𝑦𝑖)}withmexamples

42

LearningthenaïveBayesClassifier

• Whatisthehypothesisfunctionh definedby?– Acollectionofprobabilities

• Priorforeachlabel:𝑃(𝑦)• Likelihoodsforfeaturexj givenalabel:𝑃(𝑥𝑗|𝑦)

Supposewehaveadataset𝐷 = {(𝒙𝑖, 𝑦𝑖)}withmexamples

43

Anoteonconventionforthissection:• Examplesinthedatasetareindexedbythesubscript𝑖 (e.g. 𝒙𝑖)• Featureswithinanexampleareindexedbythesubscript𝑗

• The𝑗ST featureofthe𝑖ST examplewillbe𝑥UE

LearningthenaïveBayesClassifier

• Whatisthehypothesisfunctionh definedby?– Acollectionofprobabilities

• Priorforeachlabel:𝑃(𝑦)• Likelihoodsforfeaturexj givenalabel:𝑃(𝑥𝑗|𝑦)

Ifwehaveadataset𝐷 = {(𝒙𝑖, 𝑦𝑖)}withmexamplesAndwewanttolearntheclassifierinaprobabilisticway– Whatisaprobabilisticcriteriontoselectthehypothesis?

44

LearningthenaïveBayesClassifier

Maximumlikelihoodestimation

45

HerehisdefinedbyalltheprobabilitiesusedtoconstructthenaïveBayesdecision

Maximumlikelihoodestimation

Givenadataset𝐷 = {(𝒙𝑖, 𝑦𝑖)} withmexamples

46

Eachexampleinthedatasetisindependentandidenticallydistributed

SowecanrepresentP(D|h)asthisproduct

Maximumlikelihoodestimation

Givenadataset𝐷 = {(𝒙𝑖, 𝑦𝑖)}withmexamples

47

Asks“Whatprobabilitywouldthisparticularh assigntothepair(xi,yi)?”

Eachexampleinthedatasetisindependentandidenticallydistributed

SowecanrepresentP(D|h)asthisproduct

Maximumlikelihoodestimation

GivenadatasetD={(xi,yi)}withmexamples

48

Maximumlikelihoodestimation

GivenadatasetD={(xi,yi)}withmexamples

49

TheNaïveBayesassumption

xij isthejthfeatureofxi

Maximumlikelihoodestimation

GivenadatasetD={(xi,yi)}withmexamples

50

Howdoweproceed?

Maximumlikelihoodestimation

GivenadatasetD={(xi,yi)}withmexamples

51

LearningthenaïveBayesClassifier

Maximumlikelihoodestimation

52

Whatnext?

LearningthenaïveBayesClassifier

Maximumlikelihoodestimation

53

Whatnext?

Weneedtomakeamodelingassumptionaboutthefunctionalformoftheseprobabilitydistributions

LearningthenaïveBayesClassifier

Maximumlikelihoodestimation

54

Forsimplicity,supposetherearetwolabels1 and0 andallfeaturesarebinary

• Prior:P(y=1)=p andP(y=0)=1– p

Thatis,thepriorprobabilityisfromtheBernoullidistribution.

LearningthenaïveBayesClassifier

Maximumlikelihoodestimation

55

Forsimplicity,supposetherearetwolabels1 and0 andallfeaturesarebinary

• Prior:P(y=1)=p andP(y=0)=1– p

• Likelihood foreachfeaturegivenalabel• P(xj =1|y=1)=aj andP(xj =0 |y=1)=1– aj• P(xj =1|y=0)=bj andP(xj =0 |y=0)=1- bj

LearningthenaïveBayesClassifier

Maximumlikelihoodestimation

56

Forsimplicity,supposetherearetwolabels1 and0 andallfeaturesarebinary

• Prior:P(y=1)=p andP(y=0)=1– p

• Likelihood foreachfeaturegivenalabel• P(xj =1|y=1)=aj andP(xj =0 |y=1)=1– aj• P(xj =1|y=0)=bj andP(xj =0 |y=0)=1- bj

Thatis,thelikelihoodofeachfeatureisalsoisfromtheBernoullidistribution.

LearningthenaïveBayesClassifier

Maximumlikelihoodestimation

57

Forsimplicity,supposetherearetwolabels1 and0 andallfeaturesarebinary

• Prior:P(y=1)=p andP(y=0)=1– p

• Likelihood foreachfeaturegivenalabel• P(xj =1|y=1)=aj andP(xj =0 |y=1)=1– aj• P(xj =1|y=0)=bj andP(xj =0 |y=0)=1- bj

hconsistsofp,allthea’sandb’s

LearningthenaïveBayesClassifier

Maximumlikelihoodestimation

58

• Prior:P(y=1)=p andP(y=0)=1– p

LearningthenaïveBayesClassifier

Maximumlikelihoodestimation

59

• Prior:P(y=1)=p andP(y=0)=1– p

[z]iscalledtheindicatorfunctionortheIversonbracket

Itsvalueis1iftheargumentzistrueandzerootherwise

LearningthenaïveBayesClassifier

Maximumlikelihoodestimation

60

Likelihoodforeachfeaturegivenalabel• P(xj =1|y=1)=aj andP(xj =0 |y=1)=1– aj• P(xj =1|y=0)=bj andP(xj =0 |y=0)=1- bj

LearningthenaïveBayesClassifier

Substitutingandderivingtheargmax,weget

61

P(y=1)=p

LearningthenaïveBayesClassifier

Substitutingandderivingtheargmax,weget

62

P(y=1)=p

P(xj =1|y=1)=aj

LearningthenaïveBayesClassifier

Substitutingandderivingtheargmax,weget

63

P(y=1)=p

P(xj =1|y=1)=aj

P(xj =1|y=0)=bj

Let’slearnanaïveBayesclassifier

O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -

64

WiththeassumptionthatallourprobabilitiesarefromtheBernoullidistribution

Let’slearnanaïveBayesclassifier

O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -

65

𝑃 𝑃𝑙𝑎𝑦 = + =914 𝑃 𝑃𝑙𝑎𝑦 = − =

514

Let’slearnanaïveBayesclassifier

O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -

66

𝑃(𝑶 = 𝑆|𝑃𝑙𝑎𝑦 = +) =29

𝑃 𝑃𝑙𝑎𝑦 = + =914 𝑃 𝑃𝑙𝑎𝑦 = − =

514

Let’slearnanaïveBayesclassifier

67

O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -

𝑃(𝑶 = 𝑅|𝑃𝑙𝑎𝑦 = +) = 39

𝑃(𝑶 = 𝑆|𝑃𝑙𝑎𝑦 = +) =29

𝑃 𝑃𝑙𝑎𝑦 = + =914 𝑃 𝑃𝑙𝑎𝑦 = − =

514

Let’slearnanaïveBayesclassifier

68

O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -

𝑃(𝑶 = 𝑂|𝑃𝑙𝑎𝑦 = +) = 49

Andsoon,forotherattributesandalsoforPlay=-

𝑃(𝑶 = 𝑅|𝑃𝑙𝑎𝑦 = +) = 39

𝑃(𝑶 = 𝑆|𝑃𝑙𝑎𝑦 = +) =29

𝑃 𝑃𝑙𝑎𝑦 = + =914 𝑃 𝑃𝑙𝑎𝑦 = − =

514

NaïveBayes:LearningandPrediction

• Learning– Counthowoftenfeaturesoccurwitheachlabel.Normalizetogetlikelihoods

– Priorsfromfractionofexampleswitheachlabel– Generalizestomulticlass

• Prediction– Uselearnedprobabilitiestofindhighestscoringlabel

69

Today’slecture

• ThenaïveBayesClassifier

• LearningthenaïveBayesClassifier

• Practicalconcerns+anexample

70

ImportantcaveatswithNaïveBayes

1. Featuresneednotbeconditionallyindependentgiventhelabel– Justbecauseweassumethattheyaredoesn’tmeanthat

that’showtheybehaveinnature– Wemadeamodelingassumptionbecauseitmakes

computation andlearningeasier

2. Notenoughtrainingdatatogetgoodestimatesoftheprobabilitiesfromcounts

71

ImportantcaveatswithNaïveBayes

1. Featuresarenotconditionallyindependentgiventhelabel

AllbetsareoffifthenaïveBayesassumptionisnotsatisfied

Andyet,veryoftenusedinpracticebecauseofsimplicityWorksreasonablywellevenwhentheassumptionisviolated

72

ImportantcaveatswithNaïveBayes

2. Notenoughtrainingdatatogetgoodestimatesoftheprobabilitiesfromcounts

73

Thebasicoperationforlearninglikelihoodsiscountinghowoftenafeatureoccurswithalabel.

Whatifweneverseeaparticularfeaturewithaparticularlabel?Eg:SupposeweneverobserveTemperature=coldwithPlayTennis=Yes

Shouldwetreatthosecountsaszero?

ImportantcaveatswithNaïveBayes

2. Notenoughtrainingdatatogetgoodestimatesoftheprobabilitiesfromcounts

74

Thebasicoperationforlearninglikelihoodsiscountinghowoftenafeatureoccurswithalabel.

Whatifweneverseeaparticularfeaturewithaparticularlabel?Eg:SupposeweneverobserveTemperature=coldwithPlayTennis=Yes

Shouldwetreatthosecountsaszero? Butthatwillmaketheprobabilitieszero

ImportantcaveatswithNaïveBayes

2. Notenoughtrainingdatatogetgoodestimatesoftheprobabilitiesfromcounts

75

Thebasicoperationforlearninglikelihoodsiscountinghowoftenafeatureoccurswithalabel.

Whatifweneverseeaparticularfeaturewithaparticularlabel?Eg:SupposeweneverobserveTemperature=coldwithPlayTennis=Yes

Shouldwetreatthosecountsaszero?

Answer:Smoothing• Addfakecounts(verysmallnumberssothatthecountsarenotzero)• TheBayesianinterpretationofsmoothing:Priors onthehypothesis(MAPlearning)

Butthatwillmaketheprobabilitieszero

Example:Classifyingtext

• Instancespace:Textdocuments• Labels:Spam orNotSpam

• Goal:TolearnafunctionthatcanpredictwhetheranewdocumentisSpam orNotSpam

HowwouldyoubuildaNaïveBayesclassifier?

76

Letusbrainstorm

Howtorepresentdocuments?Howtoestimateprobabilities?Howtoclassify?

Example:Classifyingtext

1. RepresentdocumentsbyavectorofwordsAsparsevectorconsistingofonefeatureperword

2. LearningfromNlabeleddocuments1. Priors

2. Foreachwordwinvocabulary:

77

Example:Classifyingtext

1. RepresentdocumentsbyavectorofwordsAsparsevectorconsistingofonefeatureperword

2. LearningfromNlabeleddocuments1. Priors

2. Foreachwordwinvocabulary:

78

Example:Classifyingtext

1. RepresentdocumentsbyavectorofwordsAsparsevectorconsistingofonefeatureperword

2. LearningfromNlabeleddocuments1. Priors

2. Foreachwordwinvocabulary:

79

Example:Classifyingtext

1. RepresentdocumentsbyavectorofwordsAsparsevectorconsistingofonefeatureperword

2. LearningfromNlabeleddocuments1. Priors

2. Foreachwordwinvocabulary:

80

Example:Classifyingtext

1. RepresentdocumentsbyavectorofwordsAsparsevectorconsistingofonefeatureperword

2. LearningfromNlabeleddocuments1. Priors

2. Foreachwordwinvocabulary:

81

Howoftendoesawordoccurwithalabel?

Example:Classifyingtext

1. RepresentdocumentsbyavectorofwordsAsparsevectorconsistingofonefeatureperword

2. LearningfromNlabeleddocuments1. Priors

2. Foreachwordwinvocabulary:

82

Smoothing

Continuousfeatures

• Sofar,wehavebeenlookingatdiscretefeatures– P(xj |y)isaBernoullitrial(i.e.acointoss)

• WecouldmodelP(xj |y)withotherdistributionstoo– Thisisaseparateassumptionfromtheindependence

assumptionthatnaiveBayesmakes– Eg:Forrealvaluedfeatures,(Xj |Y)couldbedrawnfroma

normaldistribution

• Exercise:Derivethemaximumlikelihoodestimatewhenthefeaturesareassumedtobedrawnfromthenormaldistribution

83

Summary:NaïveBayes

• Independenceassumption– Allfeaturesareindependentofeachothergiventhelabel

• Maximumlikelihoodlearning:Learningissimple– Generalizestorealvaluedfeatures

• PredictionviaMAPestimation– Generalizestobeyondbinaryclassification

• Importantcaveatstoremember– Smoothing– Independenceassumptionmaynotbevalid

• Decisionboundaryislinearforbinaryclassification

84