Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall 2016 …€¦ · – Fall 2016 Machine...

72
UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 13: Naïve Bayes Classifier for Text ClassificaJon Dr. Yanjun Qi University of Virginia Department of Computer Science 11/8/16 Dr. Yanjun Qi / UVA CS 6316 / f16 1

Transcript of Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall 2016 …€¦ · – Fall 2016 Machine...

UVACS6316/4501–Fall2016

MachineLearning

Lecture13:NaïveBayesClassifierforTextClassificaJon

Dr.YanjunQi

UniversityofVirginia

DepartmentofComputerScience

11/8/16

Dr.YanjunQi/UVACS6316/f16

1

Wherearewe?èFivemajorsecFonsofthiscourse

q Regression(supervised)q ClassificaFon(supervised)q Unsupervisedmodelsq Learningtheoryq Graphicalmodels

11/8/16

Dr.YanjunQi/UVACS6316/f16

2

Wherearewe?èThreemajorsecFonsforclassificaFon

•  We can divide the large variety of classification approaches into roughly three major types

1. Discriminative - directly estimate a decision rule/boundary - e.g., support vector machine, decision tree 2. Generative: - build a generative statistical model - e.g., naïve bayes classifier, Bayesian networks 3. Instance based classifiers - Use observation directly (no models) - e.g. K nearest neighbors

11/8/16

Dr.YanjunQi/UVACS6316/f16

3

ADatasetforclassificaFon

•  Data/points/instances/examples/samples/records:[rows]•  Features/a0ributes/dimensions/independentvariables/covariates/

predictors/regressors:[columns,exceptthelast]•  Target/outcome/response/label/dependentvariable:specialcolumntobe

predicted[lastcolumn]11/8/16

Output as Discrete Class Label

C1, C2, …, CL

C

CDr.YanjunQi/UVACS6316/f16

4

Today:NaïveBayesClassifierforTextü  DicFonarybasedVectorspacerepresentaFonof

textarFcleü  MulFvariateBernoullivs.MulFnomialü  MulFvariateBernoulli§  TesFng§  TrainingWithMaximumLikelihoodEsFmaFon

foresFmaFngparametersü  MulFnomialnaïveBayesclassifier§  TesFng§  MulFnomialnaïveBayesclassifierasCondiFonal

StochasFcLanguageModels§  TrainingWithMaximumLikelihoodEsFmaFon

foresFmaFngparameters

11/8/16 5

Dr.YanjunQi/UVACS6316/f16

TextdocumentclassificaFon,e.g.spamemailfiltering

•  Input:documentD•  Output:thepredictedclassC,cisfrom{c1,...,cL}

•  SpamfilteringTask:Classifyemailas‘Spam’,‘Other’.

P(C=spam|D)

11/8/16 6

Dr.YanjunQi/UVACS6316/f16

TextclassificaFonTasks•  Input:documentD•  Output:thepredictedclassC,cisfrom{c1,...,cL}

TextclassificaFonexamples:

•  Classifyemailas ‘Spam’,‘Other’.•  Classifywebpagesas ‘Student’,‘Faculty’,‘Other’•  Classifynewsstoriesintotopics‘Sports’,‘Poli?cs’..•  Classifybusinessnamesby industry.•  Classifymoviereviewsas‘Favorable’,‘Unfavorable’,‘Neutral’ •  …andmanymore.

11/8/16 7

Dr.YanjunQi/UVACS6316/f16

TextClassificaFon:Examples•  ClassifyshipmentarFclesintoone93categories.•  Anexamplecategory‘wheat’

ARGENTINE1986/87GRAIN/OILSEEDREGISTRATIONSBUENOSAIRES,Feb26ArgenFnegrainboardfiguresshowcropregistraFonsofgrains,oilseedsandtheirproducts

toFebruary11,inthousandsoftonnes,showingthoseforfutureshipmentsmonth,1986/87totaland1985/86totaltoFebruary12,1986,inbrackets:

Breadwheatprev1,655.8,Feb872.0,March164.6,total2,692.4(4,161.0).MaizeMar48.0,total48.0(nil).Sorghumnil(nil)OilseedexportregistraFonswere:Sunflowerseedtotal15.0(7.9)SoybeanMay20.0,total20.0(nil)TheboardalsodetailedexportregistraFonsforsubproducts,asfollows....

11/8/16 8

Dr.YanjunQi/UVACS6316/f16

RepresenFngtext:alistofwordsèDicFonary

argenFne,1986,1987,grain,oilseed,registraFons,buenos,aires,feb,26,argenFne,grain,board,figures,show,crop,registraFons,of,grains,oilseeds,and,their,products,to,february,11,in,…

Commonrefinements:removestopwords,stemming,collapsingmulFpleoccurrencesofwordsintoone….

C

11/8/16 9

Dr.YanjunQi/UVACS6316/f16

‘Bagofwords’representaFonoftext

ARGENTINE1986/87GRAIN/OILSEEDREGISTRATIONSBUENOSAIRES,Feb26ArgenFnegrainboardfiguresshowcropregistraFonsofgrains,oilseedsandtheirproductsto

February11,inthousandsoftonnes,showingthoseforfutureshipmentsmonth,1986/87totaland1985/86totaltoFebruary12,1986,inbrackets:

Breadwheatprev1,655.8,Feb872.0,March164.6,total2,692.4(4,161.0).MaizeMar48.0,total48.0(nil).Sorghumnil(nil)OilseedexportregistraFonswere:Sunflowerseedtotal15.0(7.9)SoybeanMay20.0,total20.0(nil)TheboardalsodetailedexportregistraFonsforsub-products,asfollows....

grain(s) 3 oilseed(s) 2 total 3 wheat 1 maize 1 soybean 1 tonnes 1

... ...

word frequency

BagofwordrepresentaFon:

Representtextasavectorofwordfrequencies.

11/8/16 10

Dr.YanjunQi/UVACS6316/f16

Another“Bagofwords”representaFonoftextèEachdicFonarywordasBoolean

ARGENTINE1986/87GRAIN/OILSEEDREGISTRATIONSBUENOSAIRES,Feb26ArgenFnegrainboardfiguresshowcropregistraFonsofgrains,oilseedsandtheirproductsto

February11,inthousandsoftonnes,showingthoseforfutureshipmentsmonth,1986/87totaland1985/86totaltoFebruary12,1986,inbrackets:

Breadwheatprev1,655.8,Feb872.0,March164.6,total2,692.4(4,161.0).MaizeMar48.0,total48.0(nil).Sorghumnil(nil)OilseedexportregistraFonswere:Sunflowerseedtotal15.0(7.9)SoybeanMay20.0,total20.0(nil)TheboardalsodetailedexportregistraFonsforsub-products,asfollows....

grain(s) Yes oilseed(s) Yes total Yes wheat Yes maize Yes Panda No Tiger No

... ...

word Boolean

BagofwordrepresentaFon:

Representtextasavectorofwordfrequencies.

11/8/16 11

Dr.YanjunQi/UVACS6316/f16

BagofwordsrepresentaFondocumenti

X(i,j)=Frequencyofwordjindocumenti

wordj

AcollecFonofdocuments

C

11/8/16 12

Dr.YanjunQi/UVACS6316/f16

Bagofwords

•  WhatsimplifyingassumpFonarewetaking?

Weassumedwordorderisnotimportant.

11/8/16 13

Dr.YanjunQi/UVACS6316/f16

Unknown Words

•  How to handle words in the test corpus that did not occur in the training data, i.e. out of vocabulary (OOV) words?

•  Train a model that includes an explicit symbol for an unknown word (<UNK>). – Choose a vocabulary in advance and replace other

(i.e. not in vocabulary) words in the training corpus with <UNK>.

– Replace the first occurrence of each word in the training data with <UNK>.

11/8/16 14

Dr.YanjunQi/UVACS6316/f16

Today:NaïveBayesClassifierforTextü  DicFonarybasedVectorspacerepresentaFonof

textarFcleü  MulFvariateBernoullivs.MulFnomialü  MulFvariateBernoulli§  TesFng§  TrainingWithMaximumLikelihoodEsFmaFon

foresFmaFngparametersü  MulFnomialnaïveBayesclassifier§  TesFng§  MulFnomialnaïveBayesclassifierasCondiFonal

StochasFcLanguageModels§  TrainingWithMaximumLikelihoodEsFmaFon

foresFmaFngparameters

11/8/16 15

Dr.YanjunQi/UVACS6316/f16

‘Bagofwords’representaFonoftext

ARGENTINE1986/87GRAIN/OILSEEDREGISTRATIONSBUENOSAIRES,Feb26ArgenFnegrainboardfiguresshowcropregistraFonsofgrains,oilseedsandtheirproductsto

February11,inthousandsoftonnes,showingthoseforfutureshipmentsmonth,1986/87totaland1985/86totaltoFebruary12,1986,inbrackets:

Breadwheatprev1,655.8,Feb872.0,March164.6,total2,692.4(4,161.0).MaizeMar48.0,total48.0(nil).Sorghumnil(nil)OilseedexportregistraFonswere:Sunflowerseedtotal15.0(7.9)SoybeanMay20.0,total20.0(nil)TheboardalsodetailedexportregistraFonsforsub-products,asfollows....

grain(s) 3 oilseed(s) 2 total 3 wheat 1 maize 1 soybean 1 tonnes 1

... ...

word frequency

Pr(D |C = c)?

11/8/16 16

Dr.YanjunQi/UVACS6316/f16

‘Bagofwords’representaFonoftext

Pr(D |C = c)?

Pr(W1 = n1,W2 = n2 ,...,Wk = nk | C = c)

Pr(W1 = true,W2 = false...,Wk = true | C = c)

17

Dr.YanjunQi/UVACS6316/f16

TwoPreviousmodels

11/8/16

ProbabilisFcModelsoftextdocuments

Pr(D |C = c)?

Pr(W1 = n1,W2 = n2 ,...,Wk = nk | C = c)

Pr(W1 = true,W2 = false...,Wk = true | C = c)

18

Dr.YanjunQi/UVACS6316/f16

TwoPreviousmodels

11/8/16

MulFvariateBernoulliDistribuFon

MulFnomialDistribuFon

Text Classification with Naïve Bayes Classifier

•  Multinomial vs Multivariate Bernoulli?

•  Multinomial model is almost always more effective in text applications!

11/8/16 19

Dr.YanjunQi/UVACS6316/f16

AdaptFromManning’textCattutorial

Experiment:MulJnomialvsmulFvariateBernoulli

•  M&N(1998)didsomeexperimentstoseewhichisbeper

•  Determineifauniversitywebpageis{student,faculty,other_stuff}

•  Trainon~5,000hand-labeledwebpages–  Cornell,Washington,U.Texas,Wisconsin

•  Crawlandclassifyanewsite(CMU)

11/8/16 20

Dr.YanjunQi/UVACS6316/f16

AdaptFromManning’textCattutorial

MulJnomialvs.mulFvariateBernoulli

11/8/16 21

Dr.YanjunQi/UVACS6316/f16

Today:NaïveBayesClassifierforTextü  DicFonarybasedVectorspacerepresentaFonof

textarFcleü  MulFvariateBernoullivs.MulFnomialü  MulFvariateBernoulli§  TesFng§  TrainingWithMaximumLikelihoodEsFmaFon

foresFmaFngparametersü  MulFnomialnaïveBayesclassifier§  TesFng§  MulFnomialnaïveBayesclassifierasCondiFonal

StochasFcLanguageModels§  TrainingWithMaximumLikelihoodEsFmaFon

foresFmaFngparameters

11/8/16 22

Dr.YanjunQi/UVACS6316/f16

Model 1: Multivariate Bernoulli

•  Model 1: Multivariate Bernoulli – One feature Xw for each word in dictionary – Xw = true in document d if w appears in d

– Naive Bayes assumption: •  Given the document’s topic class label,

appearance of one word in the document tells us nothing about chances that another word appears

23 11/8/16

Dr.YanjunQi/UVACS6316/f16

Pr(W1 = true,W2 = false...,Wk = true | C = c)

Model 1: Multivariate Bernoulli Naïve Bayes Classifier

24 11/8/16

Dr.YanjunQi/UVACS6316/f16

•  Conditional Independence Assumption: Features (word presence) are independent of each other given the class variable:

•  Multivariate Bernoulli model is appropriate for binary feature variables

grain(s) True oilseed(s) True total True wheat True chemical False

... ...

word True/false

Model 1: Multivariate Bernoulli

•  Conditional Independence Assumption: Features (word presence) are independent of each other given the class variable:

•  Multivariate Bernoulli model is appropriate for

binary feature variables 25

C=Flu

W1 W2 W5 W3 W4 W6

11/8/16

Pr(W1 = true,W2 = false,...,Wk = true |C = c)= P(W1 = true |C)•P(W2 = false |C)•!•P(Wk = true |C)

runnynose sinus fever muscle-ache running

thisisnaïve

Dr.YanjunQi/UVACS6316/f16

Review:BernoulliDistribuFone.g.CoinFlips

•  Youflipacoin– Headwithprobabilityp– Binaryrandomvariable– Bernoullitrialwithsuccessprobabilityp

•  Youflipkcoins– Howmanyheadswouldyouexpect– NumberofheadsX:discreterandomvariable– BinomialdistribuFonwithparameterskandp

11/8/16

Dr.YanjunQi/UVACS6316/f16

26

Pr(Wi = true | C = c)

Review:BernoulliDistribuFone.g.CoinFlips

•  Youflipacoin– Headwithprobabilityp– Binaryrandomvariable– Bernoullitrialwithsuccessprobabilityp

•  Youflipkcoins– Howmanyheadswouldyouexpect– NumberofheadsX:discreterandomvariable– BinomialdistribuFonwithparameterskandp

11/8/16

Dr.YanjunQi/UVACS6316/f16

27

Pr(Wi = true | C = c)

Today:NaïveBayesClassifierforTextü  DicFonarybasedVectorspacerepresentaFonof

textarFcleü  MulFvariateBernoullivs.MulFnomialü  MulFvariateBernoulli§  TesFng§  TrainingWithMaximumLikelihoodEsFmaFon

foresFmaFngparametersü  MulFnomialnaïveBayesclassifier§  TesFng§  MulFnomialnaïveBayesclassifierasCondiFonal

StochasFcLanguageModels§  TrainingWithMaximumLikelihoodEsFmaFon

foresFmaFngparameters

11/8/16 28

Dr.YanjunQi/UVACS6316/f16

11/8/16

Dr.YanjunQi/UVACS6316/f16

29

AgeneralStatement ConsiderasamplesetT=(X1...Xn)whichisdrawnfromaprobabilitydistribuFonP(X|\theta)where\thetaareparameters.IftheXsareindependentwithprobabilitydensityfuncFonP(Xi|\theta),thejointprobabilityofthewholesetis

P( 1X ... nX |θ )=

i=1

n

∏P( iX |θ )

thismaybemaximisedwithrespectto\thetatogivethemaximumlikelihoodesFmates.

MaximumLikelihoodEsFmaFon

11/8/16 30

Dr.YanjunQi/UVACS6316/f16

ItisowenconvenienttoworkwiththeLogofthelikelihoodfuncFon.

log(L(θ )) =i=1

n

∑log(P( iX |θ ))

Theideaistoü assumeaparFcularmodelwithunknownparameters,ü wecanthendefinetheprobabilityofobservingagiveneventcondiFonalonaparFcularsetofparameters.

ü Wehaveobservedasetofoutcomesintherealworld.ü Itisthenpossibletochooseasetofparameterswhicharemostlikelytohaveproducedtheobservedresults.

Thisismaximumlikelihood.Inmostcasesitisbothconsistentandefficient.ItprovidesastandardtocompareotheresFmaFontechniques.

P( iX |θ )

θ

θ̂ = argmaxθ

P( 1X ... nX |θ )

11/8/16 31

Dr.YanjunQi/UVACS6316/f16

Review:BernoulliDistribuFone.g.CoinFlips

•  Youflipncoins– Howmanyheadswouldyouexpect– Headwithprobabilityp– NumberofheadsXoutofntrial– EachTrialfollowingBernoullidistribuFonwithparametersp

11/8/16

Dr.YanjunQi/UVACS6316/f16

32

DefiningLikelihoodforBernoulli

•  Likelihood=p(data|parameter)

!!f (xi |p)= pxi (1− p)1−xi

function of x_i

PMF:

!!L(p)=∏i=1n pxi (1− p)1−xi = px(1− p)n−x

function of p

LIKELIHOOD:

èe.g.,fornindependenttossesofcoins,withunknownp

11/8/16 33

Dr.YanjunQi/UVACS6316/f16

Observeddataèxheads-upfromntrials

x = xi

i=1

n

DerivingtheMaximumLikelihoodEsFmateforBernoulli

!!L(p)= px(1− p)n−x

maximize

0.0

00

.04

0.0

8

potential HIV prevalences

like

liho

od

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

!!log(L(p)= log px(1− p)n−x⎡⎣ ⎤⎦

maximize

!2

50

!1

50

!5

00

potential HIV prevalences

log

(lik

elih

oo

d)

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

−l(p)= − log px(1− p)n−x⎡⎣ ⎤⎦

Minimize the negative log-likelihood

05

01

00

20

0potential HIV prevalences

!lo

g(lik

elih

oo

d)

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

11/8/16 34

Dr.YanjunQi/UVACS6316/f16

p

p

p

DerivingtheMaximumLikelihoodEsFmateforBernoulli

−l(p)= − log(L(p))= − log px(1− p)n−x⎡⎣ ⎤⎦

= − log(px )− log((1− p)n−x )

= −x log(p)−(n− x)log(1− p)

Minimize the negative log-likelihood

11/8/16 35

Dr.YanjunQi/UVACS6316/f16

DerivingtheMaximumLikelihoodEsFmateforBernoulli

−l(p)= − log n

x⎛

⎝⎜⎞

⎠⎟− x log(p)−(n− x)log(1− p)

dl(p)dp

= 0 − xp−−(n − x)1− p

0 = −xˆ p

+n − x1− ˆ p

0 =−x(1− ˆ p ) + ˆ p (n − x)

ˆ p (1− ˆ p )

0 = −x + ˆ p x + ˆ p n − ˆ p x

0 = −x + ˆ p n

ˆ p = xn

The proportion of positives!

Minimize the negative log-likelihood è MLE parameter estimation

i.e.RelaFvefrequencyofabinaryevent

11/8/16 36

Dr.YanjunQi/UVACS6316/f16

Parameter estimation

fraction of documents of topic cj in which word w appears

•  Multivariate Bernoulli model:

P̂(Xw = true | cj ) =

11/8/16 37 AdaptFromManning’textCattutorial

Dr.YanjunQi/UVACS6316/f16

Today:NaïveBayesClassifierforTextü  DicFonarybasedVectorspacerepresentaFonof

textarFcleü  MulFvariateBernoullivs.MulFnomialü  MulFvariateBernoulli§  TesFng§  TrainingWithMaximumLikelihoodEsFmaFon

foresFmaFngparametersü  MulFnomialnaïveBayesclassifier§  TesFng§  MulFnomialnaïveBayesclassifierasCondiFonal

StochasFcLanguageModels§  TrainingWithMaximumLikelihoodEsFmaFon

foresFmaFngparameters

11/8/16 38

Dr.YanjunQi/UVACS6316/f16

Model 2: Multinomial Naïve Bayes - ‘Bagofwords’representaFonoftext

grain(s) 3 oilseed(s) 2 total 3 wheat 1 maize 1 soybean 1 tonnes 1

... ...

word frequency

CanberepresentedasamulFnomialdistribuFon.

Words=likecoloredballs,thereareKpossibletypeofthem(i.e.fromadicFonaryofKwords)Document=containsNwords,eachwordoccursniFmes(likeabagofNcoloredballs)

Inadocumentclassof‘wheat’,“grain”ismorelikely.whereasina“harddrive”shipmentclasstheparameterfor‘grain’ isgoingtobesmaller.

11/8/16 39

Dr.YanjunQi/UVACS6316/f16

Pr(W1 = n1,...,Wk = nk |C = c)

Model 2: Multinomial Naïve Bayes - ‘Bagofwords’representaFonoftext

grain(s) 3 oilseed(s) 2 total 3 wheat 1 maize 1 soybean 1 tonnes 1

... ...

word frequency

CanberepresentedasamulFnomialdistribuFon.

Words=likecoloredballs,thereareKpossibletypeofthem(i.e.fromadicFonaryofKwords)Document=containsNwords,eachwordoccursniFmes(likeabagofNcoloredballs)

Inadocumentclassof‘wheat’,“grain”ismorelikely.whereasina“harddrive”shipmentclasstheparameterfor‘grain’ isgoingtobesmaller.

11/8/16 40

Dr.YanjunQi/UVACS6316/f16

Pr(W1 = n1,...,Wk = nk |C = c)

Model 2: Multinomial Naïve Bayes - ‘Bagofwords’representaFonoftext

grain(s) 3 oilseed(s) 2 total 3 wheat 1 maize 1 soybean 1 tonnes 1

... ...

word frequency

CanberepresentedasamulFnomialdistribuFon.

Words=likecoloredballs,thereareKpossibletypeofthem(i.e.fromadicFonaryofKwords)Document=containsNwords,eachwordoccursniFmes(likeabagofNcoloredballs)

Inadocumentclassof‘wheat’,“grain”ismorelikely.whereasina“harddrive”shipmentclasstheparameterfor‘grain’ isgoingtobesmaller.

11/8/16 41

Dr.YanjunQi/UVACS6316/f16

Pr(W1 = n1,...,Wk = nk |C = c)

MulFnomialdistribuFon•  ThemulFnomialdistribuFonisageneralizaFonofthebinomial

distribuFon.

•  ThebinomialdistribuFoncountssuccessesofanevent(forexample,headsincointosses).

•  Theparameters:–  N(numberoftrials)–  (theprobabilityofsuccessoftheevent)

•  ThemulFnomialcountsthenumberofasetofevents(forexample,howmanyFmeseachsideofadiecomesupinasetofrolls).–  Theparameters:–  N(numberoftrials)–  (theprobabilityofsuccessforeachcategory)

p

1.. kθ θ

11/8/16 42

Dr.YanjunQi/UVACS6316/f16

MulFnomialdistribuFon•  ThemulFnomialdistribuFonisageneralizaFonofthebinomial

distribuFon.

•  ThebinomialdistribuFoncountssuccessesofanevent(forexample,headsincointosses).

•  Theparameters:–  N(numberoftrials)–  (theprobabilityofsuccessoftheevent)

•  ThemulFnomialcountsthenumberofasetofevents(forexample,howmanyFmeseachsideofadiecomesupinasetofrolls).–  Theparameters:–  N(numberoftrials)–  (theprobabilityofsuccessforeachcategory)

p

1.. kθ θ

11/8/16 43

Dr.YanjunQi/UVACS6316/f16

MulFnomialDistribuFon•  W1,W2,..Wkarevariables

P(W1 = n1,...,Wk = nk | ci , N ,θ

1,..,θk ) = N !

n1!n2 !..nk !θ1

n1θ2n2 ..θk

nk

1

k

iin N

=

=∑1

1k

iiθ

=

=∑

NumberofpossibleorderingsofNballs

orderinvariantselecFonsNoteeventsareindependent

AbinomialdistribuFonisthemulFnomialdistribuFonwithk=2and

θ1 = pθ2 = 1−θ1

11/8/16 44

Dr.YanjunQi/UVACS6316/f16

Model 2: Multinomial Naïve Bayes - ‘Bagofwords’representaFonoftext

grain(s) 3 oilseed(s) 2 total 3 wheat 1 maize 1 soybean 1 tonnes 1

... ...

word frequency

CanberepresentedasamulFnomialdistribuFon.

Words=likecoloredballs,thereareKpossibletypeofthem(i.e.fromadicFonaryofKwords)Document=containsNwords,eachwordoccursniFmes(likeabagofNcoloredballs)

11/8/16 45

Pr(W1 = n1,...,Wk = nk |C = c)

P(W1 = n1,...,Wk = nk | c, N ,θ

1,..,θk ) = N !

n1!n2 !..nk !θ1

n1θ2n2 ..θk

nk

mulFnomialcoefficient,normallycanleaveoutinpracFcalcalculaFons.

Dr.YanjunQi/UVACS6316/f16

11/8/16

Dr.YanjunQi/UVACS6316/f16

46

Today:NaïveBayesClassifierforTextü  DicFonarybasedVectorspacerepresentaFonof

textarFcleü  MulFvariateBernoullivs.MulFnomialü  MulFvariateBernoulli§  TesFng§  TrainingWithMaximumLikelihoodEsFmaFon

foresFmaFngparametersü  MulFnomialnaïveBayesclassifier§  TesFng§  MulFnomialnaïveBayesclassifierasCondiFonal

StochasFcLanguageModels§  TrainingWithMaximumLikelihoodEsFmaFon

foresFmaFngparameters

11/8/16 47

Dr.YanjunQi/UVACS6316/f16

TheBigPicture

Modeli.e.DatageneraFng

process

ObservedData

Probability

EsFmaFon/learning/Inference/Datamining

Buthowtospecifyamodel?

11/8/16

Build a generative model that approximates how data is produced.

48

Dr.YanjunQi/UVACS6316/f16

Model 2: Multinomial Naïve Bayes - ‘Bagofwords’representaFonoftext

grain(s) 3 oilseed(s) 2 total 3 wheat 1 maize 1 soybean 1 tonnes 1

... ...

word frequency

CanberepresentedasamulFnomialdistribuFon.

Words=likecoloredballs,thereareKpossibletypeofthem(i.e.fromadicFonaryofKwords)Document=containsNwords,eachwordoccursniFmes(likeabagofNcoloredballs)

11/8/16 49

Pr(W1 = n1,...,Wk = nk |C = c)

WHYisthisnaïve???

P(W1 = n1,...,Wk = nk | c, N ,θ

1,..,θk ) = N !

n1!n2 !..nk !θ1

n1θ2n2 ..θk

nk

mulFnomialcoefficient,normallycanleaveoutinpracFcalcalculaFons.

Dr.YanjunQi/UVACS6316/f16

WHYMULTINOMIALONTEXTISNAÏVEPROB.MODELING?

MainQuesFon:

11/8/16

Dr.YanjunQi/UVACS6316/f16

50

MulJnomialNaïveBayesasèa generative model that approximates how a text string is produced

•  StochasJcLanguageModels:– ModelprobabilityofgeneraFngstrings(eachwordinturnfollowingthesequenFalorderinginthestring)inthelanguage(commonlyallstringsoverdicFonary∑).

–  E.g.,unigrammodel

0.2 the

0.1 a

0.01 boy

0.01 dog

0.03 said

0.02 likes

the boy likes the dog

0.2 0.01 0.02 0.2 0.01

Multiply all five terms

Model C_1

P(s | C_1) = 0.00000008 11/8/16 51

AdaptFromManning’textCattutorial

Dr.YanjunQi/UVACS6316/f16

MulJnomialNaïveBayesasèa generative model that approximates how a text string is produced

•  StochasJcLanguageModels:– ModelprobabilityofgeneraFngstrings(eachwordinturnfollowingthesequenFalorderinginthestring)inthelanguage(commonlyallstringsoverdicFonary∑).

–  E.g.,unigrammodel

0.2 the

0.1 a

0.01 boy

0.01 dog

0.03 said

0.02 likes

the boy likes the dog

0.2 0.01 0.02 0.2 0.01

Multiply all five terms

Model C_1

P(d | C_1) = 0.00000008 11/8/16 52

AdaptFromManning’textCattutorial

Dr.YanjunQi/UVACS6316/f16

MulJnomialNaïveBayesasCondiJonalStochasJcLanguageModels

•  ModelcondiFonalprobabilityofgeneraFnganystringfromtwopossiblemodels

0.2 the

0.01 boy

0.0001 said

0.0001 likes

0.0001 black

0.0005 dog

0.01 garden

Model C1 Model C2

dog boy likes black the

0.0005 0.01 0.0001 0.0001 0.2 0.01 0.0001 0.02 0.1 0.2

P(d|C2) P(C2) > P(d|C1) P(C1)

0.2 the

0.0001 boy

0.03 said

0.02 likes

0.1 black

0.01 dog

0.0001 garden

11/8/16 53èdismorelikelytobefromclassC2

Dr.YanjunQi/UVACS6316/f16

P(C2) = P(C1)

APhysicalMetaphor

•  Coloredballsarerandomlydrawnfrom(withreplacement)

A string of words model

P()=P()P()P()P()

11/8/16 54

Dr.YanjunQi/UVACS6316/f16

UnigramlanguagemodelèMoregeneral:GeneraFnglanguagestringfromaprobabilisFcmodel

•  UnigramLanguageModels

•  Alsocouldbebigram(orgenerally,n-gram)LanguageModels

=P()P(|)P(|)P(|)

P()P()P()P()

P()

Easy. Effective!

11/8/16 55AdaptFromManning’textCattutorial

P()P(|)P(|)P(|)

Chainrule

NAÏVE:condiFonalindependentoneachposiFonofthestring

Dr.YanjunQi/UVACS6316/f16

UnigramlanguagemodelèMoregeneral:GeneraFnglanguagestringfromaprobabilisFcmodel

•  UnigramLanguageModels

•  Alsocouldbebigram(orgenerally,n-gram)LanguageModels

=P()P(|)P(|)P(|)

P()P()P()P()

P()

Easy. Effective!

11/8/16 56AdaptFromManning’textCattutorial

P()P(|)P(|)P(|)

Chainrule

NAÏVE:condiFonalindependentoneachposiFonofthestring

Dr.YanjunQi/UVACS6316/f16

MulFnomialNaïveBayes=aclasscondiFonalunigramlanguagemodel

•  ThinkofXiasthewordontheithposiFoninthedocumentstring

•  EffecFvely,theprobabilityofeachclassisdoneasaclass-specificunigramlanguagemodel

class

X1 X2 X3 X4 X5 X6

11/8/16 57AdaptFromManning’textCattutorial

Dr.YanjunQi/UVACS6316/f16

11/8/16

Dr.YanjunQi/UVACS6316/f16

58

the boy likes the dog

0.2 0.01 0.02 0.2 0.01

Using Multinomial Naive Bayes Classifiers to Classify Text: Basic method

•  Attributes are text positions, values are words.

n  SFlltoomanypossibiliFesn  UsesameparametersforawordacrossposiFonsn  Resultisbagofwordsmodel(overwordtokens)

cNB = argmaxc j∈C

P(cj ) P(xi | cj )i∏

= argmaxc j∈C

P(cj )P(x1 = "the"|cj )!P(xn = "the"|cj )

59 11/8/16

Dr.YanjunQi/UVACS6316/f16

Multinomial Naïve Bayes: Classifying Step

•  Positions ß all word positions in current document which contain tokens found in Vocabulary

•  Return cNB, where

∏∈∈

=positionsi

jijCc

NB cxPcPc )|()(argmaxj

11/8/16 60

Easytoimplement,no

needtoconstructbag-of-words

vectorexplicitly!!!

AdaptFromManning’textCattutorial

Pr(W1 = n1,...,Wk = nk |C = c j )

Equalto,(leavingoutofmulFnomialcoefficient)

Dr.YanjunQi/UVACS6316/f16

Multinomial Naïve Bayes: Classifying Step

•  Positions ß all word positions in current document which contain tokens found in Vocabulary

•  Return cNB, where

∏∈∈

=positionsi

jijCc

NB cxPcPc )|()(argmaxj

11/8/16 61

Easytoimplement,no

needtoconstructbag-of-words

vectorexplicitly!!!

AdaptFromManning’textCattutorial

Pr(W1 = n1,...,Wk = nk |C = c j )

Equalto,(leavingoutofmulFnomialcoefficient)

Dr.YanjunQi/UVACS6316/f16

Underflow Prevention: log space

•  Multiplying lots of probabilities, which are between 0 and 1, can result in floating-point underflow.

•  Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities.

•  Class with highest final un-normalized log probability score is still the most probable.

•  Note that model is now just max of sum of weights…

∑∈∈

+=positionsi

jijCc

NB cxPcPc )|(log)(logargmaxj

62 11/8/16AdaptFromManning’textCattutorial

Dr.YanjunQi/UVACS6316/f16

Today:NaïveBayesClassifierforTextü  DicFonarybasedVectorspacerepresentaFonof

textarFcleü  MulFvariateBernoullivs.MulFnomialü  MulFvariateBernoulli§  TesFng§  TrainingWithMaximumLikelihoodEsFmaFon

foresFmaFngparametersü  MulFnomialnaïveBayesclassifier§  TesFng§  MulFnomialnaïveBayesclassifierasCondiFonal

StochasFcLanguageModels§  TrainingWithMaximumLikelihoodEsFmaFon

foresFmaFngparameters

11/8/16 63

Dr.YanjunQi/UVACS6316/f16

TheBigPicture

Modeli.e.DatageneraFng

process

ObservedData

Probability

EsFmaFon/learning/Inference/Datamining

Buthowtospecifyamodel?

11/8/16

Build a generative model that approximates how data is produced.

64

Dr.YanjunQi/UVACS6316/f16

Generative Model & MLE •  Language model can be seen as a probabilistic

automata for generating text strings

•  Relative frequency estimates can be proven to be maximum likelihood estimates (MLE) since they maximize the probability that the model M will generate the training corpus T. ))(|(argmaxˆ θθ

θMTrainP=

P(W1 = n1,...,Wk = nk | cj , N ,θ

1,..,θk ) ={θ1

n1θ2n2 ..θk

nk | cj}

11/8/16 65

Dr.YanjunQi/UVACS6316/f16

DerivingtheMaximumLikelihoodEsFmateformulFnomialdistribuFon

11/8/16 66

function of θ

LIKELIHOOD:

argmaxθ

1,..,θk

P(d1,...,dT |θ1,..,θk )

= argmaxθ

1,..,θk t=1

T

∏P(dt |θ1,..,θk )

= argmaxθ

1,..,θk t=1

T

∏Nd t !

n1,dt!n2,dt

!..nk ,dt!θ1

n1,dtθ2

n2,dt ..θk

nk ,dt

= argmaxθ

1,..,θk t=1

T

∏θ1

n1,dtθ2

n2,dt ..θk

nk ,dt

s.t. θii=1

k

∑ =1

Dr.YanjunQi/UVACS6316/f16

DerivingtheMaximumLikelihoodEsFmateformulFnomialdistribuFon

11/8/16 67

argmaxθ1,..,θk

log(L(θ ))

= argmaxθ1,..,θk

log(t=1

T

∏θ1n1,dtθ2

n2,dt ..θknk ,dt )

= argmaxθ1,..,θk

n1,dtt=1,...T∑ log(θ1)+ n2,dt

t=1,...T∑ log(θ2 )+ ...+ nk ,dt

t=1,...T∑ log(θk )

s.t. θii=1

k

∑ =1

ConstrainedopFmizaFonMLEesFmator

ConstrainedopFmizaFon

θi =

ni,dtt=1,...T∑

n1,dtt=1,...T∑ + n2,dt

t=1,...T∑ +...+ nk,dt

t=1,...T∑

=

ni,dtt=1,...T∑

Ndtt=1,...T∑

è i.e. We can create a mega-document by concatenating all documents d_1 to d_T èUse relative frequency of w in mega-document

HowopFmize?SeeHandout-

EXTRA

Dr.YanjunQi/UVACS6316/f16

n  Textj ß is length n and is a singledocumentcontainingalldocsjn  foreachwordwk inVocabulary

n  nk,j ß numberofoccurrencesof wk inTextj;nj is length of Textj

n 

Naïve Multinomial : Learning Algorithm for parameter estimation with MLE •  From training corpus, extract Vocabulary •  Calculate required P(cj) and P(wk | cj) terms

–  For each cj in C do •  docsj ß subset of documents for which the target

class is cj

P(wk |c j )←

nk , j +αnj +α |Vocabulary |

|documents # total|||

)( jj

docscP ←

68 11/8/16Relative frequency of word w_k appears

across all documents of class cj

Dr.YanjunQi/UVACS6316/f16

e.g.,α =1

Multinomial Bayes: Time Complexity

•  Training Time: O(T*Ld + |C||V|)) where Ld is the average length of a document in D. –  Assumes V and all Di , ni, and nk,j pre-computed in O(T*Ld)

time during one pass through all of the data. –  |C||V|=ComplexityofcompuFngallprobabilityvalues(loopover

wordsandclasses)

–  Generally just O(T*Ld) since usually |C||V| < T*Ld •  Test Time: O(|C| Lt)

where Lt is the average length of a test document. –  Very efficient overall, linearly proportional to the time needed

to just read in all the words. –  Plus, robust in practice

69 AdaptFromManning’textCattutorial

T: num. doc |V| dict size |C| class size

11/8/16

Dr.YanjunQi/UVACS6316/f16

Parameter estimation

fraction of times in which word w appears

across all documents of topic cj

== )|(ˆ ji cwXP

11/8/16 70 AdaptFromManning’textCattutorial

Dr.YanjunQi/UVACS6316/f16

Multinomial model:

Can create a mega-document for topic j by concatenating all documents on this topic Use frequency of w in mega-document

Naive Bayes is Not So Naive •  Naïve Bayes: First and Second place in KDD-CUP 97 competition, among

16 (then) state of the art algorithms Goal: Financial services industry direct mail response prediction model: Predict if the recipient of mail will actually respond to the advertisement – 750,000 records.

•  Robust to Irrelevant Features Irrelevant Features cancel each other without affecting results Instead Decision Trees can heavily suffer from this.

•  Very good in domains with many equally important features Decision Trees suffer from fragmentation in such cases – especially if little data

•  A good dependable baseline for text classification (but not the best)! •  Optimal if the Independence Assumptions hold: If assumed independence is

correct, then it is the Bayes Optimal Classifier for problem •  Very Fast: Learning with one pass of counting over the data; testing linear in the

number of attributes, and document collection size •  Low Storage requirements

71 11/8/16

Dr.YanjunQi/UVACS6316/f16

References

q Prof.AndrewMoore’sreviewtutorialq Prof.KeChenNBslidesq Prof.CarlosGuestrinrecitaFonslidesq Prof.RaymondJ.MooneyandJimmyLin’sslidesaboutlanguagemodel

q Prof.Manning’textCattutorial

11/8/16

Dr.YanjunQi/UVACS6316/f16

72