Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall 2016 …€¦ · – Fall 2016 Machine...
Transcript of Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall 2016 …€¦ · – Fall 2016 Machine...
UVACS6316/4501–Fall2016
MachineLearning
Lecture13:NaïveBayesClassifierforTextClassificaJon
Dr.YanjunQi
UniversityofVirginia
DepartmentofComputerScience
11/8/16
Dr.YanjunQi/UVACS6316/f16
1
Wherearewe?èFivemajorsecFonsofthiscourse
q Regression(supervised)q ClassificaFon(supervised)q Unsupervisedmodelsq Learningtheoryq Graphicalmodels
11/8/16
Dr.YanjunQi/UVACS6316/f16
2
Wherearewe?èThreemajorsecFonsforclassificaFon
• We can divide the large variety of classification approaches into roughly three major types
1. Discriminative - directly estimate a decision rule/boundary - e.g., support vector machine, decision tree 2. Generative: - build a generative statistical model - e.g., naïve bayes classifier, Bayesian networks 3. Instance based classifiers - Use observation directly (no models) - e.g. K nearest neighbors
11/8/16
Dr.YanjunQi/UVACS6316/f16
3
ADatasetforclassificaFon
• Data/points/instances/examples/samples/records:[rows]• Features/a0ributes/dimensions/independentvariables/covariates/
predictors/regressors:[columns,exceptthelast]• Target/outcome/response/label/dependentvariable:specialcolumntobe
predicted[lastcolumn]11/8/16
Output as Discrete Class Label
C1, C2, …, CL
C
CDr.YanjunQi/UVACS6316/f16
4
Today:NaïveBayesClassifierforTextü DicFonarybasedVectorspacerepresentaFonof
textarFcleü MulFvariateBernoullivs.MulFnomialü MulFvariateBernoulli§ TesFng§ TrainingWithMaximumLikelihoodEsFmaFon
foresFmaFngparametersü MulFnomialnaïveBayesclassifier§ TesFng§ MulFnomialnaïveBayesclassifierasCondiFonal
StochasFcLanguageModels§ TrainingWithMaximumLikelihoodEsFmaFon
foresFmaFngparameters
11/8/16 5
Dr.YanjunQi/UVACS6316/f16
TextdocumentclassificaFon,e.g.spamemailfiltering
• Input:documentD• Output:thepredictedclassC,cisfrom{c1,...,cL}
• SpamfilteringTask:Classifyemailas‘Spam’,‘Other’.
P(C=spam|D)
11/8/16 6
Dr.YanjunQi/UVACS6316/f16
TextclassificaFonTasks• Input:documentD• Output:thepredictedclassC,cisfrom{c1,...,cL}
TextclassificaFonexamples:
• Classifyemailas ‘Spam’,‘Other’.• Classifywebpagesas ‘Student’,‘Faculty’,‘Other’• Classifynewsstoriesintotopics‘Sports’,‘Poli?cs’..• Classifybusinessnamesby industry.• Classifymoviereviewsas‘Favorable’,‘Unfavorable’,‘Neutral’ • …andmanymore.
11/8/16 7
Dr.YanjunQi/UVACS6316/f16
TextClassificaFon:Examples• ClassifyshipmentarFclesintoone93categories.• Anexamplecategory‘wheat’
ARGENTINE1986/87GRAIN/OILSEEDREGISTRATIONSBUENOSAIRES,Feb26ArgenFnegrainboardfiguresshowcropregistraFonsofgrains,oilseedsandtheirproducts
toFebruary11,inthousandsoftonnes,showingthoseforfutureshipmentsmonth,1986/87totaland1985/86totaltoFebruary12,1986,inbrackets:
Breadwheatprev1,655.8,Feb872.0,March164.6,total2,692.4(4,161.0).MaizeMar48.0,total48.0(nil).Sorghumnil(nil)OilseedexportregistraFonswere:Sunflowerseedtotal15.0(7.9)SoybeanMay20.0,total20.0(nil)TheboardalsodetailedexportregistraFonsforsubproducts,asfollows....
11/8/16 8
Dr.YanjunQi/UVACS6316/f16
RepresenFngtext:alistofwordsèDicFonary
argenFne,1986,1987,grain,oilseed,registraFons,buenos,aires,feb,26,argenFne,grain,board,figures,show,crop,registraFons,of,grains,oilseeds,and,their,products,to,february,11,in,…
Commonrefinements:removestopwords,stemming,collapsingmulFpleoccurrencesofwordsintoone….
C
11/8/16 9
Dr.YanjunQi/UVACS6316/f16
‘Bagofwords’representaFonoftext
ARGENTINE1986/87GRAIN/OILSEEDREGISTRATIONSBUENOSAIRES,Feb26ArgenFnegrainboardfiguresshowcropregistraFonsofgrains,oilseedsandtheirproductsto
February11,inthousandsoftonnes,showingthoseforfutureshipmentsmonth,1986/87totaland1985/86totaltoFebruary12,1986,inbrackets:
Breadwheatprev1,655.8,Feb872.0,March164.6,total2,692.4(4,161.0).MaizeMar48.0,total48.0(nil).Sorghumnil(nil)OilseedexportregistraFonswere:Sunflowerseedtotal15.0(7.9)SoybeanMay20.0,total20.0(nil)TheboardalsodetailedexportregistraFonsforsub-products,asfollows....
grain(s) 3 oilseed(s) 2 total 3 wheat 1 maize 1 soybean 1 tonnes 1
... ...
word frequency
BagofwordrepresentaFon:
Representtextasavectorofwordfrequencies.
11/8/16 10
Dr.YanjunQi/UVACS6316/f16
Another“Bagofwords”representaFonoftextèEachdicFonarywordasBoolean
ARGENTINE1986/87GRAIN/OILSEEDREGISTRATIONSBUENOSAIRES,Feb26ArgenFnegrainboardfiguresshowcropregistraFonsofgrains,oilseedsandtheirproductsto
February11,inthousandsoftonnes,showingthoseforfutureshipmentsmonth,1986/87totaland1985/86totaltoFebruary12,1986,inbrackets:
Breadwheatprev1,655.8,Feb872.0,March164.6,total2,692.4(4,161.0).MaizeMar48.0,total48.0(nil).Sorghumnil(nil)OilseedexportregistraFonswere:Sunflowerseedtotal15.0(7.9)SoybeanMay20.0,total20.0(nil)TheboardalsodetailedexportregistraFonsforsub-products,asfollows....
grain(s) Yes oilseed(s) Yes total Yes wheat Yes maize Yes Panda No Tiger No
... ...
word Boolean
BagofwordrepresentaFon:
Representtextasavectorofwordfrequencies.
11/8/16 11
Dr.YanjunQi/UVACS6316/f16
BagofwordsrepresentaFondocumenti
X(i,j)=Frequencyofwordjindocumenti
wordj
AcollecFonofdocuments
C
11/8/16 12
Dr.YanjunQi/UVACS6316/f16
Bagofwords
• WhatsimplifyingassumpFonarewetaking?
Weassumedwordorderisnotimportant.
11/8/16 13
Dr.YanjunQi/UVACS6316/f16
Unknown Words
• How to handle words in the test corpus that did not occur in the training data, i.e. out of vocabulary (OOV) words?
• Train a model that includes an explicit symbol for an unknown word (<UNK>). – Choose a vocabulary in advance and replace other
(i.e. not in vocabulary) words in the training corpus with <UNK>.
– Replace the first occurrence of each word in the training data with <UNK>.
11/8/16 14
Dr.YanjunQi/UVACS6316/f16
Today:NaïveBayesClassifierforTextü DicFonarybasedVectorspacerepresentaFonof
textarFcleü MulFvariateBernoullivs.MulFnomialü MulFvariateBernoulli§ TesFng§ TrainingWithMaximumLikelihoodEsFmaFon
foresFmaFngparametersü MulFnomialnaïveBayesclassifier§ TesFng§ MulFnomialnaïveBayesclassifierasCondiFonal
StochasFcLanguageModels§ TrainingWithMaximumLikelihoodEsFmaFon
foresFmaFngparameters
11/8/16 15
Dr.YanjunQi/UVACS6316/f16
‘Bagofwords’representaFonoftext
ARGENTINE1986/87GRAIN/OILSEEDREGISTRATIONSBUENOSAIRES,Feb26ArgenFnegrainboardfiguresshowcropregistraFonsofgrains,oilseedsandtheirproductsto
February11,inthousandsoftonnes,showingthoseforfutureshipmentsmonth,1986/87totaland1985/86totaltoFebruary12,1986,inbrackets:
Breadwheatprev1,655.8,Feb872.0,March164.6,total2,692.4(4,161.0).MaizeMar48.0,total48.0(nil).Sorghumnil(nil)OilseedexportregistraFonswere:Sunflowerseedtotal15.0(7.9)SoybeanMay20.0,total20.0(nil)TheboardalsodetailedexportregistraFonsforsub-products,asfollows....
grain(s) 3 oilseed(s) 2 total 3 wheat 1 maize 1 soybean 1 tonnes 1
... ...
word frequency
Pr(D |C = c)?
11/8/16 16
Dr.YanjunQi/UVACS6316/f16
‘Bagofwords’representaFonoftext
Pr(D |C = c)?
Pr(W1 = n1,W2 = n2 ,...,Wk = nk | C = c)
Pr(W1 = true,W2 = false...,Wk = true | C = c)
17
Dr.YanjunQi/UVACS6316/f16
TwoPreviousmodels
11/8/16
ProbabilisFcModelsoftextdocuments
Pr(D |C = c)?
Pr(W1 = n1,W2 = n2 ,...,Wk = nk | C = c)
Pr(W1 = true,W2 = false...,Wk = true | C = c)
18
Dr.YanjunQi/UVACS6316/f16
TwoPreviousmodels
11/8/16
MulFvariateBernoulliDistribuFon
MulFnomialDistribuFon
Text Classification with Naïve Bayes Classifier
• Multinomial vs Multivariate Bernoulli?
• Multinomial model is almost always more effective in text applications!
11/8/16 19
Dr.YanjunQi/UVACS6316/f16
AdaptFromManning’textCattutorial
Experiment:MulJnomialvsmulFvariateBernoulli
• M&N(1998)didsomeexperimentstoseewhichisbeper
• Determineifauniversitywebpageis{student,faculty,other_stuff}
• Trainon~5,000hand-labeledwebpages– Cornell,Washington,U.Texas,Wisconsin
• Crawlandclassifyanewsite(CMU)
11/8/16 20
Dr.YanjunQi/UVACS6316/f16
AdaptFromManning’textCattutorial
Today:NaïveBayesClassifierforTextü DicFonarybasedVectorspacerepresentaFonof
textarFcleü MulFvariateBernoullivs.MulFnomialü MulFvariateBernoulli§ TesFng§ TrainingWithMaximumLikelihoodEsFmaFon
foresFmaFngparametersü MulFnomialnaïveBayesclassifier§ TesFng§ MulFnomialnaïveBayesclassifierasCondiFonal
StochasFcLanguageModels§ TrainingWithMaximumLikelihoodEsFmaFon
foresFmaFngparameters
11/8/16 22
Dr.YanjunQi/UVACS6316/f16
Model 1: Multivariate Bernoulli
• Model 1: Multivariate Bernoulli – One feature Xw for each word in dictionary – Xw = true in document d if w appears in d
– Naive Bayes assumption: • Given the document’s topic class label,
appearance of one word in the document tells us nothing about chances that another word appears
23 11/8/16
Dr.YanjunQi/UVACS6316/f16
Pr(W1 = true,W2 = false...,Wk = true | C = c)
Model 1: Multivariate Bernoulli Naïve Bayes Classifier
24 11/8/16
Dr.YanjunQi/UVACS6316/f16
• Conditional Independence Assumption: Features (word presence) are independent of each other given the class variable:
• Multivariate Bernoulli model is appropriate for binary feature variables
grain(s) True oilseed(s) True total True wheat True chemical False
... ...
word True/false
Model 1: Multivariate Bernoulli
• Conditional Independence Assumption: Features (word presence) are independent of each other given the class variable:
• Multivariate Bernoulli model is appropriate for
binary feature variables 25
C=Flu
W1 W2 W5 W3 W4 W6
11/8/16
Pr(W1 = true,W2 = false,...,Wk = true |C = c)= P(W1 = true |C)•P(W2 = false |C)•!•P(Wk = true |C)
runnynose sinus fever muscle-ache running
thisisnaïve
Dr.YanjunQi/UVACS6316/f16
Review:BernoulliDistribuFone.g.CoinFlips
• Youflipacoin– Headwithprobabilityp– Binaryrandomvariable– Bernoullitrialwithsuccessprobabilityp
• Youflipkcoins– Howmanyheadswouldyouexpect– NumberofheadsX:discreterandomvariable– BinomialdistribuFonwithparameterskandp
11/8/16
Dr.YanjunQi/UVACS6316/f16
26
Pr(Wi = true | C = c)
Review:BernoulliDistribuFone.g.CoinFlips
• Youflipacoin– Headwithprobabilityp– Binaryrandomvariable– Bernoullitrialwithsuccessprobabilityp
• Youflipkcoins– Howmanyheadswouldyouexpect– NumberofheadsX:discreterandomvariable– BinomialdistribuFonwithparameterskandp
11/8/16
Dr.YanjunQi/UVACS6316/f16
27
Pr(Wi = true | C = c)
Today:NaïveBayesClassifierforTextü DicFonarybasedVectorspacerepresentaFonof
textarFcleü MulFvariateBernoullivs.MulFnomialü MulFvariateBernoulli§ TesFng§ TrainingWithMaximumLikelihoodEsFmaFon
foresFmaFngparametersü MulFnomialnaïveBayesclassifier§ TesFng§ MulFnomialnaïveBayesclassifierasCondiFonal
StochasFcLanguageModels§ TrainingWithMaximumLikelihoodEsFmaFon
foresFmaFngparameters
11/8/16 28
Dr.YanjunQi/UVACS6316/f16
AgeneralStatement ConsiderasamplesetT=(X1...Xn)whichisdrawnfromaprobabilitydistribuFonP(X|\theta)where\thetaareparameters.IftheXsareindependentwithprobabilitydensityfuncFonP(Xi|\theta),thejointprobabilityofthewholesetis
P( 1X ... nX |θ )=
i=1
n
∏P( iX |θ )
thismaybemaximisedwithrespectto\thetatogivethemaximumlikelihoodesFmates.
MaximumLikelihoodEsFmaFon
11/8/16 30
Dr.YanjunQi/UVACS6316/f16
ItisowenconvenienttoworkwiththeLogofthelikelihoodfuncFon.
log(L(θ )) =i=1
n
∑log(P( iX |θ ))
Theideaistoü assumeaparFcularmodelwithunknownparameters,ü wecanthendefinetheprobabilityofobservingagiveneventcondiFonalonaparFcularsetofparameters.
ü Wehaveobservedasetofoutcomesintherealworld.ü Itisthenpossibletochooseasetofparameterswhicharemostlikelytohaveproducedtheobservedresults.
Thisismaximumlikelihood.Inmostcasesitisbothconsistentandefficient.ItprovidesastandardtocompareotheresFmaFontechniques.
P( iX |θ )
θ
θ̂ = argmaxθ
P( 1X ... nX |θ )
11/8/16 31
Dr.YanjunQi/UVACS6316/f16
Review:BernoulliDistribuFone.g.CoinFlips
• Youflipncoins– Howmanyheadswouldyouexpect– Headwithprobabilityp– NumberofheadsXoutofntrial– EachTrialfollowingBernoullidistribuFonwithparametersp
11/8/16
Dr.YanjunQi/UVACS6316/f16
32
DefiningLikelihoodforBernoulli
• Likelihood=p(data|parameter)
!!f (xi |p)= pxi (1− p)1−xi
function of x_i
PMF:
!!L(p)=∏i=1n pxi (1− p)1−xi = px(1− p)n−x
function of p
LIKELIHOOD:
èe.g.,fornindependenttossesofcoins,withunknownp
11/8/16 33
Dr.YanjunQi/UVACS6316/f16
Observeddataèxheads-upfromntrials
x = xi
i=1
n
∑
DerivingtheMaximumLikelihoodEsFmateforBernoulli
!!L(p)= px(1− p)n−x
maximize
0.0
00
.04
0.0
8
potential HIV prevalences
like
liho
od
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
!!log(L(p)= log px(1− p)n−x⎡⎣ ⎤⎦
maximize
!2
50
!1
50
!5
00
potential HIV prevalences
log
(lik
elih
oo
d)
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
−l(p)= − log px(1− p)n−x⎡⎣ ⎤⎦
Minimize the negative log-likelihood
05
01
00
20
0potential HIV prevalences
!lo
g(lik
elih
oo
d)
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
11/8/16 34
Dr.YanjunQi/UVACS6316/f16
p
p
p
DerivingtheMaximumLikelihoodEsFmateforBernoulli
−l(p)= − log(L(p))= − log px(1− p)n−x⎡⎣ ⎤⎦
= − log(px )− log((1− p)n−x )
= −x log(p)−(n− x)log(1− p)
Minimize the negative log-likelihood
11/8/16 35
Dr.YanjunQi/UVACS6316/f16
DerivingtheMaximumLikelihoodEsFmateforBernoulli
−l(p)= − log n
x⎛
⎝⎜⎞
⎠⎟− x log(p)−(n− x)log(1− p)
€
dl(p)dp
= 0 − xp−−(n − x)1− p
€
0 = −xˆ p
+n − x1− ˆ p
€
0 =−x(1− ˆ p ) + ˆ p (n − x)
ˆ p (1− ˆ p )
€
0 = −x + ˆ p x + ˆ p n − ˆ p x
€
0 = −x + ˆ p n
€
ˆ p = xn
The proportion of positives!
Minimize the negative log-likelihood è MLE parameter estimation
i.e.RelaFvefrequencyofabinaryevent
11/8/16 36
Dr.YanjunQi/UVACS6316/f16
Parameter estimation
fraction of documents of topic cj in which word w appears
• Multivariate Bernoulli model:
P̂(Xw = true | cj ) =
11/8/16 37 AdaptFromManning’textCattutorial
Dr.YanjunQi/UVACS6316/f16
Today:NaïveBayesClassifierforTextü DicFonarybasedVectorspacerepresentaFonof
textarFcleü MulFvariateBernoullivs.MulFnomialü MulFvariateBernoulli§ TesFng§ TrainingWithMaximumLikelihoodEsFmaFon
foresFmaFngparametersü MulFnomialnaïveBayesclassifier§ TesFng§ MulFnomialnaïveBayesclassifierasCondiFonal
StochasFcLanguageModels§ TrainingWithMaximumLikelihoodEsFmaFon
foresFmaFngparameters
11/8/16 38
Dr.YanjunQi/UVACS6316/f16
Model 2: Multinomial Naïve Bayes - ‘Bagofwords’representaFonoftext
grain(s) 3 oilseed(s) 2 total 3 wheat 1 maize 1 soybean 1 tonnes 1
... ...
word frequency
CanberepresentedasamulFnomialdistribuFon.
Words=likecoloredballs,thereareKpossibletypeofthem(i.e.fromadicFonaryofKwords)Document=containsNwords,eachwordoccursniFmes(likeabagofNcoloredballs)
Inadocumentclassof‘wheat’,“grain”ismorelikely.whereasina“harddrive”shipmentclasstheparameterfor‘grain’ isgoingtobesmaller.
11/8/16 39
Dr.YanjunQi/UVACS6316/f16
Pr(W1 = n1,...,Wk = nk |C = c)
Model 2: Multinomial Naïve Bayes - ‘Bagofwords’representaFonoftext
grain(s) 3 oilseed(s) 2 total 3 wheat 1 maize 1 soybean 1 tonnes 1
... ...
word frequency
CanberepresentedasamulFnomialdistribuFon.
Words=likecoloredballs,thereareKpossibletypeofthem(i.e.fromadicFonaryofKwords)Document=containsNwords,eachwordoccursniFmes(likeabagofNcoloredballs)
Inadocumentclassof‘wheat’,“grain”ismorelikely.whereasina“harddrive”shipmentclasstheparameterfor‘grain’ isgoingtobesmaller.
11/8/16 40
Dr.YanjunQi/UVACS6316/f16
Pr(W1 = n1,...,Wk = nk |C = c)
Model 2: Multinomial Naïve Bayes - ‘Bagofwords’representaFonoftext
grain(s) 3 oilseed(s) 2 total 3 wheat 1 maize 1 soybean 1 tonnes 1
... ...
word frequency
CanberepresentedasamulFnomialdistribuFon.
Words=likecoloredballs,thereareKpossibletypeofthem(i.e.fromadicFonaryofKwords)Document=containsNwords,eachwordoccursniFmes(likeabagofNcoloredballs)
Inadocumentclassof‘wheat’,“grain”ismorelikely.whereasina“harddrive”shipmentclasstheparameterfor‘grain’ isgoingtobesmaller.
11/8/16 41
Dr.YanjunQi/UVACS6316/f16
Pr(W1 = n1,...,Wk = nk |C = c)
MulFnomialdistribuFon• ThemulFnomialdistribuFonisageneralizaFonofthebinomial
distribuFon.
• ThebinomialdistribuFoncountssuccessesofanevent(forexample,headsincointosses).
• Theparameters:– N(numberoftrials)– (theprobabilityofsuccessoftheevent)
• ThemulFnomialcountsthenumberofasetofevents(forexample,howmanyFmeseachsideofadiecomesupinasetofrolls).– Theparameters:– N(numberoftrials)– (theprobabilityofsuccessforeachcategory)
p
1.. kθ θ
11/8/16 42
Dr.YanjunQi/UVACS6316/f16
MulFnomialdistribuFon• ThemulFnomialdistribuFonisageneralizaFonofthebinomial
distribuFon.
• ThebinomialdistribuFoncountssuccessesofanevent(forexample,headsincointosses).
• Theparameters:– N(numberoftrials)– (theprobabilityofsuccessoftheevent)
• ThemulFnomialcountsthenumberofasetofevents(forexample,howmanyFmeseachsideofadiecomesupinasetofrolls).– Theparameters:– N(numberoftrials)– (theprobabilityofsuccessforeachcategory)
p
1.. kθ θ
11/8/16 43
Dr.YanjunQi/UVACS6316/f16
MulFnomialDistribuFon• W1,W2,..Wkarevariables
P(W1 = n1,...,Wk = nk | ci , N ,θ
1,..,θk ) = N !
n1!n2 !..nk !θ1
n1θ2n2 ..θk
nk
1
k
iin N
=
=∑1
1k
iiθ
=
=∑
NumberofpossibleorderingsofNballs
orderinvariantselecFonsNoteeventsareindependent
AbinomialdistribuFonisthemulFnomialdistribuFonwithk=2and
θ1 = pθ2 = 1−θ1
11/8/16 44
Dr.YanjunQi/UVACS6316/f16
Model 2: Multinomial Naïve Bayes - ‘Bagofwords’representaFonoftext
grain(s) 3 oilseed(s) 2 total 3 wheat 1 maize 1 soybean 1 tonnes 1
... ...
word frequency
CanberepresentedasamulFnomialdistribuFon.
Words=likecoloredballs,thereareKpossibletypeofthem(i.e.fromadicFonaryofKwords)Document=containsNwords,eachwordoccursniFmes(likeabagofNcoloredballs)
11/8/16 45
Pr(W1 = n1,...,Wk = nk |C = c)
P(W1 = n1,...,Wk = nk | c, N ,θ
1,..,θk ) = N !
n1!n2 !..nk !θ1
n1θ2n2 ..θk
nk
mulFnomialcoefficient,normallycanleaveoutinpracFcalcalculaFons.
Dr.YanjunQi/UVACS6316/f16
Today:NaïveBayesClassifierforTextü DicFonarybasedVectorspacerepresentaFonof
textarFcleü MulFvariateBernoullivs.MulFnomialü MulFvariateBernoulli§ TesFng§ TrainingWithMaximumLikelihoodEsFmaFon
foresFmaFngparametersü MulFnomialnaïveBayesclassifier§ TesFng§ MulFnomialnaïveBayesclassifierasCondiFonal
StochasFcLanguageModels§ TrainingWithMaximumLikelihoodEsFmaFon
foresFmaFngparameters
11/8/16 47
Dr.YanjunQi/UVACS6316/f16
TheBigPicture
Modeli.e.DatageneraFng
process
ObservedData
Probability
EsFmaFon/learning/Inference/Datamining
Buthowtospecifyamodel?
11/8/16
Build a generative model that approximates how data is produced.
48
Dr.YanjunQi/UVACS6316/f16
Model 2: Multinomial Naïve Bayes - ‘Bagofwords’representaFonoftext
grain(s) 3 oilseed(s) 2 total 3 wheat 1 maize 1 soybean 1 tonnes 1
... ...
word frequency
CanberepresentedasamulFnomialdistribuFon.
Words=likecoloredballs,thereareKpossibletypeofthem(i.e.fromadicFonaryofKwords)Document=containsNwords,eachwordoccursniFmes(likeabagofNcoloredballs)
11/8/16 49
Pr(W1 = n1,...,Wk = nk |C = c)
WHYisthisnaïve???
P(W1 = n1,...,Wk = nk | c, N ,θ
1,..,θk ) = N !
n1!n2 !..nk !θ1
n1θ2n2 ..θk
nk
mulFnomialcoefficient,normallycanleaveoutinpracFcalcalculaFons.
Dr.YanjunQi/UVACS6316/f16
MulJnomialNaïveBayesasèa generative model that approximates how a text string is produced
• StochasJcLanguageModels:– ModelprobabilityofgeneraFngstrings(eachwordinturnfollowingthesequenFalorderinginthestring)inthelanguage(commonlyallstringsoverdicFonary∑).
– E.g.,unigrammodel
0.2 the
0.1 a
0.01 boy
0.01 dog
0.03 said
0.02 likes
…
the boy likes the dog
0.2 0.01 0.02 0.2 0.01
Multiply all five terms
Model C_1
P(s | C_1) = 0.00000008 11/8/16 51
AdaptFromManning’textCattutorial
Dr.YanjunQi/UVACS6316/f16
MulJnomialNaïveBayesasèa generative model that approximates how a text string is produced
• StochasJcLanguageModels:– ModelprobabilityofgeneraFngstrings(eachwordinturnfollowingthesequenFalorderinginthestring)inthelanguage(commonlyallstringsoverdicFonary∑).
– E.g.,unigrammodel
0.2 the
0.1 a
0.01 boy
0.01 dog
0.03 said
0.02 likes
…
the boy likes the dog
0.2 0.01 0.02 0.2 0.01
Multiply all five terms
Model C_1
P(d | C_1) = 0.00000008 11/8/16 52
AdaptFromManning’textCattutorial
Dr.YanjunQi/UVACS6316/f16
MulJnomialNaïveBayesasCondiJonalStochasJcLanguageModels
• ModelcondiFonalprobabilityofgeneraFnganystringfromtwopossiblemodels
0.2 the
0.01 boy
0.0001 said
0.0001 likes
0.0001 black
0.0005 dog
0.01 garden
Model C1 Model C2
dog boy likes black the
0.0005 0.01 0.0001 0.0001 0.2 0.01 0.0001 0.02 0.1 0.2
P(d|C2) P(C2) > P(d|C1) P(C1)
0.2 the
0.0001 boy
0.03 said
0.02 likes
0.1 black
0.01 dog
0.0001 garden
11/8/16 53èdismorelikelytobefromclassC2
Dr.YanjunQi/UVACS6316/f16
P(C2) = P(C1)
APhysicalMetaphor
• Coloredballsarerandomlydrawnfrom(withreplacement)
A string of words model
P()=P()P()P()P()
11/8/16 54
Dr.YanjunQi/UVACS6316/f16
UnigramlanguagemodelèMoregeneral:GeneraFnglanguagestringfromaprobabilisFcmodel
• UnigramLanguageModels
• Alsocouldbebigram(orgenerally,n-gram)LanguageModels
=P()P(|)P(|)P(|)
P()P()P()P()
P()
Easy. Effective!
11/8/16 55AdaptFromManning’textCattutorial
P()P(|)P(|)P(|)
Chainrule
NAÏVE:condiFonalindependentoneachposiFonofthestring
Dr.YanjunQi/UVACS6316/f16
UnigramlanguagemodelèMoregeneral:GeneraFnglanguagestringfromaprobabilisFcmodel
• UnigramLanguageModels
• Alsocouldbebigram(orgenerally,n-gram)LanguageModels
=P()P(|)P(|)P(|)
P()P()P()P()
P()
Easy. Effective!
11/8/16 56AdaptFromManning’textCattutorial
P()P(|)P(|)P(|)
Chainrule
NAÏVE:condiFonalindependentoneachposiFonofthestring
Dr.YanjunQi/UVACS6316/f16
MulFnomialNaïveBayes=aclasscondiFonalunigramlanguagemodel
• ThinkofXiasthewordontheithposiFoninthedocumentstring
• EffecFvely,theprobabilityofeachclassisdoneasaclass-specificunigramlanguagemodel
class
X1 X2 X3 X4 X5 X6
11/8/16 57AdaptFromManning’textCattutorial
Dr.YanjunQi/UVACS6316/f16
Using Multinomial Naive Bayes Classifiers to Classify Text: Basic method
• Attributes are text positions, values are words.
n SFlltoomanypossibiliFesn UsesameparametersforawordacrossposiFonsn Resultisbagofwordsmodel(overwordtokens)
cNB = argmaxc j∈C
P(cj ) P(xi | cj )i∏
= argmaxc j∈C
P(cj )P(x1 = "the"|cj )!P(xn = "the"|cj )
59 11/8/16
Dr.YanjunQi/UVACS6316/f16
Multinomial Naïve Bayes: Classifying Step
• Positions ß all word positions in current document which contain tokens found in Vocabulary
• Return cNB, where
∏∈∈
=positionsi
jijCc
NB cxPcPc )|()(argmaxj
11/8/16 60
Easytoimplement,no
needtoconstructbag-of-words
vectorexplicitly!!!
AdaptFromManning’textCattutorial
Pr(W1 = n1,...,Wk = nk |C = c j )
Equalto,(leavingoutofmulFnomialcoefficient)
Dr.YanjunQi/UVACS6316/f16
Multinomial Naïve Bayes: Classifying Step
• Positions ß all word positions in current document which contain tokens found in Vocabulary
• Return cNB, where
∏∈∈
=positionsi
jijCc
NB cxPcPc )|()(argmaxj
11/8/16 61
Easytoimplement,no
needtoconstructbag-of-words
vectorexplicitly!!!
AdaptFromManning’textCattutorial
Pr(W1 = n1,...,Wk = nk |C = c j )
Equalto,(leavingoutofmulFnomialcoefficient)
Dr.YanjunQi/UVACS6316/f16
Underflow Prevention: log space
• Multiplying lots of probabilities, which are between 0 and 1, can result in floating-point underflow.
• Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities.
• Class with highest final un-normalized log probability score is still the most probable.
• Note that model is now just max of sum of weights…
∑∈∈
+=positionsi
jijCc
NB cxPcPc )|(log)(logargmaxj
62 11/8/16AdaptFromManning’textCattutorial
Dr.YanjunQi/UVACS6316/f16
Today:NaïveBayesClassifierforTextü DicFonarybasedVectorspacerepresentaFonof
textarFcleü MulFvariateBernoullivs.MulFnomialü MulFvariateBernoulli§ TesFng§ TrainingWithMaximumLikelihoodEsFmaFon
foresFmaFngparametersü MulFnomialnaïveBayesclassifier§ TesFng§ MulFnomialnaïveBayesclassifierasCondiFonal
StochasFcLanguageModels§ TrainingWithMaximumLikelihoodEsFmaFon
foresFmaFngparameters
11/8/16 63
Dr.YanjunQi/UVACS6316/f16
TheBigPicture
Modeli.e.DatageneraFng
process
ObservedData
Probability
EsFmaFon/learning/Inference/Datamining
Buthowtospecifyamodel?
11/8/16
Build a generative model that approximates how data is produced.
64
Dr.YanjunQi/UVACS6316/f16
Generative Model & MLE • Language model can be seen as a probabilistic
automata for generating text strings
• Relative frequency estimates can be proven to be maximum likelihood estimates (MLE) since they maximize the probability that the model M will generate the training corpus T. ))(|(argmaxˆ θθ
θMTrainP=
P(W1 = n1,...,Wk = nk | cj , N ,θ
1,..,θk ) ={θ1
n1θ2n2 ..θk
nk | cj}
11/8/16 65
Dr.YanjunQi/UVACS6316/f16
DerivingtheMaximumLikelihoodEsFmateformulFnomialdistribuFon
11/8/16 66
function of θ
LIKELIHOOD:
argmaxθ
1,..,θk
P(d1,...,dT |θ1,..,θk )
= argmaxθ
1,..,θk t=1
T
∏P(dt |θ1,..,θk )
= argmaxθ
1,..,θk t=1
T
∏Nd t !
n1,dt!n2,dt
!..nk ,dt!θ1
n1,dtθ2
n2,dt ..θk
nk ,dt
= argmaxθ
1,..,θk t=1
T
∏θ1
n1,dtθ2
n2,dt ..θk
nk ,dt
s.t. θii=1
k
∑ =1
Dr.YanjunQi/UVACS6316/f16
DerivingtheMaximumLikelihoodEsFmateformulFnomialdistribuFon
11/8/16 67
argmaxθ1,..,θk
log(L(θ ))
= argmaxθ1,..,θk
log(t=1
T
∏θ1n1,dtθ2
n2,dt ..θknk ,dt )
= argmaxθ1,..,θk
n1,dtt=1,...T∑ log(θ1)+ n2,dt
t=1,...T∑ log(θ2 )+ ...+ nk ,dt
t=1,...T∑ log(θk )
s.t. θii=1
k
∑ =1
ConstrainedopFmizaFonMLEesFmator
ConstrainedopFmizaFon
θi =
ni,dtt=1,...T∑
n1,dtt=1,...T∑ + n2,dt
t=1,...T∑ +...+ nk,dt
t=1,...T∑
=
ni,dtt=1,...T∑
Ndtt=1,...T∑
è i.e. We can create a mega-document by concatenating all documents d_1 to d_T èUse relative frequency of w in mega-document
HowopFmize?SeeHandout-
EXTRA
Dr.YanjunQi/UVACS6316/f16
n Textj ß is length n and is a singledocumentcontainingalldocsjn foreachwordwk inVocabulary
n nk,j ß numberofoccurrencesof wk inTextj;nj is length of Textj
n
Naïve Multinomial : Learning Algorithm for parameter estimation with MLE • From training corpus, extract Vocabulary • Calculate required P(cj) and P(wk | cj) terms
– For each cj in C do • docsj ß subset of documents for which the target
class is cj
P(wk |c j )←
nk , j +αnj +α |Vocabulary |
|documents # total|||
)( jj
docscP ←
68 11/8/16Relative frequency of word w_k appears
across all documents of class cj
Dr.YanjunQi/UVACS6316/f16
e.g.,α =1
Multinomial Bayes: Time Complexity
• Training Time: O(T*Ld + |C||V|)) where Ld is the average length of a document in D. – Assumes V and all Di , ni, and nk,j pre-computed in O(T*Ld)
time during one pass through all of the data. – |C||V|=ComplexityofcompuFngallprobabilityvalues(loopover
wordsandclasses)
– Generally just O(T*Ld) since usually |C||V| < T*Ld • Test Time: O(|C| Lt)
where Lt is the average length of a test document. – Very efficient overall, linearly proportional to the time needed
to just read in all the words. – Plus, robust in practice
69 AdaptFromManning’textCattutorial
T: num. doc |V| dict size |C| class size
11/8/16
Dr.YanjunQi/UVACS6316/f16
Parameter estimation
fraction of times in which word w appears
across all documents of topic cj
== )|(ˆ ji cwXP
11/8/16 70 AdaptFromManning’textCattutorial
Dr.YanjunQi/UVACS6316/f16
Multinomial model:
Can create a mega-document for topic j by concatenating all documents on this topic Use frequency of w in mega-document
Naive Bayes is Not So Naive • Naïve Bayes: First and Second place in KDD-CUP 97 competition, among
16 (then) state of the art algorithms Goal: Financial services industry direct mail response prediction model: Predict if the recipient of mail will actually respond to the advertisement – 750,000 records.
• Robust to Irrelevant Features Irrelevant Features cancel each other without affecting results Instead Decision Trees can heavily suffer from this.
• Very good in domains with many equally important features Decision Trees suffer from fragmentation in such cases – especially if little data
• A good dependable baseline for text classification (but not the best)! • Optimal if the Independence Assumptions hold: If assumed independence is
correct, then it is the Bayes Optimal Classifier for problem • Very Fast: Learning with one pass of counting over the data; testing linear in the
number of attributes, and document collection size • Low Storage requirements
71 11/8/16
Dr.YanjunQi/UVACS6316/f16