Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall 2016 …€¦ · – Fall 2016 Machine...

UVACS6316/4501–Fall2016

MachineLearning

Lecture13:NaïveBayesClassifierforTextClassificaJon

Dr.YanjunQi

UniversityofVirginia

DepartmentofComputerScience

11/8/16

Dr.YanjunQi/UVACS6316/f16

1

Wherearewe?èFivemajorsecFonsofthiscourse

q Regression(supervised)q ClassificaFon(supervised)q Unsupervisedmodelsq Learningtheoryq Graphicalmodels

11/8/16


2

Wherearewe?èThreemajorsecFonsforclassificaFon

•  We can divide the large variety of classification approaches into roughly three major types

1. Discriminative - directly estimate a decision rule/boundary - e.g., support vector machine, decision tree 2. Generative: - build a generative statistical model - e.g., naïve bayes classifier, Bayesian networks 3. Instance based classifiers - Use observation directly (no models) - e.g. K nearest neighbors

11/8/16


3

ADatasetforclassificaFon

•  Data/points/instances/examples/samples/records:[rows]•  Features/a0ributes/dimensions/independentvariables/covariates/

predictors/regressors:[columns,exceptthelast]•  Target/outcome/response/label/dependentvariable:specialcolumntobe

predicted[lastcolumn]11/8/16

Output as Discrete Class Label

C1, C2, …, CL

C

CDr.YanjunQi/UVACS6316/f16

4

Today:NaïveBayesClassifierforTextü  DicFonarybasedVectorspacerepresentaFonof

textarFcleü  MulFvariateBernoullivs.MulFnomialü  MulFvariateBernoulli§  TesFng§  TrainingWithMaximumLikelihoodEsFmaFon

foresFmaFngparametersü  MulFnomialnaïveBayesclassifier§  TesFng§  MulFnomialnaïveBayesclassifierasCondiFonal

StochasFcLanguageModels§  TrainingWithMaximumLikelihoodEsFmaFon

foresFmaFngparameters

11/8/16 5


TextdocumentclassificaFon,e.g.spamemailfiltering

•  Input:documentD•  Output:thepredictedclassC,cisfrom{c1,...,cL}

•  SpamfilteringTask:Classifyemailas‘Spam’,‘Other’.

P(C=spam|D)

11/8/16 6


TextclassificaFonTasks•  Input:documentD•  Output:thepredictedclassC,cisfrom{c1,...,cL}

TextclassificaFonexamples:

•  Classifyemailas ‘Spam’,‘Other’.•  Classifywebpagesas ‘Student’,‘Faculty’,‘Other’•  Classifynewsstoriesintotopics‘Sports’,‘Poli?cs’..•  Classifybusinessnamesby industry.•  Classifymoviereviewsas‘Favorable’,‘Unfavorable’,‘Neutral’ •  …andmanymore.

11/8/16 7


TextClassificaFon:Examples•  ClassifyshipmentarFclesintoone93categories.•  Anexamplecategory‘wheat’

ARGENTINE1986/87GRAIN/OILSEEDREGISTRATIONSBUENOSAIRES,Feb26ArgenFnegrainboardfiguresshowcropregistraFonsofgrains,oilseedsandtheirproducts

toFebruary11,inthousandsoftonnes,showingthoseforfutureshipmentsmonth,1986/87totaland1985/86totaltoFebruary12,1986,inbrackets:

Breadwheatprev1,655.8,Feb872.0,March164.6,total2,692.4(4,161.0).MaizeMar48.0,total48.0(nil).Sorghumnil(nil)OilseedexportregistraFonswere:Sunflowerseedtotal15.0(7.9)SoybeanMay20.0,total20.0(nil)TheboardalsodetailedexportregistraFonsforsubproducts,asfollows....

11/8/16 8


RepresenFngtext:alistofwordsèDicFonary

argenFne,1986,1987,grain,oilseed,registraFons,buenos,aires,feb,26,argenFne,grain,board,figures,show,crop,registraFons,of,grains,oilseeds,and,their,products,to,february,11,in,…

Commonrefinements:removestopwords,stemming,collapsingmulFpleoccurrencesofwordsintoone….

C

11/8/16 9


‘Bagofwords’representaFonoftext

ARGENTINE1986/87GRAIN/OILSEEDREGISTRATIONSBUENOSAIRES,Feb26ArgenFnegrainboardfiguresshowcropregistraFonsofgrains,oilseedsandtheirproductsto

February11,inthousandsoftonnes,showingthoseforfutureshipmentsmonth,1986/87totaland1985/86totaltoFebruary12,1986,inbrackets:

Breadwheatprev1,655.8,Feb872.0,March164.6,total2,692.4(4,161.0).MaizeMar48.0,total48.0(nil).Sorghumnil(nil)OilseedexportregistraFonswere:Sunflowerseedtotal15.0(7.9)SoybeanMay20.0,total20.0(nil)TheboardalsodetailedexportregistraFonsforsub-products,asfollows....

grain(s) 3 oilseed(s) 2 total 3 wheat 1 maize 1 soybean 1 tonnes 1

... ...

word frequency

BagofwordrepresentaFon:

Representtextasavectorofwordfrequencies.

11/8/16 10


Another“Bagofwords”representaFonoftextèEachdicFonarywordasBoolean




grain(s) Yes oilseed(s) Yes total Yes wheat Yes maize Yes Panda No Tiger No

... ...

word Boolean

BagofwordrepresentaFon:

Representtextasavectorofwordfrequencies.

11/8/16 11


BagofwordsrepresentaFondocumenti

X(i,j)=Frequencyofwordjindocumenti

wordj

AcollecFonofdocuments

C

11/8/16 12


Bagofwords

•  WhatsimplifyingassumpFonarewetaking?

Weassumedwordorderisnotimportant.

11/8/16 13


Unknown Words

•  How to handle words in the test corpus that did not occur in the training data, i.e. out of vocabulary (OOV) words?

•  Train a model that includes an explicit symbol for an unknown word (<UNK>). – Choose a vocabulary in advance and replace other

(i.e. not in vocabulary) words in the training corpus with <UNK>.

– Replace the first occurrence of each word in the training data with <UNK>.

11/8/16 14







11/8/16 15







... ...

word frequency

Pr(D |C = c)?

11/8/16 16



Pr(D |C = c)?

Pr(W1 = n1,W2 = n2 ,...,Wk = nk | C = c)

Pr(W1 = true,W2 = false...,Wk = true | C = c)

17


TwoPreviousmodels

11/8/16

ProbabilisFcModelsoftextdocuments

Pr(D |C = c)?

Pr(W1 = n1,W2 = n2 ,...,Wk = nk | C = c)


18


TwoPreviousmodels

11/8/16

MulFvariateBernoulliDistribuFon

MulFnomialDistribuFon

Text Classification with Naïve Bayes Classifier

•  Multinomial vs Multivariate Bernoulli?

•  Multinomial model is almost always more effective in text applications!

11/8/16 19


AdaptFromManning’textCattutorial

Experiment:MulJnomialvsmulFvariateBernoulli

•  M&N(1998)didsomeexperimentstoseewhichisbeper

•  Determineifauniversitywebpageis{student,faculty,other_stuff}

•  Trainon~5,000hand-labeledwebpages–  Cornell,Washington,U.Texas,Wisconsin

•  Crawlandclassifyanewsite(CMU)

11/8/16 20



MulJnomialvs.mulFvariateBernoulli

11/8/16 21







11/8/16 22


Model 1: Multivariate Bernoulli

•  Model 1: Multivariate Bernoulli – One feature Xw for each word in dictionary – Xw = true in document d if w appears in d

– Naive Bayes assumption: •  Given the document’s topic class label,

appearance of one word in the document tells us nothing about chances that another word appears

23 11/8/16



Model 1: Multivariate Bernoulli Naïve Bayes Classifier

24 11/8/16


•  Conditional Independence Assumption: Features (word presence) are independent of each other given the class variable:

•  Multivariate Bernoulli model is appropriate for binary feature variables

grain(s) True oilseed(s) True total True wheat True chemical False

... ...

word True/false

Model 1: Multivariate Bernoulli

•  Conditional Independence Assumption: Features (word presence) are independent of each other given the class variable:

•  Multivariate Bernoulli model is appropriate for

binary feature variables 25

C=Flu

W1 W2 W5 W3 W4 W6

11/8/16

Pr(W1 = true,W2 = false,...,Wk = true |C = c)= P(W1 = true |C)•P(W2 = false |C)•!•P(Wk = true |C)

runnynose sinus fever muscle-ache running

thisisnaïve


Review:BernoulliDistribuFone.g.CoinFlips

•  Youflipacoin– Headwithprobabilityp– Binaryrandomvariable– Bernoullitrialwithsuccessprobabilityp

•  Youflipkcoins– Howmanyheadswouldyouexpect– NumberofheadsX:discreterandomvariable– BinomialdistribuFonwithparameterskandp

11/8/16


26

Pr(Wi = true | C = c)


•  Youflipacoin– Headwithprobabilityp– Binaryrandomvariable– Bernoullitrialwithsuccessprobabilityp

•  Youflipkcoins– Howmanyheadswouldyouexpect– NumberofheadsX:discreterandomvariable– BinomialdistribuFonwithparameterskandp

11/8/16


27

Pr(Wi = true | C = c)






11/8/16 28


11/8/16


29

AgeneralStatement ConsiderasamplesetT=(X1...Xn)whichisdrawnfromaprobabilitydistribuFonP(X|\theta)where\thetaareparameters.IftheXsareindependentwithprobabilitydensityfuncFonP(Xi|\theta),thejointprobabilityofthewholesetis

P( 1X ... nX |θ )=

i=1

n

∏P( iX |θ )

thismaybemaximisedwithrespectto\thetatogivethemaximumlikelihoodesFmates.

MaximumLikelihoodEsFmaFon

11/8/16 30


ItisowenconvenienttoworkwiththeLogofthelikelihoodfuncFon.

log(L(θ )) =i=1

n

∑log(P( iX |θ ))

Theideaistoü assumeaparFcularmodelwithunknownparameters,ü wecanthendefinetheprobabilityofobservingagiveneventcondiFonalonaparFcularsetofparameters.

ü Wehaveobservedasetofoutcomesintherealworld.ü Itisthenpossibletochooseasetofparameterswhicharemostlikelytohaveproducedtheobservedresults.

Thisismaximumlikelihood.Inmostcasesitisbothconsistentandefficient.ItprovidesastandardtocompareotheresFmaFontechniques.

P( iX |θ )

θ

θ̂ = argmaxθ

P( 1X ... nX |θ )

11/8/16 31



•  Youflipncoins– Howmanyheadswouldyouexpect– Headwithprobabilityp– NumberofheadsXoutofntrial– EachTrialfollowingBernoullidistribuFonwithparametersp

11/8/16


32

DefiningLikelihoodforBernoulli

•  Likelihood=p(data|parameter)

!!f (xi |p)= pxi (1− p)1−xi

function of x_i

PMF:

!!L(p)=∏i=1n pxi (1− p)1−xi = px(1− p)n−x

function of p

LIKELIHOOD:

èe.g.,fornindependenttossesofcoins,withunknownp

11/8/16 33


Observeddataèxheads-upfromntrials

x = xi

i=1

n

∑

DerivingtheMaximumLikelihoodEsFmateforBernoulli

!!L(p)= px(1− p)n−x

maximize

0.0

00

.04

0.0

8

potential HIV prevalences

like

liho

od

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

!!log(L(p)= log px(1− p)n−x⎡⎣ ⎤⎦

maximize

!2

50

!1

50

!5

00

potential HIV prevalences

log

(lik

elih

oo

d)

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

−l(p)= − log px(1− p)n−x⎡⎣ ⎤⎦

Minimize the negative log-likelihood

05

01

00

20

0potential HIV prevalences

!lo

g(lik

elih

oo

d)

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

11/8/16 34


p

p

p


−l(p)= − log(L(p))= − log px(1− p)n−x⎡⎣ ⎤⎦

= − log(px )− log((1− p)n−x )

= −x log(p)−(n− x)log(1− p)

Minimize the negative log-likelihood

11/8/16 35



−l(p)= − log n

x⎛

⎝⎜⎞

⎠⎟− x log(p)−(n− x)log(1− p)

€

dl(p)dp

= 0 − xp−−(n − x)1− p

€

0 = −xˆ p

+n − x1− ˆ p

€

0 =−x(1− ˆ p ) + ˆ p (n − x)

ˆ p (1− ˆ p )

€

0 = −x + ˆ p x + ˆ p n − ˆ p x

€

0 = −x + ˆ p n

€

ˆ p = xn

The proportion of positives!

Minimize the negative log-likelihood è MLE parameter estimation

i.e.RelaFvefrequencyofabinaryevent

11/8/16 36


Parameter estimation

fraction of documents of topic cj in which word w appears

•  Multivariate Bernoulli model:

P̂(Xw = true | cj ) =

11/8/16 37 AdaptFromManning’textCattutorial







11/8/16 38


Model 2: Multinomial Naïve Bayes - ‘Bagofwords’representaFonoftext


... ...

word frequency

CanberepresentedasamulFnomialdistribuFon.

Words=likecoloredballs,thereareKpossibletypeofthem(i.e.fromadicFonaryofKwords)Document=containsNwords,eachwordoccursniFmes(likeabagofNcoloredballs)

Inadocumentclassof‘wheat’,“grain”ismorelikely.whereasina“harddrive”shipmentclasstheparameterfor‘grain’ isgoingtobesmaller.

11/8/16 39


Pr(W1 = n1,...,Wk = nk |C = c)



... ...

word frequency




11/8/16 40


Pr(W1 = n1,...,Wk = nk |C = c)



... ...

word frequency




11/8/16 41


Pr(W1 = n1,...,Wk = nk |C = c)

MulFnomialdistribuFon•  ThemulFnomialdistribuFonisageneralizaFonofthebinomial

distribuFon.

•  ThebinomialdistribuFoncountssuccessesofanevent(forexample,headsincointosses).

•  Theparameters:–  N(numberoftrials)–  (theprobabilityofsuccessoftheevent)

•  ThemulFnomialcountsthenumberofasetofevents(forexample,howmanyFmeseachsideofadiecomesupinasetofrolls).–  Theparameters:–  N(numberoftrials)–  (theprobabilityofsuccessforeachcategory)

p

1.. kθ θ

11/8/16 42


MulFnomialdistribuFon•  ThemulFnomialdistribuFonisageneralizaFonofthebinomial

distribuFon.

•  ThebinomialdistribuFoncountssuccessesofanevent(forexample,headsincointosses).

•  Theparameters:–  N(numberoftrials)–  (theprobabilityofsuccessoftheevent)

•  ThemulFnomialcountsthenumberofasetofevents(forexample,howmanyFmeseachsideofadiecomesupinasetofrolls).–  Theparameters:–  N(numberoftrials)–  (theprobabilityofsuccessforeachcategory)

p

1.. kθ θ

11/8/16 43


MulFnomialDistribuFon•  W1,W2,..Wkarevariables

P(W1 = n1,...,Wk = nk | ci , N ,θ

1,..,θk ) = N !

n1!n2 !..nk !θ1

n1θ2n2 ..θk

nk

1

k

iin N

=

=∑1

1k

iiθ

=

=∑

NumberofpossibleorderingsofNballs

orderinvariantselecFonsNoteeventsareindependent

AbinomialdistribuFonisthemulFnomialdistribuFonwithk=2and

θ1 = pθ2 = 1−θ1

11/8/16 44




... ...

word frequency



11/8/16 45

Pr(W1 = n1,...,Wk = nk |C = c)

P(W1 = n1,...,Wk = nk | c, N ,θ

1,..,θk ) = N !

n1!n2 !..nk !θ1

n1θ2n2 ..θk

nk

mulFnomialcoefficient,normallycanleaveoutinpracFcalcalculaFons.


11/8/16


46






11/8/16 47


TheBigPicture

Modeli.e.DatageneraFng

process

ObservedData

Probability

EsFmaFon/learning/Inference/Datamining

Buthowtospecifyamodel?

11/8/16

Build a generative model that approximates how data is produced.

48




... ...

word frequency



11/8/16 49

Pr(W1 = n1,...,Wk = nk |C = c)

WHYisthisnaïve???

P(W1 = n1,...,Wk = nk | c, N ,θ

1,..,θk ) = N !

n1!n2 !..nk !θ1

n1θ2n2 ..θk

nk

mulFnomialcoefficient,normallycanleaveoutinpracFcalcalculaFons.


WHYMULTINOMIALONTEXTISNAÏVEPROB.MODELING?

MainQuesFon:

11/8/16


50

MulJnomialNaïveBayesasèa generative model that approximates how a text string is produced

•  StochasJcLanguageModels:– ModelprobabilityofgeneraFngstrings(eachwordinturnfollowingthesequenFalorderinginthestring)inthelanguage(commonlyallstringsoverdicFonary∑).

–  E.g.,unigrammodel

0.2 the

0.1 a

0.01 boy

0.01 dog

0.03 said

0.02 likes

…

the boy likes the dog

0.2 0.01 0.02 0.2 0.01

Multiply all five terms

Model C_1

P(s | C_1) = 0.00000008 11/8/16 51



MulJnomialNaïveBayesasèa generative model that approximates how a text string is produced

•  StochasJcLanguageModels:– ModelprobabilityofgeneraFngstrings(eachwordinturnfollowingthesequenFalorderinginthestring)inthelanguage(commonlyallstringsoverdicFonary∑).

–  E.g.,unigrammodel

0.2 the

0.1 a

0.01 boy

0.01 dog

0.03 said

0.02 likes

…


0.2 0.01 0.02 0.2 0.01

Multiply all five terms

Model C_1

P(d | C_1) = 0.00000008 11/8/16 52



MulJnomialNaïveBayesasCondiJonalStochasJcLanguageModels

•  ModelcondiFonalprobabilityofgeneraFnganystringfromtwopossiblemodels

0.2 the

0.01 boy

0.0001 said

0.0001 likes

0.0001 black

0.0005 dog

0.01 garden

Model C1 Model C2

dog boy likes black the

0.0005 0.01 0.0001 0.0001 0.2 0.01 0.0001 0.02 0.1 0.2

P(d|C2) P(C2) > P(d|C1) P(C1)

0.2 the

0.0001 boy

0.03 said

0.02 likes

0.1 black

0.01 dog

0.0001 garden

11/8/16 53èdismorelikelytobefromclassC2


P(C2) = P(C1)

APhysicalMetaphor

•  Coloredballsarerandomlydrawnfrom(withreplacement)

A string of words model

P()=P()P()P()P()

11/8/16 54


UnigramlanguagemodelèMoregeneral:GeneraFnglanguagestringfromaprobabilisFcmodel

•  UnigramLanguageModels

•  Alsocouldbebigram(orgenerally,n-gram)LanguageModels

=P()P(|)P(|)P(|)

P()P()P()P()

P()

Easy. Effective!

11/8/16 55AdaptFromManning’textCattutorial

P()P(|)P(|)P(|)

Chainrule

NAÏVE:condiFonalindependentoneachposiFonofthestring


UnigramlanguagemodelèMoregeneral:GeneraFnglanguagestringfromaprobabilisFcmodel

•  UnigramLanguageModels

•  Alsocouldbebigram(orgenerally,n-gram)LanguageModels

=P()P(|)P(|)P(|)

P()P()P()P()

P()

Easy. Effective!


P()P(|)P(|)P(|)

Chainrule

NAÏVE:condiFonalindependentoneachposiFonofthestring


MulFnomialNaïveBayes=aclasscondiFonalunigramlanguagemodel

•  ThinkofXiasthewordontheithposiFoninthedocumentstring

•  EffecFvely,theprobabilityofeachclassisdoneasaclass-specificunigramlanguagemodel

class

X1 X2 X3 X4 X5 X6



11/8/16


58


0.2 0.01 0.02 0.2 0.01

Using Multinomial Naive Bayes Classifiers to Classify Text: Basic method

•  Attributes are text positions, values are words.

n  SFlltoomanypossibiliFesn  UsesameparametersforawordacrossposiFonsn  Resultisbagofwordsmodel(overwordtokens)

cNB = argmaxc j∈C

P(cj ) P(xi | cj )i∏

= argmaxc j∈C

P(cj )P(x1 = "the"|cj )!P(xn = "the"|cj )

59 11/8/16


Multinomial Naïve Bayes: Classifying Step

•  Positions ß all word positions in current document which contain tokens found in Vocabulary

•  Return cNB, where

∏∈∈

=positionsi

jijCc

NB cxPcPc )|()(argmaxj

11/8/16 60

Easytoimplement,no

needtoconstructbag-of-words

vectorexplicitly!!!


Pr(W1 = n1,...,Wk = nk |C = c j )

Equalto,(leavingoutofmulFnomialcoefficient)


Multinomial Naïve Bayes: Classifying Step

•  Positions ß all word positions in current document which contain tokens found in Vocabulary

•  Return cNB, where

∏∈∈

=positionsi

jijCc

NB cxPcPc )|()(argmaxj

11/8/16 61

Easytoimplement,no

needtoconstructbag-of-words

vectorexplicitly!!!


Pr(W1 = n1,...,Wk = nk |C = c j )

Equalto,(leavingoutofmulFnomialcoefficient)


Underflow Prevention: log space

•  Multiplying lots of probabilities, which are between 0 and 1, can result in floating-point underflow.

•  Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities.

•  Class with highest final un-normalized log probability score is still the most probable.

•  Note that model is now just max of sum of weights…

∑∈∈

+=positionsi

jijCc

NB cxPcPc )|(log)(logargmaxj

62 11/8/16AdaptFromManning’textCattutorial







11/8/16 63


TheBigPicture

Modeli.e.DatageneraFng

process

ObservedData

Probability

EsFmaFon/learning/Inference/Datamining

Buthowtospecifyamodel?

11/8/16

Build a generative model that approximates how data is produced.

64


Generative Model & MLE •  Language model can be seen as a probabilistic

automata for generating text strings

•  Relative frequency estimates can be proven to be maximum likelihood estimates (MLE) since they maximize the probability that the model M will generate the training corpus T. ))(|(argmaxˆ θθ

θMTrainP=

P(W1 = n1,...,Wk = nk | cj , N ,θ

1,..,θk ) ={θ1

n1θ2n2 ..θk

nk | cj}

11/8/16 65


DerivingtheMaximumLikelihoodEsFmateformulFnomialdistribuFon

11/8/16 66

function of θ

LIKELIHOOD:

argmaxθ

1,..,θk

P(d1,...,dT |θ1,..,θk )

= argmaxθ

1,..,θk t=1

T

∏P(dt |θ1,..,θk )

= argmaxθ

1,..,θk t=1

T

∏Nd t !

n1,dt!n2,dt

!..nk ,dt!θ1

n1,dtθ2

n2,dt ..θk

nk ,dt

= argmaxθ

1,..,θk t=1

T

∏θ1

n1,dtθ2

n2,dt ..θk

nk ,dt

s.t. θii=1

k

∑ =1


DerivingtheMaximumLikelihoodEsFmateformulFnomialdistribuFon

11/8/16 67

argmaxθ1,..,θk

log(L(θ ))

= argmaxθ1,..,θk

log(t=1

T

∏θ1n1,dtθ2

n2,dt ..θknk ,dt )

= argmaxθ1,..,θk

n1,dtt=1,...T∑ log(θ1)+ n2,dt

t=1,...T∑ log(θ2 )+ ...+ nk ,dt

t=1,...T∑ log(θk )

s.t. θii=1

k

∑ =1

ConstrainedopFmizaFonMLEesFmator

ConstrainedopFmizaFon

θi =

ni,dtt=1,...T∑

n1,dtt=1,...T∑ + n2,dt

t=1,...T∑ +...+ nk,dt

t=1,...T∑

=

ni,dtt=1,...T∑

Ndtt=1,...T∑

è i.e. We can create a mega-document by concatenating all documents d_1 to d_T èUse relative frequency of w in mega-document

HowopFmize?SeeHandout-

EXTRA


n  Textj ß is length n and is a singledocumentcontainingalldocsjn  foreachwordwk inVocabulary

n  nk,j ß numberofoccurrencesof wk inTextj;nj is length of Textj

n 

Naïve Multinomial : Learning Algorithm for parameter estimation with MLE •  From training corpus, extract Vocabulary •  Calculate required P(cj) and P(wk | cj) terms

–  For each cj in C do •  docsj ß subset of documents for which the target

class is cj

P(wk |c j )←

nk , j +αnj +α |Vocabulary |

|documents # total|||

)( jj

docscP ←

68 11/8/16Relative frequency of word w_k appears

across all documents of class cj


e.g.,α =1

Multinomial Bayes: Time Complexity

•  Training Time: O(T*Ld + |C||V|)) where Ld is the average length of a document in D. –  Assumes V and all Di , ni, and nk,j pre-computed in O(T*Ld)

time during one pass through all of the data. –  |C||V|=ComplexityofcompuFngallprobabilityvalues(loopover

wordsandclasses)

–  Generally just O(T*Ld) since usually |C||V| < T*Ld •  Test Time: O(|C| Lt)

where Lt is the average length of a test document. –  Very efficient overall, linearly proportional to the time needed

to just read in all the words. –  Plus, robust in practice

69 AdaptFromManning’textCattutorial

T: num. doc |V| dict size |C| class size

11/8/16


Parameter estimation

fraction of times in which word w appears

across all documents of topic cj

== )|(ˆ ji cwXP

11/8/16 70 AdaptFromManning’textCattutorial


Multinomial model:

Can create a mega-document for topic j by concatenating all documents on this topic Use frequency of w in mega-document

Naive Bayes is Not So Naive •  Naïve Bayes: First and Second place in KDD-CUP 97 competition, among

16 (then) state of the art algorithms Goal: Financial services industry direct mail response prediction model: Predict if the recipient of mail will actually respond to the advertisement – 750,000 records.

•  Robust to Irrelevant Features Irrelevant Features cancel each other without affecting results Instead Decision Trees can heavily suffer from this.

•  Very good in domains with many equally important features Decision Trees suffer from fragmentation in such cases – especially if little data

•  A good dependable baseline for text classification (but not the best)! •  Optimal if the Independence Assumptions hold: If assumed independence is

correct, then it is the Bayes Optimal Classifier for problem •  Very Fast: Learning with one pass of counting over the data; testing linear in the

number of attributes, and document collection size •  Low Storage requirements

71 11/8/16


References

q Prof.AndrewMoore’sreviewtutorialq Prof.KeChenNBslidesq Prof.CarlosGuestrinrecitaFonslidesq Prof.RaymondJ.MooneyandJimmyLin’sslidesaboutlanguagemodel

q Prof.Manning’textCattutorial

11/8/16


72

Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall 2016 …€¦ · – Fall 2016 Machine...

Documents

Transcript of Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall 2016 …€¦ · – Fall 2016 Machine...