Dan Jurafsky Text Classification - Wuwei Lan › courses › SP19 › 3521... · Dan Jurafsky Text...

Post on 09-Jun-2020

5 views 0 download

Transcript of Dan Jurafsky Text Classification - Wuwei Lan › courses › SP19 › 3521... · Dan Jurafsky Text...

DanJurafsky

TextClassification

• Assigningsubjectcategories,topics,orgenres• Spamdetection• Authorshipidentification• Age/genderidentification• LanguageIdentification• Sentimentanalysis• …

DanJurafsky

TextClassification:definition

• Input:• adocumentd• afixedsetofclassesC= {c1,c2,…,cJ}

• Output:apredictedclassc Î C

DanJurafsky

ClassificationMethods:SupervisedMachineLearning

• Input:• adocumentd• afixedsetofclassesC= {c1,c2,…,cJ}• Atrainingsetofm hand-labeleddocuments(d1,c1),....,(dm,cm)

• Output:• alearnedclassifierγ:dà c

3

DanJurafsky ClassificationMethods:SupervisedMachineLearning

• Anykindofclassifier• Naïve Bayes• Logisticregression• Support-vectormachines• k-NearestNeighbors

• …

DanJurafsky

NaïveBayesIntuition

• Simple(“naïve”)classificationmethodbasedonBayesrule

• Reliesonverysimplerepresentationofdocument• Bagofwords

DanJurafsky

Thebagofwordsrepresentation

I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun… It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet.

γ(

)=c

DanJurafsky

Thebagofwordsrepresentation

γ(

)=cgreat 2love 2

recommend 1

laugh 1happy 1

... ...

DanJurafsky

MultinomialNaïve BayesIndependenceAssumptions

P(x1, x2,…, xn | c)

• BagofWordsassumption:Assumepositiondoesn’tmatter

• ConditionalIndependence:AssumethefeatureprobabilitiesP(xi|cj)areindependentgiventheclassc.

P(x1,…, xn | c) = P(x1 | c)•P(x2 | c)•P(x3 | c)•...•P(xn | c)

DanJurafsky

LearningtheMultinomialNaïve BayesModel

• Firstattempt:maximumlikelihoodestimates• simplyusethefrequenciesinthedata

Sec.13.3

P̂(wi | cj ) =count(wi,cj )count(w,cj )

w∈V∑

P̂(cj ) =doccount(C = cj )

Ndoc

DanJurafsky

MultinomialNaïveBayes:Learning

• CalculateP(cj) terms• Foreachcj inC do

docsj¬ alldocswithclass=cj

P(wk | cj )←nk +α

n+α |Vocabulary |P(cj )←

| docsj || total # documents|

• CalculateP(wk | cj) terms• Textj¬ singledoccontainingalldocsj• For eachwordwk inVocabulary

nk¬ #ofoccurrencesofwk inTextj

• Fromtrainingcorpus,extractVocabulary

DanJurafsky

Choosingaclass:P(c|d5)

P(j|d5) 1/4*(2/9)3 *2/9*2/9≈0.0001

Doc Words ClassTraining 1 Chinese BeijingChinese c

2 ChineseChineseShanghai c3 ChineseMacao c4 TokyoJapanChinese j

Test 5 ChineseChineseChineseTokyo Japan ?

11

ConditionalProbabilities:P(Chinese|c)=P(Tokyo|c)=P(Japan|c)=P(Chinese|j)=P(Tokyo|j)=P(Japan|j)=

Priors:P(c)=P(j)=

34 1

4

P̂(w | c) = count(w,c)+1count(c)+ |V |

P̂(c) = Nc

N

(5+1)/(8+6)=6/14=3/7(0+1)/(8+6)=1/14

(1+1)/(3+6)=2/9(0+1)/(8+6)=1/14

(1+1)/(3+6)=2/9(1+1)/(3+6)=2/9

3/4*(3/7)3 *1/14*1/14≈0.0003

µ

µ

DanJurafsky

UnderflowPrevention:logspace

• Multiplyinglotsofprobabilitiescanresultinfloating-pointunderflow.• Sincelog(xy)=log(x)+log(y)

• Bettertosumlogsofprobabilitiesinsteadofmultiplyingprobabilities.• Classwithhighestun-normalizedlogprobabilityscoreisstillmostprobable.

• Modelisnowjustmaxofsumofweights

cNB = argmaxc j∈C

logP(cj )+ logP(xi | cj )i∈positions∑

DanJurafsky

Summary:NaiveBayesisNotSoNaive

• VeryFast,lowstoragerequirements• RobusttoIrrelevantFeatures

IrrelevantFeaturescanceleachotherwithoutaffectingresults

• VerygoodindomainswithmanyequallyimportantfeaturesDecisionTreessufferfromfragmentation insuchcases– especiallyiflittledata

• Optimaliftheindependenceassumptionshold:Ifassumedindependenceiscorrect,thenitistheBayesOptimalClassifierforproblem

• Agooddependablebaselinefortextclassification• Butwewillseeotherclassifiersthatgivebetteraccuracy

TextClassification:Evaluation

DanJurafsky

The2-by-2contingencytable

correct notcorrectselected tp fp

notselected fn tn

DanJurafsky

Precisionandrecall

• Precision:%ofselecteditemsthatarecorrectRecall:%ofcorrectitemsthatareselected

correct notcorrectselected tp fp

notselected fn tn

DanJurafsky

Acombinedmeasure:F

• AcombinedmeasurethatassessestheP/RtradeoffisFmeasure(weightedharmonicmean):

• Theharmonicmeanisaveryconservativeaverage;seeIIR§8.3

• PeopleusuallyusebalancedF1measure• i.e.,withb =1(thatis,a =½): F =2PR/(P+R)

RPPR

RP

F+

+=

−+= 2

2 )1(1)1(1

1ββ

αα

DanJurafsky

18

MoreThanTwoClasses:Setsofbinaryclassifiers

• Dealingwithany-oformultivalue classification• Adocumentcanbelongto0,1,or>1classes.

• Foreachclassc∈C• Buildaclassifierγc todistinguishc fromallotherclassesc’∈C

• Giventestdocd,• Evaluateitformembershipineachclassusingeachγc• d belongstoany classforwhich γc returnstrue

Sec.14.5

DanJurafsky

19

MoreThanTwoClasses:Setsofbinaryclassifiers

• One-oformultinomialclassification• Classesaremutuallyexclusive:eachdocumentinexactlyoneclass

• Foreachclassc∈C• Buildaclassifierγc todistinguishc fromallotherclassesc’∈C

• Giventestdocd,• Evaluateitformembershipineachclassusingeachγc• d belongstotheone classwithmaximumscore

Sec.14.5

DanJurafsky

Confusionmatrixc• Foreachpairofclasses<c1,c2>howmanydocumentsfromc1

wereincorrectlyassignedtoc2?• c3,2:90wheatdocumentsincorrectlyassignedtopoultry

20

Docsintestset AssignedUK

Assignedpoultry

Assignedwheat

Assignedcoffee

Assignedinterest

Assignedtrade

TrueUK 95 1 13 0 1 0

Truepoultry 0 1 0 0 0 0

Truewheat 10 90 0 1 0 0

Truecoffee 0 0 0 34 3 7

Trueinterest - 1 2 13 26 5

Truetrade 0 0 2 14 5 10

DanJurafsky

21

Perclassevaluationmeasures

Recall:Fractionofdocsinclassi classifiedcorrectly:

Precision:Fractionofdocsassignedclassi thatare

actuallyaboutclassi:

Accuracy:(1- errorrate)Fractionofdocsclassifiedcorrectly:

ciii∑

ciji∑

j∑

ciic ji

j∑

ciicij

j∑

Sec. 15.2.4

DanJurafsky

22

Micro- vs.Macro-Averaging

• Ifwehavemorethanoneclass,howdowecombinemultipleperformancemeasuresintoonequantity?

• Macroaveraging:Computeperformanceforeachclass,thenaverage.

• Microaveraging:Collectdecisionsforallclasses,computecontingencytable,evaluate.

Sec. 15.2.4

DanJurafsky

23

Micro- vs.Macro-Averaging:Example

Truth:yes

Truth:no

Classifier:yes 10 10

Classifier:no 10 970

Truth:yes

Truth:no

Classifier:yes 90 10

Classifier:no 10 890

Truth:yes

Truth:no

Classifier:yes 100 20

Classifier:no 20 1860

Class1 Class2 MicroAve.Table

Sec.15.2.4

• Macroaveraged precision:(0.5+0.9)/2=0.7• Microaveraged precision:100/120=.83• Microaveraged scoreisdominatedbyscoreoncommonclasses

DanJurafsky

DevelopmentTestSetsandCross-validation

• Metric:P/R/F1orAccuracy• Unseentestset

• avoidoverfitting (‘tuningtothetestset’)• moreconservativeestimateofperformance

• Cross-validationovermultiplesplits• Handlesamplingerrorsfromdifferentdatasets

• Poolresultsovereachsplit• Computepooleddev setperformance

Trainingset Development Test Set TestSet

TestSet

TrainingSet

TrainingSetDev Test

TrainingSet

Dev Test

Dev Test