Text Classification and Naïve Bayes - ecology labfaculty.cse.tamu.edu › ... ›...
Transcript of Text Classification and Naïve Bayes - ecology labfaculty.cse.tamu.edu › ... ›...
TextClassificationandNaïveBayes
TheTaskofTextClassification
Many slides are adapted from slides by Dan Jurafsky
WhowrotewhichFederalistpapers?• 1787-8:anonymousessaystrytoconvinceNewYorktoratifyU.SConstitution: Jay,Madison,Hamilton.
• Authorshipof12ofthelettersindispute• 1963:solvedbyMosteller andWallaceusingBayesianmethods
JamesMadison AlexanderHamilton
Maleorfemaleauthor?1. By1925present-dayVietnamwasdividedintothreeparts
underFrenchcolonialrule.ThesouthernregionembracingSaigonandtheMekongdeltawasthecolonyofCochin-China;thecentralareawithitsimperialcapitalatHuewastheprotectorateofAnnam…
2. Claraneverfailedtobeastonishedbytheextraordinaryfelicityofherownname.Shefoundithardtotrustherselftothemercyoffate,whichhadmanagedovertheyearstoconverthergreatestshameintooneofhergreatestassets…
S.Argamon,M.Koppel,J.Fine,A.R.Shimoni,2003.“Gender,Genre,andWritingStyleinFormalWrittenTexts,”Text,volume23,number3,pp.321–346
Positiveornegativemoviereview?• unbelievablydisappointing• Fullofzanycharactersandrichlyappliedsatire,andsomegreatplottwists
• thisisthegreatestscrewballcomedyeverfilmed
• Itwaspathetic.Theworstpartaboutitwastheboxingscenes.
5
Whatisthesubjectofthisarticle?
• Antogonists andInhibitors
• BloodSupply• Chemistry• DrugTherapy• Embryology• Epidemiology• …
6
MeSH SubjectCategoryHierarchy
?
MEDLINE Article
TextClassification
• Assigningsubjectcategories,topics,orgenres• Spamdetection• Authorshipidentification• Age/genderidentification• LanguageIdentification• Sentimentanalysis• …
TextClassification:definition• Input:– adocumentd– afixedsetofclassesC= {c1,c2,…,cJ}
• Output:apredictedclassc Î C
ClassificationMethods:Hand-codedrules
• Rulesbasedoncombinationsofwordsorotherfeatures– spam:black-list-addressOR(“dollars”AND“have beenselected”)
• Accuracycanbehigh– Ifrulescarefullyrefinedbyexpert
• Butbuildingandmaintainingtheserulesisexpensive
ClassificationMethods:SupervisedMachineLearning
• Input:– adocumentd– afixedsetofclassesC= {c1,c2,…,cJ}– Atrainingsetofm hand-labeleddocuments(d1,c1),....,(dm,cm)
• Output:– alearnedclassifierγ:dà c
10
ClassificationMethods:SupervisedMachineLearning
• Anykindofclassifier– Naïve Bayes– Logisticregression,maxent– Support-vectormachines– k-NearestNeighbors
– …
Precisionandrecall• Precision:%ofselecteditemsthatarecorrectRecall:%ofcorrectitemsthatareselected
correct notcorrectselected tp fp
notselected fn tn
Acombinedmeasure:F• AcombinedmeasurethatassessestheP/RtradeoffisFmeasure(weightedharmonicmean):
• PeopleusuallyusebalancedF1measure– i.e.,withb =1(thatis,a =½): F =2PR/(P+R)
RPPR
RP
F+
+=
−+= 2
2 )1(1)1(1
1ββ
αα
Confusionmatrixc• Foreachpairofclasses<c1,c2>howmanydocumentsfromc1 wereincorrectlyassignedtoc2?– c3,2:90wheatdocumentsincorrectlyassignedtopoultry
17
Docsintestset AssignedUK
Assignedpoultry
Assignedwheat
Assignedcoffee
Assignedinterest
Assignedtrade
TrueUK 95 1 13 0 1 0
Truepoultry 0 1 0 0 0 0
Truewheat 10 90 0 1 0 0
Truecoffee 0 0 0 34 3 7
Trueinterest - 1 2 13 26 5
Truetrade 0 0 2 14 5 10
PerclassevaluationmeasuresRecall:Fractionofdocsinclassi classifiedcorrectly:
Precision:Fractionofdocsassignedclassi thatareactually
aboutclassi:
Accuracy:(1- errorrate)Fractionofdocsclassifiedcorrectly: 18
ciii∑
ciji∑
j∑
ciic ji
j∑
ciicij
j∑
Sec. 15.2.4
Micro- vs.Macro-Averaging– Ifwehavemorethanoneclass,howdowecombinemultipleperformancemeasuresintoonequantity?
• Macroaveraging:Computeperformanceforeachclass,thenaverage.Averageonclasses
• Microaveraging:Collectdecisionsforeachinstancefromallclasses,computecontingencytable,evaluate.Averageoninstances
19
Sec. 15.2.4
Micro- vs.Macro-Averaging:Example
Truth:yes
Truth:no
Classifier:yes 10 10
Classifier:no 10 970
Truth:yes
Truth:no
Classifier:yes 90 10
Classifier:no 10 890
Truth:yes
Truth:no
Classifier:yes 100 20
Classifier:no 20 1860
20
Class1 Class2 MicroAve.Table
Sec.15.2.4
• Macroaveraged precision:(0.5+0.9)/2=0.7• Microaveraged precision:100/120=.83• Microaveraged scoreisdominatedbyscoreoncommonclasses
DevelopmentTestSetsandCross-validation
• Metric:P/R/F1orAccuracy• Unseentestset– avoidoverfitting (‘tuningtothetestset’)– moreconservativeestimateofperformance
– Cross-validationovermultiplesplits• Handlesamplingerrorsfromdifferentdatasets
– Poolresultsovereachsplit– Computepooleddev setperformance
Trainingset Development Test Set TestSet
TestSet
TrainingSet
TrainingSetDev Test
TrainingSet
Dev Test
Dev Test
NaïveBayesIntuition
• Simple(“naïve”)classificationmethodbasedonBayesrule
• Reliesonverysimplerepresentationofdocument– Bagofwords
Naïve BayesClassifier(I)
cMAP = argmaxc∈C
P(c | d)
= argmaxc∈C
P(d | c)P(c)P(d)
= argmaxc∈C
P(d | c)P(c)
MAP is “maximum a posteriori” = most likely class
Bayes Rule
Dropping the denominator
Naïve BayesClassifier(II)
cMAP = argmaxc∈C
P(d | c)P(c)
Document d represented as features x1..xn
= argmaxc∈C
P(x1, x2,…, xn | c)P(c)
Naïve BayesClassifier(III)
How often does this class occur?
cMAP = argmaxc∈C
P(x1, x2,…, xn | c)P(c)
O(|X|n•|C|)parameters
We can just count the relative frequencies in a corpus
Couldonlybeestimatedifavery,verylargenumberoftrainingexampleswasavailable.
ThebagofwordsrepresentationI love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun… It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet.
γ(
)=c
Planning GUIGarbageCollection
Machine Learning NLP
parsertagtrainingtranslationlanguage...
learningtrainingalgorithmshrinkagenetwork...
garbagecollectionmemoryoptimizationregion...
Test document
parserlanguagelabeltranslation…
Bagofwordsfordocumentclassification
...planningtemporalreasoningplanlanguage...
?
MultinomialNaïve BayesIndependenceAssumptions
• BagofWordsassumption:Assumepositiondoesn’tmatter
• ConditionalIndependence:AssumethefeatureprobabilitiesP(xi|cj)areindependentgiventheclassc.
P(x1, x2,…, xn | c)
P(x1,…, xn | c) = P(x1 | c)•P(x2 | c)•P(x3 | c)•...•P(xn | c)
ApplyingMultinomialNaiveBayesClassifierstoTextClassification
cNB = argmaxc j∈C
P(cj ) P(xi | cj )i∈positions∏
positions ¬ allwordpositionsintestdocument
LearningtheMultinomialNaïve BayesModel
• Firstattempt:maximumlikelihoodestimates– simplyusethefrequenciesinthedata
Sec.13.3
P̂(wi | cj ) =count(wi,cj )count(w,cj )
w∈V∑
P̂(cj ) =doccount(C = cj )
Ndoc
Parameterestimation
• Createmega-documentfortopicj byconcatenatingalldocsinthistopic– Usefrequencyofw inmega-document
fractionoftimeswordwi appearsamongallwordsindocumentsoftopiccj
P̂(wi | cj ) =count(wi,cj )count(w,cj )
w∈V∑
ProblemwithMaximumLikelihood• Whatifwehaveseennotrainingdocumentswiththewordfantastic and
classifiedinthetopicpositive (thumbs-up)?
• Zeroprobabilitiescannotbeconditionedaway,nomattertheotherevidence!
P̂("fantastic" positive) = count("fantastic", positive)count(w, positive
w∈V∑ )
= 0
cMAP = argmaxc P̂(c) P̂(xi | c)i∏
Sec.13.3
Laplace(add-1)smoothing:unknownwords
P̂(wu | c) = count(wu,c)+1
count(w,cw∈V∑ )
#
$%%
&
'(( + V +1
Addoneextrawordtothevocabulary,the“unknownword”wu
=1
count(w,cw∈V∑ )
#
$%%
&
'(( + V +1
UnderflowPrevention:logspace• Multiplyinglotsofprobabilitiescanresultinfloating-pointunderflow.• Sincelog(xy)=log(x)+log(y)
– Bettertosumlogsofprobabilitiesinsteadofmultiplyingprobabilities.• Classwithhighestun-normalizedlogprobabilityscoreisstillmost
probable.
• Modelisnowjustmaxofsumofweights
cNB = argmaxc j∈C
logP(cj )+ logP(xi | cj )i∈positions∑
Choosingaclass:P(c|d5)
P(j|d5) 1/4*(2/10)3 *2/10 *2/10≈0.00008
Doc Words Class
Training 1 Chinese BeijingChinese c
2 ChineseChineseShanghai c3 ChineseMacao c4 TokyoJapanChinese j
Test 5 ChineseChineseChineseTokyo Japan ?
43
ConditionalProbabilities:P(Chinese|c)=P(Tokyo|c)=P(Japan|c)=P(Chinese|j)=P(Tokyo|j)=P(Japan|j)=
Priors:P(c)=P(j)=
34 1
4
P̂(w | c) = count(w,c)+1count(c)+ |V |
P̂(c) = Nc
N
(5+1)/(8+7)=6/15(0+1)/(8+7)=1/15
(1+1)/(3+7)=2/10(0+1)/(8+7)=1/15
(1+1)/(3+7)=2/10(1+1)/(3+7)=2/10
3/4*(6/15)3 *1/15 *1/15≈0.0002
µ
µ
+1
Summary:NaiveBayesisNotSoNaive• RobusttoIrrelevantFeatures
IrrelevantFeaturescanceleachotherwithoutaffectingresults
• Verygoodindomainswithmanyequallyimportantfeatures
DecisionTreessufferfromfragmentation insuchcases– especiallyiflittledata
• Optimaliftheindependenceassumptionshold:Ifassumedindependenceiscorrect,thenitistheBayesOptimalClassifierforproblem
• Agooddependablebaselinefortextclassification– Butwewillseeotherclassifiersthatgivebetteraccuracy