Text Classification - Princeton NLP · adventure scenes are fun... It manages to be whimsical and...
Transcript of Text Classification - Princeton NLP · adventure scenes are fun... It manages to be whimsical and...
![Page 1: Text Classification - Princeton NLP · adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend](https://reader033.fdocuments.us/reader033/viewer/2022060313/5f0b60927e708231d4303817/html5/thumbnails/1.jpg)
Text Classification
Fall 2019
COS 484: Natural Language Processing
![Page 2: Text Classification - Princeton NLP · adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend](https://reader033.fdocuments.us/reader033/viewer/2022060313/5f0b60927e708231d4303817/html5/thumbnails/2.jpg)
Why classify?
• Authorship attribution
• Language detection
• News categorization
Spam detection Sentiment analysis
![Page 3: Text Classification - Princeton NLP · adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend](https://reader033.fdocuments.us/reader033/viewer/2022060313/5f0b60927e708231d4303817/html5/thumbnails/3.jpg)
Classification: The Task
• Inputs:
• A document d
• A set of classes C = {c1, c2, c3, … , cm}
• Output:
• Predicted class c for document d
Movie was terrible
Amazing acting
Classify
Classify
Negative
Positive
![Page 4: Text Classification - Princeton NLP · adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend](https://reader033.fdocuments.us/reader033/viewer/2022060313/5f0b60927e708231d4303817/html5/thumbnails/4.jpg)
Rule-based classification
• Combinations of features on words in document, meta-data IF there exists word w in document d such that w in [good, great, extra-ordinary, …], THEN output Positive IF email address ends in [ithelpdesk.com, makemoney.com, spinthewheel.com, …] THEN output SPAM
• Can be very accurate
• Rules may be hard to define (and some even unknown to us!)
• Expensive
• Not easily generalizable
![Page 5: Text Classification - Princeton NLP · adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend](https://reader033.fdocuments.us/reader033/viewer/2022060313/5f0b60927e708231d4303817/html5/thumbnails/5.jpg)
Supervised Learning: Let’s use statistics!
• Data-driven approach
• Let the machine figure out the best patterns to use
• Inputs:
• Set of m classes C = {c1, c2, …, cm}
• Set of n ‘labeled’ documents: {(d1, c1), (d2, c2), …, (dn, cn)}
• Output:
• Trained classifier, F : d -> c
![Page 6: Text Classification - Princeton NLP · adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend](https://reader033.fdocuments.us/reader033/viewer/2022060313/5f0b60927e708231d4303817/html5/thumbnails/6.jpg)
Types of supervised classifiers
Naive Bayes Logistic regression
Support vector machines k-nearest neighbors
![Page 7: Text Classification - Princeton NLP · adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend](https://reader033.fdocuments.us/reader033/viewer/2022060313/5f0b60927e708231d4303817/html5/thumbnails/7.jpg)
Multinomial Naive Bayes
• Simple classification model making use of Bayes rule
• Bayes Rule:
• Makes strong (naive) independence assumptions
![Page 8: Text Classification - Princeton NLP · adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend](https://reader033.fdocuments.us/reader033/viewer/2022060313/5f0b60927e708231d4303817/html5/thumbnails/8.jpg)
Predicting a class
• Best class,
![Page 9: Text Classification - Princeton NLP · adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend](https://reader033.fdocuments.us/reader033/viewer/2022060313/5f0b60927e708231d4303817/html5/thumbnails/9.jpg)
How to represent P(d | c)?
• Option 1: represent the entire sequence of words
• P(w1, w2, w3, …, wk | c) (too many sequences!)
• Option 2: Bag of words
• Assume position of each word is irrelevant (both absolute and relative)
• P(w1, w2, w3, …, wk | c) = P(w1|c) P(w2|c) … P(wk|c)
• Probability of each word is conditionally independent given class c
![Page 10: Text Classification - Princeton NLP · adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend](https://reader033.fdocuments.us/reader033/viewer/2022060313/5f0b60927e708231d4303817/html5/thumbnails/10.jpg)
Bag of wordsThe%Bag%of%Words%Representation
15
it
it
itit
it
it
I
I
I
I
I
love
recommend
movie
thethe
the
the
to
to
to
and
andand
seen
seen
yet
would
with
who
whimsical
whilewhenever
times
sweet
several
scenes
satirical
romanticof
manages
humor
have
happy
fun
friend
fairy
dialogue
but
conventions
areanyone
adventure
always
again
about
I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet!
it Ithetoandseenyetwouldwhimsicaltimessweetsatiricaladventuregenrefairyhumorhavegreat…
6 54332111111111111…
it
it
itit
it
it
I
I
I
I
I
love
recommend
movie
thethe
the
the
to
to
to
and
andand
seen
seen
yet
would
with
who
whimsical
whilewhenever
times
sweet
several
scenes
satirical
romanticof
manages
humor
have
happy
fun
friend
fairy
dialogue
but
conventions
areanyone
adventure
always
again
about
I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet!
it Ithetoandseenyetwouldwhimsicaltimessweetsatiricaladventuregenrefairyhumorhavegreat…
6 54332111111111111…
it
it
itit
it
it
I
I
I
I
I
love
recommend
movie
thethe
the
the
to
to
to
and
andand
seen
seen
yet
would
with
who
whimsical
whilewhenever
times
sweet
several
scenes
satirical
romanticof
manages
humor
have
happy
fun
friend
fairy
dialogue
but
conventions
areanyone
adventure
always
again
about
I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet!
it Ithetoandseenyetwouldwhimsicaltimessweetsatiricaladventuregenrefairyhumorhavegreat…
6 54332111111111111…
![Page 11: Text Classification - Princeton NLP · adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend](https://reader033.fdocuments.us/reader033/viewer/2022060313/5f0b60927e708231d4303817/html5/thumbnails/11.jpg)
Predicting with Naive Bayes
• We now have:
![Page 12: Text Classification - Princeton NLP · adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend](https://reader033.fdocuments.us/reader033/viewer/2022060313/5f0b60927e708231d4303817/html5/thumbnails/12.jpg)
Naive Bayes as a generative classifier
![Page 13: Text Classification - Princeton NLP · adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend](https://reader033.fdocuments.us/reader033/viewer/2022060313/5f0b60927e708231d4303817/html5/thumbnails/13.jpg)
Naive Bayes as a generative model
![Page 14: Text Classification - Princeton NLP · adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend](https://reader033.fdocuments.us/reader033/viewer/2022060313/5f0b60927e708231d4303817/html5/thumbnails/14.jpg)
Naive Bayes as a generative model
![Page 15: Text Classification - Princeton NLP · adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend](https://reader033.fdocuments.us/reader033/viewer/2022060313/5f0b60927e708231d4303817/html5/thumbnails/15.jpg)
Naive Bayes as a generative model
Generate the entire data set one document at a time
![Page 16: Text Classification - Princeton NLP · adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend](https://reader033.fdocuments.us/reader033/viewer/2022060313/5f0b60927e708231d4303817/html5/thumbnails/16.jpg)
Estimating probabilities
• Maximum likelihood estimates:
•
![Page 17: Text Classification - Princeton NLP · adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend](https://reader033.fdocuments.us/reader033/viewer/2022060313/5f0b60927e708231d4303817/html5/thumbnails/17.jpg)
Data sparsity
• What is count(‘amazing’, positive) = 0?
• Implies P(‘amazing’ | positive) = 0
• Given a review document, d = “…. most amazing movie ever …”
![Page 18: Text Classification - Princeton NLP · adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend](https://reader033.fdocuments.us/reader033/viewer/2022060313/5f0b60927e708231d4303817/html5/thumbnails/18.jpg)
Solution: Smoothing!
• Laplace smoothing:
• Simple, easy to use
• Effective in practice
![Page 19: Text Classification - Princeton NLP · adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend](https://reader033.fdocuments.us/reader033/viewer/2022060313/5f0b60927e708231d4303817/html5/thumbnails/19.jpg)
Overall process
• Input: Set of annotated documents
A. Compute vocabulary V of all words
B. Calculate
C. Calculate
D. (Prediction) Given document
{(di, ci)}ni=1
d = (w1, w2, . . . , wk)
![Page 20: Text Classification - Princeton NLP · adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend](https://reader033.fdocuments.us/reader033/viewer/2022060313/5f0b60927e708231d4303817/html5/thumbnails/20.jpg)
Naive Bayes Example
Choosing%a%class:P(c|d5)$
P(j|d5)$ 1/4$*$(2/9)3 *$2/9$*$2/9$≈$0.0001
Doc Words ClassTraining 1 Chinese Beijing$Chinese c
2 Chinese$Chinese$Shanghai c3 Chinese$Macao c4 Tokyo$Japan$Chinese j
Test 5 Chinese$Chinese$Chinese$Tokyo Japan ?
41
Conditional%Probabilities:P(Chinese|c)$=P(Tokyo|c)$$$$=P(Japan|c)$$$$$=P(Chinese|j)$=P(Tokyo|j)$$$$$=P(Japan|j)$$$$$$=$
Priors:P(c)=$P(j)=$
34 1
4
P̂(w | c) = count(w,c)+1count(c)+ |V |
P̂(c) = Nc
N
(5+1)$/$(8+6)$=$6/14$=$3/7(0+1)$/$(8+6)$=$1/14
(1+1)$/$(3+6)$=$2/9$(0+1)$/$(8+6)$=$1/14
(1+1)$/$(3+6)$=$2/9$(1+1)$/$(3+6)$=$2/9$
3/4$*$(3/7)3 *$1/14$*$1/14$≈$0.0003
∝
∝
![Page 21: Text Classification - Princeton NLP · adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend](https://reader033.fdocuments.us/reader033/viewer/2022060313/5f0b60927e708231d4303817/html5/thumbnails/21.jpg)
Features
• In general, Naive Bayes can use any set of features, not just words
• URLs, email addresses, Capitalization, …
• Domain knowledge crucial to performance
Top features for
Spam detection
![Page 22: Text Classification - Princeton NLP · adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend](https://reader033.fdocuments.us/reader033/viewer/2022060313/5f0b60927e708231d4303817/html5/thumbnails/22.jpg)
Naive Bayes and Language Models
Each%class%=%a%unigram%language%model• Assigning$each$word:$P(word$|$c)• Assigning$each$sentence:$P(s|c)=Π P(word|c)
0.1 I
0.1 love
0.01 this
0.05 fun
0.1 film
…
I love this fun film
0.1 0.1 .05 0.01 0.1
Class$pos
P(s$|$pos)$=$0.0000005$
Sec.13.2.1• If features = bag of words, each class is a unigram language model!
• For class c, assigning each word: assigning sentence:
![Page 23: Text Classification - Princeton NLP · adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend](https://reader033.fdocuments.us/reader033/viewer/2022060313/5f0b60927e708231d4303817/html5/thumbnails/23.jpg)
Naive Bayes as a language modelNaïve Bayes%as%a%Language%Model
• Which$class$assigns$the$higher$probability$to$s?
0.1 I
0.1 love
0.01 this
0.05 fun
0.1 film
Model$pos Model$neg
filmlove this funI
0.10.1 0.01 0.050.10.10.001 0.01 0.0050.2
P(s|pos)$$>$$P(s|neg)
0.2 I
0.001 love
0.01 this
0.005 fun
0.1 film
Sec.13.2.1
![Page 24: Text Classification - Princeton NLP · adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend](https://reader033.fdocuments.us/reader033/viewer/2022060313/5f0b60927e708231d4303817/html5/thumbnails/24.jpg)
Evaluation
• Consider binary classification
• Table of predictions
• Ideally, we want:
Positive Negative
Positive 100 5
Negative 45 100
Positive Negative
Positive 145 0
Negative 0 105
Truth
Predicted
![Page 25: Text Classification - Princeton NLP · adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend](https://reader033.fdocuments.us/reader033/viewer/2022060313/5f0b60927e708231d4303817/html5/thumbnails/25.jpg)
Evaluation Metrics
• True positive: Predicted + and actual +
• True negative: Predicted - and actual -
• False positive: Predicted + and actual -
• False negative: Predicted - and actual +
Accuracy =TP + TN
Total=
200250
= 80 %
Positive Negative
Positive 100 5
Negative 45 100
Truth
Predicted
![Page 26: Text Classification - Princeton NLP · adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend](https://reader033.fdocuments.us/reader033/viewer/2022060313/5f0b60927e708231d4303817/html5/thumbnails/26.jpg)
Evaluation Metrics
• True positive: Predicted + and actual +
• True negative: Predicted - and actual -
• False positive: Predicted + and actual -
• False negative: Predicted - and actual +
Accuracy =TP + TN
Total=
200250
= 80 %
Positive Negative
Positive 100 5
Negative 45 100
Truth
Predicted
Positive Negative
Positive 100 25
Negative 25 100
![Page 27: Text Classification - Princeton NLP · adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend](https://reader033.fdocuments.us/reader033/viewer/2022060313/5f0b60927e708231d4303817/html5/thumbnails/27.jpg)
Precision and Recall
• Precision: % of selected classes that are correct
• Recall: % of correct items selected
Precision( + ) =TP
TP + FPPrecision( − ) =
TNTN + FN
Recall( + ) =TP
TP + FNRecall( − ) =
TNTN + FP
![Page 28: Text Classification - Princeton NLP · adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend](https://reader033.fdocuments.us/reader033/viewer/2022060313/5f0b60927e708231d4303817/html5/thumbnails/28.jpg)
F-Score
• Combined measure
• Harmonic mean of Precision and Recall
• Or more generally,
F1 =2 ⋅ Precision ⋅ RecallPrecision + Recall
Fβ =(1 + β2) ⋅ Precision ⋅ Recall
β2 ⋅ Precision + Recall
![Page 29: Text Classification - Princeton NLP · adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend](https://reader033.fdocuments.us/reader033/viewer/2022060313/5f0b60927e708231d4303817/html5/thumbnails/29.jpg)
Choosing Beta
• Which value of Beta maximizes for positive class?
A.
B.
C.
Positive Negative
Positive 200 100
Negative 50 100
Fβ =(1 + β2) ⋅ Precision ⋅ Recall
β2 ⋅ Precision + Recall
Fβ
β = 0.5
β = 1
β = 2
Truth
Predicted
![Page 30: Text Classification - Princeton NLP · adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend](https://reader033.fdocuments.us/reader033/viewer/2022060313/5f0b60927e708231d4303817/html5/thumbnails/30.jpg)
Aggregating scores
• We have Precision, Recall, F1 for each class
• How to combine them for an overall score?
• Macro-average: Compute for each class, then average
• Micro-average: Collect predictions for all classes and jointly evaluate
![Page 31: Text Classification - Princeton NLP · adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend](https://reader033.fdocuments.us/reader033/viewer/2022060313/5f0b60927e708231d4303817/html5/thumbnails/31.jpg)
Macro vs Micro average
58
MicroG vs.%MacroGAveraging:%Example
Truth:$yes
Truth:$no
Classifier:$yes 10 10
Classifier:$no 10 970
Truth:$yes
Truth:$no
Classifier:$yes 90 10
Classifier:$no 10 890
Truth:$yes
Truth:$no
Classifier:$yes 100 20
Classifier:$no 20 1860
Class$1 Class$2 Micro$Ave.$Table
Sec.$15.2.4
• Macroaveraged precision:$(0.5$+$0.9)/2$=$0.7• Microaveraged precision:$100/120$=$.83• Microaveraged score$is$dominated$by$score$on$common$classes
![Page 32: Text Classification - Princeton NLP · adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend](https://reader033.fdocuments.us/reader033/viewer/2022060313/5f0b60927e708231d4303817/html5/thumbnails/32.jpg)
Validation
• Choose a metric: Precision/Recall/F1
• Optimize for metric on Validation (aka Development) set
• Finally evaluate on ‘unseen’ test set
• Cross-validation:
• Repeatedly sample several train-val splits
• Reduces bias due to sampling errors
Train Validation Test
Train Valid
Train Valid
Train Valid
![Page 33: Text Classification - Princeton NLP · adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend](https://reader033.fdocuments.us/reader033/viewer/2022060313/5f0b60927e708231d4303817/html5/thumbnails/33.jpg)
Advantages of Naive BayesSummary:%Naive%Bayes%is%Not%So%Naive
• Very$Fast,$low$storage$requirements• Robust$to$Irrelevant$Features
Irrelevant$Features$cancel$each$other$without$affecting$results
• Very$good$in$domains$with$many$equally$important$featuresDecision$Trees$suffer$from$fragmentation in$such$cases$– especially$if$little$data
• Optimal$if$the$independence$assumptions$hold:$If$assumed$independence$is$correct,$then$it$is$the$Bayes$Optimal$Classifier$for$problem
• A$good$dependable$baseline$for$text$classification• But%we%will%see%other%classifiers%that%give%better%accuracy
![Page 34: Text Classification - Princeton NLP · adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend](https://reader033.fdocuments.us/reader033/viewer/2022060313/5f0b60927e708231d4303817/html5/thumbnails/34.jpg)
Practical Naive Bayes
• Small data sizes:
• Naive Bayes is great! (high bias)
• Rule-based classifiers might work well too
• Medium size datasets:
• More advanced classifiers might perform better (e.g. SVM, logistic regression)
• Large datasets:
• Naive Bayes becomes competitive again (although most classifiers work well)
![Page 35: Text Classification - Princeton NLP · adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend](https://reader033.fdocuments.us/reader033/viewer/2022060313/5f0b60927e708231d4303817/html5/thumbnails/35.jpg)
Failings of Naive Bayes (1)
• Independence assumptions are too strong
• XOR problem: Naive Bayes cannot learn a decision boundary
• Both variables are jointly required to predict class
![Page 36: Text Classification - Princeton NLP · adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend](https://reader033.fdocuments.us/reader033/viewer/2022060313/5f0b60927e708231d4303817/html5/thumbnails/36.jpg)
Failings of Naive Bayes (2)
• Class imbalance:
• One or more classes have more instances than others
• Data skew causes NB to prefer one class over the other
• Solution: Complement Naive Bayes (Rennie et al., 2003)
![Page 37: Text Classification - Princeton NLP · adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend](https://reader033.fdocuments.us/reader033/viewer/2022060313/5f0b60927e708231d4303817/html5/thumbnails/37.jpg)
Failings of Naive Bayes (3)
• Weight magnitude errors:
• Classes with larger weights are preferred
• 10 documents with class=MA and “Boston” occurring once each
• 10 documents with class=CA and “San Francisco” occurring once each
• New document: “Boston Boston Boston San Francisco San Francisco”
![Page 38: Text Classification - Princeton NLP · adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend](https://reader033.fdocuments.us/reader033/viewer/2022060313/5f0b60927e708231d4303817/html5/thumbnails/38.jpg)
Practical text classification
• Domain knowledge is crucial to selecting good features
• Handle class imbalance by re-weighting classes
• Use log scale operations instead of multiplying probabilities
•
Underflow%Prevention:%log%space• Multiplying$lots$of$probabilities$can$result$in$floating4point$underflow.• Since$log(xy)$=$log(x)$+$log(y)
• Better$to$sum$logs$of$probabilities$instead$of$multiplying$probabilities.• Class$with$highest$un4normalized$log$probability$score$is$still$most$probable.
• Model$is$now$just$max$of$sum$of$weights
cNB = argmaxc j∈C
logP(cj )+ logP(xi | cj )i∈positions∑
![Page 39: Text Classification - Princeton NLP · adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend](https://reader033.fdocuments.us/reader033/viewer/2022060313/5f0b60927e708231d4303817/html5/thumbnails/39.jpg)
![Page 40: Text Classification - Princeton NLP · adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend](https://reader033.fdocuments.us/reader033/viewer/2022060313/5f0b60927e708231d4303817/html5/thumbnails/40.jpg)