A Survey on Text Categorization with Machine Learning
-
Upload
hedy-hoover -
Category
Documents
-
view
46 -
download
3
description
Transcript of A Survey on Text Categorization with Machine Learning
![Page 1: A Survey on Text Categorization with Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022051517/56813046550346895d95ec3c/html5/thumbnails/1.jpg)
A Survey on Text Categorization with Machine Learning
Chikayama lab.Dai Saito
![Page 2: A Survey on Text Categorization with Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022051517/56813046550346895d95ec3c/html5/thumbnails/2.jpg)
Introduction:Text Categorization
Many digital Texts are available E-mail, Online news, Blog …
Need of Automatic Text Categorization is increasing without human resource Merits of time and cost
![Page 3: A Survey on Text Categorization with Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022051517/56813046550346895d95ec3c/html5/thumbnails/3.jpg)
Introduction:Text Categorization
Application Spam filter Topic Categorization
![Page 4: A Survey on Text Categorization with Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022051517/56813046550346895d95ec3c/html5/thumbnails/4.jpg)
Introduction:Machine Learning
Making Categorization rule automatically by Feature of Text
Types of Machine Learning (ML) Supervised Learning
Labeling Unsupervised Learning
Clustering
![Page 5: A Survey on Text Categorization with Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022051517/56813046550346895d95ec3c/html5/thumbnails/5.jpg)
Introduction:flow of ML
1. Prepare training Text data with label Feature of Text
2. Learn3. Categorize new Text
Label1
Label2
?
![Page 6: A Survey on Text Categorization with Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022051517/56813046550346895d95ec3c/html5/thumbnails/6.jpg)
Outline
Introduction Text Categorization Feature of Text Learning Algorithm Conclusion
![Page 7: A Survey on Text Categorization with Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022051517/56813046550346895d95ec3c/html5/thumbnails/7.jpg)
Number of labels
Binary-label True or False (Ex. spam or not) Applied for other types
Multi-label Many labels, but
One Text has one label Overlapping-label
One Text has some labels
Yes
No
L1
L2
L3
L4
L1
L2
L3
L4
![Page 8: A Survey on Text Categorization with Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022051517/56813046550346895d95ec3c/html5/thumbnails/8.jpg)
Types of labels
Topic Categorization Basic Task Compare individual words
Author Categorization Sentiment Categorization
Ex) Review of products Need more linguistic information
![Page 9: A Survey on Text Categorization with Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022051517/56813046550346895d95ec3c/html5/thumbnails/9.jpg)
Outline
Introduction Text Categorization Feature of Text Learning Algorithm Conclusion
![Page 10: A Survey on Text Categorization with Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022051517/56813046550346895d95ec3c/html5/thumbnails/10.jpg)
Feature of Text
How to express a feature of Text? “Bag of Words”
Ignore an order of words Structure
Ex) I like this car. | I don’t like this car. “Bag of Words” will not work well
(d:document = text) (t:term = word)
![Page 11: A Survey on Text Categorization with Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022051517/56813046550346895d95ec3c/html5/thumbnails/11.jpg)
Preprocessing
Remove stop words “the” “a” “for” …
Stemming relational -> relate, truly -> true
![Page 12: A Survey on Text Categorization with Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022051517/56813046550346895d95ec3c/html5/thumbnails/12.jpg)
Term Weighting
Term Frequency Number of a term in a document Frequent terms in a document seems to be imp
ortant for categorization tf ・ idf
Terms appearing in many documents are not useful for categorization
![Page 13: A Survey on Text Categorization with Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022051517/56813046550346895d95ec3c/html5/thumbnails/13.jpg)
Sentiment Weighting
For sentiment classification,weight a word as Positive or Negative
Constructing sentiment dictionary WordNet [04 Kamps et al.]
Synonym Database Using a distance
from ‘good’ and ‘bad’
good
bad
happyd (good, happy) = 2d (bad, happy) = 4
![Page 14: A Survey on Text Categorization with Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022051517/56813046550346895d95ec3c/html5/thumbnails/14.jpg)
Dimension Reduction Size of feature vector is (#terms)*(#docum
ents) #terms ≒ size of dictionary High calculation cost Risk of overfitting
Best for training data ≠ Best for real data
Choosing effective feature to improve accuracy and calculation cost
![Page 15: A Survey on Text Categorization with Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022051517/56813046550346895d95ec3c/html5/thumbnails/15.jpg)
Dimension Reduction
df-threshold Terms appearing in very few documents
(ex.only one) are not important Score
If t and cj are independent, Score is equal to Zero
![Page 16: A Survey on Text Categorization with Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022051517/56813046550346895d95ec3c/html5/thumbnails/16.jpg)
Outline
Introduction Text Categorization Feature of Text Learning Algorithm Conclusion
![Page 17: A Survey on Text Categorization with Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022051517/56813046550346895d95ec3c/html5/thumbnails/17.jpg)
Learning Algorithm
Many (Almost all?) algorithms are used in Text Categorization Simple approach
Naïve Bayes K-Nearest Neighbor
High performance approach Boosting Support Vector Machine
Hierarchical Learning
![Page 18: A Survey on Text Categorization with Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022051517/56813046550346895d95ec3c/html5/thumbnails/18.jpg)
Naïve Bayes Bayes Rule
This value is hard to calculate ? Assumption :
each terms occurs independently
![Page 19: A Survey on Text Categorization with Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022051517/56813046550346895d95ec3c/html5/thumbnails/19.jpg)
k-Nearest Neighbor
Define a “distance” of two Texts Ex)Sim(d1, d2) = d1 ・ d2 / |d1||d2|
= cosθ
check k of high similarityTexts and categorize bymajority vote
If size of test data is larger, memory and search cost is higher
d1
d2
θk=3
![Page 20: A Survey on Text Categorization with Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022051517/56813046550346895d95ec3c/html5/thumbnails/20.jpg)
Boosting
BoosTexter [00 Schapire et al.] Ada boost
making many “weak learner”s with different parameters
Kth “weak learner” checks performance of 1..K-1th, and tries to classify right to the worst score training data
BoosTexter uses Decision Stump as “weak learner”
![Page 21: A Survey on Text Categorization with Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022051517/56813046550346895d95ec3c/html5/thumbnails/21.jpg)
+
+
+ +
+
--
--
-
Simple example of Boosting
+
+
+ +
+
--
--
-
1.
-
-
+
+
+ ++
-
-
-
2.
+
+
+ +
+
--
--
-
3.
![Page 22: A Survey on Text Categorization with Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022051517/56813046550346895d95ec3c/html5/thumbnails/22.jpg)
Support Vector Machine
Text Categorization with SVM[98 Joachims]
Maximize margin
![Page 23: A Survey on Text Categorization with Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022051517/56813046550346895d95ec3c/html5/thumbnails/23.jpg)
Text Categorization with SVM
SVM works well for Text Categorization Robustness for high dimension
Robustness for overfitting Most Text Categorization problems are linearly se
parable All of OHSUMED (MEDLINE collection) Most of Reuters-21578 (NEWS collection)
![Page 24: A Survey on Text Categorization with Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022051517/56813046550346895d95ec3c/html5/thumbnails/24.jpg)
Comparison of these methods
[02 Sebastiani] Reuters-21578 (2 versions)
difference: number of Categories
Method Ver.1(90) Ver.2(10)
k-NN .860 .823
Naïve Bayes .795 .815
Boosting .878 -
SVM .870 .920
![Page 25: A Survey on Text Categorization with Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022051517/56813046550346895d95ec3c/html5/thumbnails/25.jpg)
Hierarchical Learning
TreeBoost[06 Esuli et al.] Boosting algorithm for Hierarchical labels Hierarchical labels and Texts with label as Trainin
g data Applying AdaBoost recursively Better classifier than ‘flat’ AdaBoost
Accuracy : 2-3% up Time: training and categorization time down
Hierarchical SVM[04 Cai et al.]
![Page 26: A Survey on Text Categorization with Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022051517/56813046550346895d95ec3c/html5/thumbnails/26.jpg)
TreeBoost
root
L1 L2 L3 L4
L11 L12 L41 L42 L43
L421 L422
![Page 27: A Survey on Text Categorization with Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022051517/56813046550346895d95ec3c/html5/thumbnails/27.jpg)
Outline
Introduction Text Categorization Feature of Text Learning Algorithm Conclusion
![Page 28: A Survey on Text Categorization with Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022051517/56813046550346895d95ec3c/html5/thumbnails/28.jpg)
Conclusion
Overview of Text Categorizationwith Machine Learning Feature of Text Learning Algorithm
Future Work Natural Language Processing with
Machine Learning, especially in Japanese Calculation Cost