Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of...
-
Upload
sharyl-terry -
Category
Documents
-
view
223 -
download
0
Transcript of Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of...
![Page 1: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on.](https://reader035.fdocuments.us/reader035/viewer/2022062217/5697bfdf1a28abf838cb2b83/html5/thumbnails/1.jpg)
Text Document Categorization by Term Association
Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada
2002 IEEE International Conference on Data Mining (ICDM’02)
Presentation by Yu-Kai Lin
![Page 2: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on.](https://reader035.fdocuments.us/reader035/viewer/2022062217/5697bfdf1a28abf838cb2b83/html5/thumbnails/2.jpg)
Outline Introduction Related work Building an Associative Text Classifier Experimental Results Conclusion
![Page 3: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on.](https://reader035.fdocuments.us/reader035/viewer/2022062217/5697bfdf1a28abf838cb2b83/html5/thumbnails/3.jpg)
Introduction Text categorization is a necessity due
to the very large amount of text documents that we have to deal with daily.
A text categorization system can be used in indexing documents to assist information retrieval tasks as well as in classifying e-mails, memos or web pages in a yahoo-like manner.
![Page 4: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on.](https://reader035.fdocuments.us/reader035/viewer/2022062217/5697bfdf1a28abf838cb2b83/html5/thumbnails/4.jpg)
Introduction (cont.) The data classification process :
(a) Learning : Training data are analyzed by a classification algorithm. (Figure 1)
(b) classification : Test data are used to estimated in the form of classification rules. (Figure 2)
![Page 5: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on.](https://reader035.fdocuments.us/reader035/viewer/2022062217/5697bfdf1a28abf838cb2b83/html5/thumbnails/5.jpg)
Figure 1
name age income
Credit_rating
JonesBill LeeFoxLake…
<= 30<= 3031..40> 40 …
LowLowHighMed…
FairExcellentExcellentFair…
Training data
Classification algorithm
Classificationrules
If age = “31…40”And income = high
ThenCredit_rating = excellent
![Page 6: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on.](https://reader035.fdocuments.us/reader035/viewer/2022062217/5697bfdf1a28abf838cb2b83/html5/thumbnails/6.jpg)
Figure 2
name age income
Credit_rating
FrankSylviaAnne…
> 30<= 3031..40 …
highlowhigh…
fairfairexcellent…
Training data
Classificationrules
New data
( John ,31…40,high)Credit rating ?
excellent
![Page 7: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on.](https://reader035.fdocuments.us/reader035/viewer/2022062217/5697bfdf1a28abf838cb2b83/html5/thumbnails/7.jpg)
Related Work Text classifier Association Rule Mining
![Page 8: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on.](https://reader035.fdocuments.us/reader035/viewer/2022062217/5697bfdf1a28abf838cb2b83/html5/thumbnails/8.jpg)
Related Work (cont.) Text classifier
Naïve Bayesian classifier (chapter 7.4) ID3 (Decision tree chapter 7.3) C4.5 ( chapter 7.6) K-NN (chapter 7.7.1) Neural Networks Support Vector Machines (SVM)
![Page 9: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on.](https://reader035.fdocuments.us/reader035/viewer/2022062217/5697bfdf1a28abf838cb2b83/html5/thumbnails/9.jpg)
Related Work (cont.) Association Rule Mining
Association Rules Generation Associative classifiers
![Page 10: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on.](https://reader035.fdocuments.us/reader035/viewer/2022062217/5697bfdf1a28abf838cb2b83/html5/thumbnails/10.jpg)
Related Work (cont.) Association Rules Generation
“X=>Y” support s confidence c strong rules:
rules that have a support and confidence greater than given thresholds
![Page 11: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on.](https://reader035.fdocuments.us/reader035/viewer/2022062217/5697bfdf1a28abf838cb2b83/html5/thumbnails/11.jpg)
Related Work (cont.) Associative classifiers
Learning method is represented by the association rule mining Discover strong patterns that are
associated with the class labels New object are categorized by these
patterns (classifier)
![Page 12: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on.](https://reader035.fdocuments.us/reader035/viewer/2022062217/5697bfdf1a28abf838cb2b83/html5/thumbnails/12.jpg)
Building an Association Text Classifier
TrainingSet
PreprocessingPhase
AssociationRule Mining
AssociativeClassifier
ModelValidation
TestingSet
![Page 13: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on.](https://reader035.fdocuments.us/reader035/viewer/2022062217/5697bfdf1a28abf838cb2b83/html5/thumbnails/13.jpg)
Building an Association Text Classifier (cont.)
Data collection Preprocessing Association Rules Generation Pruning the Set of Association Rules Prediction of Classes Associated with
New Documents
![Page 14: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on.](https://reader035.fdocuments.us/reader035/viewer/2022062217/5697bfdf1a28abf838cb2b83/html5/thumbnails/14.jpg)
Building an Association Text Classifier (cont.)
Data collection Preprocessing Weed out not interesting words
stopwording stemming
Transform documents into transactions categories set C = {c1, c2, … , cm} term set T = {t1, t2, … , tn} document Di = {cc1, cc2, … , ccm, tt1, tt2, … , ttn}
![Page 15: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on.](https://reader035.fdocuments.us/reader035/viewer/2022062217/5697bfdf1a28abf838cb2b83/html5/thumbnails/15.jpg)
Building an Association Text Classifier (cont.)
Association Rules Generation Apriori
Advantage The performance studies show its efficiency and sc
alability Drawback of using on our transactions
Generate a large number of associations rules Most of them are irrelevant for classification
![Page 16: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on.](https://reader035.fdocuments.us/reader035/viewer/2022062217/5697bfdf1a28abf838cb2b83/html5/thumbnails/16.jpg)
ARC-BC Association Rule-based Categorizer By Category
algorithm Apriori-based Interested in rules that indicate a category label (T => c
i ): Strong rules Prune the rules that no use for categorization
![Page 17: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on.](https://reader035.fdocuments.us/reader035/viewer/2022062217/5697bfdf1a28abf838cb2b83/html5/thumbnails/17.jpg)
ARC-BC Algorithm
![Page 18: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on.](https://reader035.fdocuments.us/reader035/viewer/2022062217/5697bfdf1a28abf838cb2b83/html5/thumbnails/18.jpg)
ARC-BC Algorithm
![Page 19: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on.](https://reader035.fdocuments.us/reader035/viewer/2022062217/5697bfdf1a28abf838cb2b83/html5/thumbnails/19.jpg)
ARC-BC
category 1
category i
category n
association rules for category 1
association rules for category i
association rules for category n
classifier
put the new documents in the correct class
![Page 20: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on.](https://reader035.fdocuments.us/reader035/viewer/2022062217/5697bfdf1a28abf838cb2b83/html5/thumbnails/20.jpg)
Examples of association rules composing the classifier
![Page 21: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on.](https://reader035.fdocuments.us/reader035/viewer/2022062217/5697bfdf1a28abf838cb2b83/html5/thumbnails/21.jpg)
Building an Association Text Classifier (cont.) Pruning the Set of Association Rules
The number of rules that can be generated in the association rule mining phase could be very large Noisy information mislead the classification
process Make classification time longer
Pruning method Eliminate the specific rules and keep only those
that are more general and with high confidence Prune unnecessary rules by database coverage
![Page 22: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on.](https://reader035.fdocuments.us/reader035/viewer/2022062217/5697bfdf1a28abf838cb2b83/html5/thumbnails/22.jpg)
Building an Association Text Classifier (cont.)
Pruning the Set of Association Rules definition
![Page 23: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on.](https://reader035.fdocuments.us/reader035/viewer/2022062217/5697bfdf1a28abf838cb2b83/html5/thumbnails/23.jpg)
Pruning the Set of Association Rules Algorithm
![Page 24: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on.](https://reader035.fdocuments.us/reader035/viewer/2022062217/5697bfdf1a28abf838cb2b83/html5/thumbnails/24.jpg)
Building an Association Text Classifier (cont.)
Prediction of Classes Associated with New Documents Algorithm
![Page 25: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on.](https://reader035.fdocuments.us/reader035/viewer/2022062217/5697bfdf1a28abf838cb2b83/html5/thumbnails/25.jpg)
Experimental results 9,603 training
documents and 3,299 testing documents
![Page 26: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on.](https://reader035.fdocuments.us/reader035/viewer/2022062217/5697bfdf1a28abf838cb2b83/html5/thumbnails/26.jpg)
Conclusion
Its effectiveness is comparable to most well-known text classifiers
Relatively fast training time Rules generated are understandable
and can be easily manually updated When retraining a new document, only
the concerned categories are adjusted and the rules could be incrementally updated