[IEEE 2009 First International Conference on Networked Digital Technologies (NDT) - Ostrava, Czech...

6
Fouzi Harrag Computer Science Department, Farhat ABBAS University, Setif, 19000, Algeria hfouzi2001 @yahoo.fr Computer Science Department, JUST University, Amman, 25000, Jordan [email protected] Abstract This paper presents the results of classifying Arabic text documents using a decision tree algorithm. Experiments are performed over two self collected data corpus and the results show that the suggested hybrid approach of Document Frequency Thresholding using an embedded information gain criterion of the decision tree algorithm is the preferable feature selection criterion. The study concluded that the effectiveness of the improved classifier is very good and gives generalization accuracy about 0.93 for the scientific corpus and 0.91 for the literary corpus and we also conclude that the effectiveness of the decision tree classifier was increased as we increase the training size, and the nature of the corpus has such a influence on the classifier performance. Keywords: Text Mining, Decision Tree Algorithm, Text Categorization, Arabic Corpus, Feature Selection. 1. Introduction The rapid growth of the Internet has increased the number of online documents available. This has led to the development of automated text and document classification systems that are capable of automatically organizing and classifying documents [15]. Automatic text classification (which also known as text categorization or topic spotting) is the process of assigning of a text document to one or more predefined categories based on their content [11]. Automatic text classifications have been used in many applications such as real time sorting of files into folder hierarchies, topic identifications, dynamic task-based interests, automatic meta-data organization, text filtering and documents organization for databases and web pages [23][25][29]. Several methods have been used for text classification [24][25][28] such as: Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Artificial Neural Networks, Naïve Bayes probabilistic Classifier, and Decision Trees. However, a majority of works have been done in automatic text classification for documents written in English. The work in Arabic Text Classification is very limited. This may be due to the complex nature of Arabic language [5]. This paper uses decision trees technique to classify Arabic documents. It shows that this approach outperforms some other existing systems. The rest of the paper is organized as follows: Section 2 summaries related works in documents classifications. Section 3 gives a general description on decision trees as a method used to classify Arabic documents. Section 4 gives detailed description of the classification procedure. Section 5 reports our experiments of the proposed method and compare it with other existing systems. Finally we conclude the paper with a summary and an outlook for future work. 2. Related work in Arabic text categorization Many researchers have been working on text categorization in English and other European languages, however few researchers work on text categorization for Arabic language. El-Kourdi et. al. [12] used naive bayes algorithm for automatic Arabic document classification. The average accuracy reported was about 68.78%. Sawaf et. al. [24] used statistical classification methods such as maximum entropy to classify and cluster News articles. The best classification accuracy they reported was 62.7%. El-Halees [11] described a method based on association rules to classify Arabic documents. The classification accuracy reported was 74.41%. Duwairi [10] proposed a distance-based classifier for categorizing Arabic text. The average accuracy reported was 0.62 for the recall and 0.74 for the precision. Syiam et. al. [26] experimental results 978-1-4244-4615-5/09/$25.00 ©2009 IEEE 110 Eyas El-Qawasmeh Pit Pichappan Dept of Computer Sciences Riyadh, Saudi Arabia [email protected] Al I mam University, Improving Arabic Text Categorization using Decision Trees

Transcript of [IEEE 2009 First International Conference on Networked Digital Technologies (NDT) - Ostrava, Czech...

Fouzi HarragComputer Science Department,

Farhat ABBAS University,Setif, 19000, Algeria

hfouzi2001 @yahoo.fr

Computer Science Department,JUST University,

Amman, 25000, [email protected]

Abstract

This paper presents the results of classifyingArabic text documents using a decision treealgorithm. Experiments are performed over two selfcollected data corpus and the results show that thesuggested hybrid approach of Document FrequencyThresholding using an embedded information gaincriterion of the decision tree algorithm is thepreferable feature selection criterion. The studyconcluded that the effectiveness of the improvedclassifier is very good and gives generalizationaccuracy about 0.93 for the scientific corpus and0.91 for the literary corpus and we also concludethat the effectiveness of the decision tree classifierwas increased as we increase the training size, andthe nature of the corpus has such a influence on theclassifier performance.

Keywords: Text Mining, Decision Tree Algorithm,Text Categorization, Arabic Corpus, FeatureSelection.

1. Introduction

The rapid growth of the Internet has increased thenumber of online documents available. This has ledto the development of automated text and documentclassification systems that are capable ofautomatically organizing and classifying documents[15]. Automatic text classification (which alsoknown as text categorization or topic spotting) is theprocess of assigning of a text document to one ormore predefined categories based on their content[11].

Automatic text classifications have been used inmany applications such as real time sorting of filesinto folder hierarchies, topic identifications, dynamictask-based interests, automatic meta-dataorganization, text filtering and documentsorganization for databases and web pages[23][25][29]. Several methods have been used fortext classification [24][25][28] such as: Support

Vector Machine (SVM), K-Nearest Neighbor(KNN), Artificial Neural Networks, Naïve Bayesprobabilistic Classifier, and Decision Trees.However, a majority of works have been done inautomatic text classification for documents writtenin English. The work in Arabic Text Classification isvery limited. This may be due to the complex natureof Arabic language [5].

This paper uses decision trees technique toclassify Arabic documents. It shows that thisapproach outperforms some other existing systems.

The rest of the paper is organized as follows:Section 2 summaries related works in documentsclassifications. Section 3 gives a general descriptionon decision trees as a method used to classify Arabicdocuments. Section 4 gives detailed description ofthe classification procedure. Section 5 reports ourexperiments of the proposed method and compare itwith other existing systems. Finally we conclude thepaper with a summary and an outlook for futurework.

2. Related work in Arabic textcategorization

Many researchers have been working on textcategorization in English and other Europeanlanguages, however few researchers work on textcategorization for Arabic language. El-Kourdi et. al.[12] used naive bayes algorithm for automaticArabic document classification. The averageaccuracy reported was about 68.78%. Sawaf et. al.[24] used statistical classification methods such asmaximum entropy to classify and cluster Newsarticles. The best classification accuracy theyreported was 62.7%.

El-Halees [11] described a method based onassociation rules to classify Arabic documents. Theclassification accuracy reported was 74.41%.Duwairi [10] proposed a distance-based classifierfor categorizing Arabic text. The average accuracyreported was 0.62 for the recall and 0.74 for theprecision. Syiam et. al. [26] experimental results

978-1-4244-4615-5/09/$25.00 ©2009 IEEE 110

Eyas El-Qawasmeh Pit PichappanDept of Computer Sciences

Riyadh, Saudi [email protected]

Al Imam University,

Improving Arabic Text Categorization using Decision Trees

show that the suggested hybrid method of statisticaland light stemmers is the most suitable stemmingalgorithm for Arabic language and givesgeneralization accuracy of about 98%. Mesleh [19]described a support vector machines based textclassification system for Arabic language articles.The system effectiveness for Arabic data set in termof F-measure is 88.11. Al-Harbi et. al. [1] evaluatedthe performance of two popular classificationalgorithms (SVM and C5.0) on classifying Arabictext using seven Arabic corpora. The SVM averageaccuracy is 68.65%, while the average accuracy forthe C5.0 is 78.42%.

3. Categorization using Decision Trees

A decision tree classifier [6][8] is a tree in whichinternal nodes are labeled by attributes (wordsoccurrences in the case of text categorization),branches departing from them are labeled by teststhe weight that an attribute has in the test document,and leafs are labeled by categories. Decision treecategorizes a test document by recursively testingthe weights that the attributed labeling the internalnodes have in document vector, until a leaf reached.

The most common approach to inducing adecision tree is to partition the labeled examplesrecursively until a stopping criterion is met. Thepartition is defined by selecting the test which divideall examples to the disjoint subsets assigned to thetest branches, passing each example to thecorresponding branch, and treating each block of thepartition as a sub-problem, for which a sub-tree isbuild recursively. A common stopping criterion for asubset of examples is that they all have the sameclass. The skeleton of the ID3 [8] algorithm is givenin Figure 1.

//Input: The Training sample//Output: A Decision tree//Method:

Tree is constructed in a top-downrecursive divide-and-conquer manner.At start, all the training examples areat the root.Attributes are categorical (ifcontinuous-valued, they are discreditedin advance).Examples are partitioned recursivelybased on selected attributes.Test attributes are selected on the basisof a heuristic or statistical measure(e.g., information gain).

//Conditions for stopping partitioning:All samples for a given node belong tothe same class.There are no remaining attributes forfurther partitioning (majority voting isemployed for classifying the leaf).There are no samples left.

Fig 1. ID3 Decision trees algorithm

The critical decision in such a top-down decisiontree-generation algorithm is the choice of attribute ata node. Attribute selection in ID3 and C4.5algorithms [20] are based on minimizing aninformation entropy measure applied to theexamples at a node.

The approach based on information theory insistson minimizing the number of tests that will allowclassing a sample in its true category. The attribute-selection part of ID3 is based on the assumption thatthe complexity of the decision tree is stronglyrelated to the amount of information conveyed bythe value of the given attribute.

An information-based heuristic selects theattribute providing the highest information gain, i.e.,the attribute that minimizes the information neededin the resulting sub-tree to classify the sample. Themeasure favors attributes that result in partitioningthe data into subsets that have low class entropy, i.e.,when the majority of examples in it belong to asingle class. The algorithm basically chooses theattribute that provides the maximum degree ofdiscrimination between classes locally.

4. Research methodology

The purpose of this paper is to apply machinelearning techniques commonly used with textcategorization for Arabic language. The proposedmodel contains a set of phases that describe thedocuments preprocessing routines, documentrepresentation techniques and classification process.Figure 2 resume the proposed model for Arabic textcategorization.

Fig 2. Overview of the categorization process

Inputdocuments

Text Preprocessing

Document indexing

Features Selection

CategorizationAlgorithmEvaluation

111

4.1. Automatic text categorization

The document in text categorization system mustpass through a set of steps: document conversionwhich converts different types of documents intoplain text, Stop words like prepositions and particlesare considered insignificant words and must beremoved. Words must be stemmed after stop wordsremoval. Stemming is the process of removing theaffixes from the word and extracting the word root[2][4][9] [16][17][26]. After applying preprocessingroutines, document is passed by document indexingprocess which involves creation of internalrepresentation of the document. The first phase ofthe indexing process consists of the construction ofthe super vector containing all terms that appears inall the documents of the corpus. The second phase isterm selection which can be seen as a form ofdimensionality reduction by selecting a subset ofterms from the full original set of terms in the supervector according to some criteria, this subset areexpected to yield the best effectiveness, or the bestcompromise between effectiveness and efficiency[18] [21][27]. The third phase is term weighting inwhich, for every term selected in the second phaseand for every document, a weight is computed whichrepresents how much this term contributes to thediscriminative semantics of the document [22].

Finally, the classifier is constructed by learningthe characteristics of every category from a trainingset of documents. Once a classifier has been built, itseffectiveness may be tested by applying it to the testset and checking the degree of correspondencebetween the decisions of the classifier and thoseencoded in the corpus.

4.2. Data set

To better evaluate our classifier, we opted for theuse of two different corpora, The first Corpus is a setof Arabic texts from different domains collectedfrom the Arabian scientific encyclopedia (HalTâalam You K ) [7]. It contains 373documents distributed over 8 categories. Table 1represents the number of documents for eachcategory of this corpus.

Table 1. Number of documents per category for scientificcorpus

Category name # of DocumentsInvention 35Geography 35Sport 35Famous Men 35Islamic Science 35History 35Human Body 35Cosmology 35

The Second Corpus is a set of prophetictraditions or Hadiths (Say s of prophet 'Peace BeUpon Him') collected from the Propheticencyclopedia (Alkotob Altissâa, Nine Book )[13]. It is characterized by the specialization of itsdomain Hadith . It includes 453 documentsdistributed over 14 categories. Table 2 representsthe number of documents for each category for thiscorpus.

Table 2. Number of documents per category for Hadithcorpus

Category name # of documentsFaith 23Koran 24Knowledge 22Crimes 22Al-Djihad 24Good Manners 31Past Generations 12Bibliography 11Judgments 24Worships 23Behaviors 25Food 31Clothes 34Personal States 24

4.3. Evaluation criteria

For text categorization task, we use standardevaluation criteria if possible, to give a possibilityfor comparing the methods and tasks with otherworks.

Precision:foundclasses#

foundclassescorrect#P (1)

Recall:classescorrect#

foundclassescorrect#R (2)

Both quality measures in combination define theso called F-measure:

F measure:PRPR2F (3)

5. Experimental Results

The goal of this experiment is to evaluate theperformance of Decision Trees classificationalgorithm (ID3) on classifying Arabic text using thetwo Arabic corpora described in Section 4.Following the steps of the text classification model,we have removed the Arabic stop words [14], filterout the non Arabic letters, symbols and removed thedigits. We have applied a light stemming process1.We have used one third of the Arabic data set fortesting the classifier and two thirds for training theTC classifier.

1 We have used the K. Darwish Al-Stem program,http://www.glue.umd.edu/~kareem/research/.

112

5.1. Impact of Feature Selection

Classification algorithms cannot deal directlywith texts. Instead, these texts are represented as avector with m elements that denotes the number offeatures which are mostly the text words. This kindof text representation typically leads to highdimension input space which normally affects theefficiency of classification algorithms [1]. For this,term selection techniques are used to select from thesuper vector terms a subset of terms that are deemedmost useful for compactly representing the meaningof the documents [26]. The first experience consistsin evaluating the influence of feature selectioncriterion on the performances of the classificationsystem. We tested several values for the TermFrequency (TF), Document Frequency (DF) and thecombined frequency (TF/DF). Figure 3 show that forthe Hadith corpus, the best threshold criterion is(TF=2) which reported an accuracy about 0.38. Thevocabulary size after the application of this criterionis about 1938 terms.

Fig 3. Choice of the best feature selection threshold forthe Hadith corpus

For the scientific corpus, the size of thevocabulary after the application of the best thresholdis about 1107 terms. From Figure 4 we note that theuse of the Term Frequency criterion for a threshold(TF=3) gives the best accuracy (0.70).

Fig 4. Choice of the best feature selection threshold forthe scientific corpus

The comparison of the classification results withand without use of the feature selection criterion forthe two corpora is illustrated in Figure 5 and Figure6. From Figure 5, we note an improvement of (+12%) for the average precision, of (+10 %) for the

average recall and (+11 %) for the average F1measure. The global improvement of performancefor the Hadith corpus is about (+11 %).

Fig 5. Influence of feature selection criterion on theperformances of the classifier for the Hadith corpus

From Figure 6, we note an improvement of (+27%) for the average precision, of (+26 %) for theaverage recall and (+28 %) for the average F1measure. The global improvement of performancefor the scientific corpus is about (+26.5 %).

Fig 6. Influence of feature selection criterion on theperformances of the classifier for a scientific corpus

The results of this experience confirm that for thephase of dimensionality reduction, the suggestedhybrid approach of Term Frequency Thresholdingused with the embedded information gain criterionof the decision tree algorithm is the preferablefeature selection criterion. This approach permits toconsiderably improve the results of classificationsystem, the average mean of the improvement for thetwo corpus is about (+ 16 %).

5.2 Impact of the training set size

The second experience consists in evaluating theinfluence of the training set size on the performancesof the classification system. We have tested thesystem many times and determined when the bestthreshold were found, the evaluation is based on 5separate trials run on training sets that varied in thenumber of documents. The different tested sizes forthe Hadith corpus are: (150, 191, 239, 266, 340training documents) and (100, 140, 180, 200, 280training documents) for the scientific corpus. Figure7 and Figure 8 shows that as the number of trainingdocuments increases, as the accuracy and

113

effectiveness of the system also increases. FromFigure7, we note that for the Hadith corpus, theglobal improvement of performance is about of (+15%). We also note stability on the recall measurefrom the size 191.

Fig 7. Influence of the training set size criterion on theperformances of the classifier for the Hadith corpus.

From Figure7, we note that for the Scientificcorpus, the global improvement of performance isabout (+26.5 %).

Fig 8. Influence of the training set size criterion on theperformances of the classifier for the scientific corpus.

The results of this experience confirm that thesize of the training set has a big Influence on theperformances of the classifier.

5.3. Comparison of the two corpora

The comparison of the evaluation results showsthat the best scores are those of the scientific corpus,as it is shown in Figure 9.

Fig 9. Evaluation results comparison.

The weakness of the evaluation results for theHadith corpus pushes us to revise the classificationrules extracted from the constructed decision trees.This revision put in evidence that the most

misclassification errors are in fact due to thedocument s nature and features. We have noticedthat misclassified documents contain large numberof words that are representative of other categories.In other words, documents that are known to belongto a category contain numerous words that havehigher frequency in other categories. Therefore,these words have higher influence on the predictionthat will be made by the classifier. Other factors canalso influence on the performances of theclassification system as the size of the documentsand the thematic divergence between the differentcategories of the corpus.

5.4. Comparison with other classificationsystems

The results of the comparison of our decisiontrees classification system with other Arabic textclassifiers are shown in Table 3. The comparison hasbeen done with probabilistic system based on theNaive Bays algorithm like [12], statistic systembased on the maximum entropy algorithm like [24]and linear system based on the vector space modelusing Cosine, Dice, Jaccard and the Euclidianmeasure like [3].

This comparison shows that our classifier is oneof the best systems in term of global performance, itreports better values for the F1 and precisionmeasures what proves that our system is moreaccurate. In term of classification time, theclassification systems based on the decision treesalgorithm have a good coast among the best systems.Indeed the test set classification time for the twocorpora (93 documents for the scientific corpus and113 documents for Hadith corpus) is 120 seconds,very short compared with the classification time ofthe K-PPV classifier mentioned in [26], this lastreached a time of 3004 second.

Table 3. Comparison of the evaluation results of thedecision trees classifier with other Arabic text Classifiers.

System Precision Recall F1Decision trees 73.00 70.00 70.00Naive Bayes 67.88 71.96 67.83Maximum entropy 50.00 84.20 62.70VSM (Dice) 41.00 44.00 42.00VSM (Jaccard) 54.00 61.00 57.00VSM (Euclidian) 54.00 57.00 55.00

6. Conclusion

In this paper, we evaluated a classification systembased on the decision trees algorithm. Our study hasbeen based on several experiences that have soughtto study the impact of the feature selection andtraining set size criteri on the performances of the

114

presented classification system. The use of twodifferent corpora allowed us to conclude that a set offactors can influence on the classifier performancesin particular the nature and the specificity of thecorpus documents. This study also confirmed theindependence of the classification methods oppositeto the used language.

For our future works, we will consider passing tothe multiclass classification by the use of the fuzzydecision trees or the topic segmentation technique.The last perspective is to lead other experiences tostudy the impact of using root based stemmers andexternals Arabic thesauruses or ontology on theperformances of our classification system.

7. References

[1] S. Al-Harbi, A. Almuhareb, A. Al-Thubaity, M. S.Khorsheed, and A. Al-Rajeh, Automatic Arabic TextClassification , 9es Journées internationales

,JADT 08, France, 2008, pp. 77-83.

[2] M. Aljlayl, and O. Frieder, On Arabic Search:Improving the Retrieval Effectiveness Via a LightStemming Approach , In International Conference onInformation and Knowledge Management, CIKM 02,ACM, McLean, VA, USA, 2002, pp. 340-347.

[3] M.N. Al-Kabi, and S.I. Al- Sinjilawi, A ComparativeStudy of the Efficiency of Different Measures toClassify Arabic Text , University of Sharjah Journalof Pure & Applied Sciences, 2007, 4(2), pp. 13-26.

[4] R. Al-Shalabi, and M. Evens, A ComputationalMorphology System for Arabic , In Workshop onComputational Approaches to Semitic Languages,COLING-ACL 98, August 1998.

[5] R. Al-Shalabi, and R. Obeidat, Improving KNNArabic Text Classification with N-Grams BasedDocument Indexing , INFOS 08, Cairo, Egypt, March2008, pp. 108-112.

[6] C. Apté, F. Damerau, and S. Weiss, AutomatedLearning of Decision Rules for Text Categorization ,ACM Transactions on Information Systems, 1994,12(3), pp. 233-251.

[7] Arriss Computer Society, The Arabian ScientificEncyclopedia: Hal Taâlam , 2001.

[8] L. Breiman, J.H. Friedman, R.A. Olshen, and C.J.Stone, Classification and regression trees . TechnicalReport, Wadsworth International, Monterey, CA,1984.

[9] A. Chen, and F. Gey, Building an Arabic Stemmerfor Information Retrieval , In Proceedings of the 11thText Retrieval Conference, TREC 02, NationalInstitute of Standards and Technology, 2002.

[10] R.M. Duwairi, A Distance-based Classifier forArabic Text Categorization , In Proceedings of theInternational Conference on Data Mining, Las VegasUSA, 2005.

[11] A. El-Halees, Arabic Text Classification UsingMaximum Entropy , The Islamic University Journal(Series of Natural Studies and Engineering), 2007,15(1), pp. 157-167.

[12] M. El-Kourdi, A. Bensaid, and T. Rachidi,Automatic Arabic Document Categorization Based

on the Naïve Bayes Algorithm , 20th InternationalConference on Computational Linguistics, Geneva,August 2004.

[13] The Encyclopedia of the Nine Books for theHonorable Prophetic Traditions , Sakhr Company,1997, http://www.Harf.com.

[14] Fox, C., Lexical analysis and stop lists , in [Frakes& Bayes], « Information Retrieval, Data Structures &Algorithms », Prentice-Hall Inc., 1992, pp. 102-130.

[15] L. Khreisat, Arabic Text Classification UsingNGram Frequency Statistics: a Comparative Study ,DMIN 06, 2006, pp. 78-82.

[16] S. Khoja, Stemming Arabic Text , Lancaster, U.K,Computing Department, Lancaster University, 1999.

[17] L. Larkey, L. Ballesteros, and M.E. Connell,Improving Stemming for Arabic Information

Retrieval: Light Stemming and Co-occurrenceAnalysis , Proceedings of SIGIR 02, 2002, pp. 275-282.

[18] T. Liu, S. Liu, Z. Chen, and M.A. Wei-Ying, AnEvaluation on Feature Selection for Text Clustering ,Proceedings of the 12th International ConferenceICML 03, Washington, DC, USA, 2003, pp. 488-495.

[19] A. A. Mesleh, Chi Square Feature ExtractionBased Svms Arabic Language Text CategorizationSystem , Journal of Computer Science, 2007, 3(6),pp. 430-435.

[20] J.R. Quinlan, C4.5: Programs for MachineLearning . Morgan Kaufmann, San Mateo, CA, 1993.

[21] M. Rogati, and Y. Yang, High-Performing FeatureSelection for Text classification , CIKM'02, ACM,2002.

[22] G. Salton, and C. Buckley, Term-weightingApproaches in Automatic Text Retrieval ,Information Processing and Management, 1988,24(5), pp. 513-523.

[23] Sauban, M., and B. Pfahringer, Text CategorizationUsing Document Profiling . Principles of DataMining and Knowledge Discovery, 2003.

[24] H. Sawaf, J. Zaplo, and H. Ney, StatisticalClassification Methods for Arabic News Articles ,Workshop on Arabic Natural Language Processing,ACL'01, Toulouse, France, July 2001.

[25] F. Sebastiani, Machine Learning in AutomatedText Categorization , ACM Computing Surveys,2002, 34(1), pp. 1-47.

[26] M.M. Syiam, Z.T. Fayed, and M.B. Habib, AnIntelligent System for Arabic Text Categorization ,IJICIS, 2006, 6(1), pp. 1-19.

[27] Y. Yang, and J. O. Pedersen, A comparative studyon feature selection in text categorization ,Proceedings of ICML 97, 1997, pp. 412-420.

[28] Y. Yang and X. Liu, A Re-examination of TextCategorization Methods , Proceedings of 22nd ACMInternational Conference on Research andDevelopment in Information Retrieval, SIGIR 99,ACM Press, New York, USA, 1999, pp. 42-49.

[29] Y. Yang, S. Slattery, and R. Ghani, A Study ofapproaches to hypertext Categorization , Journal ofIntelligent Information Systems, 2002.

115