ANT Colony for Text Indexing

8/11/2019 ANT Colony for Text Indexing

1/6

Application of an ant colony algorithm for text

indexing

Nadia Lachetar11Computer Science department,

University 20 aout 1955 Skikda

Skikda, Algeria;

Email [email protected]

Halima Bahi2

2LabGED Laboratory, Computer Science department,

University Badji Mokhtar Annaba

Annaba, Algeria;Email [email protected]

Abstract Every day, the mass of information available to us

increases. This information would be irrelevant if our ability to

efficiently access did not increase as well. For maximum benefit,

we need tools that allow us to search, sort, index, store, and

analyze the available data. We also need tools helping us to find

in a reasonable time the desired information by performing

certain tasks for us. One of the promising areas is the automatic

text categorization. Imagine ourselves in the presence of a

considerable number of texts, which are more easily accessible if

they are organized into categories according to their theme. Of

course, one could ask a human to read the texts and classify them

manually. This task is hard if done for hundreds, even thousands

of texts. So, it seems necessary to have an automated application,

which would consist on indexing text databases. In this article, we

present our experiments in automated text categorization, where

we suggest the use of an ant colony algorithm. A Naive Bayes

algorithm is used as a baseline in our tests.

Keywords-componen: Information Retrieval, Text

categorization; Naive Bayes Algorithm; Ant Colony Algorithm.

I. INTRODUCTION

Research in the field of automatic categorization remainsrelevant today since the results are still subject toimprovements. For some tasks, the automatic classifiers

perform almost as well as humans, but for others the gap iseven greater. At first glance, the main problem is easy to grasp.On one hand, we are dealing with a bank of documents of textsand on the other with a set of categories. The goal is to make acomputer application which can determine to which category

belongs a text based on its contents [2].

Despite this simplified definition, the solution is notstraightforward and several factors must be considered. First,

we need the selection of an adequate representation of texts tobe treated; this is an essential step in machine learning. Weshould opt for a consistent and sensible attributes to abstractthe data before submitting them to an algorithm. Subsequentlywe will discuss the selection of attributes almost alwaysinvolved in automated text categorization, and eliminateunnecessary attributes considered for classification [2]. Oncethe pretreatment is completed, we perform classification using

both naive Bayes algorithm [1] and our proposed ant colonyalgorithm. The remaining of the paper is organized as follow:

In section II, we present the various aspects of automatictext categorization; particularly, it addresses the main modes ofdocuments representation. Then, Section III introduces the

Naive Bayes algorithm. In section IV, we present our approachwhich is the application of an ant colony algorithm to the texts

categorization. Section V presents the obtained results and adiscussion.

II. TEXT CATEGORIZATION

The purpose of the automatic text categorization is to learna machine to classify a text into the correct category based onits content; the categories refer to the topics (subjects). We maywish that the same text is associated with only one category orit can belong to number of categories. The set of categories isdetermined in advance. The problem is to group the texts bytheir similarity. In text categorization: the classification issimilar to the problem of extracting semantics of the texts,since membership of the text to a category is closely related tothe meaning of the text. This is partly what makes the task

difficult since the treatment of the semantics of words writtenin natural language is not yet solved.

A.How to categorize a texte?

The categorization process includes the construction of aprediction model that receives in input the text, and as output itcombines one or more labels. To identify the categoryassociated to a text, the following steps are required:

1) Learning includes several steps and leads to a

prediction model.

a) We have a set of labeled texts (for every text we know

its class)b) From this corpus, we extract the k descriptors (t1, ..,

tk) which are most relevant in the sense of the problem solving.

c) We have then a table of "descriptors X individualsand for every word of texts we know the value of descriptors

and its label.

2) The classification of words for a new text dx includes

two stages

978-1-61284-732-0/11/$26.00 2010 IEEE


2/6

a) Research and weighting the instances t1, .. tkof terms

in text to classify dx.

b) Implementation of a learning algorithm on these

instances and the previous table to predict the labels of the

text dx[1].Note that the k most relevant individuals (t1, ..., tk)

are extracted during the first phase by analyzing the texts of

the training corpus. In the second phase, the classification of anew text, we simply seek the frequency of these k descriptors

(t1, ..., tk) in the text to be classified.

B.Representation and coding of a text

Prior coding of text is necessary because there is currentlyno method of learning which can directly handle unstructureddata in the model construction stage, or when used inclassification.

For most learning methods, we must convert all textsin a PivotTable "individuals-variables".

Individual is a text dj, labeled during the learning stage,it will be classified in the prediction phase.

Variables are descriptors (terms) tk which areextracted from data of the text.

The contents of table wkjrepresent the weight of term k indocument j.

Different methods are proposed for the selection ofdescriptors and weights associated with these descriptors.Some researchers use the words as descriptors, while others

prefer to use the lemmas (lexical roots) or even Stemme(deletion of affix) [1].

C.Approaches for texts representation

Learning algorithms are not able to treat texts and moregenerally unstructured data such as images, sounds and video

clips. Therefore a preliminary step called representation isrequired. This step aims to represent each document by avector whose components are such words in the text to make itusable by the learning algorithms. A collection of texts can berepresented by a matrix whose columns are the documents [1].

Many researchers have chosen to use a vectorrepresentation in which each text is represented by a vector ofn weighted terms. The n terms are simply the n differentwords in the texts.

1) Choice of terms: In text categorization, we transform

the document into a vector dj= dj(w1j, w2j, ..., w| T | j), where T

is the set of terms (descriptors) that appear at least once in the

corpus (the collection) learning.The weight wkj correspond tothe contribution of terms tkto the semantics of text dj[1].

2) Bag of words representation: The simplest

representation of text is a vector model called "bag of words".

The idea is to transform the texts in a vector where a

component is a word. Words have the advantage of having an

explicit sense. However, several problems arise. It must first

define what a "word" to be able to automatically process. The

word can be regarded as a sequence of characters from a

dictionary or, more practically, as a sequence of non-

delimiters framed by delimiter characters.The components of

the vector are a function of occurrence of words in the text.

Representation of text and grammatical analysis excludes any

distance between words is why this representation is called

"bag of words; other authors speak of "whole words when

the weights are associated binary [1].

3) Representation of texts by sentences: Despite the

simplicity of using words as units of representation, some

authors propose to use sentences like unities. The sentencesare more informative than words, because they have the

advantage of preserving information on the position of the

word in the sentence: Logically, such a representation must

get better results than those obtained through words. However,

if semantic qualities are preserved, the statistical qualities are

largely degraded [1].

4) Representation of texts by lexical roots and lemmas: In

the model of representation "bag of words", each form of a

word is considered as a different descriptor. For example, the

words "movers, removals, move, etc.. descriptors are

considered different while they belong to the same root

move. Techniques of suffixation (or Stemming), which are

to investigate the lexical roots, stemming may resolve thisdifficulty. For the detection of lexical roots, several algorithms

have been proposed, the most known for the English language

is the algorithm of Porter [7].Stemming is to replace the verbs

in their infinitive form, and the names of their singular form.

TreeTagger algorithm was developed for English, French,

German and Italian [6].

5) Coding terms: Once we choose the components of the

vector representing the text j, we must decide how to encode

each coordinate of the vector dj.There are different methods to

calculate the weight wkj. These methods are based on two

observations:

a)

More the term tkis frequently in a document dj, moreit is relevant to the subject of this document.

b) More often the term tk is in a collection, unless it is

used as discriminating between documents.

# (tk, dj): the number of occurrences of term tkin the text dj.

|tr|: the number of documents from the training corpus.

# Tr (tk): the number of documents in this set where appears at

least once the term tk.

According to the two previous observations, a term tk is

therefore assigned a weight as much stronger than it appears

frequently in the document corpus.

The vector component is coded f (# (tk, dj)), the function f

remains to be determined [1].Two approaches can be used:

The first is to assign equal weight to the occurrence of theterm in the document

wkj = # (tk, dj) (1)

The second approach is simply to assign a binary value 1 if the

word appears in the text, 0 otherwise.


3/6

wkj= 1if # ( , 1(2)

wkj= 0 Otherwise

6) Coding terms frequency X inverse document frequency:

The two functions (1) and (2) above are rarely used because

they deplete the encoding information:

Function (1) does not take into account the frequency of

occurrence of the word in the text (which often can be an

important decision).

The function (2) does not take into account the frequency of

the term in other texts [1].

TF X IDF encoding was introduced in the vector model; it

gives much importance to words that appear often within the

same text, which corresponds to the intuitive idea that these

words are more representative of the document. But its

particularity is that it also gives less weight to words that

belong to several documents: to reflect the fact that these

words have little ability to discriminate between classes [2].The weight of term tkin document djis calculated as:

x , d #t, d log || (3) #(tk,dj): number of occurrence of term tk in document

dj

|Tr|: number of documents from the training corpus.

#Tr,(tk): number of documents in this set in which theterm tkappears at least one.

7) Coding termsTFC: TF X IDF encoding does not fix the

length of documents for this purpose, the coding TFC is

similar to TF x IDF, but it corrects the length of texts by

cosine normalization, in order to not favor the long documents

[1].

TFCtk,dj ID, ID,

|| (4)

III. NAIVE BAYES ALGORITHM

In machine learning, different types of classifiers havebeen developed to achieve maximum degree of precision andefficiency, each with its advantages and disadvantages. But,

they share common characteristics [8].

Among the learning algorithms we cite: Naive Bayeswhich is the most known algorithm, Rocchio method, neuralnetwork, method of k nearest neighbors, decision trees andsupport vector method [8].

Naive Bayes Classifier is the most commonly usedalgorithm, this classifier based on Bayes theorem forcalculating conditional probabilities. In a general context, thistheorem provides a way to calculate the conditional

probability of a case knowing the presence of an effect.

When we apply the nave Bayes for a text categorizationtask, we look for the classification that maximizes the

probability of observing the words of the documents.

During the training phase, the classifier calculates theprobability that a new document belongs to this categorybased on the proportion of training documents belonging tothis category. It calculates the probability that a given word is

present in a text, knowing that this text belongs to thiscategory. Then as a new document should be classified, wecalculate the probability that it belongs to each class usingBayes rule and the probabilities calculated in the previousstep.The likelihood to be estimated is:

p(cj|a1,a2, a3, ..., an).

Where cjis a category and aiis an attribute

Using the Bayes theorem, we obtain:

p, , , , ,,,,,\,,,, (5)

, , , , \ \ (6)The probability that a word appears in a text is that a word

appears in a text is independent of the presence of other wordsin the text. But, for example, the probability of occurrence ofthe word "artificial" depends partly on the presence of theword "intelligence." However, this assumption does not

preclude such a classifier provide satisfactory results and moreimportantly, it greatly reduces the necessary calculations.Without it, we should consider all possible combinations ofwords in a text, which on the one hand involves a largenumber of calculations, but also reduce the quality ofstatistical estimation, since the frequency of occurrence of

each combination would be much lower than the frequency ofoccurrence of words alone [1].

To estimate the probability P(ai\cj), we could calculatedirectly in documents driving the proportion of those

belonging to class cjthat contain the word ai.

In the extreme case where a word is not met in a class itsprobability of 0 dominates the others in the above product andwould void the overall probability. To overcome this problem,a good way is to use the m-estimate calculated as well.

|| (7)

Where,

nkis the number of occurrences of the word in class cj

n is the total count of words in the training corpus.

|Vocabulary|: the number of keywords.


4/6


5/6

Figure1. Cosine similarity algorithm

E.Ant colony optimization

To find the text category, we adopt the algorithm of antcolony optimization (ACO), proposed in [5]. Although the antcolony algorithm is originally designed for the travelingsalesman problem, it finally offers great flexibility. Our choiceis motivated by the flexibility of the metaheuristics whichmakes possible its application to different problems that arecommon to be NP-hard. Thus the use of a parallel model

(colonies of ants) reduces the computing time and improvesthe quality of solutions for categorization.

Formalization of the problem: In our context, the problemof classifying a text reduces the problem of subset selection[5], and we can formalize the pair (S, f) such that:

S contains all the cosine similarities calculatedbetween the documents and graph and the text toclassify. It's "matrix similarity" mat_sim.

F is defined by the function score, the score functionis defined in [5] by the formula.

doc Graph doc_class .Splits (S') is the set of nodes in the graph which are more

similar to the document to classify. So the result is a consistentsubset S' of nodes, as the score function is maximized.

F.Description of the algorithm

At each cycle of the algorithm, each ant constructs asubset. Starting from empty subset, ants at each iteration add acouple of nodes from the similarity matrix. Skchosen amongall couples not yet selected. The pair of nodes to add to Sk ischosen with a probability which depends on the trail of

pheromones and heuristics. One aims to encourage coupleswho have the greatest similarity and the other is to encouragecouples who are most increase the score function. Once each

ant has built its subset, a local search procedure start toimprove the quality of the best subset found during this cycle.Pheromone trails are subsequently updated based on the subsetimproved. Ants stop their construction when all pairs ofcandidate nodes are decreased the score subset or when thethree latest additions failed to increase scoring.

Construction of a solution by an ant: The following codedescribes the procedure followed by ants to construct a subset.The first object is selected randomly. The following items areselected in all candidates.

Figure2.Construction of a solution by an ant

V. RESULTS AND DISCUSSION

To evaluate performances of our suggestion, we makesome experiments using two corpus one for the training andthe other for the test. We also use the Nave Bayes classifier as

baseline one.

TABLE I. CLASSES OF CORPUS

Classes # of documents in

training stage

# of documents in test stage

Economy 29 18

Education 10 09

Religion 19 17

Sociology 30 14

Sport 4 2

The results of classification stage are reported below forant colony algorithm and nave Bayes algorithm

TABLE II. RESULTS OF TESTS WITH ANT COLONYALGORITHM

Class Eco. Educ. Relig. Socio. Sport Total

Eco. 17 0 0 1 0 18

Educ. 0 7 1 1 0 09

Relig. 0 1 16 0 0 17

Socio. 0 3 8 3 0 14

Sport 0 0 0 0 2 2

Algorithme_Cosine_SimilarityInput: doc_Graph, doc_class //graph of documents,

Classified document;Output: Mat_Sim / / similarity matrix based on the relevant

attributes

Mat_Sim 0;begin

For each node of doc_Graph

/ * Extract set of attribute nodes of the graph

SIM = Calcul_Sim (node, doc_class);Mat_Sim = Mat_Sim +Sim (node, doc_class);

Return Mat_SimEnd.

Procedure-Construction-subset

Input: graph_doc S (S, f) and an associated heuristic function: S *

P (S) IR+;

a strategy and a pheromone factor.Output: a subset consisting of objects

Initialize pheromone trails to max

beginRepeat

For each ant k in 1 .. nbAnts, construct a solution Sk as follows:

1. Randomly select the first node2. Sk{oi}

3. Candidat {oi belongs S / Sk {oj}} belonging to

S consisting

4. While Candidates do5. Choose a node with probability

6.

/ / Where T is the set of attributes

/ / pt(a) is the weight of term t in the document node in the graph.

/ / pt(b) is the weight of term t in document b the document to be

classified7. skSkunion {oi}

8. Remove oi from Candidates

9. Remove from candidates each node ojas Sk {oi} belong toSconsisting10. End while

11. End forUpdate pheromone trails according to {S1, ..., SnbAnts}

If a pheromone trail is less than min then set it to min

Else If a pheromone trail is greater than max then set it to max

Until maximum number of cycles reached or solution found.


6/6

TABLE III. RESULTS OF TESTS WITHALGORITHM

Class Eco. Educ. Relig. Socio.

Eco. 14 2 0 2 0

Educ. 0 7 1 1 0

Relig. 0 2 14 0 0

Socio. 0 4 8 2 0

Sport 0 0 0 0 2

Precision and recall are the most usedevaluate information retrieval systems, thfollow:

TABLE IV. CONTINGENCY TABLE BASED EVCLASSIFIERS

Documentbelonging to

category

Document assigned to the classby the classifier

a

Document class rejected by theclassifier

c

According to this table, we define:

Precision = a/(a+b), the number of coover the total number of assignments.

Recall=a/(a+c), the number of correct anumber of assignments that should haveevaluating the performance of a classifier,is not considered separately. So the F1which is used extensively by the formula:

F1 = 2*r*p/(p+ r) (r is the recall, and p iis a function which is maximized whe

precision are close.

Table V and table VI present performaand nave Bayes in terms of recall, precision

TABLE V. RECALL,PRECISION,F1FOR EACH C

Class Recall T Precision T

Economy 94 ,44 100Education 77,77 63,63Religion 94,11 64Sociology 21,42 60Sport 100 100

TABLE VI. RECALL,PRECISION,F1FOR EACH C

Class Recall T Precision T

Economy 77 ,77 100Education 77,77 46,66Religion 82,35 60,86Sociology 14,28 40Sport 100 100

AVE BAYES

port Total

18

09

17

14

2

measurements toy are defined as

LUATION OF THE

Document notbelonging to the

category

b

d

rrect assignments

ssignments on theeen made. Whenrecision or recalleasure is defined

s the precision). Itthe recall and

ces of ant colonyand F1

ASS (ANT COLONY)

F1

97,1469,9976,1831,56100

ASS (NAVE BAYES)

F1

87,4958,3269,9921,04100

Figure3. Classification rates for e

While results of severalresults of Sociology class arsize of learning corpus.

The histogram shows thalgorithm outperforms Naverecall and precision. This isrepresentation of the problemsimilar documents than Nave

REFE

[1] Jalame R. machine learning and

light Lyon 2 University, June 20

[2] Rhel S. Automatic text categfrom documents not labeled SuUniversity Laval, Quebec, Janua

[3] Hacid, H. and Zighed D.Neighborhood graphs updating,

[4] Valette M. Application of cldetection of racist content on the

[5] Solnon C. ContributionsCombinatorial-Graph and the AnUniversity Claude Bernard Lyo

[6] Schmid H. Probabilistic paconference on new methods in

1994.[7] Porter, M. An algorithm for su

137, 1980.

[8] Sebastiani F. Automated texapplication,Renne France, April

0

20

40

60

80

100

F1 ant colo

ach category and for both classifiers

lasses seem to be acceptable,e dramatic; this is due to small

at the suggested ant colonyBayes algorithm in terms of

ot surprise since the graphicalandled better relationship withayes.

ENCES

multilingual text classification

03.

orization and co-occurrence of wordsmission to the Graduate Faculty of they 2005.

An Effective Method for Locallypp. 930, 939, In DEXA 2005.

assification algorithms for automaticInternet, June 2003.

to the practical problem-solvingts, thesis for the Habilitation Research,n 1, December 2005.

t-of-speech tagging using tree inlanguage processing, Manchester UK,

fux stripping, program 14 (3) pp 130-

t categorization tools, techniques and3, 2002.

y F1 naive bayes

economy

education

religion

sociology

sport

ANT Colony for Text Indexing

Documents

Transcript of ANT Colony for Text Indexing