ANT Colony for Text Indexing
Transcript of ANT Colony for Text Indexing
-
8/11/2019 ANT Colony for Text Indexing
1/6
Application of an ant colony algorithm for text
indexing
Nadia Lachetar11Computer Science department,
University 20 aout 1955 Skikda
Skikda, Algeria;
Email [email protected]
Halima Bahi2
2LabGED Laboratory, Computer Science department,
University Badji Mokhtar Annaba
Annaba, Algeria;Email [email protected]
Abstract Every day, the mass of information available to us
increases. This information would be irrelevant if our ability to
efficiently access did not increase as well. For maximum benefit,
we need tools that allow us to search, sort, index, store, and
analyze the available data. We also need tools helping us to find
in a reasonable time the desired information by performing
certain tasks for us. One of the promising areas is the automatic
text categorization. Imagine ourselves in the presence of a
considerable number of texts, which are more easily accessible if
they are organized into categories according to their theme. Of
course, one could ask a human to read the texts and classify them
manually. This task is hard if done for hundreds, even thousands
of texts. So, it seems necessary to have an automated application,
which would consist on indexing text databases. In this article, we
present our experiments in automated text categorization, where
we suggest the use of an ant colony algorithm. A Naive Bayes
algorithm is used as a baseline in our tests.
Keywords-componen: Information Retrieval, Text
categorization; Naive Bayes Algorithm; Ant Colony Algorithm.
I. INTRODUCTION
Research in the field of automatic categorization remainsrelevant today since the results are still subject toimprovements. For some tasks, the automatic classifiers
perform almost as well as humans, but for others the gap iseven greater. At first glance, the main problem is easy to grasp.On one hand, we are dealing with a bank of documents of textsand on the other with a set of categories. The goal is to make acomputer application which can determine to which category
belongs a text based on its contents [2].
Despite this simplified definition, the solution is notstraightforward and several factors must be considered. First,
we need the selection of an adequate representation of texts tobe treated; this is an essential step in machine learning. Weshould opt for a consistent and sensible attributes to abstractthe data before submitting them to an algorithm. Subsequentlywe will discuss the selection of attributes almost alwaysinvolved in automated text categorization, and eliminateunnecessary attributes considered for classification [2]. Oncethe pretreatment is completed, we perform classification using
both naive Bayes algorithm [1] and our proposed ant colonyalgorithm. The remaining of the paper is organized as follow:
In section II, we present the various aspects of automatictext categorization; particularly, it addresses the main modes ofdocuments representation. Then, Section III introduces the
Naive Bayes algorithm. In section IV, we present our approachwhich is the application of an ant colony algorithm to the texts
categorization. Section V presents the obtained results and adiscussion.
II. TEXT CATEGORIZATION
The purpose of the automatic text categorization is to learna machine to classify a text into the correct category based onits content; the categories refer to the topics (subjects). We maywish that the same text is associated with only one category orit can belong to number of categories. The set of categories isdetermined in advance. The problem is to group the texts bytheir similarity. In text categorization: the classification issimilar to the problem of extracting semantics of the texts,since membership of the text to a category is closely related tothe meaning of the text. This is partly what makes the task
difficult since the treatment of the semantics of words writtenin natural language is not yet solved.
A.How to categorize a texte?
The categorization process includes the construction of aprediction model that receives in input the text, and as output itcombines one or more labels. To identify the categoryassociated to a text, the following steps are required:
1) Learning includes several steps and leads to a
prediction model.
a) We have a set of labeled texts (for every text we know
its class)b) From this corpus, we extract the k descriptors (t1, ..,
tk) which are most relevant in the sense of the problem solving.
c) We have then a table of "descriptors X individualsand for every word of texts we know the value of descriptors
and its label.
2) The classification of words for a new text dx includes
two stages
978-1-61284-732-0/11/$26.00 2010 IEEE
-
8/11/2019 ANT Colony for Text Indexing
2/6
a) Research and weighting the instances t1, .. tkof terms
in text to classify dx.
b) Implementation of a learning algorithm on these
instances and the previous table to predict the labels of the
text dx[1].Note that the k most relevant individuals (t1, ..., tk)
are extracted during the first phase by analyzing the texts of
the training corpus. In the second phase, the classification of anew text, we simply seek the frequency of these k descriptors
(t1, ..., tk) in the text to be classified.
B.Representation and coding of a text
Prior coding of text is necessary because there is currentlyno method of learning which can directly handle unstructureddata in the model construction stage, or when used inclassification.
For most learning methods, we must convert all textsin a PivotTable "individuals-variables".
Individual is a text dj, labeled during the learning stage,it will be classified in the prediction phase.
Variables are descriptors (terms) tk which areextracted from data of the text.
The contents of table wkjrepresent the weight of term k indocument j.
Different methods are proposed for the selection ofdescriptors and weights associated with these descriptors.Some researchers use the words as descriptors, while others
prefer to use the lemmas (lexical roots) or even Stemme(deletion of affix) [1].
C.Approaches for texts representation
Learning algorithms are not able to treat texts and moregenerally unstructured data such as images, sounds and video
clips. Therefore a preliminary step called representation isrequired. This step aims to represent each document by avector whose components are such words in the text to make itusable by the learning algorithms. A collection of texts can berepresented by a matrix whose columns are the documents [1].
Many researchers have chosen to use a vectorrepresentation in which each text is represented by a vector ofn weighted terms. The n terms are simply the n differentwords in the texts.
1) Choice of terms: In text categorization, we transform
the document into a vector dj= dj(w1j, w2j, ..., w| T | j), where T
is the set of terms (descriptors) that appear at least once in the
corpus (the collection) learning.The weight wkj correspond tothe contribution of terms tkto the semantics of text dj[1].
2) Bag of words representation: The simplest
representation of text is a vector model called "bag of words".
The idea is to transform the texts in a vector where a
component is a word. Words have the advantage of having an
explicit sense. However, several problems arise. It must first
define what a "word" to be able to automatically process. The
word can be regarded as a sequence of characters from a
dictionary or, more practically, as a sequence of non-
delimiters framed by delimiter characters.The components of
the vector are a function of occurrence of words in the text.
Representation of text and grammatical analysis excludes any
distance between words is why this representation is called
"bag of words; other authors speak of "whole words when
the weights are associated binary [1].
3) Representation of texts by sentences: Despite the
simplicity of using words as units of representation, some
authors propose to use sentences like unities. The sentencesare more informative than words, because they have the
advantage of preserving information on the position of the
word in the sentence: Logically, such a representation must
get better results than those obtained through words. However,
if semantic qualities are preserved, the statistical qualities are
largely degraded [1].
4) Representation of texts by lexical roots and lemmas: In
the model of representation "bag of words", each form of a
word is considered as a different descriptor. For example, the
words "movers, removals, move, etc.. descriptors are
considered different while they belong to the same root
move. Techniques of suffixation (or Stemming), which are
to investigate the lexical roots, stemming may resolve thisdifficulty. For the detection of lexical roots, several algorithms
have been proposed, the most known for the English language
is the algorithm of Porter [7].Stemming is to replace the verbs
in their infinitive form, and the names of their singular form.
TreeTagger algorithm was developed for English, French,
German and Italian [6].
5) Coding terms: Once we choose the components of the
vector representing the text j, we must decide how to encode
each coordinate of the vector dj.There are different methods to
calculate the weight wkj. These methods are based on two
observations:
a)
More the term tkis frequently in a document dj, moreit is relevant to the subject of this document.
b) More often the term tk is in a collection, unless it is
used as discriminating between documents.
# (tk, dj): the number of occurrences of term tkin the text dj.
|tr|: the number of documents from the training corpus.
# Tr (tk): the number of documents in this set where appears at
least once the term tk.
According to the two previous observations, a term tk is
therefore assigned a weight as much stronger than it appears
frequently in the document corpus.
The vector component is coded f (# (tk, dj)), the function f
remains to be determined [1].Two approaches can be used:
The first is to assign equal weight to the occurrence of theterm in the document
wkj = # (tk, dj) (1)
The second approach is simply to assign a binary value 1 if the
word appears in the text, 0 otherwise.
-
8/11/2019 ANT Colony for Text Indexing
3/6
wkj= 1if # ( , 1(2)
wkj= 0 Otherwise
6) Coding terms frequency X inverse document frequency:
The two functions (1) and (2) above are rarely used because
they deplete the encoding information:
Function (1) does not take into account the frequency of
occurrence of the word in the text (which often can be an
important decision).
The function (2) does not take into account the frequency of
the term in other texts [1].
TF X IDF encoding was introduced in the vector model; it
gives much importance to words that appear often within the
same text, which corresponds to the intuitive idea that these
words are more representative of the document. But its
particularity is that it also gives less weight to words that
belong to several documents: to reflect the fact that these
words have little ability to discriminate between classes [2].The weight of term tkin document djis calculated as:
x , d #t, d log || (3) #(tk,dj): number of occurrence of term tk in document
dj
|Tr|: number of documents from the training corpus.
#Tr,(tk): number of documents in this set in which theterm tkappears at least one.
7) Coding termsTFC: TF X IDF encoding does not fix the
length of documents for this purpose, the coding TFC is
similar to TF x IDF, but it corrects the length of texts by
cosine normalization, in order to not favor the long documents
[1].
TFCtk,dj ID, ID,
|| (4)
III. NAIVE BAYES ALGORITHM
In machine learning, different types of classifiers havebeen developed to achieve maximum degree of precision andefficiency, each with its advantages and disadvantages. But,
they share common characteristics [8].
Among the learning algorithms we cite: Naive Bayeswhich is the most known algorithm, Rocchio method, neuralnetwork, method of k nearest neighbors, decision trees andsupport vector method [8].
Naive Bayes Classifier is the most commonly usedalgorithm, this classifier based on Bayes theorem forcalculating conditional probabilities. In a general context, thistheorem provides a way to calculate the conditional
probability of a case knowing the presence of an effect.
When we apply the nave Bayes for a text categorizationtask, we look for the classification that maximizes the
probability of observing the words of the documents.
During the training phase, the classifier calculates theprobability that a new document belongs to this categorybased on the proportion of training documents belonging tothis category. It calculates the probability that a given word is
present in a text, knowing that this text belongs to thiscategory. Then as a new document should be classified, wecalculate the probability that it belongs to each class usingBayes rule and the probabilities calculated in the previousstep.The likelihood to be estimated is:
p(cj|a1,a2, a3, ..., an).
Where cjis a category and aiis an attribute
Using the Bayes theorem, we obtain:
p, , , , ,,,,,\,,,, (5)
, , , , \ \ (6)The probability that a word appears in a text is that a word
appears in a text is independent of the presence of other wordsin the text. But, for example, the probability of occurrence ofthe word "artificial" depends partly on the presence of theword "intelligence." However, this assumption does not
preclude such a classifier provide satisfactory results and moreimportantly, it greatly reduces the necessary calculations.Without it, we should consider all possible combinations ofwords in a text, which on the one hand involves a largenumber of calculations, but also reduce the quality ofstatistical estimation, since the frequency of occurrence of
each combination would be much lower than the frequency ofoccurrence of words alone [1].
To estimate the probability P(ai\cj), we could calculatedirectly in documents driving the proportion of those
belonging to class cjthat contain the word ai.
In the extreme case where a word is not met in a class itsprobability of 0 dominates the others in the above product andwould void the overall probability. To overcome this problem,a good way is to use the m-estimate calculated as well.
|| (7)
Where,
nkis the number of occurrences of the word in class cj
n is the total count of words in the training corpus.
|Vocabulary|: the number of keywords.
-
8/11/2019 ANT Colony for Text Indexing
4/6
-
8/11/2019 ANT Colony for Text Indexing
5/6
Figure1. Cosine similarity algorithm
E.Ant colony optimization
To find the text category, we adopt the algorithm of antcolony optimization (ACO), proposed in [5]. Although the antcolony algorithm is originally designed for the travelingsalesman problem, it finally offers great flexibility. Our choiceis motivated by the flexibility of the metaheuristics whichmakes possible its application to different problems that arecommon to be NP-hard. Thus the use of a parallel model
(colonies of ants) reduces the computing time and improvesthe quality of solutions for categorization.
Formalization of the problem: In our context, the problemof classifying a text reduces the problem of subset selection[5], and we can formalize the pair (S, f) such that:
S contains all the cosine similarities calculatedbetween the documents and graph and the text toclassify. It's "matrix similarity" mat_sim.
F is defined by the function score, the score functionis defined in [5] by the formula.
doc Graph doc_class .Splits (S') is the set of nodes in the graph which are more
similar to the document to classify. So the result is a consistentsubset S' of nodes, as the score function is maximized.
F.Description of the algorithm
At each cycle of the algorithm, each ant constructs asubset. Starting from empty subset, ants at each iteration add acouple of nodes from the similarity matrix. Skchosen amongall couples not yet selected. The pair of nodes to add to Sk ischosen with a probability which depends on the trail of
pheromones and heuristics. One aims to encourage coupleswho have the greatest similarity and the other is to encouragecouples who are most increase the score function. Once each
ant has built its subset, a local search procedure start toimprove the quality of the best subset found during this cycle.Pheromone trails are subsequently updated based on the subsetimproved. Ants stop their construction when all pairs ofcandidate nodes are decreased the score subset or when thethree latest additions failed to increase scoring.
Construction of a solution by an ant: The following codedescribes the procedure followed by ants to construct a subset.The first object is selected randomly. The following items areselected in all candidates.
Figure2.Construction of a solution by an ant
V. RESULTS AND DISCUSSION
To evaluate performances of our suggestion, we makesome experiments using two corpus one for the training andthe other for the test. We also use the Nave Bayes classifier as
baseline one.
TABLE I. CLASSES OF CORPUS
Classes # of documents in
training stage
# of documents in test stage
Economy 29 18
Education 10 09
Religion 19 17
Sociology 30 14
Sport 4 2
The results of classification stage are reported below forant colony algorithm and nave Bayes algorithm
TABLE II. RESULTS OF TESTS WITH ANT COLONYALGORITHM
Class Eco. Educ. Relig. Socio. Sport Total
Eco. 17 0 0 1 0 18
Educ. 0 7 1 1 0 09
Relig. 0 1 16 0 0 17
Socio. 0 3 8 3 0 14
Sport 0 0 0 0 2 2
Algorithme_Cosine_SimilarityInput: doc_Graph, doc_class //graph of documents,
Classified document;Output: Mat_Sim / / similarity matrix based on the relevant
attributes
Mat_Sim 0;begin
For each node of doc_Graph
/ * Extract set of attribute nodes of the graph
SIM = Calcul_Sim (node, doc_class);Mat_Sim = Mat_Sim +Sim (node, doc_class);
Return Mat_SimEnd.
Procedure-Construction-subset
Input: graph_doc S (S, f) and an associated heuristic function: S *
P (S) IR+;
a strategy and a pheromone factor.Output: a subset consisting of objects
Initialize pheromone trails to max
beginRepeat
For each ant k in 1 .. nbAnts, construct a solution Sk as follows:
1. Randomly select the first node2. Sk{oi}
3. Candidat {oi belongs S / Sk {oj}} belonging to
S consisting
4. While Candidates do5. Choose a node with probability
6.
/ / Where T is the set of attributes
/ / pt(a) is the weight of term t in the document node in the graph.
/ / pt(b) is the weight of term t in document b the document to be
classified7. skSkunion {oi}
8. Remove oi from Candidates
9. Remove from candidates each node ojas Sk {oi} belong toSconsisting10. End while
11. End forUpdate pheromone trails according to {S1, ..., SnbAnts}
If a pheromone trail is less than min then set it to min
Else If a pheromone trail is greater than max then set it to max
Until maximum number of cycles reached or solution found.
-
8/11/2019 ANT Colony for Text Indexing
6/6
TABLE III. RESULTS OF TESTS WITHALGORITHM
Class Eco. Educ. Relig. Socio.
Eco. 14 2 0 2 0
Educ. 0 7 1 1 0
Relig. 0 2 14 0 0
Socio. 0 4 8 2 0
Sport 0 0 0 0 2
Precision and recall are the most usedevaluate information retrieval systems, thfollow:
TABLE IV. CONTINGENCY TABLE BASED EVCLASSIFIERS
Documentbelonging to
category
Document assigned to the classby the classifier
a
Document class rejected by theclassifier
c
According to this table, we define:
Precision = a/(a+b), the number of coover the total number of assignments.
Recall=a/(a+c), the number of correct anumber of assignments that should haveevaluating the performance of a classifier,is not considered separately. So the F1which is used extensively by the formula:
F1 = 2*r*p/(p+ r) (r is the recall, and p iis a function which is maximized whe
precision are close.
Table V and table VI present performaand nave Bayes in terms of recall, precision
TABLE V. RECALL,PRECISION,F1FOR EACH C
Class Recall T Precision T
Economy 94 ,44 100Education 77,77 63,63Religion 94,11 64Sociology 21,42 60Sport 100 100
TABLE VI. RECALL,PRECISION,F1FOR EACH C
Class Recall T Precision T
Economy 77 ,77 100Education 77,77 46,66Religion 82,35 60,86Sociology 14,28 40Sport 100 100
AVE BAYES
port Total
18
09
17
14
2
measurements toy are defined as
LUATION OF THE
Document notbelonging to the
category
b
d
rrect assignments
ssignments on theeen made. Whenrecision or recalleasure is defined
s the precision). Itthe recall and
ces of ant colonyand F1
ASS (ANT COLONY)
F1
97,1469,9976,1831,56100
ASS (NAVE BAYES)
F1
87,4958,3269,9921,04100
Figure3. Classification rates for e
While results of severalresults of Sociology class arsize of learning corpus.
The histogram shows thalgorithm outperforms Naverecall and precision. This isrepresentation of the problemsimilar documents than Nave
REFE
[1] Jalame R. machine learning and
light Lyon 2 University, June 20
[2] Rhel S. Automatic text categfrom documents not labeled SuUniversity Laval, Quebec, Janua
[3] Hacid, H. and Zighed D.Neighborhood graphs updating,
[4] Valette M. Application of cldetection of racist content on the
[5] Solnon C. ContributionsCombinatorial-Graph and the AnUniversity Claude Bernard Lyo
[6] Schmid H. Probabilistic paconference on new methods in
1994.[7] Porter, M. An algorithm for su
137, 1980.
[8] Sebastiani F. Automated texapplication,Renne France, April
0
20
40
60
80
100
F1 ant colo
ach category and for both classifiers
lasses seem to be acceptable,e dramatic; this is due to small
at the suggested ant colonyBayes algorithm in terms of
ot surprise since the graphicalandled better relationship withayes.
ENCES
multilingual text classification
03.
orization and co-occurrence of wordsmission to the Graduate Faculty of they 2005.
An Effective Method for Locallypp. 930, 939, In DEXA 2005.
assification algorithms for automaticInternet, June 2003.
to the practical problem-solvingts, thesis for the Habilitation Research,n 1, December 2005.
t-of-speech tagging using tree inlanguage processing, Manchester UK,
fux stripping, program 14 (3) pp 130-
t categorization tools, techniques and3, 2002.
y F1 naive bayes
economy
education
religion
sociology
sport