ANT Colony for Text Indexing

download ANT Colony for Text Indexing

of 6

Transcript of ANT Colony for Text Indexing

  • 8/11/2019 ANT Colony for Text Indexing

    1/6

    Application of an ant colony algorithm for text

    indexing

    Nadia Lachetar11Computer Science department,

    University 20 aout 1955 Skikda

    Skikda, Algeria;

    Email [email protected]

    Halima Bahi2

    2LabGED Laboratory, Computer Science department,

    University Badji Mokhtar Annaba

    Annaba, Algeria;Email [email protected]

    Abstract Every day, the mass of information available to us

    increases. This information would be irrelevant if our ability to

    efficiently access did not increase as well. For maximum benefit,

    we need tools that allow us to search, sort, index, store, and

    analyze the available data. We also need tools helping us to find

    in a reasonable time the desired information by performing

    certain tasks for us. One of the promising areas is the automatic

    text categorization. Imagine ourselves in the presence of a

    considerable number of texts, which are more easily accessible if

    they are organized into categories according to their theme. Of

    course, one could ask a human to read the texts and classify them

    manually. This task is hard if done for hundreds, even thousands

    of texts. So, it seems necessary to have an automated application,

    which would consist on indexing text databases. In this article, we

    present our experiments in automated text categorization, where

    we suggest the use of an ant colony algorithm. A Naive Bayes

    algorithm is used as a baseline in our tests.

    Keywords-componen: Information Retrieval, Text

    categorization; Naive Bayes Algorithm; Ant Colony Algorithm.

    I. INTRODUCTION

    Research in the field of automatic categorization remainsrelevant today since the results are still subject toimprovements. For some tasks, the automatic classifiers

    perform almost as well as humans, but for others the gap iseven greater. At first glance, the main problem is easy to grasp.On one hand, we are dealing with a bank of documents of textsand on the other with a set of categories. The goal is to make acomputer application which can determine to which category

    belongs a text based on its contents [2].

    Despite this simplified definition, the solution is notstraightforward and several factors must be considered. First,

    we need the selection of an adequate representation of texts tobe treated; this is an essential step in machine learning. Weshould opt for a consistent and sensible attributes to abstractthe data before submitting them to an algorithm. Subsequentlywe will discuss the selection of attributes almost alwaysinvolved in automated text categorization, and eliminateunnecessary attributes considered for classification [2]. Oncethe pretreatment is completed, we perform classification using

    both naive Bayes algorithm [1] and our proposed ant colonyalgorithm. The remaining of the paper is organized as follow:

    In section II, we present the various aspects of automatictext categorization; particularly, it addresses the main modes ofdocuments representation. Then, Section III introduces the

    Naive Bayes algorithm. In section IV, we present our approachwhich is the application of an ant colony algorithm to the texts

    categorization. Section V presents the obtained results and adiscussion.

    II. TEXT CATEGORIZATION

    The purpose of the automatic text categorization is to learna machine to classify a text into the correct category based onits content; the categories refer to the topics (subjects). We maywish that the same text is associated with only one category orit can belong to number of categories. The set of categories isdetermined in advance. The problem is to group the texts bytheir similarity. In text categorization: the classification issimilar to the problem of extracting semantics of the texts,since membership of the text to a category is closely related tothe meaning of the text. This is partly what makes the task

    difficult since the treatment of the semantics of words writtenin natural language is not yet solved.

    A.How to categorize a texte?

    The categorization process includes the construction of aprediction model that receives in input the text, and as output itcombines one or more labels. To identify the categoryassociated to a text, the following steps are required:

    1) Learning includes several steps and leads to a

    prediction model.

    a) We have a set of labeled texts (for every text we know

    its class)b) From this corpus, we extract the k descriptors (t1, ..,

    tk) which are most relevant in the sense of the problem solving.

    c) We have then a table of "descriptors X individualsand for every word of texts we know the value of descriptors

    and its label.

    2) The classification of words for a new text dx includes

    two stages

    978-1-61284-732-0/11/$26.00 2010 IEEE

  • 8/11/2019 ANT Colony for Text Indexing

    2/6

    a) Research and weighting the instances t1, .. tkof terms

    in text to classify dx.

    b) Implementation of a learning algorithm on these

    instances and the previous table to predict the labels of the

    text dx[1].Note that the k most relevant individuals (t1, ..., tk)

    are extracted during the first phase by analyzing the texts of

    the training corpus. In the second phase, the classification of anew text, we simply seek the frequency of these k descriptors

    (t1, ..., tk) in the text to be classified.

    B.Representation and coding of a text

    Prior coding of text is necessary because there is currentlyno method of learning which can directly handle unstructureddata in the model construction stage, or when used inclassification.

    For most learning methods, we must convert all textsin a PivotTable "individuals-variables".

    Individual is a text dj, labeled during the learning stage,it will be classified in the prediction phase.

    Variables are descriptors (terms) tk which areextracted from data of the text.

    The contents of table wkjrepresent the weight of term k indocument j.

    Different methods are proposed for the selection ofdescriptors and weights associated with these descriptors.Some researchers use the words as descriptors, while others

    prefer to use the lemmas (lexical roots) or even Stemme(deletion of affix) [1].

    C.Approaches for texts representation

    Learning algorithms are not able to treat texts and moregenerally unstructured data such as images, sounds and video

    clips. Therefore a preliminary step called representation isrequired. This step aims to represent each document by avector whose components are such words in the text to make itusable by the learning algorithms. A collection of texts can berepresented by a matrix whose columns are the documents [1].

    Many researchers have chosen to use a vectorrepresentation in which each text is represented by a vector ofn weighted terms. The n terms are simply the n differentwords in the texts.

    1) Choice of terms: In text categorization, we transform

    the document into a vector dj= dj(w1j, w2j, ..., w| T | j), where T

    is the set of terms (descriptors) that appear at least once in the

    corpus (the collection) learning.The weight wkj correspond tothe contribution of terms tkto the semantics of text dj[1].

    2) Bag of words representation: The simplest

    representation of text is a vector model called "bag of words".

    The idea is to transform the texts in a vector where a

    component is a word. Words have the advantage of having an

    explicit sense. However, several problems arise. It must first

    define what a "word" to be able to automatically process. The

    word can be regarded as a sequence of characters from a

    dictionary or, more practically, as a sequence of non-

    delimiters framed by delimiter characters.The components of

    the vector are a function of occurrence of words in the text.

    Representation of text and grammatical analysis excludes any

    distance between words is why this representation is called

    "bag of words; other authors speak of "whole words when

    the weights are associated binary [1].

    3) Representation of texts by sentences: Despite the

    simplicity of using words as units of representation, some

    authors propose to use sentences like unities. The sentencesare more informative than words, because they have the

    advantage of preserving information on the position of the

    word in the sentence: Logically, such a representation must

    get better results than those obtained through words. However,

    if semantic qualities are preserved, the statistical qualities are

    largely degraded [1].

    4) Representation of texts by lexical roots and lemmas: In

    the model of representation "bag of words", each form of a

    word is considered as a different descriptor. For example, the

    words "movers, removals, move, etc.. descriptors are

    considered different while they belong to the same root

    move. Techniques of suffixation (or Stemming), which are

    to investigate the lexical roots, stemming may resolve thisdifficulty. For the detection of lexical roots, several algorithms

    have been proposed, the most known for the English language

    is the algorithm of Porter [7].Stemming is to replace the verbs

    in their infinitive form, and the names of their singular form.

    TreeTagger algorithm was developed for English, French,

    German and Italian [6].

    5) Coding terms: Once we choose the components of the

    vector representing the text j, we must decide how to encode

    each coordinate of the vector dj.There are different methods to

    calculate the weight wkj. These methods are based on two

    observations:

    a)

    More the term tkis frequently in a document dj, moreit is relevant to the subject of this document.

    b) More often the term tk is in a collection, unless it is

    used as discriminating between documents.

    # (tk, dj): the number of occurrences of term tkin the text dj.

    |tr|: the number of documents from the training corpus.

    # Tr (tk): the number of documents in this set where appears at

    least once the term tk.

    According to the two previous observations, a term tk is

    therefore assigned a weight as much stronger than it appears

    frequently in the document corpus.

    The vector component is coded f (# (tk, dj)), the function f

    remains to be determined [1].Two approaches can be used:

    The first is to assign equal weight to the occurrence of theterm in the document

    wkj = # (tk, dj) (1)

    The second approach is simply to assign a binary value 1 if the

    word appears in the text, 0 otherwise.

  • 8/11/2019 ANT Colony for Text Indexing

    3/6

    wkj= 1if # ( , 1(2)

    wkj= 0 Otherwise

    6) Coding terms frequency X inverse document frequency:

    The two functions (1) and (2) above are rarely used because

    they deplete the encoding information:

    Function (1) does not take into account the frequency of

    occurrence of the word in the text (which often can be an

    important decision).

    The function (2) does not take into account the frequency of

    the term in other texts [1].

    TF X IDF encoding was introduced in the vector model; it

    gives much importance to words that appear often within the

    same text, which corresponds to the intuitive idea that these

    words are more representative of the document. But its

    particularity is that it also gives less weight to words that

    belong to several documents: to reflect the fact that these

    words have little ability to discriminate between classes [2].The weight of term tkin document djis calculated as:

    x , d #t, d log || (3) #(tk,dj): number of occurrence of term tk in document

    dj

    |Tr|: number of documents from the training corpus.

    #Tr,(tk): number of documents in this set in which theterm tkappears at least one.

    7) Coding termsTFC: TF X IDF encoding does not fix the

    length of documents for this purpose, the coding TFC is

    similar to TF x IDF, but it corrects the length of texts by

    cosine normalization, in order to not favor the long documents

    [1].

    TFCtk,dj ID, ID,

    || (4)

    III. NAIVE BAYES ALGORITHM

    In machine learning, different types of classifiers havebeen developed to achieve maximum degree of precision andefficiency, each with its advantages and disadvantages. But,

    they share common characteristics [8].

    Among the learning algorithms we cite: Naive Bayeswhich is the most known algorithm, Rocchio method, neuralnetwork, method of k nearest neighbors, decision trees andsupport vector method [8].

    Naive Bayes Classifier is the most commonly usedalgorithm, this classifier based on Bayes theorem forcalculating conditional probabilities. In a general context, thistheorem provides a way to calculate the conditional

    probability of a case knowing the presence of an effect.

    When we apply the nave Bayes for a text categorizationtask, we look for the classification that maximizes the

    probability of observing the words of the documents.

    During the training phase, the classifier calculates theprobability that a new document belongs to this categorybased on the proportion of training documents belonging tothis category. It calculates the probability that a given word is

    present in a text, knowing that this text belongs to thiscategory. Then as a new document should be classified, wecalculate the probability that it belongs to each class usingBayes rule and the probabilities calculated in the previousstep.The likelihood to be estimated is:

    p(cj|a1,a2, a3, ..., an).

    Where cjis a category and aiis an attribute

    Using the Bayes theorem, we obtain:

    p, , , , ,,,,,\,,,, (5)

    , , , , \ \ (6)The probability that a word appears in a text is that a word

    appears in a text is independent of the presence of other wordsin the text. But, for example, the probability of occurrence ofthe word "artificial" depends partly on the presence of theword "intelligence." However, this assumption does not

    preclude such a classifier provide satisfactory results and moreimportantly, it greatly reduces the necessary calculations.Without it, we should consider all possible combinations ofwords in a text, which on the one hand involves a largenumber of calculations, but also reduce the quality ofstatistical estimation, since the frequency of occurrence of

    each combination would be much lower than the frequency ofoccurrence of words alone [1].

    To estimate the probability P(ai\cj), we could calculatedirectly in documents driving the proportion of those

    belonging to class cjthat contain the word ai.

    In the extreme case where a word is not met in a class itsprobability of 0 dominates the others in the above product andwould void the overall probability. To overcome this problem,a good way is to use the m-estimate calculated as well.

    || (7)

    Where,

    nkis the number of occurrences of the word in class cj

    n is the total count of words in the training corpus.

    |Vocabulary|: the number of keywords.

  • 8/11/2019 ANT Colony for Text Indexing

    4/6

  • 8/11/2019 ANT Colony for Text Indexing

    5/6

    Figure1. Cosine similarity algorithm

    E.Ant colony optimization

    To find the text category, we adopt the algorithm of antcolony optimization (ACO), proposed in [5]. Although the antcolony algorithm is originally designed for the travelingsalesman problem, it finally offers great flexibility. Our choiceis motivated by the flexibility of the metaheuristics whichmakes possible its application to different problems that arecommon to be NP-hard. Thus the use of a parallel model

    (colonies of ants) reduces the computing time and improvesthe quality of solutions for categorization.

    Formalization of the problem: In our context, the problemof classifying a text reduces the problem of subset selection[5], and we can formalize the pair (S, f) such that:

    S contains all the cosine similarities calculatedbetween the documents and graph and the text toclassify. It's "matrix similarity" mat_sim.

    F is defined by the function score, the score functionis defined in [5] by the formula.

    doc Graph doc_class .Splits (S') is the set of nodes in the graph which are more

    similar to the document to classify. So the result is a consistentsubset S' of nodes, as the score function is maximized.

    F.Description of the algorithm

    At each cycle of the algorithm, each ant constructs asubset. Starting from empty subset, ants at each iteration add acouple of nodes from the similarity matrix. Skchosen amongall couples not yet selected. The pair of nodes to add to Sk ischosen with a probability which depends on the trail of

    pheromones and heuristics. One aims to encourage coupleswho have the greatest similarity and the other is to encouragecouples who are most increase the score function. Once each

    ant has built its subset, a local search procedure start toimprove the quality of the best subset found during this cycle.Pheromone trails are subsequently updated based on the subsetimproved. Ants stop their construction when all pairs ofcandidate nodes are decreased the score subset or when thethree latest additions failed to increase scoring.

    Construction of a solution by an ant: The following codedescribes the procedure followed by ants to construct a subset.The first object is selected randomly. The following items areselected in all candidates.

    Figure2.Construction of a solution by an ant

    V. RESULTS AND DISCUSSION

    To evaluate performances of our suggestion, we makesome experiments using two corpus one for the training andthe other for the test. We also use the Nave Bayes classifier as

    baseline one.

    TABLE I. CLASSES OF CORPUS

    Classes # of documents in

    training stage

    # of documents in test stage

    Economy 29 18

    Education 10 09

    Religion 19 17

    Sociology 30 14

    Sport 4 2

    The results of classification stage are reported below forant colony algorithm and nave Bayes algorithm

    TABLE II. RESULTS OF TESTS WITH ANT COLONYALGORITHM

    Class Eco. Educ. Relig. Socio. Sport Total

    Eco. 17 0 0 1 0 18

    Educ. 0 7 1 1 0 09

    Relig. 0 1 16 0 0 17

    Socio. 0 3 8 3 0 14

    Sport 0 0 0 0 2 2

    Algorithme_Cosine_SimilarityInput: doc_Graph, doc_class //graph of documents,

    Classified document;Output: Mat_Sim / / similarity matrix based on the relevant

    attributes

    Mat_Sim 0;begin

    For each node of doc_Graph

    / * Extract set of attribute nodes of the graph

    SIM = Calcul_Sim (node, doc_class);Mat_Sim = Mat_Sim +Sim (node, doc_class);

    Return Mat_SimEnd.

    Procedure-Construction-subset

    Input: graph_doc S (S, f) and an associated heuristic function: S *

    P (S) IR+;

    a strategy and a pheromone factor.Output: a subset consisting of objects

    Initialize pheromone trails to max

    beginRepeat

    For each ant k in 1 .. nbAnts, construct a solution Sk as follows:

    1. Randomly select the first node2. Sk{oi}

    3. Candidat {oi belongs S / Sk {oj}} belonging to

    S consisting

    4. While Candidates do5. Choose a node with probability

    6.

    / / Where T is the set of attributes

    / / pt(a) is the weight of term t in the document node in the graph.

    / / pt(b) is the weight of term t in document b the document to be

    classified7. skSkunion {oi}

    8. Remove oi from Candidates

    9. Remove from candidates each node ojas Sk {oi} belong toSconsisting10. End while

    11. End forUpdate pheromone trails according to {S1, ..., SnbAnts}

    If a pheromone trail is less than min then set it to min

    Else If a pheromone trail is greater than max then set it to max

    Until maximum number of cycles reached or solution found.

  • 8/11/2019 ANT Colony for Text Indexing

    6/6

    TABLE III. RESULTS OF TESTS WITHALGORITHM

    Class Eco. Educ. Relig. Socio.

    Eco. 14 2 0 2 0

    Educ. 0 7 1 1 0

    Relig. 0 2 14 0 0

    Socio. 0 4 8 2 0

    Sport 0 0 0 0 2

    Precision and recall are the most usedevaluate information retrieval systems, thfollow:

    TABLE IV. CONTINGENCY TABLE BASED EVCLASSIFIERS

    Documentbelonging to

    category

    Document assigned to the classby the classifier

    a

    Document class rejected by theclassifier

    c

    According to this table, we define:

    Precision = a/(a+b), the number of coover the total number of assignments.

    Recall=a/(a+c), the number of correct anumber of assignments that should haveevaluating the performance of a classifier,is not considered separately. So the F1which is used extensively by the formula:

    F1 = 2*r*p/(p+ r) (r is the recall, and p iis a function which is maximized whe

    precision are close.

    Table V and table VI present performaand nave Bayes in terms of recall, precision

    TABLE V. RECALL,PRECISION,F1FOR EACH C

    Class Recall T Precision T

    Economy 94 ,44 100Education 77,77 63,63Religion 94,11 64Sociology 21,42 60Sport 100 100

    TABLE VI. RECALL,PRECISION,F1FOR EACH C

    Class Recall T Precision T

    Economy 77 ,77 100Education 77,77 46,66Religion 82,35 60,86Sociology 14,28 40Sport 100 100

    AVE BAYES

    port Total

    18

    09

    17

    14

    2

    measurements toy are defined as

    LUATION OF THE

    Document notbelonging to the

    category

    b

    d

    rrect assignments

    ssignments on theeen made. Whenrecision or recalleasure is defined

    s the precision). Itthe recall and

    ces of ant colonyand F1

    ASS (ANT COLONY)

    F1

    97,1469,9976,1831,56100

    ASS (NAVE BAYES)

    F1

    87,4958,3269,9921,04100

    Figure3. Classification rates for e

    While results of severalresults of Sociology class arsize of learning corpus.

    The histogram shows thalgorithm outperforms Naverecall and precision. This isrepresentation of the problemsimilar documents than Nave

    REFE

    [1] Jalame R. machine learning and

    light Lyon 2 University, June 20

    [2] Rhel S. Automatic text categfrom documents not labeled SuUniversity Laval, Quebec, Janua

    [3] Hacid, H. and Zighed D.Neighborhood graphs updating,

    [4] Valette M. Application of cldetection of racist content on the

    [5] Solnon C. ContributionsCombinatorial-Graph and the AnUniversity Claude Bernard Lyo

    [6] Schmid H. Probabilistic paconference on new methods in

    1994.[7] Porter, M. An algorithm for su

    137, 1980.

    [8] Sebastiani F. Automated texapplication,Renne France, April

    0

    20

    40

    60

    80

    100

    F1 ant colo

    ach category and for both classifiers

    lasses seem to be acceptable,e dramatic; this is due to small

    at the suggested ant colonyBayes algorithm in terms of

    ot surprise since the graphicalandled better relationship withayes.

    ENCES

    multilingual text classification

    03.

    orization and co-occurrence of wordsmission to the Graduate Faculty of they 2005.

    An Effective Method for Locallypp. 930, 939, In DEXA 2005.

    assification algorithms for automaticInternet, June 2003.

    to the practical problem-solvingts, thesis for the Habilitation Research,n 1, December 2005.

    t-of-speech tagging using tree inlanguage processing, Manchester UK,

    fux stripping, program 14 (3) pp 130-

    t categorization tools, techniques and3, 2002.

    y F1 naive bayes

    economy

    education

    religion

    sociology

    sport