[IEEE Comput. Soc SCCC 2001. 21st International Conference of the Chilean Computer Science Society -...

8
Genetic Algorithm Restricted by Tabu Lists in Data Mining Fábio M. Lopes, Aurora T. R. Pozo Federal University of Paraná - Computer Science Department [email protected], [email protected] Abstract. The present work shows an implementation of a genetic algorithm (GA) integrated with tabu lists to generate a classifier tool for a Data Mining task. The choice of GAs paradigm is partially justified by its great capacity in dealing with noise, invalid or inexact data, and its easy adaptation to different domains of data. The GA algorithm uses Tabu List to restrict the selection process. This restriction allows the creation of set of potential rules for the classifier tool. This strategy was proposed in [1], for multimodal and multiobjective function optimization and represents an alternative to sharing methods. In this work, the behavior of this approach in Data Mining task was analyzed. Experiments were performed on five databases and results were compared with 34 other classifying algorithms. After that, noise was added to the databases and a new set of experiments was performed. The results show that the algorithm proposed herein is efficient and robust. And the strategy used to maintain the diversity was considered valid, since the algorithm was able to keep its accuracy in categorization even for smaller populations. Keywords: Genetic Algorithm, Tabu Search, Data Mining. 1. Introduction Research in different areas indicates that a new generation of intelligent tools for automated data mining is needed to deal with large database. The approach used in this paper is to create a classifier induction tool based on genetic algorithms. Some reasons to use genetic algorithms include their scalability, their ability to handle noisy data, configuration of parameters according to the task, and the possibility of parallelization [2]. The GA algorithm uses Tabu List to restrict the selection process. This restriction allows the creation of a set of potential rules for the classifier tool. This strategy was proposed in [1], for multimodal and multiobjective function optimization and is an alternative to sharing methods. In this work we analyze the behavior of this approach in Data Mining tasks. Experiments were performed on five databases and compared with 34 other classifying algorithms. The following section gives an overview of genetic algorithms in classification tasks and related works. Section 3 gives an overview of Tabu Search. The GA algorithm implemented is presented in section 4. Section 5 compares results with other classification algorithms and, finally, Section 6 draws some conclusions. 2. Genetic Algorithms and Data Mining The present work deals with the classification task performed in Data Mining. Given a set of classified examples, the goal of the classification task is to find a logical description that correctly classifies new cases. In that sense, this logical description that was found can be considered a classifier. For a better visualization of the classification task, Table 1 and Figure 1 show an example of the input received by a classification system and the group of rules generated by it [2]. Table 1 – Input Received by a classification System Sex Country Age Buy (goal) M France 25 Yes M England 21 Yes F France 23 Yes F England 34 Yes F France 30 No M Germany 21 No M Germany 20 No F Germany 18 No F France 34 No M France 55 No IF (Country = “Germany”) THEN (Buy = “no”) IF (Country = “England”) THEN (Buy = “yes”) IF (Country = “France” and Age 25) THEN (Buy = “yes”) IF (Country = “France” and Age > 25 ) THEN (Buy = “no”) Figure 1 – Discovery rules from Table 1

Transcript of [IEEE Comput. Soc SCCC 2001. 21st International Conference of the Chilean Computer Science Society -...

Genetic Algorithm Restricted by Tabu Lists in Data Mining

Fábio M. Lopes, Aurora T. R. Pozo Federal University of Paraná - Computer Science Department

[email protected], [email protected]

Abstract.

The present work shows an implementation of a

genetic algorithm (GA) integrated with tabu lists to generate a classifier tool for a Data Mining task. The choice of GAs paradigm is partially justified by its great capacity in dealing with noise, invalid or inexact data, and its easy adaptation to different domains of data. The GA algorithm uses Tabu List to restrict the selection process. This restriction allows the creation of set of potential rules for the classifier tool. This strategy was proposed in [1], for multimodal and multiobjective function optimization and represents an alternative to sharing methods. In this work, the behavior of this approach in Data Mining task was analyzed. Experiments were performed on five databases and results were compared with 34 other classifying algorithms. After that, noise was added to the databases and a new set of experiments was performed. The results show that the algorithm proposed herein is efficient and robust. And the strategy used to maintain the diversity was considered valid, since the algorithm was able to keep its accuracy in categorization even for smaller populations.

Keywords: Genetic Algorithm, Tabu Search, Data Mining.

1. Introduction

Research in different areas indicates that a new generation of intelligent tools for automated data mining is needed to deal with large database. The approach used in this paper is to create a classifier induction tool based on genetic algorithms. Some reasons to use genetic algorithms include their scalability, their ability to handle noisy data, configuration of parameters according to the task, and the possibility of parallelization [2].

The GA algorithm uses Tabu List to restrict the selection process. This restriction allows the creation of a set of potential rules for the classifier tool. This strategy was proposed in [1], for multimodal and multiobjective function optimization and is an alternative to sharing methods. In this work we analyze the behavior of this approach in Data Mining tasks. Experiments were performed on five databases and compared with 34 other classifying algorithms.

The following section gives an overview of genetic

algorithms in classification tasks and related works. Section 3 gives an overview of Tabu Search. The GA algorithm implemented is presented in section 4. Section 5 compares results with other classification algorithms and, finally, Section 6 draws some conclusions. 2. Genetic Algorithms and Data Mining

The present work deals with the classification task performed in Data Mining. Given a set of classified examples, the goal of the classification task is to find a logical description that correctly classifies new cases. In that sense, this logical description that was found can be considered a classifier. For a better visualization of the classification task, Table 1 and Figure 1 show an example of the input received by a classification system and the group of rules generated by it [2].

Table 1 – Input Received by a classification System

Sex Country Age Buy (goal)

M France 25 Yes M England 21 Yes F France 23 Yes F England 34 Yes F France 30 No M Germany 21 No M Germany 20 No F Germany 18 No F France 34 No M France 55 No

IF (Country = “Germany”) THEN (Buy = “no”) IF (Country = “England”) THEN (Buy = “yes”) IF (Country = “France” and Age ≤ 25) THEN (Buy = “yes”) IF (Country = “France” and Age > 25 ) THEN (Buy = “no”)

Figure 1 – Discovery rules from Table 1

A traditional approach is the induction of rules with Genetic Algorithms proposed by Michigan [3]. In his work, each individual of the population corresponds to a single rule, that is, it classifies only a subset of the data. That makes it necessary to develop some strategy to extract a group of non-redundant rules from the GA population [3] [4]. In this sense, different research methods were developed.

The first group of studies [5] [6] [7] [8] [9] [10] focuses on the maintenance of the diversity on the population and include (1) sharing methods, that use sharing functions to avoid the convergence to similar individuals; (2) crowding methods, which constrain the replacements of new individuals; and (3) crossover restrictions. However, these methods are difficult to apply to practical problems: (1) Unsuitable sharing functions often prevent individuals from exploiting optimal regions; (2) the crowding often failed to avoid the early convergence; (3) the method seems too artificial.

The second group aims at improving the performance of GA’s searching capabilities using hybridization [11] [12] [13]. In this approach, GA is used aside one of the following paradigms: simulated annealing, tabu search, artificial neural networks and expert system. However, most of the studies in the literature have focused on the global search via GAs, while the local search has been made using one of the previously referred techniques.

The last group of studies focuses on either function optimization problems, or problems to find Pareto optimal solutions [14]. These studies include: (1) methods to divide individuals into subgroups, where each subgroup corresponds to one objective function, (2) methods to rank Pareto optimal individuals not to be covered by other individuals, (3) combination of tournament and sharing methods, and (4) methods to divide Pareto solutions to some ranges.

These previous studies have the common goal of improving GAs by using methods to maintain diversity in the population. Recently [1] proposed a new method that uses multiple tabu lists, that act as restriction on the selection process. Their work presented tests performed on multimodal and multiobjective function optimization.

This paper presents an implementation of an algorithm with the same philosophy proposed by [1], however with the goal of creating classifier systems.

3. Tabu Search

This section gives a brief overview of Tabu search, to

provide a better understanding about the strategy adopted by this work. The Tabu search method is a heuristic procedure presented by [15] to solve combinatorial optimization problems. The basic idea is to avoid that the search stops in a local minimum.

A general outline of the Tabu search process is illustrated in Figure 2. According to [16], during the search process, a list of forbidden movements called Tabu list (T), is maintained during the whole process. The process begins by selecting a random solution (x). Then, a local search takes place, seeking all the neighboring solutions N (x) and the best among them is selected (x'). This solution (x') is then moved to the list of restrictions (Tabu list) and the algorithm continues to search another best solution starting from the last solution found (x'). A local search is repeated and the best neighboring solution is selected as candidate for the next movement.

However, to avoid reverse movements, the list of restrictions is checked before every movement. If the movement is not in the Tabu list or, if the movement is in the Tabu list and Aspiration Criterion is satisfied, the movement is accepted.. Otherwise the better next solution is tested. This whole process is repeated until the stop criterion is satisfied.

The Aspiration Criterion represents one extra restriction to the algorithm that will not be considered in the context of this work.

Aspiration Criterion Satisfied

Out x (best solution until the present moment)

Finish

Initial solution x

Find best x’ ∈ N(x)

x’ ∈ T ?

x = x’

Add x to T

Remove x’

No

No

Yes

Yes

No

Figure 2 – Tabu Search Algorithm

4. Overview of the Algorithm Implemented Based on the work of [1], we implemented a Genetic

Algorithm restricted by Tabu lists to create a classifier system. The basic structure of the GA is given in Figure 3. It uses two lists of restrictions: Long Lists with size m and Short List with size n. The values of m and n are adjusted according to the problem. The lists: (1) store the best individuals of the previous generations; (2) maintain the elitism and the diversity of the population; and (3) avoid the convergence to a local minimum.

The idea behind the algorithm is that, at the end of each generation, the best individual will be stored in both lists. When the next generation begins and new individuals are selected for the reproduction, the lists of restrictions do not allow that two similar individuals are selected at the same time. The similarity among individuals is given by a function to measure that distance, either in the phenotypic or genotypic space.

The Short List, with size n, will only store the

individuals of the most recent iterations. When the list is completely filled, a new individual will replace the oldest one. Individuals belonging to this list can have the same phenotype. In the Long List, individuals of all the previous generations will remain stored. The individuals in the Long List cannot have identical or similar phenotype.

If an individual with a similar phenotype needs to be added to the Long List, it will only be added if it has a function value higher than the other individual. This individual will be removed from the list, suffer mutation and be put back in the population to participate in the next generations.

In this way, solutions will gradually be stored in the Long List. At the end of the process, individuals in the Long Lists will form the solution. The standard GA parameters that were considered in the implementation are showed bellow:

• Fitness Function - Laplace Function [17], described

in equation 1. Fitness Function = (VP+1)/(VP+FP+K) (Equation 1) where:

K, the number of classes in the domain. VP, number of true positive. FP, number of false positive.

• Selection method – Tournament Selection [18]. • Crossover – Uniform crossover.

Candidates (t+1) Population (t) tabu

tabu

Renew

Renew

pass

Figure 3 – Tabu – GA

GA

similar

mutation

5. Experiments and Results Experiments were conducted to verify the algorithm

behavior, analyzing three main aspects: (1) verification of the effects of the Tabu Lists on the algorithm, (2) comparison with others algorithms and (3) the impact of a reduction in database size on the quality of the solutions. 5.1.Verification of Tabu Lists Effect

In this experiment the “vot” database was used (described in section 5.2) and three sets of rules were considered: Short List, Long List and GA populations. Figures 4, 5 and 6 show the rules on each set, respectively, obtained in the 40th generation for class 2.

In Figure 4 we observe two identical rules in the Short List, which shows that these two individuals were the best rules obtained in the last two generations. However, analyzing the rules of the whole population (Figure 6), we can see that only one occurrence of that rule is present. This shows that the Short List is acting as a local restriction to prevent the replication of rules.

Besides that fact, from Figure 5, it can be verified that the rules of the Short List are not in the Long List. This is justified by the following Long List addition rule: two individuals should not be added if the Long List contains similar individuals with greatest fitness value (according to the measure of distance used).

Fitness Rule Set

0.989 IF ( v3 = 2 ) and ( v4 = 1 ) and ( v11 = 2 ) and ( v14 = 1 ) and ( v15 = 2 ) then class = 2

0.989 IF ( v3 = 2 ) and ( v4 = 1 ) and ( v11 = 2 ) and ( v14 = 1 ) and ( v15 = 2 ) then class = 2

Figure 4 – Short List - Rules Set.

Fitness

Rule Set

0.983 IF ( v3 = 2 ) and ( v4 = 1 ) and ( v10 = 1 ) and ( v12 = 1 ) then class = 2

0.980 IF ( v1 = 2 ) and ( v3 = 2 ) and ( v4 = 1 ) and ( v14 = 1 ) then class = 2

0.989 IF ( v3 = 2 ) and ( v4 = 1 ) and ( v9 = 2 ) and ( v11 = 2 ) then class = 2

0.990 IF ( v3 = 2 ) and ( v4 = 1 ) and ( v11 = 2 ) then class = 2

0.989 IF ( v1 = 2 ) and ( v3 = 2 ) and ( v4 = 1 ) and ( v11 = 2 ) then class = 2

0.989 IF ( v1 = 2 ) and ( v3 = 2 ) and ( v4 = 1 ) and ( v5 = 1 ) and ( v13 = 1 ) then class = 2

0.987 IF (v3 = 2) and (v4 = 1) and (v7 = 2) and (v9 = 2) and (v12 = 1) and (v13 = 1) and (v14 = 1) then class = 2

0.989 IF (v3 = 2) and (v4 = 1) and (v6 = 1) and (v11 = 2) and (v14 = 1) then class = 2

0.987 IF (v3 = 2) and (v4 = 1) and (v6 = 1) and (v7 = 2) and (v12 = 1) and (v13 = 1) and (v14 = 1) then class = 2

0.987 IF (v1 = 2) and (v3 = 2) and (v4 = 1) and (v12 = 1) and (v13 = 1) and (v14 = 1) and (v15 = 2) then class = 2

Figure 5 - Long List – Rules Set.

Fitness

Rule Set

0.920 IF ( v3 = 2) and (v6 = 1) and (v8 = 2 ) and (v10 = 1) and ( v16 = 2 ) then class = 2

0.947 IF ( v1 = 2 ) and (v4 = 1) and (v8 = 2 ) and (v10 = 2) and ( v11 = 2 ) and ( v13 = 1 ) and ( v15 = 2 ) then class = 2

0.978 IF ( v1 = 2 ) and (v3 = 2) and (v4 = 1 ) and (v5 = 1) and ( v8 = 2 ) and ( v14 = 1 ) then class = 2

............................

............................

............................

0.960 IF ( v4 = 1 ) and ( v9 = 2 ) and ( v13 = 1 ) and ( v14 = 1 ) and ( v15 = 2 ) then class = 2

Figure 6 – Population – Partial Rules Set. 5.2.Comparison with Others Algorithms

In order to allow comparisons, the methodology of [19] was used. In that paper, 34 classification algorithms were compared in different datasets in terms of classification accuracy and many other characteristics. A brief description of the 5 datasets used on this work is shown in Table 2. For each base, one variation has been created by the addition of noise, that is, more attributes with random values. Datasets with noise are marked with a '+' in their names.

For each dataset, one of the following methods were used to estimate the error rate: • For larger datasets (more than 1000 examples), the test

set was used to estimate the error rate. • For smaller datasets, the error rate was estimated using

10-fold cross-validation:

The dataset is randomly divided in 10 disjoint subsets, each containing approximately the same number of records with roughly the same proportion of records from each class. For each subset, a classifier is constructed using the records that are not in it. The classifier is then tested on the withheld subset to obtain a cross-validation estimate of its error rate. The error rate is the average of all cross-validation estimates.

Results from this experiment are showed in Table 3. The Implemented Algorithm has been tested on all datasets, obtaining a mean error rate of 0,2591. Compared with the mean rates of the 34 algorithms presented in [19] and [20], this values represent a difference on average of 2.5% and 1.6%, respectively. According to [19], this result is not significantly different (at the 10% level) from the best of the other 34 classifiers, which has a mean error rate of 0.195. The classifier algorithm did not get the worst result in any dataset.

Table 2 - Analyzed data

Name Size Class Attributes Description (N) (C)

Led 2000 10 7 Artificial domain representing seven light-emitting diodes

led+ 2000 10 24

bld 345 2 6 Predict whether or not a male patient has a liver disorder based on blood bld+ 345 2 15 tests and alcohol consumption

pid 532 2 7 Predicted female diabetes patient

pid+ 532 2 15

smo 1855 3 3 5 Predicted attitude toward restrictions on smoking in the workplace smo+ 1855 3 10 5

vot 435 2 16 Classify a Congressman as Democrat or Republican vot+ 435 2 30

* (N) Numerical, (C) Categorical

Table 3 -Comparison of the results between 34 algorithms and the implemented Algorithm

Data

Results from LIM, LOH and SHIH (1999)

Results from Hasse (2000)

Implemented Algorithm

Error Rate Error Rate Error Rate

Min Max Median Median Median

Led 0,2680 0,8160 0,2750 0,2670 0,285 led+ 0,2650 0,8130 0,2900 0,2760 0,3165

Bld 0,2790 0,4320 0,3220 0,2750 0,3775

bld+ 0,2860 0,4410 0,3440 0,3130 0,3475

Pid 0,2210 0,3100 0,2370 0,2370 0,2593 pid+ 0,2170 0,3180 0,2460 0,2620 0,2778

Smo 0,3040 0,4500 0,3050 0,3040 0,3008

smo+ 0,3050 0,4500 0,3080 0,3090 0,3068

Vot 0,0364 0,0617 0,0480 0,0435 0,0582 vot+ 0,0412 0,0662 0,0480 0,0460 0,0611

Mean 0.2423 0,2333 0,2591

5.3.Experiment with Different Database Size

One of the most important issues in Data Mining is the amount of data required to achieve a good classification precision. The tests presented below consider different database sizes. The results are summarized in Figure 7. For the algorithm without the use of the Tabu lists, we observe an increasing error rate committed to the reduction of the database set size. On the other hand, when the Tabu List is used, the algorithm maintains the levels of accuracy, without significant changes.

Figure 8 shows the algorithm’s behavior with Tabu list considering a variation of the database size (500, 1000, 1500 and 2000). It can be observed that this variation generated a small change of 4,6% in the error rate. This fact that an algorithm that makes use of Tabu list requires less data to achieve an acceptable accuracy level. Consequently, we have a reduction in the computational cost.

These tests were not accomplished in the works of [19] and [20], which does not allow the comparison among them.

0,00

0,10

0,20

0,30

0,40

0,50

0,60

0,70

0,80

0,90

led led+ bld bld+ pid pid+ smo smo+ vot vot+ Bases

error rate (%)

min

median

max

GA - TL

Figure 7 - Error Rate x Database Size

+ noise data

Figure 8 - Database size effect.

6. Conclusions Tools that analyze large amounts of data are becoming

more important every day, as the amount of information generated in organizations increases. In the present work, a new Genetic Algorithm restricted by Tabu List in the context of Data Mining is described. Different experiments were accomplished and results were obtained with the use of that technique.

The algorithm was compared with 34 other classifying algorithms, and in none of the results there were significant differences in the classification rates (at the 10% level). Besides, the use of Tabu search was shown to be an efficient strategy to maintain the diversity of the population, allowing it to be efficiently used in Data Mining tasks.

In the experiments performed we observed that the algorithm is able to work with smaller datasets, which reduces the cost of processing time. Another important fact was the robust behavior of the algorithm in the treatment of noise-contaminated data.

The Genetic Algorithm restricted by Tabu List is new approach that improves GAs, making it possible to solve of complex problems. Future works include (1) the application of the algorithm in a larger number of domains, (2) the sensibility analysis of the algorithm parameters, such as the size of the Short/Long Lists and (3) the use of other distance measures.

6. References [1] Kurahashi, S.; Terano, T. A Genetic algorithm with tabu search for multimodal and multiobjective function

optimization. In: GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE, 2000. Proceedings... . 2000. p. 291-298. [2] Freitas, A; Lavington, S. H. Mining very large databases with parallel processing. Boston: Kluwer Academic, 1998. 208p. [3] Holland, J. H. Escaping Brittleness: the possibilities of general purpose learning algorithms applied to parallel rule-based systems. In: MICHALSKI, R.; CARBONELL J.; MITCHELL, T. (Eds.), Machine learning: an AI approach. Los Altos: Morgan Kaufmann,1986. v. 2, p. 593-623. [4] Wilson, S. Classifier systems and the animat problem. Machine Learning, 2. Kluwer Acad., 1987, p. 199-228. [5] Eshelman, J.; Shaffer, J. Preventing Premature Convergence in Genetic Algorithm by Preventing Incest, In: Proceedings of the fourth ICGA. 1991, p. 115 –122. [6] DeJong, K; Spears, W. Using Genetic Algorithms to Solve NP-complete Problems, In: Proceedings 3rd ICGA. 1989. [7] Goldberg, D. E. Genetic algorithms in search, optimization and machine learning. Alabama: Addison-Wesley, 1989. 413p. [8] Goldberg, D. E. A Note on Boltzman Tournament Selection for Genetic Algorithms an Population oriented Simulated Annealing, Complex Systems, 4. 1990. p. 445-460.

[9] Tsutsui, S.; Fuzimoto, Y. Forking Genetic Algorithm with Blocking and Srinking Modes. Proceedings of the fifth ICGA. 1993. p. 206-213. [10] Tsutsui, S.; Fuzimoto, Y. Extended Forking Genetic Algorithm for Order Representation. Proceedings of the First IEEE Conf. on Evolutionary Computation. 1994. p. 170-175. [11] Costa, A. An Evolutionary Tabu Search Algorithm and The NHL Scheduling Problem . Information Systems and Operational Research, 33 (3) 1995. p. 161-178. [12] Glover, F. Tabu Search for nonlinear and parametric optimization (with links to genetic algorithms). Discrete Applied Mathematics, 49 1994. p. 231-255. [13] Glover, F.; Kelly, J.; Laguna, M. Genetic Algorithms and Tabu Search: Hybrid for Optimization. Computers and Operations Research, 22 (1) 1995. p. 111-134. [14] Fonseca, C.; Fleming P. Genetic Algorithms for Multiobjective Optimization: Formulation, Discussion and Generalization, Proceedings of the Fifth International Conference on Genetic Algorithms. 1993. p. 416-423. [15] Glover, F. Future paths for integer programming and links to artificial intelligence. Computers and Operations Research, v 13, p. 533-549, 1986. [16] Horne, S.; Macbeth, C. A comparation of global optimisation methods for near-offset VSP inversion. Computers and Geosciences, v. 24, n. 8, p. 563-572, 1998. [17] Niblett, T. Constructing decision trees in noisy domains. In Proceedings...: EUROPEAN WORKING SESSION ON LEARNING, 2., 1987.. Wilmslow, 1987.p. 67-78. [18] Mitchell, M. An introduction to genetic algorithms. Cambridge: Mit Press. 1997. 207 p. [19] Lim, T. S., Loh, W. Y; Shih, Y. S. A comparison of prediction accuracy, complexity and training time of 33 old and new classification algorithms. Machine Learning Journal, Boston, 1999. [20] Hasse, M.; Pozo, A. R. Using phenotypic sharing in a classifier tool. In: GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE, Proceedings... .2000, p. 392.