Genetic Programming in Classifying Large -Scale Data: An … · 2015-12-11 · experimental study...

1

Genetic Programming in Classifying Large-Scale Data: An Ensemble Method

Yifeng Zhang

Siddhartha Bhattacharyya1

Information and Decision Sciences College of Business Administration

University of Illinois at Chicago 601 S. Morgan Street (MC 294)

Chicago, IL 60607

[email protected], [email protected] Abstract This study demonstrates the potential of genetic programming (GP) as a base classifier algorithm in building ensembles in the context of large-scale data classification. An ensemble built upon base classifiers that were trained with GP was found to significantly outperform its counterparts built upon base classifiers that were trained with decision tree and logistic regression. The superiority of GP ensembles is attributed to the higher diversity, both in terms of the functional form of as well as with respect to the variables defining the models, among the base classifiers. Keywords: Genetic Programming, Ensemble, Classification, Large Scale Data

Please do not quote without permission.

1 Author for correspondence.

2

1. INTRODUCTION

The classification problem of assigning observations into one of several groups

plays a key role in decision-making. The binary classification problem, where the

data are restricted to one of two groups, has wide applicability in problems

ranging from credit scoring, default prediction and direct marketing to

applications in biology and medical domains. It has been studied extensively by

statisticians, and in more recent years a number of machine learning and data

mining approaches have been proposed. Amongst the latter group of techniques

are those that can broadly be categorized as “soft computing” methods, noted to

tradeoff optimality and precision for advantages of representational power and

applicability under wider conditions. The form of the solution varies with the

technique employed, and is expressed in some form of classification rule or

function based on the multivariate data defining each observation.

Techniques from the statistics realm beginning with Fisher’s seminal work

include linear, quadratic and logistic discriminant models, and are amongst the

most commonly used [1]. Logistic regression remains, even today, amongst the

most widely used of data mining methods in industry. The major drawback with

these methods arise from the fact that real-world data often do not satisfy their

underlying assumptions. Nonparametric methods are less restrictive, and popular

amongst them are the k-nearest neighbor and various mathematical programming

methods [2] [3]. Bradley et al. provide a comprehensive review of various

mathematical programming approaches for data mining [4]. Various inductive

learning approaches have been developed for classification, including decision

3

trees, neural networks, support vector machines. More recently, evolutionary

computation techniques like genetic algorithms have shown to be attractive for

classification and data mining.

Genetic algorithm (GA), modeled loosely on principles of natural evolution,

provide a powerful general-purpose search approach, and have been noted to offer

advantages for data mining. These advantages stem from the representational

flexibility allowed on the form that a classification function may take, and from

the largely open formulation of the search objective (fitness function).

Classification functions studied range from a set of data-attribute weights as in

traditional regression models [5] [6], condition-action type rules with

conjunction/disjunction of terms [7] [8], to the parse-tree expressions of genetic

programming [9]. The flexibility in fitness function formulation allows the

development of classification models that are tailored to specific domain

constraints and objectives. Bhattacharyya [10] [6], for example, shows how GA-

based models developed with the fitness function explicitly seeking the direct

marketing objective of maximizing lifts at specific mailing-depths, can yield

significant improvements over the traditional classification objective of

minimizing errors. Various recent studies report on the application of

evolutionary search in data mining [11] [12] [13].

Given the massive amounts of data accumulated by organizations today, a

key challenge facing the data-mining field is the development of methods for

large-scale data mining. A focus on problems aris ing from large data is, in fact, a

defining characteristic of data mining (being defined as the process of gleaning

4

useful and actionable knowledge from large data [14]). Most traditional

classification methods, however, require the simultaneous use of all the training

data, thereby posing a severe challenge in their direct application to large data

problems – it is, after all, desirable that all available data be used in developing

the classifier, even when all the data cannot be accommodated in memory.

Further, many learning algorithms are of an iterative nature, necessitating several

passes through the data – attempting to develop models on the entire data may

then take too long. Further, it may not always be possible to have all the data at a

same place at the same time due to, for example, security or privacy reasons, or

because the data may be present in naturally distributed settings. In short,

traditional classification methods are not readily amenable to handling large-scale

distributed data.

In recent years, a number of approaches have been suggested for overcoming

such large-data limitations. While some seek to incrementally develop a model

from subsets of the data [15], others obtain classification from a team or ensemble

of classifiers. Various studies have shown that ensembling is a useful method

addressing the problem of large-scale and potentially distributed data [16] [17]

[18] [19]. Ideally, different base classifiers in an ensemble capture different

patterns or aspects of a pattern embedded in the whole data. And then through

ensembling, these different patterns or aspects are incorporated into a final

prediction. A number of studies have examined techniques like bagging and

boosting for combining predictions form multiple classifiers, and these have been

shown to improve the performance of ensembles over individual classifiers.

5

Comparative evaluations of different variants of these are given in [20] and [21].

Note that data may be ‘large’ not only from the perspective of number of

observations, but also in carrying a large number of attributes. Various

dimensionality reduction and feature selection methods are applicable for this.

The focus of this paper is on the classification of data having a large number of

observations, and the term large-scale is used only is this context.

This paper proposes the use of genetic programming (GP) as a technique for

developing base classifiers for an ensemble. A key advantage of GP arises from

the ability of its parse-tree representation to model almost arbitrary non- linear

forms. A GP model can thus seek to capture various non- linear relationships in

the data. The genetic search procedure is also ideally suited to explore the large,

and, given the usually scant prior knowledge on data relationships, often poorly

understood search spaces that typify real- life data-mining problems. Different

models developed on data subsets can thus incorporate a diversity of data-patterns,

which can then be combined as an ensemble for final classification. The utility of

GP as an ensembling approach is examined using a real- life large dataset of about

five million observations from the KDD-Cup 1999 competition2. The

experimental study compares the performance of GP-ensembling to ensembles

based on decision trees as well as logistic regression. Results demonstrate the

advantages of the proposed GP based ensembling scheme. As expected, the GP

models are found to capture diverse patterns in the data, while the developed

decision tree models look similar. The ensemble of GP models is seen to yield

significantly superior performance over the decision tree ensemble. Our findings 2 The data is available at http://www.kdnuggets.com/datasets/.

6

also show that GP is able to discern data patterns using a small subset of variables

in the data, indicating its utility for feature selection and exploratory data analysis.

The paper is organized as follows: the next section provides a brief

background on genetic search and the non- linear representations of genetic

programming, and presents some relevant literature on ensembling approaches for

data mining. Section 3 details the data used for the experimental study, and

presents the experiments and results. Concluding remarks and future research

directions are given in the last section

2. BACKGROUND

2.1 Ensembles for Large Scale Classification

The notion of aggregating information from multiple models has been

addressed in the statistics and decision-making literature (see, for example, [22]

[23] [24], mostly in the context of combining forecasts from multiple models

based on different methods, but using the same data. For data mining on large-

scale data, the focus is on using smaller data subsets so as to ease the

computational burden. The essence of ensembling is to take small subsets from a

large data set to build a number of base classifiers, and then combine these base

classifiers into an ensemble. Since base classifiers are built using a small part of

the whole data, learning requires less memory, as compared to traditional

approaches of considering all the data together, and also proceeds faster in terms

of computational speed. Besides, it can readily be implemented in a parallel

fashion.

7

Bagging [25] and boosting [26] are popular methods for combining the

predictions from multiple base classifiers in creating ensembles [27]. In bagging,

different training sets for learning the base classifiers are obtained by resampling

the training data with replacement, and the classifiers are combined using a voting

scheme. While a simple average of the predictions of individual classifiers has

been found to yield a good ensemble, more complex voting mechanisms have also

been proposed; while these can show improved performance, they can also result

in an overfitted ensemble [28]. Boosting is based on weighted resampling where

classifiers are obtained as a series, with the later classifiers being trained on data

containing more of the observations that were incorrectly classified by the earlier

classifiers. Bagging has been found to outperform individual classifiers in several

studies, while boosting, though showing improved performance over bagging in

certain datasets, is noted to be susceptible to noise in the data [21]. It is noted in

[25] that bagging is effective with learning algorithms that are “unstable” in the

sense that small changes in the data can produce large differences in classifier

performance, and mentions decision tree and neural network learning procedures

as examples of such algorithms.

Although the speedup advantage of ensembling is obvious and easy to

obtain, the cha llenge of preventing classification accuracy from significant

decrease, compared to that of a single model trained on the whole data, is not an

easy task. However, [19] and [16] showed that reaching such a goal is possible.

[19] examined the number of subsets that an original dataset can be divided into

before significant performance decrease occurs. They found that, although the

8

number varies from dataset to dataset, most of the 15 datasets they studied could

be divided into four or more subsets with less than 2% decrease in classification

accuracy. Given that the 15 datasets studied by [19] are relatively small (all less

than 43,500 examples), one would expect that much larger datasets ( millions or

more examples) can be divided into many more subsets, with each still not being

too small. [16] also studied the impact of number of subsets on an ensemble’s

performance. However, instead of dividing the original dataset, [16] took subsets

from it with replacement, which makes it possible to take an infinite number of

subsets. [16] showed that an ensemble’s performance increases with number of

subsets. The increase is fast in the beginning and after a certain point reaches an

asymptotic value close to the performance of a single model built on the entire

data. In the five datasets that [16] studied, the ensemble’s performance reaches

the asymptotic value within 100 subsets. [16] also showed that the bigger the

subset (i.e. 800 examples vs. 100 examples), the faster the performance of the

ensemble approaches the asymptotic value.

Besides number and size of subsets, studies have showed that the

combining of predictions from an ensemble tends to be effective when the

individual predictors have errors that are independent of each others [29] [30] [31].

Diversity amongst the individual classifiers is thus beneficial to an ensemble’s

performance. In [30], diversity among base classifiers (neural network models)

was increased by constructing each model on a different data subset. In [31],

diversity among base classifiers (neural network models) was increased by using

a different subset of attributes in constructing each model. Thus each model only

9

captures patterns embedded in the subset of attributes on which it was built. [31]

calls this method “ensemble feature selection”. The notion of diversity is also

intuitive, considering that combining a number of very similar models is not

likely to help much in performance. [29], lending theoretical support for this

notion, demonstrated that integrating independent inputs produces better results

than integrating dependent inputs.

Other aspects of ensembling have also been studied. [17] examined

further mechanisms for combining base classifiers. They proposed two “meta-

learning” strategies that learn how to combine base classifiers, just as base

classifiers are learned from raw data. Both the strategies, called arbitration and

combining, use the base classifiers’ predictions as input for “meta- learning”, with

the arbitration strategy also taking the prediction of an arbiter as input.

Ensembling has also been studied on a special type of large-scale data: streaming

data, where new data continuously flows into a dataset at a high speed. The

strategy proposed by [18] is to train a new base classifier on a new chunk of data

of certain size and evaluates the new classifier against all those in a committee of

classifiers. If the new classifier performs better than some classifier in the

committee, one classifier in the committee is replaced by the new classifier. In

this way, the classifier committee is kept updated as new data arrives. A similar

strategy was proposed by [32] to update a decision tree.

10

2.2 Genetic Search and Genetic Programming

The term genetic algorithm describes a class of stochastic search procedure

inspired by principles of natural genetics and survival of the fittest. They operate

through a simulated evolution process on a population of solution structures that

represent candidate solutions in the search space. Evolution occurs through (1) a

selection mechanism that implements a survival of the fittest strategy, and (2)

genetic recombination of the selected solutions to produce offspring for the next

generation. GAs are considered suitable for application to complex search spaces

not easily amenable to traditional techniques, and are noted to provide an

effective tradeoff between exploitation of currently known solutions and a robust

exploration of the entire search space. The selection scheme operationalizes

exploitation and recombination effects exploration. Crossover and mutation are

the two basic recombination operators. Crossover implements a mating scheme

between pairs of “parents” to produce “progeny” that carry characteristics of both

parents. Mutation is a random operator applied to insure against premature

convergence of the population; mutation also maintains the possibility that any

population representative can be ultimately generated. A detailed account of

genetic search may be found in [33].

In the context of classification, each population member can specify a

symbolic classification rule as in [7], or can represents a weight vector defining a

traditional regression-like function [5], or general non- linear expressions as in

genetic programming (GP) [9]. In this paper, we consider the non-linear GP

representation for classifiers.

11

Genetic programming [9] is a GA-variant that uses hierarchical

representations (Figure 1a gives an example.) Solutions in GP can be represented

as parse trees where internal nodes define some function (with its subtree

branches as function arguments) and leaf nodes define some problem related

constants (terminals). Each population member thus represents a function f (x) of

the data attributes (dependent variables) that can be depicted as a parse tree, thus

allowing arbitrarily complex classifiers. The function f(x) can interpreted as a

classifier in the usual way:

f(x) ≤

0 implies Group 1> 0 implies Group 2.

The functional form of the classifier is determined by (1) a function set F defining

the functions to be used, (for example, the arithmetic operators (+, -, *, /), together

with any allowed functions like log(.), exp(.), sin(.), etc., logical operators (And,

Or) etc.), and (2) a terminal set T defining the variables and values allowed at the

leaf nodes of the tree (in this case, the dependent variables in the data, together

with numerical constants.)

Crossover is defined as a operator that exchanges randomly chosen

subtrees of two parents to create two new offspring. This is illustrated in Figure

1b. Mutation randomly changes a subtree. Since the regular GP operators are

inadequate for learning numeric constants [34], a non-uniform mutation is used

here [35].

With its ability to model various non-linear terms, GP holds much promise

for data mining. [9] illustrates the use of GP for ‘symbolic regression’ showing

how a GP can learn a non- linear function to fit given data points. The use of

12

genetic search in data mining problems is, however, relatively new. [5] examined

the use of genetic algorithms for learning linear discriminant functions, and [36]

comparative study includes both linear-GA and GP. [37] examined the impact of

different GP settings on classification accuracy; [11] proposed the use of genetic

search for feature selection; [10] [6] applied genetic algorithms to directly

optimize lifts, and [38] examined genetic algorithms and GP for multi-objective

data mining. [39] proposed a method for using GP in context of a relational

database, where a GP tree and its fitness function (classification accuracy) were

converted into a relational query.

(insert figure 1a here)

(insert figure 1b here)

3. EXPERIMENTS AND RESULTS

3.1 Data and Experiments

The dataset used in this study is from the KDD Cup’99 competition,

which consists of connection data of a simulated U.S. Air Force LAN. The

dataset is large, with training dataset containing about five million examples, and

test dataset about 300,000 examples. Each example is labeled “normal” or one of

25 attack types. Since multi-class classification is not the focus of this study,

different attack types are not differentiated. That is, each example is only

classified as “normal” or “attack”. Besides the class variable, each example is

described by 41 attributes, which can be classified into three groups: basic

13

features (i.e., duration of the connection, etc.), content features (i.e., number of

failed login attempts within a connection, etc.), and traffic features (i.e., number

of connections to the same host as the current connection in the past two seconds,

etc.). Most examples in the datasets are of “attack” type, which accounts for 80%

of the training data and 80.52% of the test data.

To construct base classifiers, ten relatively small (0.6% of the training

data/28,000 examples) subsets were randomly taken with replacement from the

training dataset. Subsets of larger size were examined in pilot experiments, but

these did not show advantages in terms of base classifiers’ classification accuracy

over the chosen size.

A post hoc experiment showed that the number of subsets chosen in this

study (10) is also appropriate. Specifically, the relationship between number of

subsets and performance of the GP ensemble on test data was examined. The

results showed that the GP ensembles’ classification accuracies increases

significantly when the number of subsets goes from 1 to 10. However, the

performance improvement is marginal when the number of subsets is more than

10 (see figure 2).

(insert figure 2 here)

Prior to training, two preparation measures were conducted: a) non-

numerical attributes were converted to numeric values, and b) all attributes were

standardized to zero mean and unit standard deviation.

14

On each subset, one GP model, together with one decision tree and one

logistic regression model were trained. The decision tree and logistic regression

models were built for comparison purpose. A uniform cost model was adopted in

all model training. That is, a same unit cost is assigned to both false positives

(“normal” being classified as “attack”) and false negatives (“attack” being

classified as “normal”). The decision tree and logistic models were developed

using the standard implementations in the SAS Enterprise Miner data-mining tool.

Since the software automatically prunes decision trees based on performance on a

validation dataset, a separate (different from the ten subsets used for training the

base classifiers) data subset was used for this purpose.

Running time of the ten individual GP models varies greatly because

complexities of the models are very different. On a PC with a 400 MHz CPU and

about 200 MB RAM memory, the running time varies from about 10 to about 40

minutes. Running times of individual decision tree and logistic regression models

are only a couple of minutes on a same PC. Since computation cost nowadays is

very low and will be increasingly lower, the heavier computation burden in

training GP models is considered tolerable.

Regularly used parameter values and GP settings, as found in the literature,

were used in the experimental study. The function-set for GP included four

arithmetic (+, -, *, /), two comparisons (>=, <=), and six mathematical (exp, ln,

log, sin, cos, tan) operators. The result of a comparison operation is 1 or 0, so that

they can be combined with arithmetic and mathematical operations without

adopting strongly typed GP. The terminal set included the dependent variables in

15

the data, together with numeric real constants in the [-1, 1] range. Classification

accuracy was set to be the fitness function. Crossover and mutation probability

were set to be 1 and 0.2 respectively. A population size of 51 was used. All

models were trained for 100 generations; monitoring of the training process

showed that all models converged within 100 generations. The GP settings were

kept unchanged across all ten subsets, except for different random seeds used for

the subsets.

After base classifiers were built, they are combined into ensembles: one

GP ensemble, one decision tree ensemble and one logistic regression ensemble.

The combining mechanism used is simple majority voting, that is, predictions of

the majority of base classifiers were set as the ensemble’s prediction.

3.2 Results

The GP ensemble was found to perform much better than the decision tree

and logistic regression ensembles, with classification accuracies of the three

ensembles being 90.55%, 80.52%, and 80.52% respectively. Table 1 gives the

confusion matrix of GP ensemble on the test data. Considering that 80.52%

examples in test data are of “attack” type, the decision tree and logistic regression

ensembles simply classified all examples as “attack”. Amongst the 10 base

decision-tree classifiers, only one was observed to show better performance on the

test data (see Table 2); the ensemble, as a whole, however, performs rather poorly.

(insert table 1 here)

16

The poor performance of decision tree ensemble is surprising because two

of the three winning entries of KDD Cup’99 used decision tree as base algorithm.

However, results of the current study cannot be directly compared to those of the

three winning entries of KDD Cup’99 because, as mentioned in the beginning of

section 3.1, the original problem was simplified in this study. Specifically, the

original five-class (one normal and four attack classes) classification problem is

simplified into a binary classification problem by aggregating the four attack

classes into a single class called “attack”. This simplification is appropriate

because multi-class classification is not the focus of this study.

The difference in the three ensembles’ performance can be attributed to

differences in their base classifiers. Two obvious differences can be observed

between the GP and decision tree classifiers: a) as expected, GP classifiers

exhibited greater diversity; b) in general GP classifiers exhibited less over fitting.

Diversity of GP classifiers was exhibited in two aspects: performance and

structure.

Table 2 shows the performance of the individual classifiers in the GP,

decision tree, and logistic regression ensembles, on both the training and the test

data. While the individual decision tree classifiers attained uniformly higher

accuracies than GP classifiers on training data subsets, only four of them kept this

superiority on test data subsets. Paired t-test showed that the average

classification accuracy of individual decision tree classifiers on training data

(99.5%) is significantly higher than that of GP classifiers (94.5%, p = 0.01), but

the average accuracies on test data of the two groups of cla ssifiers are not

17

significantly different. This demonstrates decision tree classifiers’ higher over

fitting. Another observation that we can make from table 2 is the variation of the

individual classifiers’ performances. All three groups of individual classifiers

showed greater performance variation on test data than on training data. And GP

classifiers showed higher performance variation than decision tree and logistic

regression classifiers, on both training test data. This can be verified by the

disparate variances of classification accuracy of the three types of classifiers (see

table 2).

Notice that in table 2, the classification accuracies for two GP base

classifiers’ (subset 7 and 9) on the test data were exceptional low: 19.48% and

41.15%. However, these two classifiers did not have a significant detrimental

effect on the final ensemble’s performance. When these two classifiers were

excluded from the ensemble, the ensemble’s performance only increased from

90.55% to 91.06%. This reinforces that ensembling is a robust method, especially

given the fact that only ten base classifiers were constructed in this study. This

robustness can be valuable when using a higher number of base classifiers; that is,

the ensemble’s performance will not be greatly influenced by a few poorly

performed base classifiers.


The high diversity of GP base classifiers can also been observed in the

structure of the models obtained. An important aspect of a GP model’s structure

is its complexity, which is defined here as the number of operators and terminal

18

terms that are contained in the model. As mentioned earlier, the terminal terms of

a GP model are comprised of data attributes and constants. For the purpose of

comparison of the diversity of an ensemble’s base classifiers, the complexity of a

decision tree model is represented here by its depth and number of leaves.

The complexities of the ten GP and decision tree base classifiers are

presented in Table 3. It can be seen from the table that the complexity of GP base

classifiers varies much more than that of decision tree classifiers. For instance,

complexity of GP classifiers varies widely from 3 to 23, while the depths of the

decision tree base classifiers were uniformly 3, and number of leaves varied

within a narrow range of (11, 16). The structural diversity of GP base classifiers

can also be observed in the attributes that are included in each model -- these are

shown in Table 4. It can be seen from the table that attributes included in

different GP models overlap with each other to a much lesser extent than the

attributes included in the different decision tree models. For instance, no attribute

appeared in more than four GP base classifiers, but three attributes (v4, v23, v37)

appeared in all ten decision-tree base classifiers.

Previous studies have indicated that slight overfit in base classifiers is

beneficial to ensembling, as this overfit is neutralized when base classifiers are

combined [28]. However, an opposite result was observed in our study. In

general, the decision tree base classifiers exhibited greater overfit than GP base

classifiers. Classification accuracies of the ten decision tree base classifiers on

training data were all very high (>99%) and higher than that of the corresponding

GP base classifiers’ (see table 2). However, on the test data, the majority of GP

19

base classifiers have higher classification accuracies than the corresponding

decision tree base classifiers. This contrary result as observed in this study arises

because the decision tree base classifiers in this study are too similar to each other,

which prevents their overfit from being neutralized during combining.



4. CONCLUSION

This study demonstrates the utility of GP as a method for creating and

using ensembles in data mining. Given its representational power in modeling

complex non- linearities in the data, GP is seen to be effective at learning diverse

patterns in the data. With different models capturing varied data relationships, GP

models are ideally suited for combination in ensembles. Experimental results

show that different GP models are dissimilar both in terms of the functional form

as well as with respect to the variables defining the models.

The observed diversity in different GP models can be attributed to a local

search conducted in different invocations of the genetic search. Note that

relatively small population sizes (of 51 population members) were used. Genetic

search with small populations is known to lead to local search; given different

(here, random) initial points, small populations effect local search in different

regions of the search space. While this study, with its focus on large data,

considered different GP models obtained from different data subsets, a diversity

of models can also be sought from the same dataset, by initiating genetic search

20

using different random number seeds. In this way, GP can be generally useful as

a modeling technique for use with various bagging and boosting schemes, and this

forms a topic of our continuing investigation.

Results obtained in this study also indicate that GP models are able to

‘select’ a subset of relevant attributes – notice that many of the GP models are

comprised of a handful of the total variables in the dataset (Table 4). This points

to the potential usefulness of GP as a method for variable selection and feature

extraction. Related studies by the authors show promise in that GP models are

seen to yield comparable, if not better, classification performance (over traditional

data mining methods) with fewer variables; GP here is able to discern useful non-

linear interactions consisting of fewer variables. Future studies can explore this

capability of GP more thoroughly. For example, datasets that contain a large

number of attributes is most suitable for this kind of study. GP can be used to

select a small subset of useful attributes out of all attributes first; other methods

can then be used to work on the attribute subset.

Notice that the size of the data subsets used for building the base

classifiers in this study was relatively small, compared to entire dataset. It is

valuable to find that GP is able to learn informative models from these small data

subsets. While the resulting GP-ensemble was found to significantly outperform

the other techniques, the use of a higher number of data-subsets to consider more

base classifiers for a GP-ensemble is an interesting topic for further study. Given

the positive results of GP’s usefulness in ensembling, as evidenced in this study, it

will also be worthwhile to consider extending this work towards multi-group

21

classification (the same data as used in this study, which defines, in general,

multiple “attack” types, can be useful for this purpose.)

Decision-makers often seek models that are interpretable. Interpretability

of obtained models presents an advantage of GP over techniques like neural

networks, where interpretation of different weights and networks structures is

usually not obvious. However, large tree-structured GP models may, also, not be

easily interpretable. GP trees often exhibit ‘bloat’, with repeating and redundant

terms in the tree [9] [40] [41]. While it is recognized that some redundancy can

aid in the search (introns) and may be desirable [40], this tendency of GP to

develop large trees is usually addressed by limiting the trees to some specified

depth and size (as done in this study); alternative methods have also been

proposed for controlling such bloat (for example, [41] [42]). Another

representation related issue with regular GP arises from the closure requirement

over the function set. The closure property requires that all elements of a tree

return the same data type, so as to allow arbitrary sub-trees to be recombined by

the crossover and mutation operators. Where such closure is enforced by treating,

say, Boolean and Real values identically, the resulting trees can make

interpretation harder. To overcome this, [43] describes strongly typed GP, where

elements of a tree can be of any pre-defined data type, with the tree initialization

routines and search operators generating only syntactically correct tree structures.

In the context of data mining, the typing mechanism can be defined to handle

binary, nominal and real-valued attributes suitably, and also to specify appropriate

22

combinations of different function types so that the resulting model is readily

interpretable.

The use of GP as a data mining technique has not received much attention

in the literature. For practitioners, too, no GP-based software tool, designed for

data mining, is currently available (in fact, no genetic search based tool was

noticed in the software section in the KDD website – www.kdnuggets.com).

The findings in this study provide only an initial step in exploring the data-mining

potential of GP. Various issues remain for further research. The design of fitness

functions tailored to different data-mining tasks and objectives, the structuring of

functional primitives and the model representation, susceptibility to and

avoidance of overfit and noise, are amongst important issues for investigation.

23

Figure 1a: GP tree example

Figure 1b: Crossover in GP

24

Number of subsets and classification accuracy of GP ensembles

0.740.760.780.800.820.840.860.880.900.920.94

1 2 3 4 5 10 20 50

number of subsets

clas

sifi

cati

on

acc

ura

cy

Predicted Normal Attack Total

Normal 44,591 16,002 60,593 Actual Attack 13,405 237,031 250,436

Total 57,996 253,033 311,029 Table 1: Confusion Matrix of GP Ensemble

Figure 2

25

Accuracy on Test Data (%):

Accuracy on Training Data (%):

Training Subsets

GP Classifiers

Decision Tree Classifiers

Logistic Regression Classifiers

GP Classifiers

Decision Tree

Classifiers

Logistic Regression Classifiers

S1 90.48 80.52 80.52 98.88 99.39 79.89 S2 89.26 80.52 80.52 91.52 99.50 80.53 S3 88.81 81.19 80.52 89.88 99.56 79.91 S4 90.64 80.52 80.52 94.51 99.40 80.36 S5 80.46 80.52 80.52 99.30 99.60 80.17 S6 74.64 81.93 80.52 84.97 99.40 79.80 S7 19.48 81.32 80.52 97.13 99.58 80.16 S8 90.67 80.52 80.52 98.70 99.26 80.30 S9 41.15 90.95 80.52 91.69 99.82 80.21 S10 91.85 78.66 80.52 99.03 99.48 79.59

Standard Deviation 25 3.4 0 4.9 0.15 0.29

Classification Accuracy of Ensemble

90.55 80.52 80.52

Table2: Classification Accuracy of Base Classifiers

GP Trees Decision Trees Subsets

Complexity

Depth Number of Leaves

S1 13 3 11 S2 3 3 15

S3 5 3 12

S4 5 3 12 S5 23 3 12

S6 3 3 12

S7 15 3 12 S8 7 3 11

S9 13 3 13

S10 3 3 16 Table 3: Complexity of GP and Decision Tree Base Classifiers

26

Subsets GP Trees Decision Trees

S1 v1, v5, v12, v23, v32, v37 v3, v4, v23, v37 S2 v30, v32, v4, v5, v23, v37 S3 v37 v4, v6, v23, v34, v37 S4 v12, v22, v28, v3, v4, v23, v37 S5 v14, v23, v31, v38, v41, v2, v4, v23, v34, v37 S6 v6, v40 v3, v4, v6, v23, v37 S7 v1, v12, v32 v3, v4, v6, v23, v37, S8 v12, v23 v4, v23, v37 S9 v19, v32, v34 v3, v4, v6, v23, v37 S10 v2, v30 v4, v23, v34, v37

Table 4: Attributes Included in GP and Decision Tree Base Classifiers

27

References:

[1] R.-A. Fisher, The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics, 7(1936), 179-188. [2] N. Freed, and F. Glover. A Linear Programming Approach to the Discriminant Problem. Decision Sciences, 12(1981), 68-74. [3] D.-J. Hand, Discrimination and Classification. New York: John Wiley and Sons. 1981. [4] P.-S. Bradley, U.-M. Fayyad, and O.-L. Mangasarian, Mathematical programming for data mining: Formulations and challenges. INFORMS Journal on Computing, 11(1999), 217-238. [5] G.-J. Koehler, Linear Discriminant Functions Determined through Genetic Search. ORSA Journal on Computing, 3(1991), 345-357. [6] S. Bhattacharyya, Direct Marketing Performance Modeling using Genetic Algorithms, INFORMS Journal of Computing, 11 (1999). [7] D.-P. Greene, and S.F. Smith, A Genetic System for Learning Models of Consumer Choice, in J.J. Grefenstette: Genetic Algorithms and their Applications: Proceedings of the Second International Conference on Genetic Algorithms, Hillsdale, NJ: L. Erlbaum Associates. 1987. [8] K. DeJong, W.-M. Spears, and D.-F. Gordon. Using Genetic Algorithms for Concept Learning. Machine Learning, 13(1993), 161-188. [9] J.-R. Koza, Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, 1992. [10] S. Bhattacharyya, Direct Marketing Response Models using Genetic Search, Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98), New York, AAAI Press, 1998. [11] Y. Kim, W. Street, F. Menczer. Feature selection in unsupervised learning via evolutionary search. Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2000), 2000, 365-369. [12] Y. Kim, W. Street, F. Menczer. An evolutionary multi-objective local selection algorithm for customer targeting. Congress on Evolutionary Computing (CEC2001), Seoul, Korea. 2001, 759-766.

28

[13] S. Bhattacharyya, Evolutionary Algorithms in Data Mining: Multi-Objective Performance Modeling for Direct Marketing. Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2000), Boston, AAAI Press, 2000.

[14] U.-M. Fayyad, G. Piatetskt-Shapiro, P. Smyth, and R. Uthurusamy, Advances in Knowledge discovery and Data Mining. AAAI Press, 1996.

[15] P. Domingos, and G. Hulten.. Mining High-Speed Data Streams. Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000), Boston, AAAI Press, 2000.

[16] L. Breiman, Pasting bites together for prediction in large data sets and on-line. Machine Learning, 36(1999), 85-103. [17] P. Chan, S. Stolfo. On the accuracy of meta-learning for scalable data mining. Journal of Intelligent Information Systems, 8(1997), 5-28. [18] W. Street, Y. Kim. A streaming ensemble algorithm (SEA) for large-scale classification. Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-200), San Francisco, CA. 2001, 377-382.

[19] L. Hall, K. Bowyer, W. Kegelmeyer, T. Moore, C. Chao. Distributed learning on very large data sets. Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2000), Boston, AAAI Press. 2000.

[20] E. Bauer, and R. Kohavi. An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants. Machine Learning, 36(1999), 105-142. [21] D. Optiz, and R. Maclin. Popular Ensemble Methods; An Empirical Study. Journal of Artificial Intelligence Research, 11(1999) 169-198.

[22] J.-M. Bates, and C.-J Granger. The Combination of Forecasts. Operations Research Quarterly. 20(1969), 451-468.

[23] R.-L. Winkler, and S. Makridakis. The Combination of Forecasts. Journal of the Royal Statistical Society, A, 146(1983) 150-157. [24] R. Clemen, Combining forecasts: A review and annotated bibliography. Journal of Forecasting, 5(1989), 559-583. [25] L. Breiman, Bagging predictors. Machine Learning, 24(1996), 123-140.

29

[26] R. Schapire, The Strength of Weak Learnability. Machine Learning, 5(1990), 197-227. [27] M. Kaufmann. Bagging, boosting, and bloating in genetic programming. Proceedings of the Genetic and Evolutionary Computation Conference, 1999, 1053-1060. [28] P. Sollich, A. Krogh. Learning with ensembles: How overfitting can be useful, in: D. Touretzky, M. Mozer, M. Hasselmo, Advances in Neural Information Processing Systems, 8(1996). MIT Press, Cambridge, MA. [29] R. Jacobs,. Methods for combining experts’ probability assessments. Neural Computation 7(1995) 867-888. [30] A. Krogh, J. Vedelsby. Neural network ensembles, cross validation, and active learning, in: G. Tesauro, D. Touretzky, T. Leen, Advances in Neural Information Processing Systems, MIT Press, Cambridge, MA., 7(1995), 231-238. [31] D. Optiz, Feature selection for ensembles. Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI), 1999 379-384. [32] J. Gehrke, R. Ganti, BOAT – Optimistic decision tree construction. Proceedings of the 1999 SIGMOD conference, Philadelphia, PA, 1999. [33] M. Mitchell, An Introduction to Genetic Algorithms. MIT Press. Cambridge, Massachusetts. 1996. [34] M. Evett, and T. Fernandez. Numeric Mutation Improves the Discovery of Numreic Constants in Genetic Programming. In J.R. Koza: Proceedings of the Third Annual Genetic Programming Conference, Wisconsin, Madison, Morgan Kaufmann, 1998. [35] Z. Michalewicz, Genetic Algorithms + Data Structures = Evolution Programs, 2nd Edition, Springer-Verlag, 1994.

[36] S. Bhattacharyya, and P. Pendharkar. Inductive, Evolutionary, and Neural Techniques for Discrimination: A Comparative Study. Decision Sciences, 9(1998), 871-900.

[37] J. Eggermont, A. Eiben, J. Hemert.. A comparison of genetic programming variants for data classification. in: D. Hand, J. Kok, M. Berthold, Advances in Intelligent Data Analysis (Third International Symposium). Amsterdam, The Netherlands, 1999

30

[38] S. Bhattacharyya, Multi-Objective Data-Mining using Genetic Search, Proceedings of the GECCO-2000 Workshop on Data Mining with Evolutionary Algorithms, AAAI press, 2000.

[39] A. Freitas, A genetic programming framework for two data mining tasks: Classification and generalized rule induction. Genetic Programming 1997: Proc. 2nd Annual Conf., Morgan Kaufmann, 1997, 96-101. [40] L. Altenberg, Emergent phenomena in genetic programming, in: A. V. Sebald and L. J. Fogel (ed.), Evolutionary Programming -- Proceedings of the Third Annual Conference, World Scientific Pub., 1994, 233-241. [41] P. -J. Angeline, Subtree crossover causes bloat, in: John R. Koza et. al., (Ed.), Genetic Programming 1998: Proceedings of the Third Annual Conference, Morgan Kaufmann Pub., 1998. 745-752. [42] W.-B. Langdon, Size fair and homologous tree genetic programming crossovers. Genetic Programming and Evolvable Machines, 1(2000), 95-119. [43] D. -J. Montana, Strongly Typed Genetic Programming. Evolutionary Computation. 3(1995) 199-230.

Genetic Programming in Classifying Large -Scale Data: An … · 2015-12-11 · experimental study...

Documents

Transcript of Genetic Programming in Classifying Large -Scale Data: An … · 2015-12-11 · experimental study...