Largescale data mining using geneticsbased machine learning

Advanced Review

Large-scale data mining usinggenetics-based machine learningJaume Bacardit1∗ and Xavier Llora2

In the last decade, genetics-based machine learning methods have shown theircompetence in large-scale data mining tasks because of the scalability capacitythat these techniques have demonstrated. This capacity goes beyond the innatemassive parallelism of evolutionary computation methods by the proposal of avariety of mechanisms specifically tailored for machine learning tasks, includ-ing knowledge representations that exploit regularities in the datasets, hardwareaccelerations or data-intensive computing methods, among others. This paperreviews different classes of methods that alone or (in many cases) combined ac-celerate genetics-based machine learning methods. C© 2013 Wiley Periodicals, Inc.

How to cite this article:WIREs Data Mining Knowl Discov 2013, 3: 37–61 doi: 10.1002/widm.1078

INTRODUCTION

W e are living in the petabyte era. We havelarger and larger data to analyze, process, and

transform into useful answers for the domain ex-perts. Robust data mining tools, able to cope withpetascale volumes and/or high dimensionality pro-ducing human-understandable solutions are crucialto numerous aspects of science and society. Genetics-based machine learning (GBML) techniques1,2 areperfect candidates for this task. Recent advances inrepresentations,3–7 learning mechanisms,7–13 and the-oretical modeling14–18 along with experimental com-parisons against other machine learning techniquesrealized across benchmark datasets2,19–21 have shownthe potential of evolutionary computation (EC) tech-niques in herding large-scale data analysis.

If GBML techniques aspire to be a relevantplayer in this context, they need to have the ca-pacity of processing vast amounts of data. Theyneed to process these data within reasonable time.Moreover, massive computation cycles are gettingcheaper, allowing researchers to have access to un-precedented computational resources on the edge of

∗Correspondence to: [email protected] Computing and Complex Systems (ICOS)Research Group, School of Computer Science, University ofNottingham, Nottingham, UK2Google Inc., 1600 Amphitheatre Pkwy Mountain View, CA 94043

DOI: 10.1002/widm.1078

petascale computing. Several topics are interlacedin these two requirements: (1) having the properlearning paradigms and knowledge representations,(2) understanding them and knowing when they aresuitable for the problem at hand, (3) using effi-ciency enhancement techniques, and (4) transform-ing and visualizing the produced solutions to giveback as much insight as possible to the domainexperts.

This review paper aims at shedding light on theabove-mentioned questions, following a road mapthat starts by describing what large scale means, andwhy large is a challenge and opportunity for GBMLmethods. Afterwards, we show how to tackle thisopportunity through the use/combination of multi-ple facets: parallel models, representations able tocope with large dimensionality spaces, advanced ex-ploration mechanism or hardware implementations,among others. Each class of components that con-tributes to large-scale data mining tasks is described indetail. Furthermore, we show a few examples of howwe can visualize the results of GBML systems. Then,we describe how we can model the scalability of thecomponents of GBML systems via a better engineer-ing that will make embracing large datasets routine.Next, we illustrate how all these ideas fit by review-ing a case study of GBML applied to a large-scaleproblem. Finally, we summarize the broad variety ofmechanisms described throughout the manuscript, in-dicating how each of them contributes to the function-ing of the GBML methods and discussing potentialrisks/benefits of combining them.

Volume 3, January /February 2013 37c© 2013 John Wi ley & Sons , Inc .

Advanced Review wires.wiley.com/widm

WHAT IS LARGE SCALE?

This section aims at describing the broad spectrumof problems that can potentially be included in thespectrum of ‘large scale’, not in terms of their respec-tive domains, but in terms of the characteristics of theproblem. The most relevant of these dimensions oflarge scale is, of course, the number of records. Forinstance, GenBank22 (the genetic sequences databaseof the US National Institute of Health) as of October2010 contains more than 125 million gene sequences(82 million in 2008) and more than 118 billion nu-cleotides (85 billion in 2008). The Large HadronCollider is forecasted to generate up to 700 megabytesof data per second.23 Next-generation sequencingtechnologies can sequence up to 1 billion base pairsin a single day.24

Moreover, large records not only come fromthe natural sciences. The Netflix prize25 has probablybeen one of the most famous data mining challengesin the latest years (and a predecessor of what laterin time has been known as Crowdsourcing). Netflixis a video rental company which in 2006 released ananonymized version of their movie ratings database ofover an 100 million ratings from 480,000 customersand 18,000 movies. They challenged the public toprovide a recommendation algorithm that was 10%better than their own proprietary method. In 2007,the best participant method was 8.43% better, 9.44%in 2008 and, finally, in 2009 a team managed to breakthe 10% barrier.26

However, large sets of records are not the onlyfacet of the term of large scale. Another challengingaspect is the number of variables of the prob-lem. As an example, the data resulting from mi-croarray analysis27 (one of the most widespreadtechnologies of molecular biology research) usuallygenerates records with tens of thousands of variables.Microarrays generally have an added difficulty: Gen-erating each record is very expensive, thus datasetstend to have very few samples which creates a veryhigh imbalance between records and variables. In thisscenario, many machine learning methods tend tooverfit the data. Thus, very sophisticated statisticalvalidations mechanisms are employed to assure thatthe results of the data analysis process are statisticallysound.28 Other natural sciences domains, such as pro-tein structure prediction (PSP),29 routinely generatedatasets with hundreds (if not more) of variables, inthis case coupled with a number of records at least inthe hundred thousand range.

The (large) number of classes is another sourceof difficulty for machine learning methods. Thereare well-known information retrieval/text mining

datasets with an extremely high number of classes,30

datasets where the classes are not just a set of labels,but they have a hierarchical structure31 or datasetswhere the frequency of certain classes is very low, thatis, that present class imbalance. This phenomenon hasbeen studied for long in the data mining community,32

and probably on its own cannot be considered as oneof the facets of difficulty in large-scale data mining.However, as datasets grow and tend to become moreheterogeneous, the class imbalance problem becomesa recurrent issue, and it cannot be ignored. The GBMLfield has also seen some specific and principled worktoward tackling the class imbalance problem.16

Given the different facets that large-scaledatasets can take, it is necessary to study their char-acteristics in a principled and quantitative way. Com-plexity metrics33 have been proposed that assess thedatasets in many different ways such as their sparse-ness, shape of its decision frontier, dimensionality,and so on. Furthermore, these metrics have been ap-plied to study the domains of competence of GBMLmethods.34,35 However, their widespread applicationto large-scale domains is still limited, given that thecalculation of most of these metrics does not scalewell for large numbers of instances or attributes.

When does a dataset becomes large scale? This isa very difficult question to address, because it dependson the type of learning and the resources available ateach organization for performing data analysis. Thesimplest definition, albeit vague, is that a dataset be-comes large scale when its size starts becoming anissue to take into account explicitly in the data min-ing process.36 If an algorithm just needs to perform asingle pass through a training set it may easily processhundreds of millions of records, but if the training setneeds to be used again and again the algorithm maystart struggling with just a few tens of thousands of in-stances. The same situation happens with the numberof attributes. Depending on the number of param-eters involved in the knowledge representation, thetractability frontier can be moved around.

THE ADVANTAGES ANDCHALLENGES OF GBML FORLARGE-SCALE DATA MINING

As said in Introduction, GBML methods are per-fect candidates for large-scale data mining for manyreasons, but the main of them comes from theirmain working principle: evolution. Evolutionary al-gorithms are innately parallel because they are pop-ulation based, and this parallelization capacity hasbeen exploited for many years to improve their

38 Volume 3, January /February 2013c© 2013 John Wi ley & Sons , Inc .

WIREs Data Mining and Knowledge Discovery Large-scale Data Mining using GBML

efficiency, in what is known as parallel geneticalgorithms37 or, in a more general perspective, paral-lel metaheuristics.38 The different stages of an evolu-tionary algorithm require more of less amount of syn-chronization points, going from no synchronizationand hence maximum potential parallelism (the eval-uation of the population) to some other stages thatare highly sequential (e.g., the selection algorithm).Considering these constraints, several paradigms ofparallelism exist with different strategies for the dis-tribution of the computation effort of the evolution-ary algorithm that attempt to boost the efficiency ofthe algorithm in different ways, such as the master–slave paradigm (just parallelizing the evaluation of theindividuals), the island model (with distributed popu-lations and carefully defined policies for the migrationof individuals between them) or cellular evolutionaryalgorithms, where the population has a specific topol-ogy that defines the neighbors of each individual andhence, the potential channels of communication. Thechoice of paradigm depends on many different fac-tors, and theoretical models exist37 to inform this de-cision. The section of the paper on parallel models willdescribe several specific examples of the applicationof these paradigms to GBML.

However, as the focus of this paper is notjust large-scale problems, but large-scale data miningproblems, we have to be aware of some challenges aswell, and the main one of them is dealing with largevolumes of data. Generally, EC methods are purelycomputational, that is, a very expensive fitness func-tion needs to be run millions (or more) of times, butthis function does not need to access large volumesof information. Our context is totally different, andwithout dealing carefully with data management, allthe potential parallelism of EC can be in jeopardy.

The choices and strategies for data managementtotally depend on the chosen hardware platform. Thestrategy used for experiments run in a single-CPUor a traditional (small) cluster (e.g., HDF539) prob-ably will not scale to data-intensive computing envi-ronments with hundreds or thousands of computingnodes and other alternative methods (for instance,Hadoop Distributed File System40 or MongoDB41)would be needed, and neither of them would be suit-able for general-purpose graphics processing units(GPGPUs). The aim of this paper is to focus on thedata mining methods and not on the data manage-ment itself, thus we will not discuss these options indetail with the exception of the GPGPU case, as han-dling the data and mining it are so intimately boundthat they cannot be described in separate. What wewill discuss in depth in several sections is what to doonce the data has been parsed and loaded.

KALEIDOSCOPIC LARGE-SCALE DATAMINING

The aim of this section is to present the four differ-ent types of techniques that have been used to adaptGBML systems for large-scale data mining tasks. Asany taxonomy probably we are not going to be able tocover every possible scenario but we believe that thisclassification covers most published works of GBMLfor large-scale datasets. Moreover, these four cate-gories are not mutually exclusive. There are manycase where the actual method is a mix of a few ofthese types of scaling up techniques, as we will showlater in the manuscript. We would also like to clar-ify that the techniques included in this classificationare (mostly) specifically tailored for GBML methods.There are many efficiency enhancement techniquesfor EC methods reported in the literature,42 whichwe will not cover in this paper.

The four types of methods are

1. Software solutions. In this category, we in-clude methods that modify the data miningmethods to improve their efficiency withoutspecial/parallel hardware.

2. Hardware acceleration techniques. This cat-egory includes methods that employ cus-tomized hardware such as vectorial instruc-tions of modern standard processors [Multi-Media eXtension (MMX), Streaming SIMDExtensions (SSE), etc.] or specialized com-puting hardware such as GPGPUs.

3. Parallelization models. This category in-cludes all classic methods (in oppositionto the next one) to use multiple comput-ing nodes to perform a single data miningexperiment.

4. Data-intensive computing. This categoryincludes parallel scenarios that go beyondtraditional high-performance computingclusters, where the number of computingcores go beyond tens of thousands, they areheterogeneous, distributed, and sometimesfail. This scenario has become very popularin the last few years thanks to the appear-ance of actors such as Google with theirMapReduce data analysis methodology43

and its open-source implementation,Hadoop.44

The following sections describe each of thesefour categories of scaling up methods.



SOFTWARE SOLUTIONS

This section covers methods that modify its in-ner workings to improve their efficiency withoutrelying on parallel implementations nor special hard-ware. Because of the broad diversity of such mecha-nisms, this section will be split in four different cate-gories of methods:

1. Windowing mechanisms, where a subset ofexamples from the training set is used for fit-ness computations, and how the subset/s arecreated and how often are they changed variesacross methods.

2. Exploiting regularities in the data, to precom-pute classifications, or avoid irrelevant com-putational effort

3. Hybrid methods that substitute or extend thetraditional exploration operators of EC meth-ods with smarter alternatives.

4. Fitness surrogates that generate a cheaper es-timation of the fitness function

Windowing MechanismsBy windowing algorithms we understand methodsthat instead of evaluating candidate solutions on thewhole training set they use only a subset of examples.This type of mechanisms has been used for many yearsacross the whole machine learning community45–48 aswell as in the specific context of GBML. Alex Freitas49

proposed a taxonomy for this kind of methods de-pending on how often the system changes the chosensubset of examples:

• Individual-wise, changing the subset of thetraining examples used for each fitness com-putation.

• Run-wise, selecting a static subset of examplesfor the whole evolutionary process.

• Generation-wise, changing the subset of thetraining examples used at each generation ofthe evolutionary process.

For a genetic algorithm (GA) using a flat pop-ulation is quite unfair to change the fitness functionamong individuals in the same generation. When thepopulation has specific neighborhood relations (e.g.,a cellular GA) this policy has shown some successfulresults.50 Selecting a static subset of examples beforethe learning process is what is known as prototypeselection, and there are many successful GBML meth-ods to perform such process.51,52 In the rest of thissubsection, we will focus on the third option, namelythe generationwise methods.

How can this generation-wise training exam-ples subset be selected? Most systems reported in theliterature perform a pure random sampling processto select the subset. One example of this approach,applied to attribute selection (but able to be extrapo-lated to rule induction), is Ref 53. Other approachesrefine the random sampling in several ways. Diffi-cult instances (missclassified) can be used more fre-quently than correctly classified instances,54 or ‘aged’instances (instances not used for some time) can havemore probability of being selected.54

Also, Ref 53 discusses the issue of choosing thefinal solution of the learning process from all the in-dividuals in the final population. If the sample used issmall, maybe the best individual of the last iterationis not representative enough of the whole training set.The authors propose two approaches: the first one isto use the whole set in the last iteration. The other ap-proach is to perform a kind of voting process amongall individuals in the population.

ILAS (Incremental Learning with AlternatingStrata)55 is a windowing method that was originallyproposed for the GAssist Pittsburgh GBML method56

but that later in time has been also adapted to otherlearning systems.3 This mechanism first divides thetraining set into a set of strata. Each strata has thesame class distribution as the overall training set inan effort of making it a good representative of all theoverall dataset. Afterward, each iteration of the GAwill use each of these strata in a round-robin fashion.When using such a mechanism the practitioner wouldexpect to have speedups equivalent to the number ofstrata, but the reality is that many times, in the spe-cific case of GAssist (with variable-length individuals),the obtained speedups were better than the numberof strata. The reason was that with moderate num-ber of strata the windowing mechanisms managed tointroduce generalization pressure that created morecompact (i.e., short) individuals which were fasterto evaluate56 and also obtained in most cases bet-ter test accuracy. ILAS has been successfully appliedin many real world problems, both small20,56 andlarge.3,8,57–59

RSS-DSS (random subset selection–dynamicsubset selection)60 is a much more complex window-ing method applied on top of a steady-state geneticprogramming classifier. The windowing process issplit into two hierarchically organized stages. Thefirst stage, RSS is similar to ILAS, it partitions thewhole training set into a series of B blocks. Theauthors determine the number of blocks so each ofthem can fit in memory. Within each block, a secondwindowing process called DSS is performed. DSS dy-namically samples with replacement examples from



FIGURE 1 | Rule generated from a Bioinformatics dataset with 300 attributes.

its corresponding RSS block given two constraints:the age of an examples (the last time it was used) andits difficulty (based on training error). The exampleswithin a DSS sample are used for a fix number of GPiterations which is set in proportion to the popula-tion size, and the number of DSS samples generatedper block is variable (but with an upper limit). Blocksthat are easier to learn (based on the fitness of the bestindividual) use less number of DSS samples. Finally,the blocks are picked at random with uniform prob-ability, instead of the round-robin method of ILAS.This method was evaluated on a half a million ex-amples dataset, the popular KDDCUP’99 intrusiondetection dataset held at the UCI repository.61

Windowing mechanisms in GBML have notbeen applied only to classification tasks, but also forother types of problem such as prototype selection51

or subgroup discovery.62

A crucial issue about most windowing mech-anism is how much we can increase the number ofstrata without constraining the success of the learn-ing process. Intuitively, we can think that the numberof strata can be increased while each of the stratum isstill a good representative of the overall training set.Initial steps toward answering this question have beendone in the particular context of ILAS. In Ref 55, asimple theoretical model for the success of ILAS wasproposed, based on the assumption that the numberof rules in the solution is known and that each rulecovers a similar number of examples. Under those as-sumptions, a closed form equation can be estimatedfor the probability of success of the stratification, de-fined as the probability that each strata had at leastone instance covered by each rule in the solution.

P(success/s) = e−rs·e− pDs (1)

Where s is the number of strata, r the number of rules,D is the training set size, and p is the probability thata random problem instance represents a particularrule.

Exploiting Regularities in the DataThis subsection covers methods that manage to alle-viate a portion of the run-time of the system by pre-computing some parts of the evaluation process or byskipping computations in part of the data that is iden-

tified to be irrelevant. These methods exploit the factthat in most problems the data does not uniformlycover the whole search space or that the individualsubparts of a solution (e.g., rule set, decision tree,and so on) do not consider all the variables of theproblem at the same time.

Some methods63 are able to precompute theinstances that match a rule for problems with dis-crete/discretized attributes by aggregating togetherexamples that present the same value for the attributesof the problem. The result of this process is a treestructure that is able to retrieve very efficiently all theinstances of the training set that belong to a specificvalue of a specific attribute. Afterward, the matchprocess of a given rule is just the intersection of thesets generated for each attribute of the problem. Theretrieval process will be fast because the tree is rathersparse as the search space is not uniformly distributed.Other methods64 use a very similar approach but forthe dual problem: given an instance, identify the set ofrules that match it. The latter approach is more chal-lenging, as a population of rules changes more fre-quently than a training set (unless we are in a streamdata mining context), thus it is not only to efficientlyconstruct the tree, but also to update it.

A second family of methods exploits anotherregularity in the data, which is that not all attributesare equally important. To exemplify this issue,Figure 1 shows a rule has been generated from aBioinformatics dataset3 with 300 attributes. Only 9attributes out of the 300 were present in this rule,hence the rest were irrelevant. In most systems, thematch process for such a rule would look like shownin Figure 2. This algorithm would essentially waste291 out of the 300 iterations of the ForEach loop be-cause the attributes are not relevant. Thus, what canbe done to alleviate this difficulty.

Different alternatives have been proposed in theliterature. Some methods65 opt for reordering the at-tributes in the problem from specific to general basedon statistics extracted from the current population ofrules. The rationale for this decision is that the mostspecific attributes are the most likely of stopping thematch function to return false. Hence, if these at-tributes are evaluated first it is quite probable thatthe match algorithm performs just a few iterations ofthe ForEach loop in most situations. Still, in the cases



FIGURE 2 | Pseudocode of the match algorithm for an hyperrectangle rule representation.

Class

#Expr. Atts.

1 3 7

4

4Expr. Atts.

Intervals 1 U 7L1 U L 33 4L U4 L7 U

C1

FIGURE 3 | Example of a rule in the attribute list knowledgerepresentation with four expressed attributes: 1, 3, 4, and 7. ln =lower bound of attribute n, un= upper bound of attribute n, c1 =Class 1 of the domain. c© Memetic Computing by Springer-VerlagBerlin/Heidelberg. Reproduced with permission of Springer-VerlagBerlin/Heidelberg in the format Journal via Copyright Clearance Center.

where a whole rule matches, the irrelevant attributesneed to be evaluated.

Therefore, could we simply get rid of these ir-relevant attributes? This is the aim of the attributelist knowledge representation (ALKR).3 In this repre-sentation, each rule only holds information for the at-tributes of the problem that it considers to be relevant.The way of achieving this consists in (1) updating therepresentation and (2) adding two new explorationoperators. The representation for a rule is exempli-fied in Figure 3. The most important element of therepresentation (and the one that gives its name) is thevector that contains the attributes considered to berelevant for this rule. The rule will only hold intervalsfor these attributes and the crossover and mutationand match function will only use them and no otherattribute. A key question remains, which is how toidentify which are the correct attributes for that rule.This is the job of two extra operators, called gener-alize and specialize, which are in charge of updatingthe list by either removing or adding randomly cho-sen attributes, respectively. As the list of attributes isvariable-length, it may be subject to the bloat effect.66

To control the bloat, it is necessary to use the repre-sentation within a system that employs a fitness func-tion that rewards compact (i.e., using few attributes)rules, such as the BioHEL system,3 which will be de-

scribed in a bit more detail in Real Examples. ALKRuses a ‘virtual one-point crossover operator’ that hasbeen adapted to cope with individuals that may con-tain different sets of attributes. This is done by simplyconsidering that the missing attributes were there mo-mentarily and applying a simple one-point crossoverto the virtual complete rules. The match function re-mains unchanged, but it iterates over the list of at-tributes of the rule, not over all the attributes of thedomain, thus the number of wasted iterations of thematch function should be kept to a minimum.

The representation was evaluate in both smalland large datasets3 showing not only that it was moreefficient than traditional representations but that itwas also learning better, obtaining higher test accu-racy in most datasets. The reason for this is that giventhat each rule only contains relevant attributes, theexploration operators are more focused and hencethe learning process becomes more efficient.

Hybrid MethodsLarge-scale datasets very often present an extremelyvast search space. In these cases, the main challengebecomes to explore this vast space efficiently us-ing smart/directed exploration mechanisms, either incombination or replacing the traditional operators ofEC methods. This trend is not specific to GBML, it hasaffected the overall EC field. The two most popularclasses of these methods probably are estimation ofdistribution algorithms (EDAs)67 and memetic algo-rithms (MA).68 The former methods estimate a modelof the structure of the problem and then generate off-spring according to this model, whereas the latter per-form a combination of local search (LS) and globalsearch.

There are many examples of the applicationof both classes of advance exploration mechanismsto GBML methods. Several flavors of EDA havebeen employed hybridized within both Pittsburghand Michigan learning classifier system (LCS) meth-ods. For instance, the eXtended Classifier System(XCS) Michigan LCS was integrated9 with two



different types of EDA: the extended compact geneticalgorithm (eCGA)69 and the Bayesian OptimizationAlgorithm.70 These two methods derive global struc-tural information from the best rules in the popula-tion, which later is used to inform the crossover op-erator when generating new offspring by preventingcrossover from breaking good building blocks (BBs)in the individuals it is crossing.

The compact classifier system (CCS)71 is a re-cent integration of EDAs within the framework of aPittsburgh LCS, using the compact genetic algorithm(CGA)72 CGA is run iteratively to generate differentrules. Different perturbations of the initial solution ofCGA are needed to generate different rules, and the in-dividuals in CCS store a set of such perturbations. Theobjective of CCS is to determine the minimum set ofrules that creates a maximally general solution. Laterin time, Llora et al.73 proposed χeCCS, an extensionof CCS. χeCCS evolves a population of rules usingthe model building and recombination mechanismsof eCGA. A niching method using restricted tourna-ment selection74 was employed to guarantee that thepopulation learns (all at once) all the rules neededto solve the domain. Experiments showed that thismethod scales quadratically in relation to the prob-lem size for the Multiplexer family of problems. EDAshave also been used for other machine learning taskssuch as feature or prototype selection.75,76

Memetic-like exploration methods have alsobeen employed in GBML systems. One such exam-ple was the Memetic Pittsburgh Learning ClassifierSystem (MPLCS).8 This method extracted statisticsfrom the evaluation process of the GAssist56 Pitts-burgh GBML system such as which parts of whichrules were performing well or not, or what part ofthe input space was covered by each rule. Then GAs-sist was extended with different types of LS operatorsthat were performing directed search based on thesestatistics. Two kinds of LS operators were proposed:(1) rulewise operators that were editing individ-ual rules to either specialize them to remove mis-classifications or generalize them to add more positivematches, and (2) a rule setwise operator that given aset of N parents (where generally N > 2), it con-structed a single offspring by taking the rules of allthe parents and choosing the best subset of them andtheir correct order (as GAssist generates ordered rulesets) with the objective of generating a rule set withthe highest possible training accuracy and as few rulesas possible. This type of rule setwise exploration op-erators has also been used in other contexts such asmultistep problems77 or genetic fuzzy systems78 butalways using a traditional crossover framework oftwo parents–two offsprings. Moreover, memetic op-

erators have also been employed in Michigan LCS toexplore not just the predicate of a rule but also itsperformance parameters.11,79 Also, memetic searchhas been used for machine learning tasks other thanclassification, such as in prototype selection.51

Fitness SurrogatesFitness surrogates have been extensively used in theEC context for efficiency enhancement purposes.These methods construct a cheap estimation of thefitness function, and use it instead of the original,exact, fitness formula whenever possible. Generally,the success of these methods comes when the properregime of combination between the surrogate and theoriginal fitness formula is identified. Theory exists todetermine the optimal regime of combination of bothfitness functions based on the bias, variance, and costof the original fitness and the surrogate.42,80

Recent works have applied the principle of fit-ness surrogates to GBML.81,82 Both works use amethodology recently proposed83 for the construc-tion of fitness surrogates based on the model-buildingprocess of EDAs.67 EDAs, as mentioned above, gen-erate models of the structure of the problem to biasthe exploration process. Some EDAs estimate the BBsof the problem. Each block consisting in a set ofproblem’s attributes that interact between them. Af-terward, the fitness contribution of each BB is esti-mated from the individuals in the population usingregression, and the surrogate function for a given in-dividual is constructed as the addition of the estimatedfitness contributions of the BBs it contains. In Ref 81,two different model building methods were evaluatedfor GBML purposes: eCGA,69 which generates set ofnon-overlapped BBs; and DSMGA,84 which generatesa list of (possibly) overlapped BBs. The latter methodwas able to generate more reliable fitness surrogatesthan the former for problems where the same variablewas involved in more than one BB, which is a veryusual situation in the context of machine learning.18

The scalability of this methodology in relation to thesize of the BBs was the focus of Ref 82, where fitnesssurrogates for various classes of hierarchical decom-posable problems with varying sizes were evaluated.

HARDWARE ACCELERATIONTECHNIQUES

In this section, we will include GBML methods thatuse specialized hardware and will include two parts:techniques using vectorial instructions and techniquesusing GPGPUs.



Usage of Vectorial Instructions to Boostthe Matching ProcessOne of the main characteristics of the early supercom-puters such as the Cray-1 was the use of vectorial pro-cessing units, where the same operation was applied inparallel and in high efficiency to a whole array of vari-ables, in what is known as single-instruction multiple-data (SIMD) paradigm. Despite its efficiency, thisparadigm was also complex to implement, and its usefaded in time in favor of simpler architectures thatcould run at faster clock speeds. SIMD reappeared inthe mid 1990s, not in the high-performance comput-ing market but rather in the consumer market, whenIntel introduced the MMX instruction set for theirPentium processors, which allowed to reuse each ofthe registers in the Floating Point Unit of the pro-cessor to pack several integers within (e.g., 2 × 32bit integers, 4 × 16 bit integers or 8 × 8 bit integers)and perform SIMD operations on them. Later in time,this capacity was expanded in types of operations andsizes of the registers with Intel’s own SSE (Stream-ing SIMD Extensions) while other hardware vendorsproposed their own alternative instruction sets (e.g.,PowerPC’s Altivec or AMD’s 3DNow!).

As shown in Software Solutions, the process ofmatching a rule with a given instance generally con-sists in iteratively checking if each attribute in theproblem matches, and stopping the process when anyof such tests fail. Using vectorial instructions this loopcan be unrolled, and several attribute tests can beperformed in parallel. An example of this techniquewas presented in Ref 85 for problems with binaryattributes. That approach was based on the use ofvectorial instructions coupled with a specific encod-ing for rules and instances. Two bits were employedto represent both the condition associated to a certainattribute in the rule and the value of the attribute inthe instance as follows:

Predicate Encoding Value Encoding

Att is 0 10 0 10Att is 1 01 1 01Att is irrelevant 00/11 – –

Given this encoding, checking (for a given at-tribute) if a condition matches an instance is as sim-ple as computing the logical and operation betweenthe encodings for condition and value and then com-paring the result with the original instance attributevalue. If both values are equal, then the conditionhas been matched. However, a rule contains multiple

FIGURE 4 | Parallel match algorithm using binary encoding (top)and Streaming SIMD Extensions (SSE) vectorial operations (bottom).Reprinted with permission from Ref 85. Copyright 2006 Association forComputing Machinery, Inc.

conditions that need to be check. Current CPUs pro-vide logical operations for 32-bit unsigned integers.This has two advantages: (1) allowing packing up to16 conditions per integer—not wasting any memoryand (2) performing the matching process describedat the unsigned integer level will provide the paral-lel matching of 16 conditions at once. The use of the128-bit vectorial instructions, which can perform thesame operation in parallel to four 32-bit words, al-lows the extension of the exact same mechanism tomatch up to 64 conditions at once. Figure 4 presentsa sample implementation of both the 16 conditionsCPU matching and the 64 conditions SSE matching.This vectorial instructions-based matching algorithmwas later in time extended to deal with continuousattributes.86

High-Performance Fitness ComputationUsing GPGPUsGPGPUs are a relatively new technology but one thathas created a huge impact in all fields of scientific



computation because it has enabled researchers to ac-cess very large amounts of raw computational powerfor a small fraction of the price than an equiva-lent amount of power costs in a traditional high-performance computer cluster. This is achieved byexploiting with a different angle a hardware origi-nally designed for the video games industry. Within aGPGPU, a very high number of threads, all executingthe same code, can be run in parallel, each of them op-erating on different data. Thus, it changes the classicSIMD paradigm into a single-program multiple data.However, the communication between threads andfrom the threads to the main memory of the GPGPUis far from trivial and in many cases it requires a sig-nificant change of the programming model to find theperfect equilibrium between usage of raw CPU powerand access to memory.

Probably the most popular of the GPG-PUs platforms is CUDA (computer unified devicearchitecture),87 a parallel computing architecture de-veloped by NVIDIA. Threads in CUDA are organizedin geometrical lattices at two different levels. First,threads are aggregated in blocks, and then blocksare aggregated to compose the overall data distribu-tion. Threads within a block can communicate be-tween themselves at high speed through the usage ofshared memory. Threads can also access other typesof memory. Global memory is the main memory ofthe GPU. It can be accessed by all threads, but it isslow, specially if all threads are trying to read/writeat the same time unless this access is coalesced (con-secutive threads in a block accessing consecutive po-sition in memory). Constant memory can hold datathat does not change throughout the whole executionof the program and can be accessed at high speed,but it is very limited in size. Local memory is usedto store data specific to each thread. This memory isnot used explicitly by the programmer but rather bythe CUDA compiler. Texture memory is a read-only(from the thread’s point of view) type of memory thatis optimised for memory access patterns following atwo-dimensional spatial locality. The proper use ofeach type of memory is crucial for obtaining goodperformance out of CUDA.

GPGPUs have been widely used in recentyears to boost the performance of all kinds of ECmethods88,89 as well as in the specific context ofGBML.90–91 Moreover, GPGPUs have been integratedinto many different machine learning methods.92–95

In this paper, we will describe one specific exam-ple for illustrative purposes: The CUDA-based fitnesscomputation96 of the BioHEL3 GBML system.

BioHEL is a rule-based GBML method that ap-plies the iterative rule learning (IRL) paradigm.97 Its

fitness computations are the result of matching eachrule in its population against a training set. Moreover,BioHEL also contains other efficiency enhancementmethods such as the ILAS windowing scheme55 or theALKR rule representation,3 described in the previoussections. The design of the GPGPU-based fitness eval-uation was made so it could be integrated with theseother methods and their respective speedups couldbecome cumulative.

Specifically, the most costly part of BioHEL fit-ness formula is computing three metrics:

1. The number of instances in the training setthat match the condition of the rule.

2. The number of instances in the training setthat match the class of the rule.

3. The number of instances in the training setthat match the class and the condition at thesame time.

In most situations, it is not really necessary toknow which specific instances were matched by eachrule (unless memetic operators, such as those de-scribed earlier in this paper, are used). This is veryimportant in this situation, as extracting data from aGPGPU is very costly (much more costly than send-ing data there). Thus, if the specific matches are notneeded, the GPGPU can count the matches (doingwhat is generally known as reducing the data) itselfand return back a data structure proportional to thenumber of rules in the population, instead of beingproportional to rules × instances.

The GPGPU-based solution worked in twostages (called kernels):

1. Kernel 1: Match and initial reduction. Eachthread in the GPGPU is in charge of perform-ing an individual match operation involvingone rule and one instance. Afterward, it com-putes three bits corresponding to the threemetrics described above. Then the reduction(counting) stage starts. This process can beperformed at high speed if the threads are ableto communicate between themselves throughthe shared memory.98 The drawback is thatshared memory is limited to the threads thatbelong to the same block (up to 1024 threadsin the latest versions of CUDA). Thus, to max-imize the power of this initial reduction it isnecessary to design the grid (how the blocksin an experiment are distributed) so all thethreads of a block operate on the same rule.At the end of the kernel the rule’s perfor-mance counts are saved to the device’s global



FIGURE 5 | Diagram of the computer unified device architecture based fitness computation pipeline of the BioHEL genetics-based machinelearning system.

memory. This produces a large impact in thekernel’s performance. However, this impacthas been greatly reduced thanks to the initialreduction.

2. Kernel 2: Full reduction. For most problems(those with more than 1024 examples), it isnecessary to perform a second reduction stage(or, for problems with more than 220 exam-ples a third stage, and so on) to get the finalvalues for these three metrics. This is the aimof the second kernel. This kernel will be ap-plied iteratively to reduce the data for each ofthe three metrics, one at a time, as this is moreefficient than a single kernel reducing all threemetrics at once.

Moreover, the training set had to be flattened(into simple data structures) and uploaded to the de-vice’s global memory before the kernels, and the finalresult of the reduction downloaded from the deviceinto the computer’s memory so BioHEL could con-tinue its normal working cycle. The overall process isrepresented in Figure 5.

The speedups that can be obtained in CUDAgreatly depend on being able to maintain occupied atmost times all the computational resources (threads)in the GPU. This is a challenge as many factors can

contribute to stall the processors in the GPU. We havealready mentioned the most important of these fac-tors: the memory access. If all threads try to writeto main memory at once there is a bottleneck, hencethe need to use efficiently the shared memory and thereduction methods to minimize the volume of globalmemory writes. Moreover, to efficiently read from themain memory consecutive threads (in CUDA’s grid)should be reading consecutive positions of memoryin what is called coalesced memory access. Anotherimportant aspect is the flow of the code being run inthe threads. If an algorithm is essentially serial, it willbe difficult to obtain any kind of speedup. This issueespecially affects LS methods, such as the memetic op-erators mentioned above in Hybrid Methods. More-over, even if an algorithm can be parallelized, if thecode has many branches (e.g., ifs, loops), parts ofthe GPU will be stalled because blocks are splittedin warps, and all threads within each warp are syn-chronized so all run the same instruction at once in atrue SIMD fashion. Hence, if different threads withina warp take different branches of an algorithm someof them will be stalled waiting for the rest. Finally, allthreads within a block share a single register bank; soif a kernel needs to use a large number of registers, itcan create a constraint in the total number of threadsbeing run at once. In summary, while the potential



benefits of using CUDA are enormous, not all GBMLmethods are suitable to be adapted to this architecturebecause of their patterns of memory access or flow ofcode.

The consequence of these constrains is that weshould only expect to obtain good performance outof GPGPUs when the code run in the GPU kernels isrelatively simple. In Ref 96, we showed how we cansuccessfully combine the ALKR representation witha GPGPU matching, but the key to that success wasto limit the degree of parallelism of the kernels: Eachkernel was performing the match process of a singleexample from the training set (or the current window)with a single rule in the population. That is, iterat-ing over the attributes expressed in that rule. Othermethods90 implemented kernels where each threadwas only matching a single attribute of a rule. Thisstrategy would have not been possible to implement incombination with ALKR because each rule may havea different set of expressed attributes, and this wouldcreate a very sparse grid structure where many threads(those associated to attributes not present in the rule)would not be used, hence reducing the speedup of thesystem.

PARALLELIZATION MODELS

The following section contains a description of par-allel implementations of GBML methods. It will besplit into two parts: coarse-grained and fine-grainedmethods.

Coarse-Grained Parallel ModelsBy coarse-grained methods, we understand paral-lel implementations presenting a moderately lowamounts of communication between them, generallycorrelated with a quite high computational load pernode. Included in this category are methods withvarying degrees of communication. The most extremecase are methods that present no communication atall between nodes. Most GBML (and in general EC)methods can exploit this option given that they arestochastic methods and their evaluation requires run-ning them multiple times with different random seeds.This can be treated as a, rather obvious, example ofcoarse-grained parallelism. However, there are meth-ods that explicitly exploit the act of running the al-gorithm multiple times, for instance, for ensemblelearning.99 In Ref 100 it was shown how by sim-ply running the GAssist Pittsburgh GBML methodmultiple times, producing a rule set for each run,and then aggregating these rule sets using a major-ity vote, the ensemble managed to obtain higher test

accuracy than the average of the single runs for a setof 25 real-world classification datasets. Later in timethis same ensemble mechanism has been applied toa broad variety of large-scale datasets58,59,86,101–104

with equally successful results. More traditional typesof ensemble learning, such as Bagging or Boostinghave also been used in GBML, specifically using Ge-netic Programming.105,106

Moreover, ensembles created using GBML andother bio-inspired methods have also been appliedto nonstandard types of classification, such as ordi-nal classification,100 where the classes of the problemhave an intrinsic order between them and a multiclassproblem can be decomposed in a series of two-classproblems by joining together the instances of classesthat are adjacent in this intrinsic order, and then gen-erate predictions for the full problem by chaining sev-eral of these subpredictors. A different class of classi-fication problem but that can be solved with a quitesimilar ensemble architecture are hierarchical classifi-cation problems,107 where different predictor modelsare generated for different layers of the class hierar-chy, and the final prediction is done, similarly to theabove example, by chaining several of these predictorsfrom top layer to bottom layer.

If we increase the level of communication, wefind several examples of GBML methods that ap-ply some of the classic paradigms of parallel GAs37

such as the master–slave model8,86 or the islandsmodel.108,109

The performance of the master–slave model, inwhich the GA cycle is managed by a single (master)node, and the slave nodes only perform fitness com-putations for the individuals held at the master, relieson two assumptions to achieve good performance:(1) most of the computation time is concentrated infitness computations. This has been shown to be thecase for most GBML systems. In large-scale datasets(hundreds of variables or a few hundred thousandrecords), fitness computations easily pass 99% of theCPU time.85 (2) The communication costs of send-ing the population to the slaves and receiving backthe fitness values is low. This may not be always thecase in GBML, as these approaches (especially thosefollowing the Michigan110 or the IRL97 approaches)tend to use rather large populations, forcing us to sendrules to the evaluation slaves and collect the resultingfitness. This scheme also increments the sequentialparts that cannot be parallelized, posing a thread tothe overall speedup of the parallel implementation asa result of Ambdhal’s law.111

The NAX system86 presents an interesting vari-ant of the master–slave model to minimize such com-munication cost. NAX is a parallel GBML system



FIGURE 6 | Parallel architecture of the NAX system. c© Natural Computing by Kluwer Academic Publishers. Reproduced with permission ofKluwer Academic Publishers in the format Journal via Copyright Clearance Center.

applying the IRL approach in which each processorruns an identical copy of the algorithm—all seeded inthe same manner, and, hence performing the same ge-netic operations. They only differ in the portion of thepopulation being evaluated. Thus, the population istreated as collection of chunks where each processorevaluates its own assigned chunk, sharing the fitnessof the individuals in its chunk with the rest of proces-sors. In this manner, fitness can be encapsulated andbroadcasted, maximizing the occupation of the un-derlying packing frames used by the network infras-tructure. Moreover, this approach also removes theneed for sending the actual rules back and forth be-tween processors—as a master–slave approach wouldrequire—thus, minimizing the communication to thebare minimum—namely, the fitness values. Figure 6presents a conceptual scheme of the parallel architec-ture of NAX.

Fine-Grained Parallel ModelsThere are specific examples of GBML systems thatcontain fine-grained parallelism at the core of their

design such as GALE (Genetic and Artificial LifeEnvironment),50 which integrates elements of thePittsburgh approach and cellular automata models.GALE uses a two-dimensional grid board formed bym × n cells for spreading spatially the evolving pop-ulation. Each cell of the grid contains either one orzero individuals. Each individual is a complete solu-tion to the classification problem, following the Pitts-burgh approach of GBML. Several knowledge repre-sentations (even mixed on the same board, but withrestrictions) have been implemented within GALE:rule sets, decision trees, and sets of synthetic proto-types. Genetic operators are restricted to the immedi-ate neighborhood of the cell in the grid. The size of theneighborhood is r. Given a cell and r = 1, the neigh-borhood of that cell is defined by the eight adjacentcells to it. Thus, r is the number of hops that definesthe neighborhood. GALE’s speedup model shows thatwhen r equals 1, the speedup grows linearly with thenumber of processors used.112

The evolutionary model proposed by GALE isbased on the local interactions among cells as de-scribed above. Each cycle in GALE has three main



operations: merge, split, and survive. Individuals canmerge (given a certain probability) with one of itspresent neighbors, essential performing a crossoveroperation to generate a single off-spring with replacesthe individual undergoing the merge. Afterward, thesplit operator clones and mutates the individual inthe cell. The new mutated individual is placed in anempty cell from the neighborhood of the original in-dividual. The survival stage decides whether the indi-vidual in an occupied cell will die (vacating the cell) orremain, depending on the number of existing neigh-bors and the individual’s fitness. For further detailsabout GALE model, please see Refs 50, 112, 113.

There are other examples of GBML models that,while they were never implemented as fine-grainedparallelism, they have the potential to be easy trans-formed into such type of parallel architecture. Onesuch example is Ref 114 which proposes a distributedLCS architecture designed to be used in very simpleagents with limited memory. Each agent only storesa small part of a rule set, and multiple agents arechained to provide a complete solution to the classifi-cation problem. In such model good speedups wouldbe obtained when all agents are continuously ac-tive, similarly to the pipeline architecture of modernCPUs.

DATA-INTENSIVE COMPUTING

When solving large-scale optimization or machinelearning problems using EC techniques, researchershave realized that the population requirements maybecome infeasible if approached with traditional high-performance computing techniques.115 Most modelsrequired maintaining the entire or at least a sampleof large populations during the evolutionary process.However, the data abundance provided by such largepopulations have enabled data-intensive computingtechniques to become a viable alternative paralleliza-tion scheme for EC techniques.116–119 Moreover, suchapproaches also provide three key advantages whencompared to their traditional high-performance com-puting counterparts:

(1) They do not require detailed knowledgeof the underlying hardware architectureand their complex programming techniques,which are hard to debug.

(2) They do not require intensive check-pointingto tolerate failures quite common on largejobs than may run for days.

(3) They scale well on commodity clusters;usually the efficiency of traditional paral-

lel programming frameworks (e.g., MessagePassing Interface—MPI) relies on expensivehigh-quality interconnection networks.

The following section will focus on a specificcase: the Hadoop44 data-intensive framework, widelyused to do large-scale data processing, and its ap-plication to parallelize the eCGA’s model buildingprocess.120

Hadoop and the MapReduce ModelHadoop44 is Yahoo!’s open source MapReduceframework. Modeled after Google’s MapReducepaper,121 Hadoop builds on the map and reduce prim-itives present in functional languages. Hadoop relieson these two abstractions to enable the easily devel-opment of large-scale distributed applications as longas your application can be modeled around these twophases.

In this model, the computation takes a set ofinput key/value pairs, and produces a set of outputkey/value pairs. The user of the MapReduce libraryexpresses the computation as two functions: Mapand Reduce. Map, written by the user, takes an in-put pair and produces a set of intermediate key/valuepairs. The MapReduce framework then groups to-gether all intermediate values associated with thesame intermediate key I and passes them to the Re-duce function. The Reduce function, also written bythe user, accepts an intermediate key I and a set ofvalues for that key. It merges together these valuesto form a possibly smaller set of values. The interme-diate values are supplied to the user’s Reduce func-tion via an iterator. This allows the model to han-dle lists of values that are too large to fit in mainmemory.

Conceptually, the Map and Reduce functionssupplied by the user have the following types:

map(k1, v1) → list(k2, v2), (2)

reduce(k2, list(v2)) → list(v3), (3)

that is, the input keys and values are drawn froma different domain than the output keys and val-ues. Furthermore, the intermediate keys and valuesare from the same domain as the output keys andvalues.

The Map invocations are distributed acrossmultiple machines by automatically partitioning theinput data into a set of M splits. The input splits canbe processed in parallel by different machines. Re-duce invocations are distributed by partitioning the



intermediate key space into R pieces using a partition-ing function, which is hash(key)%R according to thedefault Hadoop configuration. The number of parti-tions (R) and the partitioning function are specifiedby the user. The overall execution is thus orchestratedin two steps: first all Mappers are executed in paral-lel, then the Reducers process the generated key valuepairs by the Reducers. A detailed explanation of thisframework is beyond the scope of this paper and canbe found elsewhere.122

MapReducing eCGARef 120 has presented a Hadoop-based implementa-tion of eCGA’s69 model-building process using data-intensive computing. As mentioned above, eCGA isan EDA that models a problem’s structure as a set ofnonoverlapped BBs. The quality of a model is assessedby a fitness function based on the minimum descrip-tion length (MDL) principle.122 The model buildingprocess starts with one BB per problem variable thatare iteratively merged (in a greedy fashion, mergingtwo BBs at a time) while the merged BBs improveeCGA’s MDL metric. The model building is an impor-tant step in eCGA and can become the bottleneck ifimplemented sequentially. However, it is also difficultto parallelize this step because of the interdependenceof these steps. The solution is to split the populationamong different mappers. Each mapper computes theBB scores for the individuals it processes as well as thescore of every possible two-way merge of BBs. Thena single reducer aggregates the scores to compute theglobal MDL metric of all candidate BB merges, picksthe best merge and sends the merged partition to themappers for the next iteration of model building untilthe MDL metric cannot be improved.

Other Data-Intensive ComputingFrameworksOther data-intensive computing frameworks havealso been applied to parallelize EC methods.118–120

MongoDB (http://www.mongodb.org/) is a scal-able, high-performance, open source, schema-free,document-oriented database. Among others, Mon-goDB provides MapReduce tasks as a primitive ofthe query interface. When documents are stored insharded collections (collections of documents bro-ken in to shards distributed across different servers),MongoDB is able to run MapReduce tasks in par-allel making it an appealing alternative to Hadoop.Moreover, Meandre123 is the National Center forSupercomputing Applications’s data-intensive com-puting infrastructure for science, engineering, and

humanities. Meandre provides a more flexible pro-gramming model that allows to create complex dataflows, which could be regarded as complex and pos-sible iterating MapReduce stages. Dataflow execu-tion engines provide a scheduler that determines thefiring (execution) sequence of components. Mean-dre uses a decentralized scheduling policy designedto maximize the use of multicore architectures. Alsogroups of components can be placed across differ-ent machines and hence also scale by distributed ex-ecution. Meandre also allows works with processesthat require directed cyclic graphs, thus extendingbeyond the traditional MapReduce directed acyclicgraphs.

VISUALIZATION OF GBML RESULTS

Knowledge extraction is a crucial part of the data min-ing process, because in this context it is not enoughto use GBML systems (and in general, any kind ofmachine learning) to generate just predictions. Theend users of these methods expect to discover newknowledge in the data. Hence, understanding and vi-sualizing the solutions produced by GBML methodsis a very important topic. GBML is in a very good po-sition in this context, as a very large majority of theirsystems are rule-based and generate solutions that aredirectly human-readable as a set of logic predicates.However, in most case it is not enough to just showrule sets. To generate a complete picture of the GBMLsolutions some kind of visualization is required.

A recent work using Michigan LCS systems124

uses a technique inspired in the concept of heatmaps.A heatmap is a graphical visualization technique, fre-quently used to represent biological data, in whichcolors are used to represent the values in a matrixwhere rows and columns are the samples (instances)and genes/proteins (attributes), respectively. Rowsand columns are sorted using a hierarchical clusteringtechnique. The effect of this sorting is that patterns ofcolors can be easily identified in the heatmap whichmay lead to indicate the most important variablesassociated to a specific group of samples. In the ap-plication of heatmaps to represent rule sets, the rowsand columns are the instances and attributes of the do-main and the color of the heatmap for a particular cellindicates if the attribute of that cell is important forthe prediction of that sample based on the rules gen-erated by the Michigan LCS. By sorting the instancesin the dataset properly, this technique becomes usefulto identify salient interactions between attributes thatare important to predict a large number of samples inthe dataset.



FIGURE 7 | Visualization of a large number of rule sets as anetwork of interactions.104 Nodes = attributes. Edges = attributesappearing together in a rule. c© The Plant Cell by American Society ofPlant Physiologists Copyright 2013 Reproduced with permission ofAmerican Society of Plant Biologists in the format Republish in ajournal via Copyright Clearance Center.

The previous method was used to visualize asingle rule set, the final population of a MichiganLCS. Other methods integrate multiple rule sets in asingle visualization. In Ref 104, the BioHEL GBMLsystem was used to generate rule sets from a microar-ray dataset about plant seed germination. This datasethad 119 samples and almost 14,000 attributes, andeach time that BioHEL was run on this data (using dif-ferent random seeds) it ended up producing a totallydifferent rule set, although some attributes appearedin the rules far more frequently than others. This issuewas exploited by running BioHEL 20,000 times (inthis dataset a BioHEL run just takes a few seconds) togenerate 20000 different rule sets. Rankings of impor-tant attributes and their interactions were generatedby simply counting how many times (1) an attributewas used in a rule and (2) a pair of attributes appearedtogether in a rule. Finally, an interaction network ofthe dataset’s attributes was generated by connectingtogether with an edge the attributes (nodes) that ap-peared together in the rules. This network is shown inFigure 7. The color of the nodes in the network is de-termined by some preexisting domain knowledge thathad not been used in the training process. The regulardistribution of the colors shows that the network (andhence the rule sets) have been able to reconstruct thisdomain knowledge from the data.

The previous two examples used visualization torepresent the content of the rule sets. Other visualiza-tions have been proposed to represent the structureof the rule sets. In Parallelization Models, we havementioned an example of a (potentially) fine-grainedLCS architecture where the prediction are performedby a distributed set of simple agents chained betweenthem.115 The authors of that work generated a se-ries of visualizations of the patterns of connectionsof their rule set agents for a variety of problems thatwas very useful to understand the behavior of theirdistributed architecture.

Finally, there are also examples of visualizationsof the solutions of GBML systems that do not evolverule sets. In Ref 125, the authors proposed a methodto evolve the topology of a hidden Markov model(HMM) as the core of a Fisher Kernel used for pro-tein sequence classification tasks. The HMMs had ahierarchical structure divided in blocks. A GA wasused to evolve the connections from block to block(or to itself), while an heuristic algorithm was usedwithin each block to generate the low-level HMMsubsolutions. In this specific case the visualization ofthe evolved solutions is simply the representation ofthe solution’s phenotype: the diagram of connectedblocks. Different block shapes were used to indicatedifferent types of block.

THEORETICAL MODELS OF GBML

The previous sections have shown the enormous va-riety of GBML mechanisms that we can combineto improve the efficiency of GBML methods, but afew questions remain: how to choose the appropri-ate method? Moreover, what is the best way of us-ing them? Recent years have seen the development ofmany theoretical models about GBML systems, eitherabout the whole system or from one or more of thecomponents of the method. Their motivation is notjust for the sake of modeling the system, but as toolsto adjust in principled ways the components/systemthey are modeling.

We have already mentioned above the successmodel of the ILAS windowing scheme,55 which canbe used, for the specific case of a Pittsburgh LCS,to estimate the maximum number of strata to usein ILAS without compromising the learning abilitiesof the system. Furthermore, a simple model of the theGAssist56 system’s initialization stage was proposed14

to adjust automatically the main parameter control-ling its initialization (p).

Michigan LCS are the systems that have beenmodelled in most detail. Butz15 proposed several



models for the XCS126 system covering all of itsfunctioning: initialization, crossover, mutation andlearning time. Moreover, these models were used toestimate bounds of the minimal population size re-quired for the system to function correctly. Laterin time, Orriols16 created versions of these modelsspecifically adapted to deal with a very importantchallenge in large-scale data mining: class imbalance.These models were able to inform how to modify theparameters in XCS to make sure that accurate ruleswere learnt for all classes of the problem includingthe minority ones.

Finally, recently Franco et al.17 have proposedmodels for the initialization stage of the BioHEL3

GBML system. These models are inspired in thosedeveloped for XCS but with three major changes: (1)using the ALKR and GABIL representation, insteadof the ternary representation of XCS and (2) dealingwith χ−ary variables instead of the binary variablesof the original models and (3) explicitly modeling theconcept of rule overlap, which is a known factor ofdifficulty for GBML systems.18 These models wereable to inform about how to adjust BioHEL’s initial-ization parameters to ensure the creation of a goodinitial population.

REAL WORLD EXAMPLES

The literature is rich in applications of GBMLmethods (e.g., see Refs 127–130). In the specific caseof large-scale datasets there are several examplesof GBML methods applied to the well-knownKDDCUP-99 large-scale benchmark60,132–134

as well as applications to specific prob-lems, mostly in the bioinformatics/biomedicinedomains.58,86,91,101–104,107,124

In the following section, we are going to de-scribe in detail a specific example as a case study: theapplication of the BioHEL3 rule-based GBML systemto protein contact map (CM) prediction.

Case Study: Contact Map Prediction UsingGBMLPSP134–135 is one of the most challenging problems inthe field of computational biology. Proteins are cru-cial molecules for the functioning of all processes inliving organisms. A protein is constructed as a chainof amino acids (AAs). As it is being constructed, thischain folds to create very complex three-dimensionalshapes. The function of a protein if a consequenceof its structure. Thus, knowing the structure of aprotein is a crucial step for understanding its func-

tion. However, it is extremely difficult and costly toexperimentally determine the structure of a protein.Given that proteins fold on their own as they are be-ing constructed, the general consensus of the commu-nity is that the sequence of a protein should containenough information to predict its structure. This as-sumption gave place to the PSP field. The impact ofhaving better PSP methods would be enormous, notjust for understanding life better, but also to designnew/modified proteins for a variety of applicationssuch as the production of better crops or more ef-ficient drugs. This case study focuses on a specificsubproblem of PSP: CM prediction. A CM is a bi-nary matrix with as many rows and columns as AAin the protein, where each cell indicates whether acertain pair of AAs in a protein is in contact (their eu-clidean distance in the structure is less than a certainthreshold). Reconstructing a CM is generally treatedas a machine learning problem, and a very challengingone because of three reasons: very large training sets,high-dimensionality representations, and huge classimbalance.

For a protein of N AAs, we can have roughlyN2 possible contacts. A typical protein dataset tendto have around 2000 sequences of an average sizearound 200 AAs. This roughly means 80 millions con-tacts. Moreover, a CM is a very sparse matrix, andthe number of real contacts generally is around 2%of the total number of possible pairs of AAs. Hence,this is a prediction problem presenting an extremelyhigh class imbalance. Finally, the state-of-the-art rep-resentations for this problem tend to have hundredsof attributes (if not thousands). Most methods do nottend to use datasets as difficult as described above.Preprocessing techniques including thinning out theset of employed proteins,136 rebalancing the positiveand negative examples137 or reducing the size of thealphabet101 are sometimes employed to alleviate thesechallenges.

The system employed for this problem,BioHEL3 is a GBML system applying the IRL98

paradigm. In this paradigm of rule learning, the ruleset presented as final solution of the learning pro-cess is learnt one rule at a time in a iterative way.Once a rule is learnt, the examples covered by itare discarded from the training set and the processis started again to learn the next rule. This itera-tive process ends when there all training exampleshave been covered. The fitness function of such sys-tems has two goals: generate rules that are not onlyaccurate but also general (covering as many exam-ples as possible). It is not easy to find the correctbalance between these two goals. In BioHEL this isachieved by redefining the concept of coverage as a



piecewise function that rewards rules that cover atleast a certain, user-specified, percentage of exam-ples. In this way, the system avoid learning accuratebut overspecific rules, although its performance com-pletely depends on setting up this parameter correctlyciteFranco2012. Moreover, BioHEL was designed ex-plicitly to cope from large-scale datasets, incorporat-ing several of the efficiency enhancement mechanismsthat we have described throughout this paper: theILAS windowing scheme, the ALKR knowledge rep-resentation, a GPGPU-based fast fitness computationand ensembles for consensus prediction. For a com-plete description of BioHEL, please see Ref 3.

For the prediction of CM a training set of 2413proteins representing a broad range of different struc-tures was generated. In this dataset (after thinningout), there were over 32 million potential contacts,each of them represented using 631 attributes. Thetraining set occupies 56.7 GB of disk space. In or-der to alleviate both the training set size and theclass imbalance, 50 samples out of this huge datasetwere selected. Each sample had 660,000 examplesand a ratio of two noncontacts for each contact.Hence, rebalancing the training sets. Next, BioHELwas trained on each of the samples 25 times, each timeusing a different random seed. This created a total of1250 different rule sets, one resulting from each ofthe 50 × 25 runs. These rule sets were integrated intoan ensemble that was performing a simple majorityvoting as represented in Figure 8. The whole trainingprocess took around 25,000 CPU hours using IntelXeon E5472 processors at 3.00 GHz. For a completedescription of this CM prediction architecture, pleasesee Ref 59. This prediction method participated inthe CM category of the eighth and ninth editions ofCASP (critical assessment of techniques for proteinstructure prediction), a biannual experiment held toassess the start of the art in PSP methods. The methodparticipated in CASP8138 as ID72 and in CASP9139 asID51. The results of the assessment showed that thismethod ranked very high according to most perfor-mance metrics. Particularly, in CASP9 this methodhad the highest rank among all sequence-based pre-dictors in five out of the six evaluated metrics.59

CLASSIFICATION AND ANALYSIS OFLARGE-SCALE GBML MECHANISMS

In this paper, we have described a very broad rangeof mechanisms that have been proposed to tackle thechallenges of large-scale data mining, both from thepoint of view of efficiency, as it is natural, but alsofrom the point of view of the quality of the generated

FIGURE 8 | Representation of the training and prediction processfor contact map prediction using the BioHEL genetics-based machinelearning system.

solutions. That is, whether the mechanisms modifythe learning capacity of the GBML systems. To sum-marize the myriad of mechanisms described in thispaper, we propose a taxonomy based on answeringthese five questions:

Pos Does the method have a positive impact on thequality of the generated solutions?

Neg May the method suffer from a certain (moder-ate) degree of solution quality loss?

Enh Does the method improve efficiency by changingthe algorithm?

HW Does the method improve efficiency by usingspecial hardware?

Par Does the method improve efficiency by paral-lelizing the algorithm?

Table 1 shows the classification of all themethods described in this paper based on thistaxonomy.



TABLE 1 Classification of GBML Methods for Large-Scale Data Mining

Method Pos Neg Enh HW Par

Windowing mechanisms53–55,60 y y y n nRule match precomputing63,64 n n y n nReordering attributes by specificity65 n n y n nAttribute list knowledge representation3,132 y n y n nHybrid EDA–GBML methods9,71,73 y n y n nHybrid memetic-GBML methods8,11,77–79 y n y n nFitness surrogates81,82 n y y n nVectorial matching85,86 n n y y nGPGPU matching90–91,96 n n n y nEnsemble mechanisms100,105,107 y n n n yMaster–slave parallel models86 n n n n yIsland parallel models108,109 y y n n yFine-grained parallel models50,114 y y n n yData-intensive computing116–120,123 n n n n y

As we can see in Table 1 there are methods thatproduce an impact in more than one of the categoriesof the taxonomy while others only affect a single cate-gory. Particularly interesting are the three cases (win-dowing mechanisms, island models, and fine-grainedparallel models) that can both affect positively andnegatively the learning process of the GBML systems.In these cases, adjusting the parameters of the mech-anisms (e.g., the number of strata in ILAS, or thenumber of agents in the island/fine-grained models) iscrucial to avoid constraining the learning process.

Moreover, many of these mechanisms can besuccessfully combined among them to accumulatetheir individual benefits. We have shown in the casestudy on CM prediction how in the BioHEL sys-tem we combine the ILAS windowing scheme, theALKR representation, GPGPU matching and ensem-ble mechanisms. Some of these mechanisms can beeasily combined with any other, while some cannot.For instance, the ensemble methods can be used ontop of any other mechanism because they combinethe models extracted from several independent runsof the system, so there is no explicit coupling betweenmechanisms.

The windowing mechanisms are a very delicatecase. Given that they modify how the system learns(because essentially they create a dynamic optimiza-tion environment by changing the instances used tocompute the fitness calculations at every GA itera-tion), the methods that use the content of the in-stances to perform their work may not be suitablefor combining with it. For instance, it would be al-most impossible to combine the windowing with thematch precomputing algorithms/fitness surrogates, as

the benefit of these techniques would mostly be lostif the precomputing needs to happen at each iterationinstead of once for the whole run. The specific caseof MPLCS showed how windowing can be combinedsuccessfully with memetic operators, but only whenthe number of strata in ILAS is correctly adjusted.When the strata stop being good representatives ofthe whole training set, the memetic operators maystart to overfit the rules to the current stratum.

As we have discussed above, GPGPUs can becombined with other efficiency enhancement tech-niques such as the ALKR knowledge representa-tion, but only when the right degree of parallelismis used. That degree of parallelism roughly speak-ing is that that achieves optimal memory access pat-terns and an utilization of the GPU computationalresources close to maximal. Moreover, some algo-rithms, like the memetic operators, are essentiallyserial, so it is very difficult to reimplement themwith a parallel approach. Furthermore, the binary/SSEencoding technique for vectorial matching probablywould not be efficient in combination with the ALKRrepresentation because of the sparseness of that rep-resentation. The vectorial matching requires rules tobe tightly packed and have uniform structure in orderto quickly perform the match of multiple attributes ina single logical operation. When the rules stop havinga uniform structure or the packing needs to be editedtoo often, the efficiency of these mechanisms quicklydrops.

Finally, we have to keep in mind the suitabilityof these mechanisms to the type of problems beingtackled. For instance, data-intensive computing andto a lesser extent the GPGPU matching can only be



applied if the volume of data in the training set is largeenough. The trade-off between the costs of perform-ing an individual’s evaluation and the communicationrequired for such evaluation is important to decidebetween traditional master–slave models or adaptedversions such as NAX. Most hybrid methods are onlyable to significantly outperform traditional geneticsearch on problems of high enough dimensionality.

CONCLUSION

This paper has reviewed the broad range of methodsthat have been applied to improve/adapt GBML tech-niques to cope with large-scale datasets. These meth-ods build upon the natural parallel capacity of ECmethods, but are not limited to only the traditionalefficiency enhancement techniques of EC (e.g., fitnesssurrogates, parallel GAs, hybrid exploration, and soon). GBML methods have specific scaling up mecha-nisms that span representations, learning paradigms,inference mechanisms but also parallel/special hard-ware implementations. Moreover, theoretical modelshave been proposed in recent years about a variety ofaspects of a GBML system that can help adjust themin a principled way, which is crucial in the context oflarge-scale data mining because in most cases it is notpossible to afford a full parameter sweep. The paperhas shown how these techniques are prepared to dealwith extremely difficult real-world problems of highimpact for society, and that they have quickly adaptedto embrace the latest advances of high-performancecomputing hardware such as GPGPUs.

The appearance in the last one/two years of buzzwords such as ‘big data’ or ‘data science’, especiallyin industry, indicates that the road ahead of us canprovide many benefits for the GBML methods and

community, if they solve some of their limitations. Wehave seen how in many cases the good performance ofGBML scaling up mechanisms depends on the carefulchoice of parameters. This is a drawback for a widedissemination of these methods, as practitioners thatare not GBML experts may opt for other, simplerto set up, techniques. To solve this problem, it is cru-cial to develop automatic parameter setting heuristics,and the theoretical models of all aspects of a GBMLsystem are very important to design principled ad-justing mechanisms for all these parameters. We haveshown in this paper that many theoretical models al-ready exist, but many of them are limited to simplecases (e.g., binary representations), so there is still alot of work to do in this direction. Another aspectthat still has a large potential for improvement arethe data-intensive computing implementations. Theexample that we have illustrated in this paper is thecore eCGA optimization algorithm, hence not exclu-sively designed for machine learning tasks. GBMLmethods cover a range of problems relevant enoughthat it is worth and necessary to propose specific, forexample, Hadoop implementations of GBML mecha-nisms. Moreover, there are very sophisticated knowl-edge representations in GBML (e.g., hyperellipsoidconditions140) that are not ready yet to cope withhigh dimensionality domains, so it may be possible toextend them using some of the strategies that we havediscussed throughout this manuscript. Finally, a verycrucial aspect for the future is dissemination, makingsure that GBML methods are known to a very broadaudience of potential practitioners. A possible strat-egy to achieve this aim may be to integrate implemen-tations of GBML methods in popular machine learn-ing packages such as WEKA,141 KEEL142 (a machinelearning platform for GBML methods), or Mahout143

(the machine learning extension of Hadoop).

REFERENCES

1. Kovacs T. Genetics-based machine learning. In:Rozenberg G, Back T, Kok J, eds. Handbook of Nat-ural Computing: Theory, Experiments, and Applica-tions. Berlin: Springer Verlag; 2011.

2. Fernandez A, Garcia S, Luengo J, Bernado-MansillaE, Herrera F. Genetics-based machine learning forrule induction: state of the art, taxonomy, and com-parative study. IEEE Trans Evolut Comput 2010,14:913–941.

3. Bacardit J, Burke EK, Krasnogor N. Improvingthe scalability of rule-based evolutionary learning.Memetic Comput 2009, 1:55–67.

4. Lanzi PL, Rocca S, Solari S. An approach to analyzethe evolution of symbolic conditions in learning clas-sifier systems. In: GECCO ’07: Proceedings of the9th Annual Conference on Genetic and EvolutionaryComputation. ACM Press, New York; 2007, 2795–2800.

5. Browne WN, Ioannides C. Investigating scaling of anabstracted LCS utilising ternary and s-expression al-phabets. In: GECCO ’07: Proceedings of the 2007GECCO Conference Companion on Genetic andEvolutionary Computation. ACM Press, New York;2007, 2759–2764.



6. Llora X, Priya A, Bhargava R. Observer-invarianthistopathology using genetics-based machine learn-ing. Nat Comput 2008, 8:101–120.

7. Butz M, Lanzi P, Wilson S. Function approximationwith XCS: hyperellipsoidal conditions, recursive leastsquares, and compaction. IEEE Trans Evolut Com-put 2008, 12:355–376.

8. Bacardit J, Krasnogor N. Performance and efficiencyof memetic pittsburgh learning classifier systems. Evo-lut Comput J 2009, 17:307–342.

9. Butz MV, Pelikan M, Llora X, Goldberg DE. Auto-mated global structure extraction for effective localbuilding block processing in XCS. Evolut Comput2006, 14:345–380.

10. Llora X, Sastry K, Lima CF, Lobo FG, Goldberg DE.Linkage learning, rule representation, and the x-aryextended compact classifier system. In: Bacardit J,Bernado-Mansilla E, Butz MV, Kovacs T, Llora X,Takadama K, eds. Learning Classifier Systems, Lec-ture Notes In Artificial Intelligence, Vol. 4998. Berlin,Heidelberg: Springer-Verlag; 2008, 189–205.

11. Wyatt D, Bull L. A memetic learning classifier sys-tem for describing continuous-valued problem spaces.In: Recent Advances in Memetic Algorithms. Berlin:Springer; 2004, 355–396.

12. Loiacono D, Marelli A, Lanzi PL. Support vector re-gression for classifier prediction. In: GECCO ’07:Proceedings of the 9th Conference on Genetic andEvolutionary Computation. ACM Press, New York;2007, 1806–1813.

13. Tamee K, Bull L, Pinngern O. Towards clusteringwith XCS. In: GECCO ’07: Proceedings of the 9thAnnual Conference on Genetic and EvolutionaryComputation. ACM Press, New York; 2007, 1854–1860.

14. Bacardit J. Analysis of the initialization stage of aPittsburgh approach learning classifier system. In:GECCO ’05: Proceedings of the 7th Annual Con-ference on Genetic and Evolutionary Computation.ACM Press, New York; 2005, 1843–1850.

15. Butz MV. Rule-Based Evolutionary online learningsystems: a principled approach to lcs analysis anddesign. In: Studies in Fuzziness and Soft Computing.Berlin: Springer; 2006.

16. Orriols-Puig A. New Challenges in Learning Clas-sifier Systems: Mining Rarities and Evolving FuzzyModels. PhD Thesis. Barcelona, Spain: Ramon LlullUniversity; 2008.

17. Franco MA, Krasnogor N, Bacardit J. Modellingthe initialisation stage of the ALKR representa-tion for discrete domains and GABIL encoding. In:GECCO ’11: Proceedings of the 13th Annual Con-ference on Genetic and Evolutionary Computation.ACM, New York: 2011.

18. Ioannides C, Barrett G, Eder K. XCS cannot learn allboolean functions. In: GECCO ’11: Proceedings of

the 13th Annual Conference on Genetic and Evolu-tionary Computation. ACM, New York: 2011, 1283–1290.

19. Bernado-Mansilla E, Llora X, Garrell JM. XCS andGALE: a comparative study of two learning classifiersystems with six other learning algorithms on classi-fication tasks. In: IWLCS 2001, LNAI 2321, Berlin:Springer Verlag; 2002, 115–132.

20. Bacardit J, Butz MV. Data mining in learning classi-fier systems: comparing XCS with GAssist. In: KovacsT, Llora X, Takadama K, Lanzi PL, Stolzmann W,Wilson SW, eds. Advances at the Frontier of Learn-ing Classifier Systems. Berlin: Springer-Verlag; 2007,282–290.

21. Orriols-Puig A, Casillas J, Bernado-Mansilla E.Genetic-based machine learning systems are com-petitive for pattern recognition. Evolut Intell 2008,1:209–232.

22. Genbank release notes. Available at: ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt. (Accessed November 21,2012).

23. Physicists brace themselves for LHC ‘data avalanche’.Available at: http://www.nature.com/news/2008/080722/full/news.2008.967.html. (Accessed Novem-ber 21, 2012).

24. Pop M, Salzberg SL. Bioinformatics challenges of newsequencing technology. Trends Genet 2008, 24:142–149.

25. The netflix prize. Available at: http://www.netflixprize.com. (Accessed November 21, 2012).

26. The netflix prize leaderboard. Available at: http://www.netflixprize.com/leaderboard. (AccessedNovember 21, 2012).

27. Allison DB, Page GP, Beasley TM, Edwards JW,eds. DNA Microarrays and Related Genomics Tech-niques: Design, Analysis, and Interpretation of Ex-periments. Boca Raton, FL: Chapman & Hall; 2005.

28. Glaab E, Garibaldi J, Krasnogor N. Arraymining:a modular web-application for microarray anal-ysis combining ensemble and consensus methodswith cross-study normalization. BMC Bioinformatics2009, 10:358.

29. The ICOS PSP benchmarks repository. Available at:http://icos.cs.nott.ac.uk/datasets/psp benchmark.html.(Accessed November 21, 2012).

30. Reuters-21578 text categorization collection.Available at: http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html. (Accessed Novem-ber 21, 2012).

31. Otero F, Freitas A, Johnson C. A hierarchical multi-label classification ant colony algorithm for proteinfunction prediction. Memetic Comput 2010, 2:165–181.

32. Japkowicz N, Stephen S. The class imbalance prob-lem: a systematic study. Intelli Data Anal 2002,6:429–450.



33. Ho TK, Basu M. Complexity measures of supervisedclassification problems. IEEE Trans Pattern AnalMach Intell 2002, 24:289–300.

34. Bernado-Mansilla E, Ho TK. Domain of compe-tence of XCS classifier system in complexity measure-ment space. IEEE Trans Evolut Comput 2005, 9:82–104.

35. Luengo J, Fernandez A, Garcıa S, Herrera F. Address-ing data complexity for imbalanced data sets: analysisof SMOTE-based oversampling and evolutionary un-dersampling. Soft Comput 2010, 15:1909–1936.

36. Dumbill E. What is big data?, 2012. Avail-able at: http://radar.oreilly.com/2012/01/what-is-big-data.html.

37. Cantu-Paz E. Efficient and Accurate Parallel GeneticAlgorithms. Norwell, MA: Kluwer Academic; 2000.

38. Alba E. Parallel Metaheuristics: A New Class of Al-gorithms. Hoboken, New Jersey: Wiley Interscience;2005.

39. The HDF5 data storage system. Available at:http://www.hdfgroup.org/HDF5/. (Accessed Novem-ber 21, 2012).

40. The hadoop distributed file system. Availableat: http://hadoop.apache.org/docs/hdfs/current/hdfsdesign.html. (Accessed November 21, 2012).

41. The mongodb document-oriented database. Availableat: http://www.mongodb.org/. (Accessed November21, 2012).

42. Sastry K. Principled efficiency enhancement tech-niques. In: GECCO ’05: Proceedings of the 7thannual conference companion on Genetic andevolutionary computation; 2005. Available at:http://www.illigal.uiuc.edu/web/kumara/2005/11/24/principled-efficiency-enhancement-techniques/. (Acc-essed November 21, 2012).

43. Dean J, Ghemawat S. MapReduce: simplified dataprocessing on large clusters. In: Proceedings of the6th Symposium on Operating Systems Design andImplementation (OSDI ’04). USENIX, Berkeley, CA;2004, 137–150.

44. The hadoop distributed computing framework. Avail-able at: http://hadoop.apache.org/. (Accessed Novem-ber 21, 2012).

45. Furnkranz J. Integrative windowing. J Artif Intell Res1998, 8:129–164.

46. Quinlan JR. C4.5: Programs for Machine Learning.San Mateo, CA: Morgan Kaufmann; 1993.

47. John GH, Langley P. Static versus dynamic sam-pling for data mining. In: Simoudis E, Han J, FayyadUM, eds. Proceedings of the Second InternationalConference on Knowledge Discovery and Data Min-ing. AAAI Press, Menlo Park, CA; 1996, 367–370.

48. Maloof MA, Michalski RS. Selecting examples forpartial memory learning. Mach Learn 2000, 41:27–52.

49. Freitas AA. Data Mining and Knowledge Discov-ery with Evolutionary Algorithms. Berlin: Springer-Verlag; 2002.

50. Llora X. Genetics-Based Machine Learning usingFine-grained Parallelism for Data Mining. PhD The-sis. Barcelona, Spain: Enginyeria i Arquitectura LaSalle, Ramon Llull University; 2002.

51. Derrac J, Garcıa S, Herrera F. Stratified prototypeselection based on a steady-state memetic algorithm:a study of scalability. Memetic Comput 2010, 2:183–199.

52. Garcıa S, Derrac J, Cano J, Herrera F. Prototype se-lection for nearest neighbor classification: taxonomyand empirical study. IEEE Trans Pattern Anal MachIntell 2012, 34:417–435.

53. Sharpe PK, Glover RP. Efficient GA based techniquesfor classification. Appl Intell 1999, 11:277–284.

54. Gathercole C, Ross P. Dynamic training subset se-lection for supervised learning in genetic program-ming. In: Davidor Y, Schwefel H-P, Manner R. eds.Parallel Problem Solving from Nature III. Jerusalem:Springer-Verlag; 1994, 312–321.

55. Bacardit J, Goldberg D, Butz M, Llora X, GarrellJM. Speeding-up pittsburgh learning classifier sys-tems: modeling time and accuracy. In: Yao X, BurkeEK, Lozano JA, Smith J, Merelo-Guervos JJ, Bulli-naria JA, Rowe JE, Tino P, Kaban A, Schwefel H-P,eds. Parallel Problem Solving from Nature—PPSN2004. LNCS. Berlin: Springer-Verlag; 2004, 1021–1031.

56. Bacardit J. Pittsburgh Genetics-Based MachineLearning in the Data Mining era: Representa-tions, Generalization, and Run-Time. PhD Thesis.Barcelona, Spain: Ramon Llull University; 2004.

57. Bacardit J, Stout M, Krasnogor N, Hirst JD,Blazewicz J. Coordination number prediction us-ing learning classifier systems: performance and in-terpretability. In: GECCO ’06: Proceedings of the8th Annual Conference on Genetic and EvolutionaryComputation. ACM Press, New York; 2006, 247–254.

58. Stout M, Bacardit J, Hirst JD, Krasnogor N. Pre-diction of recursive convex hull class assignmentsfor protein residues. Bioinformatics 2008, 24:916–923.

59. Bacardit J, Widera P, Marquez-Chamorro A, DivinaF, Aguilar-Ruiz JS, Krasnogor N. Contact map pre-diction using a large-scaleensemble of rule sets andthe fusion of multiple predicted structural features.Bioinformatics. 2012, 28:2441–2448.

60. Song D, Heywood MI, Zincir-Heywood AN. Train-ing genetic programming on half a million patterns:an example from anomaly detection. IEEE TransEvolut Comput 2005, 9:225–239.

61. Frank A, Asuncion A. UCI Machine Learning Repos-itory. Irvine, CA: University of California, School ofInformation and Computer Science; 2010. Available



at: http://archive.ics.uci.edu/ml. (Accessed November21, 2012).

62. Cano J-R, Garcıa S, Herrera F. Subgroup discoverin large size data sets preprocessed using stratifiedinstance selection for increasing the presence of mi-nority classes. Pattern Recognit Lett 2008, 29:2156–2164.

63. Giraldez R, Aguilar-Ruiz JS, Santos JCR. Knowledge-based fast evaluation for evolutionary learning. IEEETrans Syst Man Cybern C 2005, 35:254–261.

64. Mellor D., Nicklin SP. A population-based approachto finding the matchset of a learning classifier sys-tem efficiently. In: GECCO ’09: Proceedings of the11th Annual Conference on Genetic and Evolution-ary Computation. ACM, New York; 2009, 1267–1274.

65. Butz MV, Lanzi PL, Llora X, Loiacono D. An analysisof matching in learning classifier systems. In: GECCO’08: Proceedings of the 10th Annual Conference onGenetic and Evolutionary Computation. ACM, NewYork; 2008, 1349–1356.

66. Langdon WB. Fitness causes bloat in variable size rep-resentations. Technical Report CSRP-97-14. Birming-ham, UK: School of Computer Science, Universityof Birmingham; 1997. (Position paper at the Work-shop on Evolutionary Computation with VariableSize Representation at ICGA-97.)

67. Larranaga P, Lozano J, eds. Estimation of Distri-bution Algorithms, A New Tool for EvolutionnaryComputation. Genetic Algorithms and EvolutionnaryComputation. Norwell, MA: Kluwer Academic Pub-lishers; 2002.

68. Krasnogor N, Smith J. A tutorial for competentmemetic algorithms: model, taxonomy and design is-sues. IEEE Trans Evolut Comput 2005, 9:474–488.

69. Harik G. Linkage learning via probabilistic model-ing in the ECGA. Technical Report 99010. Urbanaand Champaign, IL: Illinois Genetic Algorithms Lab,University of Illinois at Urbana-Champaign; 1999.

70. Pelikan M, Goldberg DE, Cantu-Paz E. BOA: theBayesian optimization algorithm. In: Proceedings ofthe Genetic and Evolutionary Computation Confer-ence GECCO-99, Morgan Kaufmann, San Mateo,CA; 1999, 525–532.

71. Llora X, Sastry K, Goldberg DE. The compact clas-sifier system: Scalability analysis and first results. In:Proceedings of the Congress on Evolutionary Com-putation 2005. New York: IEEE Press, 2005, 1:596–603.

72. Harik G, Lobo F, Goldberg D. The compact geneticalgorithm. IEEE Trans Evolut Comput 1999, 3:287–297.

73. Llora X, Sastry K, Lima C, Lobo F, GoldbergDE. Linkage learning, rule representation, andthe X-ary extended compact classifier system. In:Bacardit J, Bernado-Mansilla E, Butz M, Kovacs

T, Llora X, Takadama K, eds. Learning ClassifierSystems. Lecture Notes in Computer Science.Berlin/Heidelberg: Springer; 2008, 189–205.

74. Harik GR. Finding multimodal solutions using re-stricted tournament selection. In: Eshelman L, ed.Proceedings of the Sixth International Conferenceon Genetic Algorithms. Morgan Kaufmann, SanFrancisco, CA; 1995, 24–31.

75. Sierra B, Lazkano E, Inza I, Merino M, LarranagaP, Quiroga J. Prototype selection and feature sub-set selection by estimation of distribution algorithms.A case study in the survival of cirrhotic patientstreated with TIPS. Lecture Notes in Computer Sci-ence. Berlin/Heidelberg: Springer; 2001, 20–29.

76. Abegaz T, Dozier G, Bryant K, Adams J, Shelton J, Ri-canek K, Woodard D. SSGA amp; EDA based featureselection and weighting for face recognition. In: NewYork: IEEE Congress on Evolutionary Computation(CEC) 2011. IEEE; 2011, 1375 –1381.

77. Grefenstette JJ. Lamarckian learning in multi-agentenvironments. In: Belew R, Booker L, eds. Proceed-ings of the Fourth International Conference on Ge-netic Algorithms. Morgan Kaufman, San Francisco,CA; 1991, 303–310.

78. Casillas J, Martınez P, Benıtez AD. Learning consis-tent, complete and compact sets of fuzzy rules in con-junctive normal form for regression problems. SoftComput 2008, 13:451–465.

79. Butz MV, Goldberg DE, Lanzi PL. Gradient descentmethods in learning classifier systems: Improving XCSperformance in multistep problems. IEEE Trans Evo-lut Comput 2005, 9:452–473.

80. Miller BL, Goldberg DE. Genetic algorithms, selec-tion schemes, and the varying effects of noise. EvolutComput 1996, 4:113–131.

81. Llora X, Sastry K, Yu T-L, Goldberg DE. Do notmatch, inherit: fitness surrogates for genetics-basedmachine learning techniques. In: GECCO ’07: Pro-ceedings of the 9th Annual Conference on Geneticand Evolutionary Computation. ACM, New York;2007, 1798–1805.

82. Orriols-Puig A, Bernado-Mansilla E, Sastry K, Gold-berg DE. Substructrual surrogates for learning de-composable classification problems: implementationand first results. In: GECCO ’07: Proceedings of the2007 GECCO Conference Companion on Geneticand Evolutionary Computation. ACM, New York;2007, 2875–2882.

83. Sastry K, Lima CF, Goldberg DE. Evaluation relax-ation using substructural information and linear esti-mation. In: GECCO ’06: Proceedings of the 8th An-nual Conference on Genetic and Evolutionary Com-putation. ACM, New York; 2006, 419–426.

84. Yu T-L, Goldberg DE, Yassine A, Chen Y-P. Geneticalgorithm design inspired by organizational theory:pilot study of a dependency structure matrix driven



genetic algorithm. In: GECCO ’03: Proceedings ofthe 5th Annual Conference on Genetic and Evolu-tionary Computation. ACM, New York; 2003, 1620–1621.

85. Llora X, Sastry K. Fast rule matching for learningclassifier systems via vector instructions. In: GECCO’06: Proceedings of the 8th Annual Conference onGenetic and Evolutionary Computation. ACM Press,New York; 2006, 1513–1520.

86. Llora X, Priya A, Bhargava R. Observer-invarianthistopathology using genetics-based machine learn-ing. Natural Comput 2009, 8:101–120.

87. NVIDIA. NVIDIA CUDA programming guide 2.0,2008.

88. Maitre O, Baumes LA, Lachiche N, Corma A,Collet P. Coarse grain parallelization of evolution-ary algorithms on GPGPU cards with EASEA. In:Proceedings of the 11th Annual Conference on Ge-netic and Evolutionary Computation. ACM, Mon-treal; 2009, 1403–1410.

89. Langdon WB, Harrison AP. GP on SPMD parallelgraphics hardware for mega bioinformatics data min-ing. Soft Comput 2008, 12:1169–1183.

90. Lanzi PL, Loiacono D. Speeding up matching in learn-ing classifier systems using CUDA. In: Bacardit J,Browne W, Drugowitsch J, Bernado-Mansilla E, ButzM, eds. Learning Classifier Systems. Lecture Notes inComputer Science. Berlin/Heidelberg: Springer; 2010,1–20.

91. Langdon WB. Large scale bioinformatics data miningwith parallel genetic programming on graphics pro-cessing units. In: de Vega F, Cantu-Paz E, eds. Paral-lel and Distributed Computational Intelligence. Stud-ies in Computational Intelligence. Berlin/Heidelberg:Springer; 2010, 113–141.

92. Prabhu R. SOMGPU: an unsupervised pattern classi-fier on graphical processing unit. In: IEEE Congresson Evolutionary Computation. New York: IEEE;2008, 1011–1018.

93. Sharp T. Implementing decision trees and forests ona GPU. In: Computer Vision—ECCV 2008. Berlin:Springer; 2008, 595–608.

94. Steinkraus D, Buck I, Simard P. Using GPUs for ma-chine learning algorithms. In: Document Analysis andRecognition, 2005. Proceedings. Eighth InternationalConference on IEEE, New York; 2005. 2:1115–1120.

95. Catanzaro B, Sundaram N, Keutzer K. Fast supportvector machine training and classification on graph-ics processors. In: Proceedings of the 25th Inter-national Conference on Machine Learning (ICML2008). 2008, 111.

96. Franco MA, Krasnogor N, Bacardit J. Speedingup the evaluation of evolutionary learning systemsusing GPGPUs. In: GECCO ’10: Proceedings of the12th Annual Conference on Genetic and Evolution-

ary Computation. ACM, New York; 2010, 1039–1046.

97. Venturini G. SIA: a supervised inductive algorithmwith genetic search for learning attributes based con-cepts. In: Brazdil PB, ed. Machine Learning: ECML-93—Proceedings of the European Conference on Ma-chine Learning. Springer-Verlag: Berlin, 1993, 280–296.

98. NVIDIA. NVIDIA CUDA SDK—Data-Parallel algo-rithms. Available at: http://www.nvidia.com/content/cudazone/cuda sdk/Data-Parallel Algorithms.html#reduction. (Accessed November 21, 2012).

99. various authors. Special issue on integrating multiplelearned models. Mach Learn 1999, 36:5–139.

100. Bacardit J, Krasnogor N. Empirical evaluation of en-semble techniques for a pittsburgh learning classifiersystem. In: Learning Classifier Systems, Revised Se-lected Papers of IWLCS 2006-2007. LNAI. Berlin:Springer-Verlag; 2008, 255–268.

101. Bacardit J, Stout M, Hirst JD, Valencia A, Smith RE,Krasnogor N. Automated alphabet reduction for pro-tein datasets. BMC Bioinformatics, 2009, 10:6.

102. Lee MC, Boroczky L, Sungur-Stasik K, Cann AD,Borczuk AC, Kawut SM, Powell CA. Computer-aideddiagnosis of pulmonary nodules using a two-step ap-proach for feature selection and classifier ensembleconstruction. Artif Intell Med 2010, 50:43–53.

103. Armananzas R, Inza I, Santana R, Saeys Y, FloresJ, Lozano J, Peer Y, Blanco R, Robles V, Bielza C,Larranaga. P. A review of estimation of distributionalgorithms in bioinformatics. BioData Min 2008, 1:6.

104. Bassel GW, Glaab E, Marquez J, Holdsworth MJ,Bacardit J. Functional network construction in ara-bidopsis using rule-based machine learning on large-scale data sets. Plant Cell 2011, 23:3101–3116.

105. Zhang Y, Bhattacharyya S. Genetic programming inclassifying large-scale data: an ensemble method. InfSci 2004, 163:85–101.

106. Folino G, Pizzuti C, Spezzano G. GP ensembles forlarge-scale data classification. IEEE Trans EvolutComput 2006, 10:604–616.

107. Holden N, Freitas A. Hierarchical classification ofprotein function with ensembles of rules and parti-cle swarm optimisation. Soft Comput 2009, 13:259–272.

108. Dam HH, Abbass HA, Lokan C. DXCS: an XCS sys-tem for distributed data mining. In GECCO ’05: Pro-ceedings of the 7th Annual Conference on Geneticand Evolutionary Computation. ACM, New York;2005, 1883–1890.

109. Bull L, Studley M, Bagnall A, Whittley I. Learningclassifier system ensembles with rule-sharing. IEEETrans Evolut Comput 2007, 11:496–502.

110. Holland JH, Reitman JS. Cognitive systems based onadaptive algorithms. In: Hayes-Roth D, Waterman F,



eds. Pattern-directed Inference Systems. New York:Academic Press; 1978, 313–329.

111. Amdahl G. Validity of the single processor approachto achieving large-scale computing capabilities. In:AFIPS Conference Proceedings. ACM, New York;1967, 483–485.

112. Llora X, Garrell J. Knowledge-independent datamining with fine-grained parallel evolutionary algo-rithms. In: GECCO ’01: Proceedings of the 3th An-nual Conference on Genetic and Evolutionary Com-putation. Morgan Kaufmann San Mateo, CA; 2001,461–468.

113. Llora X, and Garrell J. Evolving partially-defined in-stances with evolutionary algorithms. In Proceedingsof the 18th International Conference on MachineLearning (ICML’2001). Morgan Kaufmann Publish-ers, San Mateo, CA; 2001, 337–344.

114. Scheidler A., Middendorf M. Learning classifier sys-tems to evolve classification rules for systems ofmemory constrained components. Evolut Intell 2011,4:127–143.

115. Sastry K, Goldberg DE, Llora X. Towards billion-bit optimization via a parallel estimation of distri-bution algorithm. In: GECCO ’07: Proceedings ofthe 9th Annual Conference on Genetic and Evolu-tionary Computation. ACM, New York; 2007, 577–584.

116. Jin C, Vecchiola C, Buyya R. MRPGA: An extensionof mapreduce for parellelizing genetic algorithms. In:Press I, ed. IEEE Fouth International Conference oneScience 2008. IEEE: New York; 2008, 214–221.

117. Verma A, Llora X, Goldberg DE, Campbell RH. Scal-ing genetic algorithms using mapreduce. In: ISDA ’09:Proceedings of the 2009 Ninth International Confer-ence on Intelligent Systems Design and Applications.Washington, D.C.: IEEE Computer Society; 2009,13–18.

118. Llora X. Data-intensive computing for competent ge-netic algorithms: a pilot study using meandre. In:GECCO ’09: Proceedings of the 11th Annual Con-ference on Genetic and Evolutionary Computation,ACM, New York; 2009, 1387–1394.

119. Llora X, Verma A, Campbell RH, Goldberg DE.When huge is routine: scaling genetic algorithmsand estimation of distribution algorithms via data-intensive computing. In: Fernandez de Vega F,Cantu-Paz E, eds. Parallel and Distributed Com-putational Intelligence. Berlin Heidelberg: Springer-Verlag; 2010, 1141.

120. Verma A, Llora X, Venkataraman S, Goldberg DE,Campbell RH. Scaling eCGA model building via data-intensive computing. In: IEEE Congress on Evolu-tionary Computation’10. New York: IEEE; 2010, 1–8.

121. Dean J, Ghemawat S. MapReduce: simplified dataprocessing on large clusters. In: OSDI’04: Sixth Sym-

posium on Operating System Design and Implemen-tation. 2004.

122. Rissanen J. Modeling by shortest data description.Automatica 1978, 14:465–471.

123. Llora X, Acs B, Auvil L, Capitanu B, Welge M,Goldberg DE. Meandre: semantic-driven data-intensive flows in the clouds. In: Proceedings ofthe 4th IEEE International Conference on e-Science.IEEE, New York; 2008, 238–245.

124. Urbanowicz RJ, Granizo-MacKenzie A, Moore JH.Instance-linked attribute tracking and feedback formichigan-style supervised learning classifier systems.In: GECCO ’12: Proceedings of the 14th AnnualConference on Genetic and Evolutionary Computa-tion. ACM Press, New York; 2012, 927–934.

125. Won KJ, Saunders C, Prugel-Bennett A. Evolvingfisher kernels for biological sequence classifi-cation. Evolut Comput. Available at: http://www.mitpressjournals.org/doi/abs/10.1162/EVCOa 00065?journalCode=evco. (Accessed December19, 2011).

126. Wilson S.W. Classifier fitness based on accuracy. Evo-lut Comput 1995, 3:149–175.

127. Bull L, ed. Applications of Learning Classifier Sys-tems. Berlin: Springer-Verlag; 2004.

128. Ghosh A, Jain LC, eds. Evolutionary Computation inData Mining. Berlin: Springer-Verlag; 2005.

129. Bull L, Bernado-Mansilla E, Holmes J, eds. LearningClassifier Systems in Data Mining. Berlin: Springer-Verlag; 2008.

130. del Jesus MJ, Gamez JA, Puerta JM. Special issue onevolutionary and metaheuristics based data mining(EMBDM). Soft Comput 2009, 13:209–318.

131. Shafi K, Abbass HA. An adaptive genetic-based sig-nature learning system for intrusion detection. ExpertSyst Appl 2009, 36:12036–12043.

132. Bacardit J, Krasnogor N. A mixed discrete-continuousattribute list representation for large scale classifica-tion domains. In: GECCO ’11: Proceedings of the11th Annual Conference on Genetic and Evolution-ary Computation. ACM, New York; 2009, 1155–1162.

133. Marın-Blazquez JG, Martınez Perez G. Intrusion de-tection using a linguistic hedged fuzzy-XCS classifiersystem. Soft Comput 2009, 13:273–290.

134. Lesk AM. Introduction to Protein Structure. Oxford:Oxford University Press; 2001.

135. Moult J, Fidelis K, Kryshtafovych A, Rost B,Tramontano A. Critical assessment of methods ofprotein structure prediction. Round VIII. Proteins2009, 77:1–4.

136. Baldi P, Pollastri G. The principled design oflarge-scale recursive neural network architecturesDAG-RNNS and the protein structure predictionproblem. J Mach Learn Res 2003, 4:575–602.



137. Punta M, Rost B. PROFcon: novel prediction of long-range contacts. Bioinformatics 2005, 21:2960–2968.

138. Ezkurdia I, Grana O, Izarzugaza JMG, Tress ML.Assessment of domain boundary predictions and theprediction of intramolecular contacts in CASP8. Pro-teins 2009, 77:196–209.

139. Monastyrskyy B, Fidelis K, Tramontano A,Kryshtafovych A. Evaluation of residue-residue con-tact predictions in CASP9. Proteins 2011, 79:119–125.

140. Butz MV, Lanzi PL, Wilson SW. Hyper-ellipsoidalconditions in XCS: rotation, linear approximation,and solution structure. In: GECCO ’06: Proceedingsof the 8th Annual Conference on Genetic and Evolu-

tionary Computation. ACM, New York; 2006, 1457–1464.

141. Witten IH, Frank E. Data Mining: Practical MachineLearning Tools and Techniques. San Mateo, CA:Morgan Kaufmann; 2005.

142. Alcala Fdez J, Sanchez L, Garcıa S, del Jesus M,Ventura S, Garrell J, Otero J, Romero C, Bac-ardit J, Rivas V, Fernandez J, Herrera. F. Keel: asoftware tool to assess evolutionary algorithms fordata mining problems. Soft Comput 2009, 13:307–318.

143. Mahout: Scalable machine learning and data mining.Available at: http://mahout.apache.org/. (AccessedNovember 21, 2012).


Largescale data mining using geneticsbased machine learning

Documents

Transcript of Largescale data mining using geneticsbased machine learning