Evolving optimum populations with XCS classifier systems
-
Upload
mengjie-zhang -
Category
Documents
-
view
213 -
download
0
Transcript of Evolving optimum populations with XCS classifier systems
ORIGINAL PAPER
Evolving optimum populations with XCS classifier systems
XCS with code fragmented action
Muhammad Iqbal • Will N. Browne •
Mengjie Zhang
Published online: 30 September 2012
� Springer-Verlag 2012
Abstract The main goal of the research direction is to
extract building blocks of knowledge from a problem
domain. Once extracted successfully, these building blocks
are to be used in learning more complex problems of the
domain, in an effort to produce a scalable learning classi-
fier system (LCS). However, whilst current LCS (and other
evolutionary computation techniques) discover good rules,
they also create sub-optimum rules. Therefore, it is difficult
to separate good building blocks of information from oth-
ers without extensive post-processing. In order to provide
richness in the LCS alphabet, code fragments similar to
tree expressions in genetic programming are adopted. The
accuracy-based XCS concept is used as it aims to produce
maximally general and accurate classifiers, albeit the rule
base requires condensation (compaction) to remove spuri-
ous classifiers. Serendipitously, this work on scalability of
LCS produces compact rule sets that can be easily con-
verted to the optimum population. The main contribution
of this work is the ability to clearly separate the optimum
rules from others without the need for expensive post-
processing for the first time in LCS. This paper identifies
that consistency of action in rich alphabets guides LCS to
optimum rule sets.
Keywords Learning classifier systems � XCS �Optimal populations � Scalability � Code fragments �Action consistency
1 Introduction
Human beings have the ability to apply the domain
knowledge learned from a smaller problem to more com-
plex problems of the same or a related domain, but cur-
rently machine learning techniques lack this ability. This
lack of ability to apply the already learned knowledge of a
domain results in consuming more resources and time to
solve more complex problems of the domain. As the
problem scales, it becomes difficult and even sometimes
impractical (if not impossible) to solve due to the needed
resources and time.
Therefore, a system is needed that has the ability to
reuse the learned knowledge of a problem domain to scale
in the domain. If the system has modularity and coopera-
tion among the modules, it can help to reuse the knowledge
effectively using an evolutionary mechanism. A learning
classifier system (LCS) is an evolutionary adaptive system
that learns a problem using a set of rules cooperating
among each other (Holland 1986), whereas most of the
traditional evolutionary techniques have competition only
among the individuals.
Typically, an LCS represents a rule-based agent that
incorporates evolutionary computing and machine learning
to solve a given task. The rules are of the form ‘‘if state
then action’’. An LCS learns by interacting with the
environment. Typically, it starts learning by covering the
individual data patterns given as input from the environ-
ment in rules, such as ‘100110 : 1’, i.e. if input is ‘100110’
then the action will be ‘1’. It generalizes the population of
M. Iqbal (&) � W. N. Browne � M. Zhang
School of Engineering and Computer Science,
Victoria University of Wellington,
Wellington 6140, New Zealand
e-mail: [email protected]
W. N. Browne
e-mail: [email protected]
M. Zhang
e-mail: [email protected]
123
Soft Comput (2013) 17:503–518
DOI 10.1007/s00500-012-0922-5
classifiers by attempting to remove irrelevant information.
Usually, the generality is achieved using the common ter-
nary alphabet {0, 1, #} where ‘#’ is the ‘don’t care’ symbol
that can be either 0 or 1, e.g. ‘10##1#:1’.
The LCS technique can scale in problem domains, but
has to relearn from the start each time. Further, increased
dimensionality of the problem, resulting in increased
search space, demands large memory space and leads to
much longer training times; and eventually restricts the
LCS to a limit in problem size. By explicitly feeding
domain knowledge to an LCS, scalability can be achieved
but it adds bias and restricts use in multiple domains
(Ioannides and Browne 2007). On the other hand, human
beings learn the underlying method to construct a solution,
and so can scale the problem arbitrarily.
The main goal of the research direction is to develop an
LCS capable of autonomously scalable learning, from
small problems to more complex problems of the same or a
related domain, in a similar behavior to human beings
(Thrun 1996). A modular approach to learning in LCS is to
be adopted so that building blocks of knowledge may be
formed and utilized. It is anticipated that the modular
approach will make the system more scalable. The rela-
tively fitter modules from a learning system trained against
a small problem will be put in a higher level problem in the
same or a related domain to reduce the learning time of the
system.
In order to extract building blocks of knowledge in the
form of reusable modules or functions, a richer encoding
scheme will be used in the work presented here. In this
encoding representation, the action will be replaced by a
code fragment while using the typical ternary alphabet for
the condition of a classifier. A code fragment is a tree
expression, similar to a tree generated in genetic pro-
gramming (see Sect. 2.2).
Wilson’s accuracy-based XCS (Wilson 1995), the most
popular learning classifier system, is used to implement and
test the proposed system. It is a well studied and tested
‘‘non-supervised reinforcement (learning)’’ learning clas-
sifier system. In XCS the genetic algorithm is applied to an
action set instead of the whole population to conserve
similar building blocks of information. These features of
XCS make it possible to form a complete and accurate
mapping from inputs and actions to payoff predictions. The
ability of XCS to produce complete and accurate solution,
for a given problem, motivated its suitability for this
research work. If a learning system is unable to produce a
complete and accurate solution, then the extracted building
blocks lack important knowledge. These building blocks,
missing vital knowledge, are not suitable candidates to be
used for scalability of the system.
Serendipitously, this work on scalability of LCS pro-
duced compact rule sets that were easily converted to the
optimum population. Where alternative good rules were
discovered, the most condensed were selected. The aim of
this paper is to detail and investigate this novel approach in
LCS to determine the mechanism by which optimum
solutions for the given setup are produced. Once the
method for successful learning has been determined, this
technique will be applied to a broad range of problems, but
this work is beyond the scope of this paper.
The rest of the paper is organised as follows. Section 2
describes genetic algorithm, building blocks, genetic pro-
gramming, learning classifier systems and accuracy-based
learning classifier systems. In Sect. 3, the novel imple-
mentation of XCS using code fragmented actions is
detailed. Section 4 introduces the multiplexer problem
domain and experimental setup used in the experimenta-
tion. In Sect. 5, experimental results are presented and
compared with standard implementation of XCS using
static binary actions. Section 6 is discussion, elaborating
action consistency and production of the optimum solution.
In the last section, this work is concluded and the future
work is outlined.
2 Background
Evolutionary computational techniques (Eiben and Smith
2003; Jong 2006) are based on the ideas of Darwin’s theory
of survival of the fittest. An evolutionary process starts with
a population of, usually randomly generated, individuals
where each individual represents a potential problem
solution or a part of the solution. Each individual is eval-
uated to determine its utility or fitness for the given
problem. Relatively fit members of the population are
selected to create new offspring and the worst members
may be deleted from the population. This process of evo-
lution is repeated for a fixed number of times or until an
ending criterion is met.
In the following subsections, two of the most common
evolutionary techniques namely genetic algorithms and
genetic programming are briefly described, as they are
directly related to the work presented here. This is followed
by an introduction to learning classifier systems.
2.1 Genetic algorithms and building blocks
The discovery component of an LCS is commonly imple-
mented using a genetic algorithm (GA). An LCS seeks to
evolve a population of co-operative rules, where each
individual rule is optimized using the GA.
GAs are population-based search algorithms (Holland
1975) where each individual member of the population
is usually represented by a bitstring of finite length. A
population of randomly generated individuals, each
504 M. Iqbal et al.
123
representing a potential problem solution, is used at the
start. After that, each individual is evaluated to determine
its utility or fitness for the given problem. Then, new
populations are generated and evaluated, using genetically
inspired operations of reproduction, crossover and muta-
tion. The individuals are probabilistically selected to par-
ticipate in the genetic operations based on their fitness.
Reproduction is a process in which the selected individual
strings are simply copied to the new generation, i.e. sur-
vival of the fittest individuals. In the crossover operation,
two new individuals are generated by swapping elements
of two or more, hypothesized the best, selected individuals.
Mutation operates at the bit level by randomly flipping bits
within the current individual. This process of evolution is
repeated for a fixed number of times or until an ending
criterion is met. The fittest individual in the final popula-
tion is taken as the solution to the problem. GAs are purely
competitive with the whole population aiming to converge
to the optimal solution, whereas LCSs are competitive only
in niches, i.e. separable subsolutions of the problem
domain that when combined cover the complete space of
the problem.
To better understand the behavior and performance of
genetic algorithms in any evolutionary system, Goldberg
(Goldberg 1989) has studied and described a GA using the
concept of schema. A schema is a similarity template for
describing a set of finite-length strings defined over a finite
alphabet. For example, if the alphabet is {0, 1, *} then the
schema ‘‘1**0’’ is describing all strings of length four that
start with symbol 1 and end with symbol 0 such as 1000,
1010, 1100, and 1110. It is to be noted that ‘*’ is treated as
‘don’t care’ symbol here, meaning it can be either 0 or 1.
The distance between first and last specific string positions
in a schema H is called its defining length, denoted by
d(H), and the number of specific positions in it is called its
order, denoted by o(H). For example the defining length of
the schema ‘‘1**0’’ is 3 and the order is 2, whereas the
defining length of the schema ‘‘101*’’ is 2 and the order is 3.
Goldberg hypothesized that higher performance indi-
viduals are actually generated as a result of the combina-
tion of short-length, low-order and high-performance
schemata (Goldberg 1989). These schemata are called the
building blocks of the system. These building blocks are
likely to be selected and combined via crossover to produce
longer and fit individuals in a GA. These building blocks
are also relatively less affected by mutation. The assump-
tion by Goldberg that this is the way how a GA works, is
termed the building block hypothesis.
However, for a population of individuals represented by
fixed length strings, the genetic operators sometimes can-
not process the building blocks effectively as a random
crossover point may lie within the building block. To avoid
this disruption of partial solutions by the genetic operators,
a probability distribution-based approach, known as Esti-
mation of Distribution Algorithm (EDA), was developed
(Mhlenbein and Paaß 1996). In the various forms of EDAs,
the crossover and mutation operators are replaced by
generating new offspring according to the probability dis-
tribution of the selected individuals (Pelikan et al. 2002).
A sample from a schema such as ‘‘1**0’’ is implicitly
sampling from many other schemata too, like ‘‘10*0’’,
‘‘11*0’’, ‘‘1*00’’, and ‘‘1*10’’. It has been estimated that in
any generation of a population of n individuals, the number
of schemata being processed by the GA is proportional to
n3. This inherently parallel processing of a large quantity of
schemata is known as implicit parallelism.
The schema theory has been criticized due to its weak
theoretical foundations (Altenberg 1995; Beyer 1997;
Burjorjee 2008), but still remains a popular tool to explain
the power of GAs (Poli and Langdon 1998; Poli 2000;
Drugowitsch 2008).
2.2 Genetic programming
Commonly, genetic programming (GP) uses a much richer
alphabet than GA to encode the solution, i.e. more
expressive symbols that can express functions as well as
numbers. A GP like alphabet to describe the problem is
used in the LCS developed here, so the GP technique is
described to aid understanding.
GP is an evolutionary approach to generating computer
programs for solving a given problem automatically (Koza
1992; Banzhaf et al. 1998; Poli et al. 2008). GP is an
extension of the GA in which the structures in the popu-
lation are not fixed-length character strings that encode
candidate solutions to a problem, but programs that, when
executed, generate the candidate solutions to the problem.
The task to be solved is represented by a primitive set of
operations, known as the function set, and a set of oper-
ands, known as the terminal set. The generated computer
programs are commonly represented by a tree. The internal
nodes of the tree are functions and leaves are the terminals.
To generate a computer program for regression and
classification using GP, a set of (input, output) pairs is
needed for training the candidate solutions along with sets
of functions and terminals. GP attempts to construct a
computer program that maps each of the (input, output)
pairs correctly. For example, if the (input, output) pairs set
is fð0; 1Þ; ð1; 3Þ; ð2; 7Þ; ð3; 13Þ; ð4; 21Þ. . .g and {?, -, *, /}
and {x,1} are the function set and terminal set, respectively,
then the optimal corresponding GP generated program is as
shown in Fig. 1.
GP typically starts with a population of randomly gen-
erated individuals, i.e. computer programs, composed of
the given functions and terminals. The initial individuals
are usually generated subject to a pre-established
Evolving optimum populations 505
123
maximum size (Koza and Poli 2005). GP iteratively
transforms a population of computer programs into a new
generation of the population by applying the genetic
operations of reproduction, crossover and mutation to
selected individuals from the population. The individuals
are probabilistically selected to participate in the genetic
operations based on their fitness. Reproduction involves
simply copying certain individuals into the new population.
Given copies of two parent trees, typically, crossover
involves randomly selecting a crossover point in each
parent tree and swapping the sub-trees rooted at the
crossover points. Traditional mutation consists of randomly
selecting a mutation point in a tree and substituting the sub-
tree rooted there with a randomly generated sub-tree. This
process of evolution is repeated for a fixed number of times
or until an ending criterion is met. The fittest individual
program in the final population is used to compute the
solution to the problem.
GP is a technique that can produce a computer pro-
gram automatically to maximally map an (input, output)
pairs set. However, to generate this program it needs
many CPU cycles and much memory space (Robilliard
et al. 2009). The GP-generated computer program is
normally represented as a tree, which may contain
unnecessary terms (bloat) and non-optimum expressions
(a phenotypic behavior may not be represented by the
most compact genotype). These problems are usually
addressed by limiting maximal allowed depth for an
individual tree and/or using a fitness measure that pun-
ishes excess sized individuals (Luke and Panait 2006).
The other ways to control bloat in genetic programing
include simplifying individual programs using algebraic
and numerical simplification methods (Kinzett et al.
2009), or using specific bloat control operators (Alfaro-
Cid et al. 2010).
A GP system produces a tree as a ‘single’ solution,
rather than a co-operative set of rules as in an LCS. It
generally requires supervised learning (Russell and Norvig
2011) with the whole training set, rather than on-line,
reinforcement learning (Sutton and Barto 1998) as in LCS.
2.3 Learning classifier systems
Traditionally, an LCS represents a rule-based agent that
incorporates evolutionary computing and machine learning
to solve a given task, enacting in an unknown environment
via a set of sensors for input and a set of effectors for
actions. The rules are of the form ‘‘if state then action’’.
After observing the current state of the environment, the
agent performs an action, and the environment provides a
reward, as depicted in Fig. 2.
Although LCS predates the field of memetic algorithms
(Tenne and Armfield 2009; Acampora et al. 2011), the
approaches show similarities. Each classifier rule can be
considered as a heuristic such that on LCS, it is thus a
metaheuristic. LCS is a hybridized technique using rein-
forcement learning to evaluate and propagate rule utility
coupled with the GA to evolve global optimal classifiers.
An LCS is an adaptive system that learns to perform the
best action receiving maximum reward from the environ-
ment for a given input. An LCS is adaptive in the sense that
its ability to choose the best action improves with experi-
ence. Reward received for a given action is used by the
LCS to alter the likelihood of taking that action, in those
circumstances, in the future. To understand how this works,
see Sect. 2.3.1.
There are two important families of LCSs: the Pitts-
burgh (Smith 1980) and Michigan (Booker et al. 1989)
approaches, where Michigan is considered in this work. In
*
x x x 1
+
+
Fig. 1 A GP generated tree program to map the set of (input, output)
pairs fð0; 1Þ; ð1; 3Þ; ð2; 7Þ; ð3; 13Þ; ð4; 21Þ. . .g: This GP tree is equiv-
alent to the output expression ðx � xÞ þ ðxþ 1Þ; where x is the input
Fig. 2 Schematic depiction of a learning classifier system
506 M. Iqbal et al.
123
a Michigan-style LCS, population consists of a single set of
co-operative rules, i.e. each individual represents a unique,
distinct rule. The goal here is to find the best set of clas-
sifier rules that, when applied, gain an optimum result for
the problem to be solved. Michigan-style LCSs have two
main types of fitness definitions: strength-based, e.g. ZCS
(Wilson 1994) and accuracy-based, e.g. XCS (Wilson
1995), where the latter is adopted here as it provides a
complete mapping of states to reward.
LCS can be applied to a wide range of problems (Lanzi
et al. 2000) including reinforcement learning problems,
classification problems and function approximation (Butz
2007). LCS have also been adapted to supervised learning
where the environment also returns the ‘correct’ optimal
action through the UCS (sUpervised Classifier System)
framework (Orriols-Puig and Bernado-Mansilla 2006).
UCS is an accuracy-based LCS, specifically designed for
supervised learning problems. In XCS, fitness is computed
using a reinforcement learning scheme so it provides a
complete mapping of states to reward, whereas in UCS
fitness is calculated from a supervised learning perspective
so it evolves a best action map (Bernad-Mansilla and
Garrell-Guiu 2003). UCS can only be applied to single-step
classification tasks, where supervision is available. How-
ever, XCS is more general and can be applied to multi-step
problems and online environments, i.e. interacting with the
environment and obtaining reward according to the per-
formed action.
2.3.1 Accuracy-based learning classifier system
XCS is a formulation of LCS that uses accuracy-based
fitness to learn the problem by forming a complete map-
ping of states and actions to rewards.1 In XCS, the learning
agent evolves a population [P] of classifiers, where each
classifier consists of a rule and a set of associated param-
eters estimating the quality of the rule. Each rule is of
the form ‘if condition then action’, having two parts: a
condition and the corresponding action. Commonly, the
condition is represented by a fixed length bitstring defined
over the ternary alphabet {0, 1, #}, and the action is
represented by a numeric constant.
Each classifier has three main parameters: (1) prediction
p, an estimate of the payoff that the classifier will receive if
its action is selected, (2) prediction error �; which estimates
the error between the classifier’s prediction and the
received payoff, and (3) fitness F, computed as an inverse
function of the prediction error. In addition, each classifier
keeps an experience parameter exp, which is a count of the
number of times it has been updated, and a numerosity
parameter n, which is a count of the number of copies of
each unique classifier.
The agent has two modes of operation, explore (train-
ing) and exploit (application). In the following, XCS
operation is concisely described. For a complete descrip-
tion, the interested reader is referred to the original XCS
papers by Wilson (Wilson 1995, 1998), and to the algo-
rithmic details by Butz and Wilson (2002).
In the explore mode, the agent attempts to obtain
information about the environment and describe it by
creating the decision rules:
1. observes the current state of the environment, s 2 S;
where S is the set of all possible environmental states.
The current state s is usually represented by a fixed
length bitstring defined over the binary alphabet {0, 1}.
2. selects classifiers from the classifier population [P] that
have conditions matching the state s, to form the
match set [M].
3. performs covering: for every action ai 2 A in the set of
all possible actions, if ai is not represented in [M] then a
random classifier is generated with a given generaliza-
tion probability such that it matches s and advocates
ai, and added to the population (termed covering).2 The
prediction, prediction error, and fitness of the generated
classifier are set to very small initial values.
4. forms a system prediction array, P(ai) for every ai 2 A
that represents the system’s best estimate of the payoff
should the action ai be performed in the current state s.
Commonly, P(ai) is a fitness weighted average of the
payoff predictions of all classifiers advocating ai.
5. selects an action a to explore (probabilistically or
randomly) and selects all the classifiers in [M] that
advocated a to form the action set [A].
6. performs the action a, records the reward r from the
environment, and uses r to update the associated
parameters of all classifiers in [A]. On receiving the
environmental reward r, the parameters of each clas-
sifier j in the action set [A] are updated as follows:3 First
of all, the experience expj is increased by one. Then, the
prediction error �j is updated: �j �j þ bðjr � pjj � �jÞfor expj [ 1/b, otherwise average (|r - pj|), where b(0 B b B 1) is the learning rate and pj is the prediction
of the classifier j. Next, the prediction pj is adjusted:
pj pj þ bðr � pjÞ for expj [ 1/b, otherwise average
(r). After that, the classifier’s accuracy is computed as
1 For a detailed review of different types and approaches in LCS refer
to (Urbanowicz and Moore 2009).
2 If the classifier population size grows larger than the specified limit,
then one of the classifier rules has to be deleted so that the new rule
can be inserted.3 Currently only single step problems are under investigation so the
parameter updates being described here are for single step problems.
For multi-step problems, parameter updation occurs on the previous
action set [A]-1, as described in (Wilson 1995, 1998).
Evolving optimum populations 507
123
an inverse function of the classifier’s error: kj ¼að�j=�0Þ�m
for �j� �0; otherwise 1. The parameter
�0ð�0 [ 0Þ determines the threshold error under which a
classifier is considered to be accurate, providing
robustness to noise. The parameters a (0 \ a\ 1)
and m (m[ 0) control the degree of decline in accuracy
if the classifier is inaccurate (Butz et al. 2001). The
parameter m separates rules of similar fitness to increase
the probability for selection of better rules. Then,
the relative accuracy k0j is computed by dividing the
accuracy kj by the total amount of accuracies in the
action set. Finally, the fitness Fj is updated according to
the classifier’s relative accuracy: Fj Fj þ bðk0j � FjÞ:Note that basing fitness on the relative accuracies
provides fitness sharing among the classifiers belonging
to the same action set. Fitness sharing allocates
resources to niches evenly, i.e. unbalanced classes or
complex classes do not get ignored.
7. when appropriate, implements rule discovery by
applying an evolutionary mechanism (commonly a
GA) in the action set [A], to introduce new classifiers
to the population. First of all, two parent classifiers are
selected from [A] based on fitness and the offspring are
created out of them. Next, the conditions of the
offspring are crossed with probability v and then each
bit in the conditions is mutated with probability l such
that both offspring match the currently observed state
s. After that, the actions of the produced offspring are
mutated with probability l.
The experience and numerosity of the offspring are set
to 0 and 1, respectively. If the offspring are crossed and/
or mutated then their prediction is set to the average of
the parents’ values. The prediction error and fitness of
the crossed and/or mutated offspring are set to the
average of the parents’ values reduced by constants pre-
dictionErrorReduction and fitnessReduction, respec-
tively (Butz 2000). It is to be noted that in XCS, only
two children are produced by evolutionary operation, as
opposed to typical GA and GP evolution where the
whole population is replaced by the newly generated
individuals. In XCS, the genetic operations are applied
in sequence on two selected parent classifiers to
produce two offspring, whereas in the GA and GP the
genetic operations are applied in parallel on the whole
population of individuals to produce the new genera-
tion of individuals that replace all the current gener-
ation. The XCS rule discovery operation is illustrated
graphically with an example in Sect. 3.
Additionally, the explore mode may perform subsump-
tion, to merge specific classifiers into any more general and
accurate classifiers. If an offspring generated by the GA has
the same action as that of the parents, then its parents are
examined to see if either of them: (1) has an experience
value greater than a threshold, (2) is accurate, and (3) the
environmental inputs it matches are a superset of the inputs
matched by the offspring. If this test is satisfied, the off-
spring is discarded and the numerosity of the parent is
incremented by one. A similar check for subsumption can
be done in action sets to subsume any less general classi-
fiers in an action set [A] by the most general subsumer
classifier in the set [A]. Subsumption deletion is a way of
biasing the genetic search towards more general, but still
accurate, classifiers (Wilson 1998). It also effectively
reduces the number of classifier rules in the final popula-
tion (Kovacs 1996).
In contrast to the explore mode, in the exploit mode the
agent does not attempt to discover new information and
simply performs the action with the best predicted payoff.
The exploit mode is also used to test learning performance
of the agent in the application.
The generalization property in LCS allows a single rule to
cover more than one state provided that the action-reward
mapping is similar. Traditionally, generalization in LCS
classifier conditions is achieved by the use of a special ‘don’t
care’ symbol (#) in the ternary representation, which matches
any value of a specified attribute in the vector describing the
state s. The next section presents various other representa-
tions that have been successfully used in XCS.
2.3.2 XCS’s variations
Various richer encoding schemes have been investigated in
the LCS research community in an attempt to improve the
generalization, to obtain compact classifier rules, to reach the
optimal performance faster, and to investigate scalability of
the learning system. Most of these schemes have been
implemented on Wilson’s XCS, which is the most tested and
the best performing model of learning classifier systems.
In 1999, Lanzi experimented with two different ways to
represent classifier conditions: firstly the fixed-length bit-
strings coding of classifier conditions was replaced with a
variable-length messy coding (Lanzi 1999) in which
environmental inputs were translated into the bitstrings that
have no positional linking between the bits in classifier
condition and any feature in the environmental input. Then,
he extended a step further from messy coding to a more
complex representation in which S-expressions were used
to represent general classifier conditions (Lanzi and
Perrucci 1999).4
In 2002, Wilson introduced the idea of computed pre-
diction, as a function of classifier condition and a weight
4 S-expressions are list-based data structures that are suitable for
representing arbitrary complex data. An S-expression is defined
recursively as either a byte-string or a list of simpler S-expression. For
a more detailed description of S-expressions refer to (Rivest 1997).
508 M. Iqbal et al.
123
vector, to learn approximations to functions (Wilson 2002).
The classifier condition was changed from ternary alphabet
string to a concatenation of interval-based numeric values.
The implemented system is known as XCSF in the LCS
research community. Lanzi et al. (2005) used XCSF for the
learning of Boolean functions. They have shown that
XCSF can produce more compact classifier rules as com-
pared to XCS, since the use of computed prediction allows
more general solutions (Lanzi et al. 2007).
In 2006, Butz et al. incorporated the EDA mechanism in
XCS to identify and process building blocks for solving
hierarchical decomposable binary classification problems
(Butz et al. 2006). They have used extended compact GA
(ECGA) and the Bayesian optimization algorithm (BOA) to
estimate the probability of distribution. In domains con-
taining building blocks, this approach has shown the benefits
of not using the potentially destructive crossover operation.
In 2007, Charalambos and Browne investigated scaling
of an abstract LCS using pre-constructed functions for a
specific problem domain (Ioannides and Browne 2007).
They implemented classifier conditions as a combination of
ternary and S-expression alphabets. They have shown that
using domain-relevant functions the scalability of XCS can
be improved. However, without domain knowledge the
appropriate functions for a problem need to be automati-
cally discovered.
In July 2007, Lanzi and Loiacono (2007) introduced a
version of XCS with computed actions, named XCSCA, to
be used for problem domains involving a large number of
actions. The classifier action was computed using a
parametrized function in a supervised fashion. They have
shown that XCSCA can evolve accurate and compact
representations of binary functions which would be diffi-
cult to solve using a typical XCS model. Then in Sep-
tember 2007, they extended XCSCA using support vector
machines to compute the classifier action (Loiacono et al.
2007). This extension resulted in reaching optimal perfor-
mance faster than XCSCA.
A GP-based rich encoding has been used by Ahluwalia
et al. (Ahluwalia and Bull 1999) within a simplified
strength-based learning classifier system (Wilson 1994).
They used binary strings to represent condition and an
S-expression to represent the action of a classifier rule. This
GP-based LCS generates filters for feature extraction,
rather than performing classification directly. The extracted
features are used by the K-Nearest Neighbour algorithm to
perform classification.
3 XCS with code fragmented action
Motivated from the research findings of computed predic-
tion/action and building blocks processing, the idea of code
fragmented action is investigated to produce reusable
building blocks of information in an attempt to achieve
autonomous scalability in LCS. In the work presented here,
XCS is enhanced by having GP-tree like action, named
code fragmented action, with a generalizing condition. The
proposed XCS with code fragmented action, called
XCSCFA, performs classification directly as opposed to
generating feature extractors in the GP-based LCS devel-
oped by Ahluwalia and Bull (1999).
In XCSCFA, the static binary action is replaced by a
code fragment while using the ternary alphabet in the
condition of the classifier rules. Each code fragment is a
binary tree of depth up to d. The value of d depends on the
length of condition in a classifier. The function set for
the action tree is {AND, OR, NAND, NOR, NOT} and the
terminal set is fD0;D1;D2; . . .;Dn� 1g; where n is the
length of condition in a classifier. A population of classi-
fiers having code fragmented actions is illustrated in Fig. 3.
The symbols &, |, d, r, and � denotes AND, OR, NAND,
NOR, and NOT operators, respectively. The code frag-
mented actions are shown in postfix form.
The action value of a classifier is determined by eval-
uating the action code tree. The action code tree is evalu-
ated by replacing the terminal symbols with corresponding
binary bits from the associated condition in the classifier
rule. A ‘don’t care’ symbol (denoted by ‘#’) in the condi-
tion is randomly treated as 0 or 1.5 For example, consider
the classifier shown in Fig. 4. In this classifier the condition
bit D2 is ‘#’. If D2 is 0 then the action value of this clas-
sifier will be 0 and if D2 is 1 then the action value of this
classifier will be 1.
When rule discovery mechanism is applied in the action
set [A] to produce two offspring, conditions of the offspring
are created by applying GA and action trees are generated
Fig. 3 Classifier population using code fragmented actions. Here
‘&’, ‘|’, ‘d’, ‘*’, and ‘r’ denote AND, OR, NAND, NOT, and NOR
operators, respectively. The code fragmented actions are shown in
postfix form
5 Four methods to implement code fragmented actions in XCS were
tested, but as the results of the other three methods were not
illuminating for scaled learning, they are not presented here.
Evolving optimum populations 509
123
by applying GP-based genetic operations. First of all, two
parent classifiers are selected from [A] based on fitness and
the offspring are created out of them. Next, the conditions
and action trees of the offspring are crossed with proba-
bility v by applying GA and GP crossover operations,
respectively. After that, the conditions of the resulted
children by crossover are mutated with probability l, such
that both children match the currently observed state
s. Then, the action trees of the children are mutated with
probability pm, to replace a subtree of the action with a
randomly generated subtree of depth up to 1.
For example, consider the rule discovery operation
graphically summarized in Fig. 5. Figure 5a shows the
action set containing three classifiers with action value 1,
formed from the classifier population [P] shown in Fig. 3,
against the environmental input s = 001010. First of all,
two parent classifiers are selected, Fig. 5b, from the action
set based on fitness F, and two children are created out of
them, Fig. 5c. Then, in Fig. 5d, the conditions of the
reproduced children are crossed over by applying two-point
GA-crossover operation at the two marked points, and the
action trees of the children are crossed over by applying
GP-crossover operation. The fitness value of crossed over
children is set to the average of parents’ fitness values
reduced by 0.1, as suggested by Butz et al. in (Butz and
Wilson 2002). After that, Fig. 5e, the conditions of the
crossed over children are mutated by applying GA-mutation
such that both children match the currently observed state
s = 001010, and the action trees of the children are mutated
by applying GP-mutation. The final children generated in the
rule discovery operation are shown in Fig. 5f.
It is to be noted that the advantages of subsumption
deletion are lost due to genotypic differences resulting in
subsumption not occurring despite phenotypically similar
behaviour. Subsumption deletion is made possible, albeit
still problematic, by matching the action code on a
character by character base. To avoid volatility in per-
formance of the system due to the issue of inconsistency
of a classifier’s action value (to be discussed in
Sect. 6.2), it is necessary for the subsumer to have
consistent action value, in addition to being accurate and
experienced.
If a newly created classifier in the rule discovery oper-
ation is not subsumed by the parents and there is no clas-
sifier equal to it in the population, then it will be added to
the population. Two classifiers are considered to be equal,
if and only if both have the same condition and the same
code fragmented action tree.
Fig. 4 A classifier rule with code fragmented action. Here ‘|’ and ‘&’
denote logical OR and logical AND operators, respectively
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 5 Rule discovery by applying GA and GP operations in the
action set [A] formed from the classifier population [P] shown in
Fig. 3, against the environmental input s = 001010. a action set
containing three classifiers with action value 1, b two parent
classifiers selected from the action set based on fitness F, c two
reproduced children, d crossed over children produced by applying
two-point GA-crossover on the conditions and GP-crossover on the
action trees of the reproduced offspring classifiers, with fitness values
equal to average of parents’ fitness values reduced by 0.1, e mutated
children produced by applying GA-mutation on the conditions and
GP-mutation on the action trees of the crossed over offspring
classifiers, and f final children generated in the rule discovery
operation
510 M. Iqbal et al.
123
4 The problem domain and experimental setup
The problem domain used in the experimentation is the
multiplexer problem, which is commonly used by the LCS
research community. Multiplexer problems are considered
to be interesting because they are highly non-linear and,
therefore, relatively difficult to learn. They also allow
generalizations and are suitable for examining the scala-
bility of the algorithm.
4.1 Multiplexer problem domain
A multiplexer is an electronic circuit that accepts n input
signals and gives one output signal. The n inputs are
divided into two groups: k address bits and the remaining
n-k data bits. Actually n is of the form k ? 2k. Hence the
data bits are n - k = 2k. For n input signals, there can be
2n different input combinations. The values of address bits
are used to select the data bit to be given as output. In 6-bits
multiplexer problem, there are two address bits and four
data bits. If we denote the address bits by A0 and A1 and
the data bits by D0, D1, D2, and D3 then the four cases for
6-bits multiplexer to decide its output signal are
• if A1 = 0 and A0 = 0 then the output is the value of data
bit D0,
• if A1 = 0 and A0 = 1 then the output is the value of data
bit D1,
• if A1 = 1 and A0 = 0 then the output is the value of data
bit D2, and
• if A1 = 1 and A0 = 1 then the output is the value of data
bit D3.
For example, if the input is 011101 then the output will
be 1 and if the input is 101101 then the output will be 0.6
Because in the multiplexer problem domain output
signal depends upon the address bits, the set of input
combinations can be generalized. For example in the case
of 6-bits multiplexer problem, the output for inputs
100101, 101100, and 101001 is 0 because in all these three
input combinations, the output is the value of the data bit
D2 and we do not care about the values of other data bits. If
we use ‘# as the ‘don’t care’ symbol (it can be either 0 or 1)
then these three input instances can be generalized as
10##0#.7 The complete generalized set of inputs along with
the accurate outputs, for 6-bits multiplexer, is shown in
Table 1.
In the experimentation, both address and data bits are
denoted by D (instead of denoting address bits by A and
data bits by D), as the LCS has no domain knowledge
on the purpose of the bits and the ordering of bits is
irrelevant to the learning system provided it is consistent.
As is common practice, the environment has left most
bits for the address and the remaining bits for the data,
e.g. in case of 6-bits multiplexer D0 and D1 represent
address bits and the remaining D2 to D5 represent data
bits.
4.2 Experimental setup
The system uses the following, commonly used in the
literature, parameter values as suggested by Butz and
Wilson (2002) and Butz (2000): fitness fall-off rate
a = 0.1; prediction error threshold �0 ¼ 10; fitness expo-
nent m = 5; learning rate b = 0.2; threshold for GA
application in the action set hGA = 25; experience
threshold for classifier deletion hdel = 20; fraction of
mean fitness for deletion d = 0.1; classifier experience
threshold for subsumption hsub = 20; crossover probabil-
ity v = 0.8; condition mutation probability l = 0.04;
action mutation probability pm = 0.1; initial prediction
pI = 10.0; initial prediction error �I ¼ 0:0; initial fitness
FI = 0.01; reduction of the prediction error prediction-
ErrorReduction = 0.25; reduction of the fitness fitness-
Reduction = 0.1; maximum allowed tree depth d is set to
2 for 6-bits and 11-bits multiplexer problems whereas to 3
for 20-bits and 37-bits multiplexer problems; and the
selection method is tournament selection with tournament
size ratio 0.4. Probability of ‘#’ in covering is set to 0.5
for 6-, 11-, and 20-bits multiplexers whereas to 0.75 for
37-bits multiplexer. The number of micro classifiers used
is 500, 1,000, 2,000, and 8,000 for 6-, 11-, 20-, and
37-bits multiplexers, respectively. Explore and exploit
problems are alternated with probability 0.5. All the
experiments have been repeated 30 times with a known
different seed in each run. One run is stopped after one
million explore problems.
Table 1 Optimum ternary encoded rule set for the 6-bits multiplexer
problem
Input Output
A1 A0 D0 D1 D2 D3
0 0 0 # # # 0
0 0 1 # # # 1
0 1 # 0 # # 0
0 1 # 1 # # 1
1 0 # # 0 # 0
1 0 # # 1 # 1
1 1 # # # 0 0
1 1 # # # 1 1
6 It is assumed that the order of bits is A1A0D0D1D2D3.7 The generalized input 10##0# actually covers eight input combi-
nations: 100000, 100001, 100100, 100101, 101000, 101001, 101100,
and 101101.
Evolving optimum populations 511
123
5 Results
The proposed method XCSCFA successfully learned the
6-, 11-, 20-, and 37-bits multiplexer problems, shown in
Fig. 6. The performance curves are plotted using a
100-point running average from exploit problems. The
learning performance of XCSCFA is similar to that of XCS
for 6-bits and 11-bits multiplexer problems as shown with
coincident curves. The code fragmented XCS takes slightly
more instances to learn the 20-bits and 37-bits multiplexer
problems as compared with the standard implementation of
XCS having a static binary action. The reason for this is
inconsistency of the action value of a classifier using code
fragmented action, to be explained in the Sect. 6.2. This
inconsistency of the action value of a classifier causes the
learning to be relatively difficult for the system as com-
pared to standard XCS with static binary action.
Due to this difficulty in learning, the system produces
unique and interesting results having no ‘don’t care’
symbol (‘#’) in the address bits of most of the classifiers,
shown in Table 2.8 The standard XCS implementation
produces a final population of classifier rules having almost
as many classifiers with ‘#’ in the address bits as with
specific address bits, for example rules 1, 3, 5, 8, 12, 13, 15,
16, and 17 shown in Table 3, have ‘#’ symbol in the
address bits. The reason for these classifiers having ‘#’ in
the address bits is that the standard XCS tends to produce
maximally general and accurate classifier rules (Butz et al.
2004), but cannot differentiate between two maximally
general and accurate classifiers that are semantically dif-
ferent, e.g. rules 1, 2, and 4 in Table 3.
5.1 Condensation
As the code fragmented action implementation of XCS
produces classifiers mostly with no ‘#’ in the address bits, a
specialized condensation mechanism has been imple-
mented in an effort to obtain maximally general, compact,
and accurate classifiers for the multiplexer problem
domain.
It is to be noted that condensation is not main point of
the algorithm, and it is being included for completeness.
This is a post priori, whilst non-iterative compaction
mechanism instead of a standard iterative condensation
(a) 6-, 11-, and 20-bits multiplexer problems (b) 37-bits multiplexer problem
Fig. 6 Results of the multiplexer problem domain using standard
XCS with static binary action and the proposed approach of XCS with
code fragmented action: a for 6-, 11-, and 20-bits multiplexer
problems, and b for 37-bits multiplexer problem. XCS with code
fragmented action can learn the multiplexer problems successfully,
although it needs slightly more training examples with comparison to
standard XCS with static binary action, for 20-bits and 37-bits
multiplexer problems
Table 2 A sample of classifier rules, obtained in a typical run for
6-bits multiplexer, using XCS with code fragmented actions
Sr.
no.
Condition Action Prediction
1 1 1 # # # 1 D1D2rD1D2r& 0
2 1 0 # # 1 # D5D0|D5D0|| 1,000
3 1 1 # # # 1 D1D2rD1D1r& 0
4 1 1 # # # 1 D1D1rD1D2r& 0
5 1 0 # # 0 # D1D2dD1D2dr 1,000
6 1 0 # # 0 # D1* 0
7 0 0 0 # # # D1D1| 1,000
8 0 1 # 1 # # D3D5rD1D0&d 1,000
9 1 1 # # # 0 D1D1| 0
10 1 1 # # # 1 D3D0|D0D0|| 1,000
11 1 0 # # 0 # D1*D1*| 0
12 0 0 1 # # # D0** 0
13 1 1 # # # 1 D0D0|D3D0|| 1,000
14 0 1 # 1 # # D1D0&D3D5rd 1,000
15 1 0 # # 1 # D5D0| 1,000
16 1 0 # # 1 # D1D4*| 0
17 0 0 1 # # # D2** 1,000
18 1 0 # # 1 # D4* 0
19 1 1 # # # 1 D3D0|D3D0|| 1,000
20 0 0 0 # # # D1D1D0dd 0
It is worth noting that there are no ‘#’ symbols in the address bits
8 A few newly created classifiers may contain ‘#’ in the address bits
due to mutation.
512 M. Iqbal et al.
123
technique. In a typical condensation method (Wilson 1995;
Kovacs 1996), evolutionary search is suspended by stop-
ping the GA creating new classifiers and learning continues
for a certain number of iterations. In the condensation
mechanism being introduced here, training is stopped and
the rule set is compacted instantly.
The specialized condensation algorithm for this imple-
mentation is given below:
1. From the final rule set, delete all the classifiers that are
either inaccurate (i.e., prediction error �[ �0), or less
experienced (i.e., experience exp B 1/b), or have
inconsistent action values.
2. In the remaining population, if two classifiers have the
same condition, the same action value, and the same
prediction value, then treat them as a single classifier.
Delete one of them and increase the numerosity of the
other by the numerosity value of the one being deleted.
The classifier being kept retains the higher experience
and fitness values from these two classifiers.
3. In the resulting population of step 2, if two classifiers
have the same condition, but opposite action (i.e. 0/1)
and likewise the opposite prediction values (i.e.
0/1,000), then invert the action and prediction values
of the classifier having prediction value equal to 0 and
condense them as a single classifier. Delete one of
(a) 6-bits multiplexer (b) 11-bits multiplexer
(c) 20-bits multiplexer (d) 37-bits multiplexer
Fig. 7 The numerosity and fitness of classifiers in final population for a typical run of 6-, 11-, 20-, and 37-bits multiplexer problems using XCS
with code fragmented action. There are two groups of classifiers according to numerosity values
Table 3 A sample of classifier rules, obtained in a typical run for
6-bits multiplexer, using standard XCS
Sr.
no.
Condition Action Prediction
1 # 1 # 1 # 1 0 0
2 1 1 # # # 1 0 0
3 0 # 1 1 # # 1 1,000
4 0 1 # 1 # # 0 0
5 0 # 1 1 # # 0 0
6 0 1 # 1 # # 1 1,000
7 1 0 # # 0 # 1 0
8 # 1 # 0 # 0 0 1,000
9 1 1 # # # 0 1 0
10 1 1 # # # 0 0 1,000
11 0 0 1 # # # 1 1,000
12 # 0 1 # 1 # 1 1,000
13 # 1 # 1 # 1 1 1,000
14 1 0 # # 1 # 0 0
15 1 # # # 1 1 0 0
16 1 # # # 0 0 0 1,000
17 # 0 1 # 1 # 0 0
18 0 0 1 # # # 0 0
19 0 1 # 0 # # 1 0
20 1 0 # # 0 # 0 1,000
It can be seen that there are ‘#’ symbols in the address bits
Evolving optimum populations 513
123
them and increase the numerosity of the other by the
numerosity value of the one being deleted. The
classifier being kept retains the higher experience
and fitness values from these two classifiers.9
4. Sort the resulting population of step 3 according to
numerosity in descending order.
5. Find the two consecutive classifiers that have the
maximum numerosity difference between each other:
say they are C1 and C2 such that numerosity of
classifier C1 is greater than that of C2.
6. Delete all the classifiers having numerosity equal to or
less than the numerosity of C2.
When applying steps 1–4 of the condensation mecha-
nism, the population of classifiers automatically separated
into two groups according to numerosity values. Figure 7
shows the separation of classifiers, for a typical run of
6-, 11-, 20-, and 37-bits multiplexer problems. It can be seen
that the group of classifiers having higher numerosity values
also have higher fitness values as would be expected. So the
classifiers with low numerosity were deleted, applying steps
5–6 of the condensation algorithm, to obtain a final popu-
lation of maximally general, compact, and accurate classi-
fiers. The resulting population has 8, 16, 32, and 64
classifiers for 6-, 11-, 20-, and 37-bits multiplexer problems,
respectively. The final populations for 6-bits and 11-bits
multiplexers are shown in Tables 4 and 5, respectively.
Figure 8 shows the final population of accurate and
experienced classifiers for 37-bits multiplexer problem,
obtained using standard XCS with static binary actions. In
standard XCS, it is observed that there is some form of
grouping, but no distinct separation of optimal and sub-
optimal classifiers as it failed 22 times out of 30 runs for
37-bits multiplexer to produce the optimum compact rule
set. To obtain the desired optimum rule set in binary
action-based XCS, extensive processing is needed (Kovacs
1996), e.g. condensation or compaction algorithm.
It is to be noted that the fitness of classifiers in code
fragmented XCS is smaller as compared to that in XCS
with binary action. The reason for this is that in XCS, the
fitness is shared among the accurate classifiers in a niche,
and the most general classifier in a niche has subsumed
other less general classifiers. In standard XCS with binary
action, subsumption deletion is fully enabled so the fitness
of the general classifier in a niche gets higher value as it
subsumes the less general classifiers in the niche. In code
fragmented XCS, the multiple genotypes to a single phe-
notype issue disables the subsumption deletion function, so
Table 4 Final population of maximally general, compact, and
accurate classifiers obtained in a typical run of 6-bits multiplexer,
using XCS with code fragmented actions
Condition Code fragmented
action
Calculated action
value
Prediction
11###1 D3D0|D0D0|| 1 1,000
01#1## D3D5rD1D0&d 1 1,000
10##1# D5D0|D5D0|| 1 1,000
001### D2** 1 1,000
000### D1D1| 0 1,000
11###0 D5D5& D4*& 0 1,000
01#0## D0D3& 0 1,000
10##0# D1D2dD1D2dr 0 1,000
Table 5 Final population of maximally general, compact, and
accurate classifiers obtained in a typical run of 11-bits multiplexer,
using XCS with code fragmented actions
Condition Code fragmented
action
Calculated action
value
Prediction
101#####0## D2* 0 1,000
111#######1 D10*D2D1rr 1 1,000
110######0# D1*D9D10&& 0 1,000
100####0### D7D7& 0 1,000
0000####### D3D3& 0 1,000
011###1#### D1*D6*d 1 1,000
011###0#### D8D5rD1*& 0 1,000
111#######0 D0D3rD0D3r& 0 1,000
001#1###### D1D4dD4* | 1 1,000
010##0##### D1*D1*& 0 1,000
001#0###### D1 0 1,000
010##1##### D1*D1*r 1 1,000
0001####### D0D7d 1 1,000
110######1# D9D9|D9| 1 1,000
101#####1## D8D0rD8D0rr 1 1,000
100####1### D1D2&D1d 1 1,000
Fig. 8 The numerosity and fitness of classifiers in final population for
a typical run of 37-bits multiplexer using standard XCS with static
binary actions
9 This assumes binary classification with the complete mapping
payoff of XCS being no longer explicitly required.
514 M. Iqbal et al.
123
fitness in a niche is distributed among multiple equally
general classifiers, all having a relatively small fitness
value as compared to the binary action-based XCS.
6 Discussions
It was not the primary goal to produce a maximally gen-
eral, accurate and compact population. The main aim of
this experimentation was to investigate the scalability of
building blocks within LCS, but it resulted in the seren-
dipity of producing the optimum solution. The following
sections elaborate why the optimum solution was
produced.
6.1 Specialization of address bits
If there are no ‘#’ bits in the address bits then the system
requires just one specific data bit, at the correct position, to
generate an accurate rule. If there is a ‘#’ symbol in the
address bits then it needs at least two specific data bits, at
the correct positions and having the same value, to produce
an accurate rule (which will actually cover two simple
rules). For example, in Table 6 there is no ‘#’s in the
address bits for first two rules so just one specific data bit is
enough to make them accurate classifiers whereas the third
rule has a ‘#’ in address bits so it needs two specific data
bits to be accurate. Similarly the fourth rule, having two
‘don’t care’ symbols in the address bits, needs four specific
data bits to be an accurate rule. It is to be noted that the
correctness of the rules depend on the value of action, e.g.,
in the case of these four rules, if action value is ‘1’ then
they will be correct, otherwise they are incorrect classifiers.
Each ‘don’t care’ symbol in the address bits makes it
difficult for the system to produce an accurate classifier
rule, although if it is produced then it will cover more than
one classifier so the final population of classifiers would
have relatively fewer classifiers than enumerated specific
classifiers. If there are n ‘don’t care’ symbols in the address
bits of a classifier then the system needs at least 2n specific
data bits, at correct positions with the same value, in the
classifier to make it an accurate classifier rule. If it is
produced, it will be equivalent to 2n simpler classifiers.
If the action is binary (as in standard XCS implemen-
tation) then it is relatively easy for the system to produce
such classifier rules. However, in our case, the action is a
code fragment and the action value is determined by taking
the associated condition as its input where ‘#’ in the con-
dition is treated as 0 or 1 randomly. So it heavily depends
on the bits in the associated condition to generate an
accurate action that will result in the accurate classifier
rule. A single ‘#’ symbol in the address bits makes it harder
for this system to produce an accurate corresponding ‘code
fragmented action’ because of inconsistency of action
value. The consistency of a classifier’s action value with
different condition patterns is discussed next.
6.2 Consistency of a classifier’s action value
In the case of standard XCS implementation, all classifier
rules are 100 % consistent in terms of the action value. If a
classifier’s action is 0 then it will be permanently 0 and if it
is 1 then it will be permanently 1 throughout the system
evolution. However in code fragmented action implemen-
tation of XCS, where a classifier’s action value is deter-
mined using the classifier’s condition as input to the code
fragment tree, this is not the case.
In this implementation, the 100 % consistency of the
whole population of classifiers (in terms of action value) is
not guaranteed. The reason of this decreased consistency is
that the ‘#’ symbol in the condition of a classifier is ran-
domly treated as 0 or 1 during the computation of a clas-
sifier’s action value.10
A classifier having no ‘#’ symbol in the condition is
100 % consistent in terms of its action value, but if a clas-
sifier has one or more ‘#’ symbols in the condition then its
consistency depends upon the code fragment tree. If the
value of a code fragment tree is dependent upon a condition
bit that is ‘#’ then it cannot be 100 % consistent. For example
in case of the classifier rule ‘‘1#01#1:D0D4|D3D1&&’’,
depicted in Fig. 9, there are two ‘#’ symbols in the condition
(D1 and D4) and both occur in the code fragment tree. The
value of this tree is dependent on the bit D1: if the value of bit
D1 is 0 then the tree’s output value will be 0 and if the value
of bit D1 is 1 then the tree’s output value will be 1 (Note: The
value of this code fragment tree is not dependent on the value
of bit D4.).
Suppose if there are m code fragments in the code
fragment population then there will be m classifier rules
that have the same condition, but a different code fragment
Table 6 Four sample classifier rules for 6-bits multiplexer, demon-
strating the specialization of address bits
Sr.
no.
Condition Action
1 1 1 # # # 1 Action
2 0 1 # 1 # # Action
3 # 1 # 1 # 1 Action
4 # # 1 1 1 1 Action
10 If the code fragmented action tree is evaluated using the current
environmental instance as input to the tree (instead of assigning
randomly 0 or 1 to ‘#’ symbol in the classifier’s condition), then the
final population of classifiers is similar to the population obtained
using standard XCS.
Evolving optimum populations 515
123
as the action. Some of these classifiers will be 100 %
consistent in terms of their action values and others not.
If a classifier’s action value is consistent (it can be
correct or incorrect)11 then the correct action will lead to a
stable predicted reward of maximum value (1,000 in this
implementation) and similarly the incorrect action will lead
to a stable predicted reward of minimum value (0 in this
implementation). If the action value of a classifier is not
consistent then the predicted reward of the classifier will
never reach the maximum payoff value nor the minimum
payoff value. The predicted reward of a classifier having
inconsistent action value will increase and decrease
depending upon the correctness of its action value for a
given environmental instance of the problem. The incon-
sistency of a classifier in terms of action value results in
inconsistency of the classifier in terms of predicted reward.
A classifier’s accuracy relates to the consistency of its
reward prediction, so a classifier having consistent action
value will be more accurate than a classifier having
inconsistent action value.
6.3 Optimal classifiers
In the case of multiplexer problem domain, the action value
is dependent on the address bits so the code fragments
having specific address bits have a high probability to
survive in the system. If the address bits’ values are spe-
cific, the code fragment will be more consistent in terms of
its value so the classifier will be relatively more accurate
than a classifier having ‘#’ in the address bits. If there is a
‘#’ in the address bits, then the code fragment will be
relatively inconsistent in its value so its accuracy will be
degraded.
To analyze the consistency of action values, consistency
of all address bit patterns in conditions for the 6-bits
multiplexer problem was calculated. There are two address
bits in the 6-bits multiplexer and each of these two bits can
take a value from the ternary alphabet {0, 1, #} so there are
nine (32) different address patterns for 6-bits multiplexer.
Similarly, the four data bits can take values from the ter-
nary alphabet {0, 1, #} so there are 81 (34) different data
patterns for the 6-bits multiplexer. Combining these dif-
ferent address bits and data bits patterns, there are 729 (9
9 81) different conditions. There are six terminals and five
operators of arity {1, 2, 2, 2, 2}, so there are 97,506 dis-
tinct code fragment trees of depth up to two. Using these
729 different conditions and 97506 different code fragment
trees as actions, results in 71,081,874 different classifier
rules. Each of the nine address patterns have 7,897,986
classifiers. The consistency of each of the address patterns
is shown in Table 7. The consistency of each of the four
patterns having both specific address bits is 79.65 %,
whereas the consistency of the last pattern that has ‘#’ in
both of the address bits is just 49.21 %. The patterns with
one specific address bit and one ‘#’ address bit are 64.41 %
consistent in their action values.
Because the classifiers having conditions from the
schema ‘‘AAxxxx’’12 are more consistent in terms of action
values than other classifiers, the former classifiers are more
consistent in terms of reward predictions than the latter
ones. This consistency of reward prediction makes the
classifiers having condition parts from the ‘‘AAxxxx’’
schema more accurate than the other classifiers having
condition parts from the schema ‘‘#Axxxx’’, ‘‘A#xxxx’’,
and ‘‘##xxxx’’. Now in XCS, as the fitness of a classifier
depends on its accuracy of reward prediction, the classifiers
having specific address bits in conditions have higher fit-
ness values than the classifiers having one or more ‘#’
symbols in the address bits.
In XCS, the reproduction is niche based, i.e. a genetic
algorithm (GA) is applied to the classifiers participating in
Table 7 Consistency of classifiers in terms of action values when
using different patterns as the condition
Sr.
no.
Pattern
(condition)
Total
classifiers
Consistent
classifiers
Consistency
percentage
1 00xxxx 7,897,986 6,290,622 79.65
2 01xxxx 7,897,986 6,290,622 79.65
3 10xxxx 7,897,986 6,290,622 79.65
4 11xxxx 7,897,986 6,290,622 79.65
5 0#xxxx 7,897,986 5,087,259 64.41
6 1#xxxx 7,897,986 5,087,259 64.41
7 #0xxxx 7,897,986 5,087,259 64.41
8 #1xxxx 7,897,986 5,087,259 64.41
9 ##xxxx 7,897,986 3,886,488 49.21
Here ‘‘x’’ means 0, 1, or ‘#’
Fig. 9 A classifier rule with code fragmented action where the action
is dependent on a ‘#’ bit in the condition. Here ‘|’ and ‘&’ denote
logical OR and logical AND operators, respectively
11 In the XCS system, accuracy of prediction is more important than
the correctness of the prediction itself.
12 In this schema ‘A’, the address bits, can be either 0 or 1 and ‘x’,
the data bits, can be 0, 1, or ‘#’.
516 M. Iqbal et al.
123
the action set instead of applying it to the whole classifiers
population and according to Wilson (1995):
… within a given action set, the more accurate
classifiers will have higher fitnesses than the less
accurate ones. They will consequently have more
offspring. But by becoming relatively more numer-
ous, those classifiers will gain a larger fraction of the
total relative accuracy (which always equals 1) and so
will have yet more offspring compared to their less
accurate brethren. Eventually, the most accurate
classifiers in the action set will drive out the others, in
principle leaving the X x A ) P map with the best
classifier (assuming the GA has discovered it) for
each situation–action combination.
The classifiers having specific address bits are more
consistent in terms of their action values and also seman-
tically simpler, so they are relatively more accurate clas-
sifiers than other classifiers. Therefore, these simple and
consistent classifiers are preferred by the system. When
condensed using the specifically designed condensation
algorithm, described in Sect. 5.1, these classifiers result in
the maximally general, compact and accurate classifiers in
the final population for multiplexer problem domain, i.e the
optimum population.
7 Conclusions
The learning classifier system implemented using ternary
alphabet-based condition and code fragmented action suc-
cessfully learns the tested multiplexer domain problems,
which is to be expected as LCS have good performance in
this domain. It was unexpected that code fragmented
actions produce an optimal solution, given the GP-like
encoding. Investigations showed that in constructing a
richer alphabet the search space became more difficult,
which produced the optimal solutions.
The consistency of action value is guaranteed in the
standard XCS implementation using static binary actions,
but in the code fragmented action that treats a ‘don’t care’
symbol randomly this consistency of action value cannot
be guaranteed for every classifier. This resulted in a com-
pact solution as classifiers having specific address bits were
more consistent than other potentially correct classifiers
(i.e. correct under sampling assumptions implicit in com-
mon alphabets).
This investigation of code fragments in XCS shows that
the multiple genotypes to a single phenotype issue in fea-
ture rich encodings disables the subsumption deletion
function. The additional methods and increased search
space lead to more training examples being required for
similar levels of performance. This is compensated by the
autonomous separation of optimal and sub-optimal classi-
fiers in the final population, eventually resulting in the
optimum rule set of the maximally general, compact and
accurate classifiers.
The next stage is to introduce a mechanism for treating
two classifiers with the same conditions and consistent
actions outputting the same action values as a single clas-
sifier, during the learning process, in order to fully enable
the subsumption deletion function. Hopefully, this will
result in reducing the number of environmental inputs
required to learn the problem domain.
Ultimately, the identified fit building block units from a
simple problem in a domain (e.g. 6-bit multiplexer) will be
used to seed the population in a more complex problem in
the same problem domain (e.g. 11-bit multiplexer) and so
forth. By utilizing this ‘stepping-stone’ approach it is
hoped that eventually a problem will be solved in the
domain (e.g. 1,034-bit multiplexer), which had not previ-
ously been solved using the base techniques.
It is anticipated that multiple populations of building
block units from different, associated problem domains
will need to be leveraged to assist in general problem
solving.
References
Acampora G, Cadenas JM, Loia V, Ballester EM (2011) A multi-agent
memetic system for human-based knowledge selection. IEEE
Trans Syst Man Cybern A Systems Humans 41(5):946–960
Ahluwalia M, Bull L (1999) A genetic programming based classifier
system. In: Proceedings of the genetic and evolutionary compu-
tation conference, pp 11–18
Alfaro-Cid E, Merelo JJ, de Vega FF, Esparcia-Alcazar AI, Sharman
K (2010) Bloat control operators and diversity in genetic
programming: a comparative study. Evol Comput 18(2):305–332
Altenberg L (1995) The schema theorem and Price’s theorem. In:
Foundations of genetic algorithms, pp 23–49
Banzhaf W, Nordin P, Keller RE, Francone FD (1998) Genetic
programming—an introduction: on the automatic evolution of
computer programs and its applications. Morgan Kaufmann,
Burlington
Bernad-Mansilla E, Garrell-Guiu JM (2003) Accuracy-based learning
classifier systems: models, analysis and applications to classifi-
cation tasks. Evol Comput 11(3):209–238
Beyer HG (1997) An alternative explanation for the manner in which
genetic algorithms operate. BioSystems 41:1–15
Booker LB, Goldberg DE, Holland JH (1989) Classifier systems and
genetic algorithms. Artif Intell 40(1-3):235–282
Burjorjee KM (2008) The fundamental problem with the building
block hypothesis
Butz MV (2000) XCSJava 1.0: an implementation of the XCS
classifier system in Java. Technical Report 2000027, Illinois
Genetic Algorithms Laboratory
Butz MV (2007) Combining gradient-based with evolutionary online
learning: an introduction to learning classifier systems. In:
Proceedings of the seventh international conference on hybrid
intelligent systems, pp 12–17
Evolving optimum populations 517
123
Butz MV, Kovacs T, Lanzi PL, Wilson SW (2001) How XCS evolves
accurate classifiers. Technical Report 2001008, Illinois Genetic
Algorithms Laboratory
Butz MV, Kovacs T, Lanzi PL, Wilson SW (2004) Toward a theory
of generalization and learning in XCS. IEEE Trans Evol Comput
8(1):28–46
Butz MV, Pelikan M, Llora X, Goldberg DE (2006) Automated
global structure extraction for effective local building block
processing in XCS. Evol Comput 14(3):345–380
Butz MV, Wilson SW (2002) An algorithmic description of XCS.
Soft Comput A Fusion Found Methodol Appl 6(3-4):144–153
Drugowitsch J (2008) Design and analysis of learning classifier
systems: a probabilistic approach. Springer Berlin
Eiben AE, Smith JE (2003) Introduction to evolutionary computing,
1st edn. Natural Computing Series. Springer, Berlin
Goldberg DE (1989) Genetic algorithms in search, optimization and
machine learning. Addison Wesley, Boston
Holland JH (1975) Adaptation in natural and artificial systems.
University of Michigan Press, Ann Arbor
Holland JH (1986) Escaping brittleness: the possibilities of general-
purpose learning algorithms applied to parallel rule-based
systems. In: Machine learning: an artificial intelligence
approach, vol II. Morgan Kaufmann, Burlington, pp 593–623
Ioannides C, Browne WN (2007) Investigating scaling of an
abstracted LCS utilising ternary and S-expression alphabets.
In: Proceedings of the genetic and evolutionary computation
conference, pp 2759–2764
Jong KAD (2006) Evolutionary computation: a unified approach. MIT
Press, Cambridge
Kinzett D, Johnston M, Zhang M (2009) Numerical simplification for
bloat control and analysis of building blocks in genetic
programming. Evol Intell 2(4):151–168
Kovacs T (1996) Evolving optimal populations with XCS classifier
systems. Technical Report CSR-96-17 and CSRP-9617, Univer-
sity of Birmingham, UK
Koza JR (1992) Genetic programming: on the programming of
computers by means of natural selection. MIT Press, Cambridge
Koza JR, Poli R (2005) Genetic programming. In: Search method-
ologies: introductory tutorials in optimization and decision
support techniques, chap. 5. Springer, Berlin, pp 127–164
Lanzi PL (1999) Extending the representation of classifier conditions
Part I: from binary to messy coding. In Proceedings of the
genetic and evolutionary computation conference, pp 337–344
Lanzi PL, Loiacono D (2007) Classifier systems that compute action
mappings. In: Proceedings of the genetic and evolutionary
computation conference, pp 1822–1829
Lanzi PL, Loiacono D, Wilson SW, Goldberg DE (2005) XCS with
computed prediction for the learning of Boolean functions.
Technical Report 2005007, Illinois Genetic Algorithms Laboratory
Lanzi PL, Loiacono D, Wilson SW, Goldberg DE (2007) General-
ization in the XCSF classifier system: analysis, improvement,
and extension. Evol Comput 15(2):133–168
Lanzi PL, Perrucci A (1999) Extending the representation of classifier
conditions Part II: from messy coding to S-expressions. In:
Proceedings of the genetic and evolutionary computation
conference, pp 345–352
Lanzi PL, Stolzmann W, Wilson SW (2000) Learning classifier
systems: from foundations to applications. Springer, Berlin
Loiacono D, Marelli A, Lanzi P (2007) Support vector machines for
computing action mappings in learning classifier systems. In:
Proceedings of the congress on evolutionary computation,
pp 2141–2148
Luke S, Panait L (2006) A comparison of bloat control methods for
genetic programming. Evol Comput 14(3):309–344
Mhlenbein H, Paaß G (1996) From recombination of genes to the
estimation of distributions I. Binary parameters. In: Parallel
Problem Solving from Nature, pp 178–187
Orriols-Puig A, Bernado-Mansilla E (2006) A further look at UCS
classifier system. In: Proceedings of the ninth international
workshop on learning classifier systems. Springer, Berlin
Pelikan M, Goldberg DE, Lobo FG (2002) A survey of optimization
by building and using probabilistic models. Comput Optim Appl
21(1):5–20
Poli R (2000) Why the schema theorem is correct also in the presence
of stochastic effects. In: Proceedings of the congress on
evolutionary computation, pp 487–492
Poli R, Langdon WB (1998) Schema theory for genetic programming
with one-point crossover and point mutation. Evol Comput
6:231–252
Poli R, Langdon WB, McPhee NF (2008) A field guide to genetic
programming. Lulu Enterprises, UK Ltd
Rivest RL (1997) S-expressions, Internet Engineering Task Force—
Internet Draft. http://people.csail.mit.edu/rivest/Sexp.txt,1997
Robilliard D, Marion-Poty V, Fonlupt C (2009) Genetic programming
on graphics processing units. Genet Program Evol Mach
10(4):447–471
Russell SJ, Norvig P (2011) Artificial intelligence: a modern
approach, 3rd edn. Pearson Education, Boston
Smith SF (1980) A learning system based on genetic adaptive
algorithms. PhD thesis
Sutton RS, Barto AG (1998) Reinforcement learning: an introduction.
MIT Press, Cambridge
Tenne Y, Armfield S (2009) A framework for memetic optimization
using variable global and local surrogate models. Soft Comput A
Fusion Found Methodol Appl 13(8):781–793
Thrun S (1996) Is learning the n-th thing any easier than learning the
first? In: Advances in neural information processing systems.
MIT Press, Cambridge, pp 640–646
Urbanowicz RJ, Moore JH (2009) Learning classifier systems: a
complete introduction, review, and roadmap. J Artif Evol Appl
2009(1):1–25
Wilson SW (1994) ZCS: a zeroth level classifier system. Evol
Comput 2(1):1–18
Wilson SW (1995) Classifier fitness based on accuracy. Evol Comput
3(2):149–175
Wilson SW (1998) Generalization in the XCS classifier system. In:
Procedings of the third annual genetic programming conference,
pp 665–674
Wilson SW (2002) Classifiers that approximate functions. NatComput 1:211–233
518 M. Iqbal et al.
123