Do not Match, Inherit: Fitness Surrogates for Genetics-Based Machine Learning Techniques

23
GECCO 2007 HUMIES 1 Do not Match, Inherit: Fitness Surrogates for Genetics-Based Machine Learning Techniques Xavier Llorà 1,2 , Kumara Sastry 2 , Tian-Li Yu 3 , David E. Goldberg 2 1 National Center for Supercomputing Applications. University of Illinois at Urbana-Champaign 2 Illinois Genetic Algorithms Laboratory, University of Illinois at Urbana-Champaign 3 Department of Electrical Engineering, National Taiwan University Supported by AFOSR FA9550-06-1-0370, NSF at ISS-02-09199

description

A byproduct benefit of using probabilistic model-building genetic algorithms is the creation of cheap and accurate surrogate models. Learning classifier systems---and genetics-based machine learning in general---can greatly benefit from such surrogates which may replace the costly matching procedure of a rule against large data sets. In this paper we investigate the accuracy of such surrogate fitness functions when coupled with the probabilistic models evolved by the x-ary extended compact classifier system (xeCCS). To achieve such a goal, we show the need that the probabilistic models should be able to represent all the accurate basis functions required for creating an accurate surrogate. We also introduce a procedure to transform populations of rules based into dependency structure matrices (DSMs) which allows building accurate models of overlapping building blocks---a necessary condition to accurately estimate the fitness of the evolved rules.

Transcript of Do not Match, Inherit: Fitness Surrogates for Genetics-Based Machine Learning Techniques

Page 1: Do not Match, Inherit: Fitness Surrogates for Genetics-Based Machine Learning Techniques

GECCO 2007 HUMIES 1

Do not Match, Inherit: Fitness Surrogates forGenetics-Based Machine Learning Techniques

Xavier Llorà1,2, Kumara Sastry2, Tian-Li Yu3, David E. Goldberg2

1 National Center for Supercomputing Applications. University of Illinois at Urbana-Champaign2 Illinois Genetic Algorithms Laboratory, University of Illinois at Urbana-Champaign

3 Department of Electrical Engineering, National Taiwan University

Supported by AFOSR FA9550-06-1-0370, NSF at ISS-02-09199

Page 2: Do not Match, Inherit: Fitness Surrogates for Genetics-Based Machine Learning Techniques

GECCO 2007 Llorà, Sastry, Yu & Goldberg 2

Motivation

• Competent GBML– Use competent GAs to approach GBML problems

– Take advantage of competent GA scalability

– Provide insight about problem structure

– χeCCS by Llorà, Sastry, Goldberg & de la Ossa (2006)

• Rule matching may thread practical applications– Even for small dimensional problems (MUX 20), rule matching

may take more than 85% of the execution time in XCS

– As dimensionality or cardinality of training sets increase, rulematching rules the overall execution time

– Efficient implementation (Llora & Sastry, 2006) still requirematching rules

Page 3: Do not Match, Inherit: Fitness Surrogates for Genetics-Based Machine Learning Techniques

GECCO 2007 Llorà, Sastry, Yu & Goldberg 3

Motivation

• Competent GAs– Byproduct: Models and problem structure insight

– Revision of the fitness relaxation for expensive fitnessevaluations

– Idea: Build a cheap surrogate fitness accurate enough

– Successfully applied to GA (Sastry, Lima & Goldberg, 2006)

– Help cut down the number of fitness evaluations

• GBML– Can we transfer the same ideas to GBML approaches?

– What are the requirements needed for competent GBML tobenefit from fitness relaxation?

Page 4: Do not Match, Inherit: Fitness Surrogates for Genetics-Based Machine Learning Techniques

GECCO 2007 Llorà, Sastry, Yu & Goldberg 4

Outline

• Overview χeCCS

• Evaluation relaxation

• Fitness inheritance using least squares fitting

• Fitness inheritance and χeCCS

• Results

• Conclusions

Page 5: Do not Match, Inherit: Fitness Surrogates for Genetics-Based Machine Learning Techniques

GECCO 2007 Llorà, Sastry, Yu & Goldberg 5

χ-ari Extended Compact Classifier System

• No reinforcement learning is used

• A competent GA is in charge of the learning

• The idea:

– A population of single rules

– For each rule we compute its fitness

– The χ-ari extended compact genetic algorithm

– Niching to maintain different accurate rules (restrictedtournament replacement)

Page 6: Do not Match, Inherit: Fitness Surrogates for Genetics-Based Machine Learning Techniques

Maximally Accurate and General Rules

• Accuracy and generality can be compute as

!

"(r) =nt+(r) + n

t#(r)

nt

!

"(r) =nt+(r)

nm

• Fitness should combine accuracy and generality

!

f (r) ="(r) # $(r)%

• Such measure can be either applied to rules or rule sets

GECCO 2007 Llorà, Sastry, Yu & Goldberg 6

Page 7: Do not Match, Inherit: Fitness Surrogates for Genetics-Based Machine Learning Techniques

Maximally Accurate and General Rules

GECCO 2007 Llorà, Sastry, Yu & Goldberg 7

Page 8: Do not Match, Inherit: Fitness Surrogates for Genetics-Based Machine Learning Techniques

Extended Compact Genetic Algorithm

• A Probabilistic model building GA (Harik, 1999)

– Builds models of good solutions as linkage groups

• Key idea:

– Good probability distribution → Linkage learning

• Key components:

– Representation: Marginal product model (MPM)• Marginal distribution of a gene partition

– Quality: Minimum description length (MDL)• Occam’s razor principle

• All things being equal, simpler models are better

– Search Method: Greedy heuristic search

GECCO 2007 Llorà, Sastry, Yu & Goldberg 8

Page 9: Do not Match, Inherit: Fitness Surrogates for Genetics-Based Machine Learning Techniques

Marginal Product Model (MPM)

• Partition variables into disjoint sets

• Product of marginal distributions on a partition of

genes

• Gene partition maps to linkage groups

x1 x2 x3 x4 x5 x6 xl-2 xl-1 xl

{p000, p001, p00#, p010, p011, p01#, p100, p101,p10#, p110, p111, p11# … }(27 probabilities)

. . .

MPM: [1, 2, 3], [4, 5, 6], … [l-2, l -1, l]

GECCO 2007 Llorà, Sastry, Yu & Goldberg 9

Page 10: Do not Match, Inherit: Fitness Surrogates for Genetics-Based Machine Learning Techniques

Minimum Description Length Metric

• Hypothesis: For an optimal model

– Model size and error is minimum

• Model complexity, Cm

– # of bits required to store all marginal probabilities

• Compressed population complexity, Cp

– Entropy of the marginal distribution over all partitions

• MDL metric, Cc = Cm + Cp

GECCO 2007 Llorà, Sastry, Yu & Goldberg 10

Page 11: Do not Match, Inherit: Fitness Surrogates for Genetics-Based Machine Learning Techniques

Building an Optimal MPM

• Assume independent genes ([1],[2],…,[l])

• Compute MDL metric, Cc

• All combinations of two subset merges

• Eg., {([1,2],[3],…,[l]), ([1,3],[2],…,[l]), ([1],[2],…,[l-1,l])}

• Compute MDL metric for all model candidates

• Select the set with minimum MDL,

• If , accept the model and go to step 2.

• Else, the current model is optimal

GECCO 2007 Llorà, Sastry, Yu & Goldberg 11

Page 12: Do not Match, Inherit: Fitness Surrogates for Genetics-Based Machine Learning Techniques

χeCCS Models for Different MultiplexersBu

ildin

g Bl

ock

Size

Incr

ease

s

Page 13: Do not Match, Inherit: Fitness Surrogates for Genetics-Based Machine Learning Techniques

GECCO 2007 Llorà, Sastry, Yu & Goldberg 13

Fitness Inheritance using Least Squares

• Proposed by Sastry, Lima & Goldberg (2006)

• Surrogate is a regression using basis identified by BBs

• A simple example: [1,3] [2] [4]

• The schemas represented are

– {0*0*, 0*1*, 0*#*, 1*0*, 1*1*, 1*#*, #*0*, #*1*, #*#*,*0**, *1**, *#**, ***0 , ***1 , ***#}

• Recode the individuals by

Page 14: Do not Match, Inherit: Fitness Surrogates for Genetics-Based Machine Learning Techniques

GECCO 2007 Llorà, Sastry, Yu & Goldberg 14

Fitness Inheritance using Least Squares

• Recoding defines matrix A

• Normalize the fitness

Page 15: Do not Match, Inherit: Fitness Surrogates for Genetics-Based Machine Learning Techniques

GECCO 2007 Llorà, Sastry, Yu & Goldberg 15

Fitness Inheritance using Least Squares

• Solve using least squares

• Once solved, the fitness surrogate take the following form

Page 16: Do not Match, Inherit: Fitness Surrogates for Genetics-Based Machine Learning Techniques

GECCO 2007 Llorà, Sastry, Yu & Goldberg 16

Fitness Inheritance and χeCCS

• Two different problems

Hidden XOR 6-input multiplexer

Page 17: Do not Match, Inherit: Fitness Surrogates for Genetics-Based Machine Learning Techniques

GECCO 2007 Llorà, Sastry, Yu & Goldberg 17

Hidden XOR

• Evolved rules and model

• Surrogate accuracy

Page 18: Do not Match, Inherit: Fitness Surrogates for Genetics-Based Machine Learning Techniques

GECCO 2007 Llorà, Sastry, Yu & Goldberg 18

6-input Multiplexer

• The evolved solution and model

• The surrogate is totally off

Page 19: Do not Match, Inherit: Fitness Surrogates for Genetics-Based Machine Learning Techniques

GECCO 2007 Llorà, Sastry, Yu & Goldberg 19

6-input Multiplexer

• The key = missing basis

• χeCCS is able to solve the problem quickly, reliably,

and accurately

• However, the model basis are not accurate enough

to build a proper surrogate

Page 20: Do not Match, Inherit: Fitness Surrogates for Genetics-Based Machine Learning Techniques

GECCO 2007 Llorà, Sastry, Yu & Goldberg 20

Overlapping BBs using DSMGA

• Proposed by Yu, Yassine, Goldberg and Chen (2003)

• Based on organizational theory

• Main property = DSMGA model builder (DSMcluster)

deals with overlapping building blocks

• The main issue = translate a populations or rules

into a dependency structure matrix (DSM)

• The intuition = specific bits are the ones responsible

for the kind of linkage we seek

Page 21: Do not Match, Inherit: Fitness Surrogates for Genetics-Based Machine Learning Techniques

GECCO 2007 Llorà, Sastry, Yu & Goldberg 21

Jumping to the Results

• DSMcluster model for the hidden XOR

– [i0 i1 i2] [i3] [i4] [i5]

• DSMcluster model for the 6-input multiplexer

– [i0 i1] <i2 i3 i4 i5>

– It identifies a BB [i0 i1] of variables interacting with abus <i2 i3 i4 i5>

– Translated into χeCCS language:

[i0 i1 i2] [i0 i1 i3] [i0 i1 i4] [i0 i1 i5]

– The right model which provides the right set of basis

Page 22: Do not Match, Inherit: Fitness Surrogates for Genetics-Based Machine Learning Techniques

GECCO 2007 Llorà, Sastry, Yu & Goldberg 22

Conclusions

• The matching process is crucial and expensive

• Efficient implementations can take us far to a point

• Relaxation can get rid of the need of matching

• For some types of problems overlapping BBs are

required

• DSMGA provides the proper machinery to identify

the proper basis for such a surrogate

Page 23: Do not Match, Inherit: Fitness Surrogates for Genetics-Based Machine Learning Techniques

GECCO 2007 HUMIES 23

Do not Match, Inherit: Fitness Surrogates forGenetics-Based Machine Learning Techniques

Xavier Llorà1,2, Kumara Sastry2, Tian-Li Yu3, David E. Goldberg2

1 National Center for Supercomputing Applications. University of Illinois at Urbana-Champaign2 Illinois Genetic Algorithms Laboratory, University of Illinois at Urbana-Champaign

3 Department of Electrical Engineering, National Taiwan University

Supported by AFOSR FA9550-06-1-0370, NSF at ISS-02-09199