Genetic-Algorithm-Based Instance and Feature Selection

Genetic-Algorithm-Based Genetic-Algorithm-Based Instance and Feature Instance and Feature

SelectionSelection

Instance Selection and Construction for Data Mining Ch. 6

H. Ishibuchi, T. Nakashima, and M. Nii

AbstractAbstract

GA based approach for selecting a small number of instances from a given data set in a pattern classification problem.

To improve the classification ability of our nearest neighbor classifier by searching for an appropriate reference set.

Genetic AlgorithmGenetic Algorithm

Coding Binary string of the length (n+m)

ai : inclusion or exclusion of the i-th feature sp : the inclusion or exclusion of the p-th instance

Fitness function Minimize |F|, minimize |P|, and maximize g(S) |F| : number of selected feature |P| : number of selected instance g(S) : classification performance

mn sssaaaS 2121


Performance measure (first one) : gA(S) The number of correctly classified instances Minimize |P| subject to gA(S) = m

Performance measure (second one) : gB(S) When an instance xq was included in the reference set, xq was n

ot selected as its own nearest neighbor.

fitness

PxxPxxxd

PxPxxxdxxd

qqpqpF

qpqpF

qpF if }},{|),(min{

if },|),(min{),( *

||||)()( PWFWSgWSfitness PFg


1. Initialization

2. Genetic Operation: Iterate the following procedure Npop/2 times to generate Npop string

1. Randomly select a pair of strings

2. Apply a uniform crossover

3. Apply a mutation operator

3. Generation Update: Select the Npop best string from 2Npop

4. Termination test

Numerical ExampleNumerical Example

Biased MutationBiased Mutation

For effectively decreasing the number of selected instances is to bias the mutation probability

In the biased mutation, a much larger probability is assigned to the mutation from sp = 1 to sp = 0.

Data setsData sets

2 artificial + 4 real Normal distribution with small overlap Normal distribution with large overlap Iris data Appendicitis Data Cancer Data Wine Data

Parameter SpecificationsParameter Specifications

Pop Size : 50 Crossover Prob. : 1.0 Mutation Prob.

Pm = 0.01 for feature selection

Pm(1 0) = 0.1 for instance selection

Pm(0 1) = 0.01 for instance selection

Stopping condition : 500 gen. Weight values : Wg = 5; WF = 1; WP = 1

Performance measure : gA(S) or gB(S)

30 trials for each data

Performance on Training Performance on Training DataData

Performance on Test DataPerformance on Test Data

Leaving-one-out procedure (iris & appendicitis) 10-fold cross-validation (cancer & wine)

Effect of Feature SelectionEffect of Feature Selection

Effect on NNEffect on NN

Some VariantsSome Variants

Genetic-Algorithm-Based Instance and Feature Selection

Documents

Transcript of Genetic-Algorithm-Based Instance and Feature Selection