Genetic-Algorithm-Based Instance and Feature Selection
description
Transcript of Genetic-Algorithm-Based Instance and Feature Selection
Genetic-Algorithm-Based Genetic-Algorithm-Based Instance and Feature Instance and Feature
SelectionSelection
Instance Selection and Construction for Data Mining Ch. 6
H. Ishibuchi, T. Nakashima, and M. Nii
AbstractAbstract
GA based approach for selecting a small number of instances from a given data set in a pattern classification problem.
To improve the classification ability of our nearest neighbor classifier by searching for an appropriate reference set.
Genetic AlgorithmGenetic Algorithm
Coding Binary string of the length (n+m)
ai : inclusion or exclusion of the i-th feature sp : the inclusion or exclusion of the p-th instance
Fitness function Minimize |F|, minimize |P|, and maximize g(S) |F| : number of selected feature |P| : number of selected instance g(S) : classification performance
mn sssaaaS 2121
Genetic AlgorithmGenetic Algorithm
Performance measure (first one) : gA(S) The number of correctly classified instances Minimize |P| subject to gA(S) = m
Performance measure (second one) : gB(S) When an instance xq was included in the reference set, xq was n
ot selected as its own nearest neighbor.
fitness
PxxPxxxd
PxPxxxdxxd
qqpqpF
qpqpF
qpF if }},{|),(min{
if },|),(min{),( *
||||)()( PWFWSgWSfitness PFg
Genetic AlgorithmGenetic Algorithm
1. Initialization
2. Genetic Operation: Iterate the following procedure Npop/2 times to generate Npop string
1. Randomly select a pair of strings
2. Apply a uniform crossover
3. Apply a mutation operator
3. Generation Update: Select the Npop best string from 2Npop
4. Termination test
Numerical ExampleNumerical Example
Biased MutationBiased Mutation
For effectively decreasing the number of selected instances is to bias the mutation probability
In the biased mutation, a much larger probability is assigned to the mutation from sp = 1 to sp = 0.
Data setsData sets
2 artificial + 4 real Normal distribution with small overlap Normal distribution with large overlap Iris data Appendicitis Data Cancer Data Wine Data
Parameter SpecificationsParameter Specifications
Pop Size : 50 Crossover Prob. : 1.0 Mutation Prob.
Pm = 0.01 for feature selection
Pm(1 0) = 0.1 for instance selection
Pm(0 1) = 0.01 for instance selection
Stopping condition : 500 gen. Weight values : Wg = 5; WF = 1; WP = 1
Performance measure : gA(S) or gB(S)
30 trials for each data
Performance on Training Performance on Training DataData
Performance on Test DataPerformance on Test Data
Leaving-one-out procedure (iris & appendicitis) 10-fold cross-validation (cancer & wine)
Effect of Feature SelectionEffect of Feature Selection
Effect on NNEffect on NN
Some VariantsSome Variants