GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang,...

GA-Based Feature Selection and Parameter Optimization for Support Vector Machine

Cheng-Lung Huang, Chieh-Jen Wang

Expert Systems with Applications, Volume 31, Issue 2, pp.231-240, 2006.

Outline

INTRODUCTION BACKGROUND KNOWLEDGE GA-BASED FEATURE SELECTION AND

PARAMETER OPTIMIZATION NUMERICAL ILLUSTRATIONS CONCLUSION

Introduction

Support vector machines (SVM) were first suggested by Vapnik (1995) for classification.

SVM classifies data with different class label by determining a set of support vectors that outline a hyperplane in the feature space.

The kernel function transforms train data vector to feature space.

SVM used in a range of problems including pattern recognition (Pontil and Verri 1998),

bioinformatics (Brown et al. 1999), text categorization (Joachims 1998).

Problems

While using SVM, we confront two problems: How to set the best parameters for SVM ! How to choose the input attributes for SVM !

Feature Selection

Feature selection is used to identify a powerfully predictive subset of fields within the database and to reduce the number of fields presented to the mining process.

Affects several aspects of pattern classification: 1.The accuracy of classification algorithm learned 2.The time needed for learning a classification function 3.The number of examples needed for learning 4.The cost associated with feature

SVM Parameters Setting

Proper parameters setting can improve the classification accuracy of SVM.

The parameters that should be optimized include penalty parameter C and the parameters with different kernel function.

Grid Algorithm is an alternative to find the best C and the gamma parameter, however it is time consuming and does not perform well.

Research Purposes

This research objective is to optimize the parameters and the feature subset simultaneously, without degrading the classification accuracy of SVM.

Genetic Algorithms (GA) have the potential to be used to generate both the feature subset and the SVM parameters at the same time.

Chromosomes Design

： represents the value of parameter C

： represents the value of parameter γ

： represents selected features

1 ~ cnC Cg g

1 ~ ng g

1 ~ cnC Cg g

1 ~ fnf fg g

1Cg 1g

1fgcn

Cg ng

fnfgi

Cg jg kfg

Genotype to Phenotype The bit strings for parameter C and γ are genotype

that should be transformed into phenotype

Value: the phenotype value Pmin: the minimum value of parameter (user define) Pmax: the maximum value of parameter (user define) D: the decimal value of bit string

L: the length of bit string

max minmin *

2 1L

P PValue P D

Fitness Function Design

WA: SVM classification accuracy weight SVM_accuracy: SVM classification accuracy WF: weight of the features Ci: cost of feature i Fi: “1” represents that feature i is selected; “0”

represents that feature i is not selected

1

1

* _ *( * )fn

A F i ii

fitness W SVM accuracy W C F

System Flows for GA-based SVM

(1) Data preprocess: scaling (2) Converting genotype to phenotype (3) Feature subset (4) Fitness evaluation (5) Termination criteria (6) Genetic operation

Figure of System Flows

Scaling

Testing setTraining set

Dataset

Termination are satisfied?

Training SVM classifier

Fitness evaluation

No

Yes

Optimized (C, ) and feature subset

Classfication accuracy for testing set

Genetic operatation

Converting genotype to phenotype

Population

Selected feature subset

Testing set with selected feature subset

Training set with selected feature subset

Parameter genes

Phenotype of parameter genes

Feature genes

Phenotype of feature genes

Trained SVM classifier

Experimental Dataset

No. Names #Classes #InstancesNominalfeatures

Numericfeatures

Totalfeatures

1 German (credit card) 2 1000 0 24 24

2 Australian (credit card) 2 690 6 8 14

3 Pima-Indian diabetes 2 760 0 8 8

4 Heart disease (Statlog Project) 2 270 7 6 13

5 Breast cancer(Wisconsin) 2 699 0 10 10

6 Contraceptive Method Choice 3 1473 7 2 9

7 Ionosphere 2 351 0 34 34

8 Iris 3 150 0 4 4

9 Sonar 2 208 0 60 60

10 Statlog project : vehicle 4 940 0 18 18

11 Vowel 11 990 3 10 13

Experiments Description

To guarantee that the present results are valid and can be generalized for making predictions regarding new data

Using k-fold-cross-validation This study used k = 10, meaning that all of the

data will be divided into ten parts, each of which will take turns at being the testing data set.

Accuracy Calculation

Accuracy using the binary target datasets can be demonstrated by the positive hit rate (sensitivity), the negative hit rate (specificity), and the overall hit rate.

For the multiple class datasets, the accuracy is demonstrated only by the average hit rate.


Sensitivity is the proportion of cases with positive class that are classified as positive: P(T+|D+) = TP / (TP+FN).

Specificity is the proportion of cases with the negative class: P(T-|D-) = TN / (TN + FP).

Overall hit rate is the overall accuracy which is calculated by (TP+TN) / (TN+FP+FN+FP).

Target (or Disease)

+ -

Predicted(or Test)

+ True Positive(TP) False Positive(FP)

- False Negative(FN) True Negative(TN)


The SVM_accuracy of the fitness in function is measured by Sensitivity*Specificity for the datasets with two classes (positive or negative).

Overall hit rate for the datasets with multiple classes.

GA Parameter Setting

Chromosome Represented by using Binary Code Population Size 500 Crossover Rate 0.7 ,One Point Crossover Mutation Rate 0.02 Roulette Wheel Selection Elitism Replacement

WA and WF

Weight WA and WF can influence the experiment result according to the fitness function

The higher WA is; the higher classification accuracy is.

The higher WF is; the smaller the number of features is.

Folder #4 of German Dataset Curve Diagram

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

Accurcay Weight

Accuracy Fitness

0

5

10

15

20

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

Accuracy Weight

Feature Number

Experimental Results for German Dataset

Results summary (GA-based approach vs. Grid search )

GA-based approach Grid algorithm p-value for Wilcoxon Te

sting

Names Number of Original features

Number of Selected features

AveragePositiveHit Rate

AverageNegativeHit Rate

AverageOverall Hit

Rate%

AveragePositiveHit Rate

AverageNegativeHit Rate

AverageOverall Hit

Rate%

German 24 13±1.83 0.89608 0.76621 85.6±1.96 0.888271 0.462476 76±4.06 0.005*

Australian 14 3±2.45 0.8472 0.92182 88.1±2.25 0.885714 0.823529 84.7±4.74 0.028*

diabetes 8 3.7±0.95 0.78346 0.87035 81.5±7.13 0.592593 0.88 77.3±3.03 0.139

Heart disease 13 5.4±1.85 0.94467 0.95108 94.8±3.32 0.75 0.909091 83.7±6.34 0.005*

breast cancer 10 1±0 0.9878 0.8996 96.19±1.24 0.98 0.944444 95.3±2.28 0.435

Contraceptive 9 5.4±0.53 N/A N/A 71.22±4.15 N/A N/A 53.53±2.43 0.005*

ionosphere 34 6±0 0.9963 0.9876 98.56±2.03 0.94 0.9 89.44±3.58 0.005*

iris 4 1±0 N/A N/A 100±0 N/A N/A 97.37±3.46 0.046*

sonar 60 15±1.1 0.9863 0.9842 98±3.5 0.65555 0.9 87±4.22 0.004*

vehicle 18 9.2±1.4 N/A N/A 84.06±3.54 N/A N/A 83.33±2.74 0.944

Vowel 13 7.8±1 N/A N/A 99.3±0.82 N/A N/A 95.95±2.91 0.02*

ROC curve for fold #4 of German Credit Dataset

1 - Specificity

1.00.75.50.250.00

Sen

sit

ivit

y1.00

.75

.50

.25

0.00

Guide

Grid search

GA-based approach

Average AUC for Datasets

GA-based approach Grid algorithm

German 0.8424 0.7886

Australian 0.9019 0.8729

diabetes 0.8298 0.7647

Heart disease 0.9458 0.8331

breast cancer 0.9423 0.9078

Contraceptive 0.7701 0.6078

ionosphere 0.9661 0.8709

iris 0.9756 0.9572

sonar 0.9522 0.8898

vehicle 0.8587 0.8311

Vowel 0.9513 0.9205

Conclusion

We proposed a GA-based strategy to select features subset and to set the parameters for SVM classification.

We have conducted two experiments to evaluate the classification accuracy of the proposed GA-based approach with RBF kernel and the grid search method on 11 real-world datasets from UCI database.

Generally, compared with the grid search approach, the proposed GA-based approach has good accuracy performance with fewer features.

Thank YouQ & A

GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang,...

Documents

Transcript of GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang,...