GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang,...
-
Upload
florence-donna-pierce -
Category
Documents
-
view
225 -
download
6
Transcript of GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang,...
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine
Cheng-Lung Huang, Chieh-Jen Wang
Expert Systems with Applications, Volume 31, Issue 2, pp.231-240, 2006.
Outline
INTRODUCTION BACKGROUND KNOWLEDGE GA-BASED FEATURE SELECTION AND
PARAMETER OPTIMIZATION NUMERICAL ILLUSTRATIONS CONCLUSION
Introduction
Support vector machines (SVM) were first suggested by Vapnik (1995) for classification.
SVM classifies data with different class label by determining a set of support vectors that outline a hyperplane in the feature space.
The kernel function transforms train data vector to feature space.
SVM used in a range of problems including pattern recognition (Pontil and Verri 1998),
bioinformatics (Brown et al. 1999), text categorization (Joachims 1998).
Problems
While using SVM, we confront two problems: How to set the best parameters for SVM ! How to choose the input attributes for SVM !
Feature Selection
Feature selection is used to identify a powerfully predictive subset of fields within the database and to reduce the number of fields presented to the mining process.
Affects several aspects of pattern classification: 1.The accuracy of classification algorithm learned 2.The time needed for learning a classification function 3.The number of examples needed for learning 4.The cost associated with feature
SVM Parameters Setting
Proper parameters setting can improve the classification accuracy of SVM.
The parameters that should be optimized include penalty parameter C and the parameters with different kernel function.
Grid Algorithm is an alternative to find the best C and the gamma parameter, however it is time consuming and does not perform well.
Research Purposes
This research objective is to optimize the parameters and the feature subset simultaneously, without degrading the classification accuracy of SVM.
Genetic Algorithms (GA) have the potential to be used to generate both the feature subset and the SVM parameters at the same time.
Chromosomes Design
: represents the value of parameter C
: represents the value of parameter γ
: represents selected features
1 ~ cnC Cg g
1 ~ ng g
1 ~ cnC Cg g
1 ~ fnf fg g
1Cg 1g
1fgcn
Cg ng
fnfgi
Cg jg kfg
Genotype to Phenotype The bit strings for parameter C and γ are genotype
that should be transformed into phenotype
Value: the phenotype value Pmin: the minimum value of parameter (user define) Pmax: the maximum value of parameter (user define) D: the decimal value of bit string
L: the length of bit string
max minmin *
2 1L
P PValue P D
Fitness Function Design
WA: SVM classification accuracy weight SVM_accuracy: SVM classification accuracy WF: weight of the features Ci: cost of feature i Fi: “1” represents that feature i is selected; “0”
represents that feature i is not selected
1
1
* _ *( * )fn
A F i ii
fitness W SVM accuracy W C F
System Flows for GA-based SVM
(1) Data preprocess: scaling (2) Converting genotype to phenotype (3) Feature subset (4) Fitness evaluation (5) Termination criteria (6) Genetic operation
Figure of System Flows
Scaling
Testing setTraining set
Dataset
Termination are satisfied?
Training SVM classifier
Fitness evaluation
No
Yes
Optimized (C, ) and feature subset
Classfication accuracy for testing set
Genetic operatation
Converting genotype to phenotype
Population
Selected feature subset
Testing set with selected feature subset
Training set with selected feature subset
Parameter genes
Phenotype of parameter genes
Feature genes
Phenotype of feature genes
Trained SVM classifier
Experimental Dataset
No. Names #Classes #InstancesNominalfeatures
Numericfeatures
Totalfeatures
1 German (credit card) 2 1000 0 24 24
2 Australian (credit card) 2 690 6 8 14
3 Pima-Indian diabetes 2 760 0 8 8
4 Heart disease (Statlog Project) 2 270 7 6 13
5 Breast cancer(Wisconsin) 2 699 0 10 10
6 Contraceptive Method Choice 3 1473 7 2 9
7 Ionosphere 2 351 0 34 34
8 Iris 3 150 0 4 4
9 Sonar 2 208 0 60 60
10 Statlog project : vehicle 4 940 0 18 18
11 Vowel 11 990 3 10 13
Experiments Description
To guarantee that the present results are valid and can be generalized for making predictions regarding new data
Using k-fold-cross-validation This study used k = 10, meaning that all of the
data will be divided into ten parts, each of which will take turns at being the testing data set.
Accuracy Calculation
Accuracy using the binary target datasets can be demonstrated by the positive hit rate (sensitivity), the negative hit rate (specificity), and the overall hit rate.
For the multiple class datasets, the accuracy is demonstrated only by the average hit rate.
Accuracy Calculation
Sensitivity is the proportion of cases with positive class that are classified as positive: P(T+|D+) = TP / (TP+FN).
Specificity is the proportion of cases with the negative class: P(T-|D-) = TN / (TN + FP).
Overall hit rate is the overall accuracy which is calculated by (TP+TN) / (TN+FP+FN+FP).
Target (or Disease)
+ -
Predicted(or Test)
+ True Positive(TP) False Positive(FP)
- False Negative(FN) True Negative(TN)
Accuracy Calculation
The SVM_accuracy of the fitness in function is measured by Sensitivity*Specificity for the datasets with two classes (positive or negative).
Overall hit rate for the datasets with multiple classes.
GA Parameter Setting
Chromosome Represented by using Binary Code Population Size 500 Crossover Rate 0.7 ,One Point Crossover Mutation Rate 0.02 Roulette Wheel Selection Elitism Replacement
WA and WF
Weight WA and WF can influence the experiment result according to the fitness function
The higher WA is; the higher classification accuracy is.
The higher WF is; the smaller the number of features is.
Folder #4 of German Dataset Curve Diagram
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
Accurcay Weight
Accuracy Fitness
0
5
10
15
20
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
Accuracy Weight
Feature Number
Experimental Results for German Dataset
Results summary (GA-based approach vs. Grid search )
GA-based approach Grid algorithm p-value for Wilcoxon Te
sting
Names Number of Original features
Number of Selected features
AveragePositiveHit Rate
AverageNegativeHit Rate
AverageOverall Hit
Rate%
AveragePositiveHit Rate
AverageNegativeHit Rate
AverageOverall Hit
Rate%
German 24 13±1.83 0.89608 0.76621 85.6±1.96 0.888271 0.462476 76±4.06 0.005*
Australian 14 3±2.45 0.8472 0.92182 88.1±2.25 0.885714 0.823529 84.7±4.74 0.028*
diabetes 8 3.7±0.95 0.78346 0.87035 81.5±7.13 0.592593 0.88 77.3±3.03 0.139
Heart disease 13 5.4±1.85 0.94467 0.95108 94.8±3.32 0.75 0.909091 83.7±6.34 0.005*
breast cancer 10 1±0 0.9878 0.8996 96.19±1.24 0.98 0.944444 95.3±2.28 0.435
Contraceptive 9 5.4±0.53 N/A N/A 71.22±4.15 N/A N/A 53.53±2.43 0.005*
ionosphere 34 6±0 0.9963 0.9876 98.56±2.03 0.94 0.9 89.44±3.58 0.005*
iris 4 1±0 N/A N/A 100±0 N/A N/A 97.37±3.46 0.046*
sonar 60 15±1.1 0.9863 0.9842 98±3.5 0.65555 0.9 87±4.22 0.004*
vehicle 18 9.2±1.4 N/A N/A 84.06±3.54 N/A N/A 83.33±2.74 0.944
Vowel 13 7.8±1 N/A N/A 99.3±0.82 N/A N/A 95.95±2.91 0.02*
ROC curve for fold #4 of German Credit Dataset
1 - Specificity
1.00.75.50.250.00
Sen
sit
ivit
y1.00
.75
.50
.25
0.00
Guide
Grid search
GA-based approach
Average AUC for Datasets
GA-based approach Grid algorithm
German 0.8424 0.7886
Australian 0.9019 0.8729
diabetes 0.8298 0.7647
Heart disease 0.9458 0.8331
breast cancer 0.9423 0.9078
Contraceptive 0.7701 0.6078
ionosphere 0.9661 0.8709
iris 0.9756 0.9572
sonar 0.9522 0.8898
vehicle 0.8587 0.8311
Vowel 0.9513 0.9205
Conclusion
We proposed a GA-based strategy to select features subset and to set the parameters for SVM classification.
We have conducted two experiments to evaluate the classification accuracy of the proposed GA-based approach with RBF kernel and the grid search method on 11 real-world datasets from UCI database.
Generally, compared with the grid search approach, the proposed GA-based approach has good accuracy performance with fewer features.
Thank YouQ & A