[IEEE 2011 21st International Conference Radioelektronika (RADIOELEKTRONIKA 2011) - Brno, Czech...

Model Parameters Selection for SVM Classification using Particle Swarm Optimization Martin HRIC, Michal CHMULÍK, Roman JARINA

Dept. of Telecommunications and Multimedia, University of Žilina, Univerzitná 1, 010 26 Žilina, Slovak Republic

[email protected], [email protected], [email protected]

Abstract. Support Vector Machine (SVM) classification requires set of one or more parameters and these parameters have significant influence on classification precision and generalization ability. Searching for suitable model parameters invokes great computational load, which accentuates with increasing size of the dataset and with amount of the parameters being optimized. In this paper we present and compare various SVM parameters selection techniques, namely grid search, Particle Swarm Optimization (PSO) and Genetic Algorithm (GA). Experiments conducting over two datasets show promising results with PSO and GA optimization technique.

Keywords SVM, model selection, classification, PSO, GA

1. Introduction Nowadays, Support Vector Machine (SVM) is one of

the most frequently used techniques for classification and regression in many applications. SVM is the learning procedure based on Vapnik’s statistical learning theory [1] proposed in 1979, but only in the past decade it has been applied and used in real applications.

SVM was initially developed for the classification problem with separable data. Later, it has been improved to handle nonseparable data and also adapted to solve the regression problem [2]. In the case that a linear decision boundary is inappropriate, SVM maps an input vector into a higher dimensional feature space by a nonlinear mapping to enable to construct a separation hyperplane. The nonlinear mapping into feature space is performed by the kernel function [6].

SVM algorithm was originally designed for binary classification and later was extended to multiclass classification problems. This extension is achieved by two approaches: 1) solving a large optimization problem at once, 2) decomposition of the original problem into smaller binary sub-tasks and followed by combining their partial solutions. Although both approaches give similar performance during tuning of the hyper-parameters, the decomposition is more computational interesting and thus

more often used [3]. There are two main types of decomposition schemes: one-against-all (OAA) and one-against-one (OAO) [4]. They have been widely used due to their simplicity, efficiency and good classification performance [3].

This paper presents various model parameter selection techniques for SVM classification and influence of the parameters selected for SVM classification. The stress is laid upon PSO algorithm [7, 3], Genetic Algorithm, and standard approach grid search [7, 9] for model parameter selection of multiclass SVMs with RBF (Radial Basis Function) kernel.

The paper is organized as follows. Section 2 describes C-SVM (Cost – based SVM) classification algorithm. Three different model parameters selection techniques, namely grid search, PSO and GA, are discussed. In section 3, we present our experimental results of the parameters search by using these techniques. Finally in section 4, we discuss performance of the model selection techniques and their impact on classification precision.

2. SVM classification SVM, are related to the supervised learning methods

that analyze data and recognize patterns. It is non-probabilistic binary linear classifier. SVM belongs to the group of model based classifiers. Training algorithm constructs the model that represents patterns as points in vector space. Such mapped patterns of the separate classes are divided by a gap that is as wide as possible.

Development of the classification system includes separating data into training and testing sets. Each instance in the training set contains features of the observed data and the class labels. In the next paragraph, C-SVM formulation [5] is introduced.

The training set consists of the instance – label pairs (xi,yi), i = 1,2,…,l, where xi ∈ Rn and y = {1,-1}l. The SVM requires the solution of the optimization problem [5]:

+=

l

ii

T

bwCww

1,, 21min ξ

ξ (1)

with subject to:

978-1-61284-324-7/11/$26.00 ©2011 IEEE

( )( )0

1≥

−≥+

i

iiT

i bxwyξ

ξφ

(2)

where φ(xi) maps xi into a higher dimensional space and C > 0 is the regularization parameter. Due to possible high dimensionality of the vector variable ω, we usually solve the following dual problem defined as:

− αααα

TT eQ21min (3)

subject to:

Cy

i

T

≤≤=

αα

00

(4)

where i = 1,2…,l, and e = [1,1…,1]T is vector of all ones of the length l, Q is an l by l positive semidefinite matrix, Qij ≡ yiyjK(xi,xj), and K(xi,xj) ≡ φ(xi)Tφ(xj) are the kernel functions [5].

After problem (3) has been solved, the optimal w satisfies the term (using the primal-dual relationship):

( )=

=l

iiii xyw

1α (5)

and the decision function is:

( )( ) ( )+=+=

l

iiii

T bxxKybxw1

,sgnsgn αφ (6)

After this step, yiαi ∀i, b, label names, support vectors, and other information such as kernel parameters are stored in the model [5].

There are four basic kernel functions: linear, polynomial, radial basis function (RBF) and sigmoidal function. Each of the kernels has one or more parameters to be set depending on the particular type. The most frequently used kernel function - RBF is defined as [5]:

( ) ( )2exp, jiji xxxxK −−= γ (7)

where > 0.

2.1 SVM Model Selection Each SVM formulations require an user to set two or

more parameters, depending on selection of the appropriate kernel function. These parameters affect overall performance and generalization ability of SVM [7].

In our experiment, C-SVM formulation with RBF kernel function is used. This formulation of SVM requires setting of two parameters: cost parameter C, which has the value typically between 2-5 and 2-20, and the parameter which is typically between 2-20 and 25 [7].

The naive search method for choosing near – optimal parameters is called the grid search. This method exhaustively calculates n–fold Cross-Validation (CV) accuracy for every combination from defined region of parameters C and . For instance if performing a coarse search of region between 2-5 and 220 for the parameter C, we could choose every cost parameter 2m for m = -5, -4, -3,…, 20. For each of these C parameters, we evaluate every at the value 2m for m = -20, -19,…, 5. This search requires

to run SVM training for 676 different parameters combinations. Hence, this technique is very time consuming even for searching for two model parameters.

Besides the deterministic method, how to set the parameters of SVM described in the previous paragraph, the other possibility is to exploit evolution optimization techniques usually based on stochastic processes [8]. These methods are able to find solutions that can be very close to the optimal ones even on the multimodal function with many local extremes. As typical examples, we can mention genetic algorithms, differential evolution, simulated annealing, evolution strategies or algorithms belonging to the group of swarm intelligence.

2.2 Model Selection Using PSO One of the novel promising algorithms is called

Particle swarm optimization (PSO). This approach is inspired by social behavior of animal communities, particularly birds. PSO uses only primitive mathematical operations therefore it is computational inexpensive and is very easy to implement and also achieves fast convergence.

PSO consists of a swarm of interacting particles moving in n-dimensional search space of the problem’s solutions. Each particle i is represented by four elements (vectors): its current position xi, velocity vi, best previous position pi, and the best global position g ever found in the swarm. The PSO iterates for several generations, updating particles velocities and positions. Information about good solutions spread throughout the swarm and the particles can explore promising regions [7].

2.3 GA – based model Selection Genetic Algorithms (GA) are one of the most

frequently used optimization techniques in many areas. This group of algorithms belongs to the stochastic optimization algorithms and their origin reaches to the seventies of 20th century. Original approach, inspired by the laws of evolution, works in binary space and represents the individuals as a bit-string. Later, the real space version has been described, where the individual is represented as a real number or group of numbers. Here we focus on the modification of GA with real numbers since it has been implemented in our experiments. This version of GA uses the same operation at the offspring creation like the binary version – crossover and mutation. Crossover mostly performs by averaging – offspring is created from the parents as a mean of their values – the other possibility is

crossover by selection – from 2 parents is randomly picked one and the offspring accepts its value. Mutation is process, when an individual is randomly changed. Usually, crossover acquires value from the range 0.5 – 1 and mutation has value lower than 0.1. The offspring creation runs in generations consisted from the individuals – it is the same as in original binary approach.

3. Experiments In this paper, we compare three different techniques

for model parameters selection: grid search, PSO and GA. For this purpose, each dataset were partitioned according to the 5-fold CV scheme, where 5 different training/testing separate parts were produced. SVM model has been constructed (by using LIBSVM library [5]) by using train data and their performance has been evaluated using test data.

In evaluation process, two datasets has been used as follows: 1) The Letter recognition dataset [10] consisting of character images database; 2) Speech dataset for speaker recognition task, where the speaker utterances have been picked from the MobileDat-SK database [11, 13]. Dataset descriptions are summarized in Table 1.

Dataset Features no.

Classes no.

Training set size

Testing set size

Letter 16 26 10500 5000

Speaker 22 20 14091 7240

Tab. 1. Dataset description.

The procedure of the signal computation is as follow.

From the speaker utterances, 22 Mel Frequency Cepstral Coefficients (MFCC) have been generated as the speech features. Silent frames for each speaker utterance were dropped out by using short time energy thresholding and Gaussian mixture modeling (GMM). After extraction process, every attribute of the data were scaled to the range [-1, 1]. We used the same method to scale both train and test data. The main advantage of scaling is to avoid attributes in greater numeric ranges dominating those in smaller numeric range. Another advantage is to avoid numeric difficulties during the calculation [12].

The grid search model selection has been performed on the Letter dataset over parameters range C = {2-5, 2-

4.9,…, 219.9 ,220} and = {2-20, 2-19.9,…, 24.9, 25} (Fig. 1). The best model parameters C = 24.8 and = 21 has been found. These parameters achieve the highest 5-fold CV accuracy 96.619 % and satisfy condition of maximum value of C and . Other parameters with equivalent 5-fold CV accuracy are listed in Table 2.

Fig. 1. 5- fold CV accuracy contour over parameters for the

Letter dataset.

log2C 2.9 3 3.1 4.1 4.2 4.3 4.4 4.5 4.7 4.8

log2 1 1 0.9 1 1 1 1 1 1 1

Tab. 2. Parameters with highest 5-fold CV accuracy.

The model selection using the standard version of PSO has also been employed. Each value of the particle was rounded to one decimal place where parameter range was the same as in the grid search model selection. The PSO parameters used in the algorithm were as follows: 21 individuals, inertia factor wi = 0.729844, learning coefficients c1 = c2 = 1.49618, maximum velocity v = 4. PSO searched for the best model parameters of SVM in 20 iterations.

The GA parameters used in the algorithm were as follows: 21 individuals, crossover probability equal to 0.85, probability of mutation equal to 0.05. GA has been searching for the best model parameters of SVM in 20 iterations.

Method Grid search

PSO GA

log2C 4.8 3.8 4.2

log2 1 1 1

5-f. CV Acc. [%] 96,619 96.6 96.619

time [hour] 422 3.6 5.2

Tab. 3. Model selection results – dataset Letter.

Table 3 presents results of the model parameters selection for the Letter dataset as follows: 5-fold CV accuracy, time needed to parameters search and best parameters. The results of the classification performance: F1 measure (median/average value), accuracy (median/average value) are shown in Table 4.

On the Speaker dataset, both PSO and GA based search have been applied. The grid search model selection is time consuming technique and therefore was not performed. The results for the PSO and GA model parameters selection are shown in Table 5.

The classification results are interpreted on both frame and segment levels. The frames of 30 ms length have been extracted from the speaker utterances. Segment classification was conducted on 300 ms segments consisted of 10 frames. The classification results are shown in Table 6.

Method Grid search

PSO GA

Accuracy – median 0.9818 0.9818 0.9818

Accuracy – average 0.9722 0.9716 0.9718

F1 – median 0.9765 0.9740 0.9740

F1 – average 0.9721 0.9714 0.9716

Tab. 4. Classification results – dataset Letter.

Method PSO GA

log2C 15.2 9.8

log2 0.8 - 4.3

5-f. CV Acc. [%] 87.6233 78.8659

time [hour] 5.7 29.2

Tab. 5. Model selection frame level results – dataset Speaker.

Method PSO frame

PSO segment

GA frame

GA segment

Accuracy – median 0.4820 0.8819 0.4740 0.8888

Accuracy – average 0.4577 0.8046 0.4742 0.8218

F1 – median 0.4273 0.8167 0.4617 0.8211

F1 – average 0.4561 0.7951 0.4729 0.8077

Tab. 6. Classification results – dataset Speaker.

4. Conclusion In this paper, we described three different model

parameters selection techniques for SVM with the RBF kernel. A basic method for this task is grid search that is a naive and time consuming. From this point of view, it is important to use some optimization method that allows reduction of required number of CV steps to set the SVM parameters. Model selection techniques using either GA or PSO prove that they can be used for parameters selection and solve the problem much faster than the grid search. Both approaches give comparable results (Tab. 3, 4). Another advantage of PSO and GA algorithms is ability to optimize even more than 2 parameters.

It emerges from the results for the Letter dataset that GA and PSO based approaches achieve the classification performance as good as the grid search approach. The experiment of the classification on the Speaker dataset (Tab. 5, 6) shows that PSO is able to find „good” parameters and achieve comparable classification precision five times faster than GA. Note the classification accuracy

for the Speaker dataset is worse due to more difficult recognition task since speaker utterances have been recorded in noisy environment over public telecommunication network. In addition, acoustic properties of might be ambiguous to distinguish speakers in some cases.

Acknowledgements This work has been partly supported by Slovak

Scientific Grant Agency, Project No. 1/0655/10.

References [1] VAPNIK, V. Statistical learning theory. New York: Wiley, 1998.

[2] CHERKASSKY, V., MULIER, F. M. Learning from Data Concepts, Theory, and Methods. 2nd ed. New Jersey: Wiley, 2007.

[3] DE SOUZA, B. F., DE CARVALHO, A. C. P. L. F., CALVO, R., ISHII, R. P. Multiclass SVM model selection using particle swarm optimization Sixth International Conference on Hybrid Intelligent Systems. Rio de Janeiro (Brazil), 2006, p.31.

[4] SHIGEO, A. Support Vector Machine for Pattern Classification, Advances in Pattern Recognition. London: Springer, 2005

[5] CHANG, C. C., LIN, C. J. LIBSVM: a Library for Support Vector Machines (datasheet). 34 pages. [Online] cited 2011-02-7. Available at: http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf

[6] JUNLI, C., LICHENG, J. Classification mechanism of support vector machines. 5th Internationa conference on Signal Processing Proceedings. Beijing (China), 2000, vol .3, 2000, p. 1556-1559.

[7] BLONDIN, J., SAAD, A. Metaheuristic techniques for Support Vector Machines model selection. In 2010 10th International Conference on Hybrid Intelligent Systems. Atlanta, 2010, p. 197-200.

[8] ZELINKA, I., OPLATKOVÁ Z., ŠEDA M., OŠMERA, P., V ELA , F. Evolu ní výpo etní techniky: principy a aplikace. Praha: BEN, 2009

[9] STAELIN, C. Parameter selection for Support Vector Machines (technical report). 4 pages. [Online] Cited 2011-02-7. Available at: http://www.hpl.hp.com/techreports/2002/HPL-2002-354R1.pdf

[10] SLATE, D. J. Odesta Corporation, Letter Image recognition data set. [Online] Cited 2011-02-7. Available at: http://archive.ics.uci.edu/ ml/machine-learning-databases/letter-recognition/

[11] RUSKO, M., TRNKA, M., DARJAA, S. MobilDat-SK a mobile telephone extension to the SpeechDat-E SK telephone speech database in Slovak. In Proceedings of the 11-th International Conference Speech and Computer (SPECOM’2006). St. Petersburg, 2006, p. 449-454.

[12] HSU, C.-W., CHANG, C.-C., LIN, C.-J. A practical guide to support vector classification. Tech. Rep. [Online] Cited 2011-02-7. Available at: http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide. pdf.

[13] JUHÁR, J., ONDÁŠ, S., IŽMÁR, A., JARINA, R., RUSKO, M., ROZINAJ, G. Development of Slovak GALAXY / VoiceXML based spoken language dialogue system to retrieve information from the internet. In Proceedings of the Interspeech 2006 - ICSLP. Pittsburgh (USA), 2006, p. 485-488.

[IEEE 2011 21st International Conference Radioelektronika (RADIOELEKTRONIKA 2011) - Brno, Czech...

Documents

Transcript of [IEEE 2011 21st International Conference Radioelektronika (RADIOELEKTRONIKA 2011) - Brno, Czech...