[IEEE 2009 Eighth IEEE/ACIS International Conference on Computer and Information Science - Shanghai,...

A Max Modular Support Vector Machine and ItsVariations for Pattern Classification

Hui-Cheng LianSchool of Computer Engineering and Science, Shanghai University

P. Box 147, No. 149 Yanchang Road, Shanghai, 200072, ChinaEmail: [email protected]

Abstract— In this paper, we propose a max modular supportvector machine (M2-SVM) and its two variations for patternclassification. The basic idea behind these methods is to decom-pose training samples of one class into several parts and learneach part by one modular classifier independently. To implementthese methods, a ‘part-against-others’ training strategy and amax modular combination principle are proposed. Also, a down-sampling technique is employed to improve the training withoutlosing any information of positive sample. Finally, a variationcalled feature-constructed M2-SVM (FM2-SVM) is proposed,which selects features of each subset for classification in onemodular. Experimental results show that FM2-SVM can not onlyimprove the precision of classifying but also reduce the dimensionof input features. Performances of the proposed methods areshown to be superior to traditional SVMs as well as M3-SVMand KNN on artificial data, UCI Forest Covertype data, CAS-PEAL face database and AR face database.

I. INTRODUCTION

A modular classifier can be defined as that: a classifierwhere the computation performed can be decomposed intomore than one module that operates on inputs without com-municating with each other [1][2]. From their viewpoints, thedisadvantages of the current SVM methods can be summarizedas follows:

• When trained on very large data sets, SVMs tend tobe huge time consumed. Although many optimizationmethods have been proposed to improve the trainingprocess, such as Chunking, SMO or SVMlight [3][4][5],but these methods are said to be limited [1][2]. Anotherpoint is that, they can not run on parallel machines orbe implied on massively parallel hardware. However,modular approaches [1][2] have such a priority.

• Problems such as multi-view face detection [6] or multi-view gender classification [7][8] have attracted muchattention. The challenge of such kind of problems isthat several sub-patterns may exist in one class. Forexample, in a gender classification problem the male orfemale samples may contain 0 and 30 degree faces. Thesepatterns (0 and 30 degree faces) inside one class (male orfemale) are called sub-patterns for gender classification.These kinds of sub-patterns may be largely different inone class, that, it is better to train such different sub-patterns with different modules [8] than to train all sub-patterns on one whole module.

• Learning classifiers from imbalanced data is anotherimportant and practical problem [9]. The natural solutions

are ‘down-sampling’ instances from the majority class or‘up-sampling’ the training instances from the minorityclass. Obviously, they will lose information or introducenoise [9]. However, a modular classifier can balance theproblem by decomposing less modules for a small-scaleclass and decomposing more modules for a huge-scaleclass.

In this paper, we propose a max modular support vectormachine (called M2-SVM) to deal with such kind of problems.The basic idea behind is to decompose samples of one classinto several subsets and then train on them respectively. Theoutputs of modules are combined together to get the solution.Under this framework, the sub-patterns can be learned inde-pendently so that the large-scale data sets can be calculated ina parallel way. Another important thing is that, it allows us tosplit a larger class into more parts, while less parts for smallerclasses. By this way, imbalanced classification problems canbe well done. The variations of M2-SVM are also introduced.We describe the max modular network M2-network) firstlybefore introducing M2-SVM, for convenience.

II. MAX MODULAR NETWORK

A. Notations

Let T be the training set for a K-class classification problemand the labels of these K classes are represented by 1, 2, ..., Krespectively.

T = {(Xl, yl)}Ll=1 (1)

where Xl ∈Rd is the input vector, yl ∈ Y is the label of Xl

and Y ={1, 2, ..., K} is the label set of all samples, L is thenumber of training data.

Suppose the K training input sets, X1, X2 ,..., XK , areexpressed as

Xi ={

X(i)l

}Li

l=1for i = 1, ..., K (2)

where Li is the number of training inputs in class Ci, X(i)l

is the l-th sample belonging to class Ci and all of X(i)l ∈ Xi

have the same class label, andK∑

i=1

Li = L.

Let X be the union of X1, X2 ,..., XK and Xi be thecomplement of Xi. Then they can be written as

2009 Eigth IEEE/ACIS International Conference on Computer and Information Science

978-0-7695-3641-5/09 $25.00 © 2009 IEEE

DOI 10.1109/ICIS.2009.54

559

2009 Eigth IEEE/ACIS International Conference on Computer and Information Science

978-0-7695-3641-5/09 $25.00 © 2009 IEEE

DOI 10.1109/ICIS.2009.54

559

X =K⋃

i=1

Xi and Xi =K⋃

j=1,j �=i

Xj (3)

We suggest that Xi defined by (2) can be further decom-posed into a number of subsets as small as needed accordingto requirements. Assume that the input set Xi defined by (2)is further partitioned into Ni (1 ≤ Ni ≤ Li) subsets in theform of

Xij ={

X(ij)l

}L(j)i

l=1for i = 1, ..., K (4)

and j = 1, ..., Ni

where L(j)i is the number of training inputs included in Xij ,

andNi⋃j=1

Xij = Xi, a max modular network then can be

constructed from these subsets as follows.

B. Max Modular Network

A traditional pairwise solution for the K-class classificationproblem [10] is to construct K modules Mi (i=1, ..., K)that each module is learned by a component classifier thatis trained on Xi and Xi. However, we proposed to furtherdecompose each binary classifier into a series of smallermodules. According to last section, the training of modulesare expressed as following:

{{Xij , +1}, {Xi,−1}

} train=⇒ Mij , for i = 1, ..., K (5)

and j = 1, ..., Ni

where Mij means a module trained on Xij and Xi.The max modular combination principle of M2-network is

“Win-Take-All”, and the outputs of the network should bemapped to their associated class labels. This can be formallyexpressed by

y = C[arg max

i(hi(x))

](6)

where C is a function that maps the module’s label to theassociated class label, and hi(x) is he discriminant functionof the i-th module. We call this training strategy as “part-against-others”.

The main differences between M2-network and M3-networkproposed in [2] are as follows: (1) the M2-network onlyuse a MAX unit, while the M3-network use MIN, MAX,or/and INV units. Consequently, the module number of M2-network is K ·N , while the module number of M3-network isK(K−1)·N2/2, where suppose module number of each classequals to N . (2) The M2-network use a ‘part-against-others’training strategy, while the M3-network use a ‘part-against-part’ training strategy, (3) The M3-network is ‘biased to itsnegative class’ (see page 23 of [11]), while the M2-networkdo not have such a problem, (4) M2-network is suitable forselecting features for each sub-pattern, while M3-network donot have such an ability.

III. MAX MODULAR SUPPORT VECTOR MACHINE AND

ITS VARIATIONS

As important descendants of M2-network, the M2-SVMand its variations are proposed in this section. The purposeof down-sampling technique is introduced to balance scalesbetween positive and negative samples of M2-SVM. And thefeature-constructed M2-SVM is also proposed to learn andconstruct sub-patterns from the subsets of training samples.

A. Max Modular Support Vector Machine

Given the positive training sample sets Xij and negativetraining sample sets Xi described in (4) and (3) where i =

1, ..., K and j = 1, ..., Ni. Let Lij = L(j)i +

K∑j=1,j �=i

Li. The j-

th SVM of the i-th class is trained with all of the examples inthe Xij with positive labels, and all the examples in the Xi withnegative labels. The j-th SVM of the i-th class independentlysolves the following problem:

minwi,j ,bi,j ,ξi,j

l

12(wi,j)T wi,j + C

Lij∑l=1

ξi,jl (wi,j)T

(wi,j)T φ(xl) + bi,j ≥ 1 − ξi,jl , if xl ∈ Xij

(wi,j)T φ(xl) + bi,j ≤ −1 + ξi,jl , if xl ∈ Xi

ξi,jl ≥ 0, l = 1, ..., Lij (7)

where the training data xl are mapped to a higher dimensionalspace by the function φ and C is the penalty parameter.

After solving (7), there areK∑

i=1

Ni decision functions, and

we say x is in the class that decided by

class of x ≡ C[arg max

(i,j)

((wi,j)T φ(x) + bi,j

)]. (8)

B. Down-Sampling for Max Modular SVM

From (7) we can see that the samples of Xij and Xi arelargely imbalanced for training one SVM. Formally, for a K-class problem, the sample number of Xi will be (K − 1) · Ntimes to that of Xij . The imbalance between Xij and Xi

will cause a low efficient or even failed learning. A simplebut efficient method is to down-sample on Xi. The samplingnumber could be associated to the scale of Xij . In thispaper, samples that five times to the positive samples Xij arerandomly sampled from Xi, when negative samples are toolarge to be learned. The priority of down-sampling M2-SVM(DM2-SVM) is that it can largely reduce training time andmeanwhile provide a balanced condition for learning. Thisis quite important for large-scale real world problems thatclassifiers usually need to be learned efficiently.

Although the down-sampling technique discards a large partof samples in Xi, but it never means that the discarded samplesof Xi would not be learned. Actually, they are regarded aspositive samples in one or other some modules to learn their

560560

corresponding modular classifiers. Consequently, the down-sampling technique can balance training samples for M2-SVM without losing any positive information. This is theessential difference between the DM2-SVM and the otherdown-sampling methods mentioned in Section I.

C. Feature-Constructed Max Modular SVM

Fig. 1. The feature-constructed M2-SVM.

As we can see, the most features we can remember clearlyare usually the outstanding features of one object. For ex-ample, a scar in someone’s face may be the first importantfeature of him or her. Based on this idea, we select differentfeatures for different modules and thereby propose a feature-constructed M2-SVM (called FM2-SVM) for classification.According to the principle of “part-against-others”, any binary-class feature selection method can be utilized to select a set offeatures Fi,j for the module Mij of FM2-SVM. In this paper,we use feature selection method proposed in [12] to selectfeatures for module Mij . We define Fij to be a operator thatselect feature set Fi,j from X as follow,

X ′ = Fij(X ) (9)

In training phase, module Mij in FM2-SVM is trained onFij(Xij) and Fij(Xi). In test phase, sample x is operated byFij before it is input into the module Mij of FM2-SVM.

Fig. 1 shows the struct of FM2-SVM. The priority is impor-tant features of each sub-pattern can be selected, meanwhilethe dimension is reduced. This is quite important for real worldclassification problems to improve their precisions, especialfor problems that contain large different sub-patterns in oneclass, such as gender classification with different poses or ageestimation with different genders.

IV. EXPERIMENTS AND DISCUSSIONS

Three different experiments are evaluated to present thevarious performance of our methods. The LIBSVM [13] isutilized for all M2-SVMs, and all experiments run on a 2.8GHz P4 PC with MATLAB 7.0.

A. Artificial Imbalanced Data Sets

Artificial imbalanced sets are generated from six 2D Gaussdistributions with a same σ=0.1. Class I contains three Gaus-sian parts with mean values μ11=(0.4, 0.9), μ12=(0.1, 0.5)and μ13=(0.5, 0.0) and the sample number is 40, respectively.While Class II contains three Gaussian parts with μ21= (0.7,0.9), μ22=(0.4, 0.5), μ23=(0.9, 0.0) and the sample number is2000, respectively. For M2-SVM, Class I is decomposed into3 subsets and Class II is decomposed into 150 subsets orderlyfor learning. The discriminant borders learned by SVM andM2-SVM are shown in Fig. 2.

From (a), (b), (c), (d) of Fig. 2, we can see that, SVMs cannot correct classify Class I almost. The discriminant bordersare not between Class I and Class II. The reason is that Class Iand Class II are largely imbalanced. It is difficult for SVMs toconverge to those borders near Class II. This fact shows thatSVMs can not do well on largely imbalanced data. However, ascan be seen from (e), (f), (g), (h) of Fig. 2, M2-SVM can dealwith this imbalanced data problem much better than SVMs.The discriminant borders is generated between Class I andClass II. The reason why M2-SVM can classify Class I andClass II correctly is that the large class (i.e. Class II) is splitinto many small subsets that subsets of two classes can belearned consistently to obtain a good result.

B. UCI Forest Covertype Data

The UCI Forest Covertype dataset contains 581,012 sampleswith seven classes and 54 attributions. However, we only usedthe last four classes with 50,117 samples in the experiment(see Table I). The first 1000, 4000, 8000 and 10000 samplesof each class are used as training data and the remain dataare used as test data in our experiment. Note that we scaleall training data to be in [-1,1]. Then test data are adjusted to[-1,1] accordingly. Equal Clustering method proposed in [13]is used to decompose training data of the four classes into 1,2, 4, 5 subsets, respectively.

We estimate the generalized accuracy by using RBF kernelwith different parameters γ and cost parameters C: γ =[2−10, 2−9, ..., 24

]and C =

[2−2, 2−3, ..., 212

]. Therefore,

we try 255 combinations. We present the optimal parameters(γ, C) and the corresponding accuracy rates. Table II presentsthe result of comparing four methods. Simulated parallel ofthe modular methods are presented by a way of recordingthe maximal training time, maximal testing time and maximalSVs number of all independent modules. In this experiment,M2-SVM obtained the highest rate and DM2-SVM obtaineda comparative result although it has less number of supportvectors than M2-SVM. Note that if the data sets are processedin a parallel way, the processing CPU-times of modularmethods could be reduced significantly.

C. CAS-PEAL and AR Gender Databases with Various Posesand Expressions

A set of 4,792 images from CAS-PEAL database [14]including 0, 15 and 30 degrees of the subject’s poses areused for gender classification with various poses. And a set of

561561

(a) (b) (c) (d)

(e) (f) (g) (h)

Fig. 2. Discriminant borders learned by SVM and M2-SVM on artificial imbalanced data sets, the kernel is RBF and (a), (b), (c) and (d) are results of SVMwith C=1,10,100,1000, (e), (f), (g), (h) are results of M2-SVM with C=1,10,100,1000.

TABLE I

SAMPLE NUMBERS OF TRAINING AND TEST DATA FROM UCI FOREST COVERTYPE DATA.

class 1 class 2 class 3 class 4Train 1,000 4,000 8,000 10,000Test 1,747 5,493 9,367 10,510Total 2,747 9,493 17,367 20,510

TABLE II

NUMBER OF MODULES, TRAINING TIME, TEST TIME, NUMBER OF SUPPORT VECTORS AND ACCURACY.

SVM M3-SVM M2-SVM DM2-SVMModule Number - 49 12 12Training Time Max - 2.38 18.22 10.00

(sec.) Total 21.28 17.66 75.36 36.94Test Time Max - 1.16×10−4 2.59×10−4 1.79×10−4

(sec./per) Total 6.98×10−4 9.93×10−4 9.57×10−4 7.16×10−4

SV number Max - 342 582 504Total 1,079 2,323 2,591 2,088

Accuracy 97.25 97.27 97.38 97.37(γ,C) (22, 23) (20, 26) (22, 23) (22, 23)

735 images from AR database [15] including neutral, smileand scream expressions are used for gender classificationwith various expressions. All images are preprocessed into75×65 gray faces by method described in [7]. M2-SVM andFM2-SVM are investigated to deal with these two problemscomparing with SVMs, M3-SVM [13] and KNN. The poseinformation is used to decompose the training data of twoclass, i.e. female and male, into three subsets respectively forPEAL database. And the expression information is used todecompose the training data of two class into three subsetsrespectively for AR database. Note that during the test,priorinformation of the testing samples is unknown. Table IIIshows the numbers of training and test data of PEAL andAR database, where 200 samples from each subset of PEAL

database and 40 samples from each subset of AR database arerandom taken out for training and the other samples remainfor testing. We run ten random cross-validation rounds toestimate the generalized accuracies of the five methods withthree different kernels and cost parameters C=1.

To FM2-SVM, 4000 features of male and female in 0, 15and 30 degree, and the male and female in neutral, smile, andscream expressions are selected for PEAL gender classificationand AR gender classification. The selected features are shownin Fig. 3 and 4, where each row shows the both 3000 and 4000selected features plotted by black pixels. For a more clearshowing, faces overlapped by the selected features are alsoshown. The effectiveness of FM2-SVM can be explained bythe result that we can identify the original poses or expressions

562562

TABLE III

NUMBERS OF TRAINING AND TEST DATA OF PEAL AND AR DATABASES

PM00 PM15 PM30 Total Training TestPEAL Female 445 844 844 2,133 200×3 1,533

Male 595 1,032 1,032 2,659 200×3 2,059

Neutral Smile Scream Total Training TestAR Female 140 140 140 420 40×3 300

Male 105 105 105 315 40×3 195

(a) (b)

(c) (d)

(e) (f)

Fig. 3. The 3000 and 4000 selected features (black points) for PEAL-gender classification. (a), (c), (e) for female in 0, 15 and 30 degree and (b), (d), (f)for male in 0, 15 and 30 degree, respectively.

(a) (b)

(c) (d)

(e) (f)

Fig. 4. The 3000 and 4000 selected features for AR-gender classification. (a), (c), (e) for female in neutral, smile and scream and (b), (d), (f) for male inneutral, smile and scream, respectively.

easily from these plotted black-pixels.

Table IV shows the comparisons of M2-SVM and FM2-SVM with SVM, M3-SVM and KNN on PEAL genderclassification and AR gender classification with ten randomcross-validation rounds. The accuracies of M2-SVM and FM2-SVM are clearly higher than those of SVMs as well as M3-SVM and KNN. The reason why M2-SVM and FM2-SVMcan outperform SVMs in our experiments is follows: the large

different subsets in one classes, such as different poses anddifferent expressions, are separated from each other beforebeing input into modular learning of M2-SVM and FM2-SVM.This means that each sub-pattern can be learned independentlyand sufficiently by one single classifier. Finally, the maxmodular combination principle of the network guarantees toget the highest similarity score of all learned modules. Thistable also shows that FM2-SVM can outperform or do almost

563563

TABLE IV

COMPARISONS OF M2-SVM AND FM2-SVM WITH SVM, M3-SVM AND KNN ON PEAL AND AR GENDER CLASSIFICATION WITH TEN RANDOM

CROSS-VALIDATION ROUNDS. BRACKETS DENOTE STANDARD DEVIATIONS.

Method SVM M3-SVM M2-SVM FM2-SVM KNNDimension 4875 4875 4875 4000 4875

linear 91.91 (0.26) 92.65 (0.78) 93.64 (0.30) 93.24 (0.52) 77.49(0.60)PEAL polynomial 78.94 (1.70) 80.54 (0.65) 87.33 (0.73) 87.32 (1.15)

RBF 90.39 (0.56) 89.05 (0.14) 91.30 (0.38) 91.30 (0.57)linear 89.76 (1.11) 88.73 (0.76) 91.85 (1.71) 91.85 (1.60) 83.45 (2.67)

AR polynomial 79.66 (2.48) 83.37 (0.84) 86.46 (1.57) 87.07 (2.17)RBF 87.14 (1.41) 84.65 (2.05) 86.26 (1.39) 87.73 (1.72)

as well as M2-SVM with only 4000 features. This indicatesthe efficiency of the FM2-SVM.

The M3-SVM seems can not be better than M2-SVM as wellas FM2-SVM in these experiments except its large massivelyparallel learning ability. It also can be seen that only in someexperiments M3-SVM can be better than SVMs, but in otherones M3-SVM can not be better than SVMs. This may beexplained by the drawback of M3-network described in [11]that it is ‘biased to its negative class’. Another importantreason is that M3-SVM usually performs well for large-scaledata sets, but not so good for comparative small data sets,which SVMs are usually enough for learning.

V. CONCLUSIONS

A fundamental method to process massive multi-class clas-sification problem is the divide-and-conquer technique. TheM2-SVM has been presented in this paper to implement thisparadigm. The down-sampling technique has been proposedto improve the training process without losing any positiveinformation. And the feature-constructed M2-SVM has beenproposed to select features for each large different subset. Inaddition, FM2-SVM can not only improve the performanceof classification, but also reduce the feature dimension. Theexperimental results show the effectiveness and efficiency ofM2-SVM, DM2-SVM and FM2-SVM over traditional SVMs,as well as M3-SVM and KNN on artificial data, UCI ForestCovertype data, CAS-PEAL database and AR database.

ACKNOWLEDGMENT

This work is supported by Shanghai Leading Academic Dis-cipline Project, Project Number: J50103, and also supportedby the Excellent Young Teacher Foundation of Shanghai viathe grant shu08068.

REFERENCES

[1] M. Sarkar, ”Modular pattern classifiers: a brief survey”, IEEE Interna-tional Conference on Systems, Man, and Cybernetics, vol. 4, pp. 2878-2883, 2000.

[2] B. L. Lu and M. Ito, ”Task Decomposition and Module CombinationBased on Class Relations: a Modular Neural Network for Pattern Clas-sification”, IEEE Transactions on Neural Networks, 10, pp. 1244-1256,1999.

[3] V. Vapnik, ”Statistical Learning Theory”, John Wiley and Sons, NewYork, 1998.

[4] J. C. Platt, ”Fast Training of Support Vector Machines Using SequentialMinimal Optimization”, Advances in Kernel Methods: Support VectorMachines. B. Scholkopf, C.J.C. Burges, and A. Smola, eds., pp. 185-208, Cambridge, Mass.: MIT Press, Dec. 1. 1998.

[5] T. Joachims, ”Making Large-Scale Support Vector Machine LearningPractical”, Advances in Kernel Methods: Support Vector Machines, B.Scholkopf, C.J.C. Burges, and A. Smola, eds., pp. 169- 184, Cambridge,Mass.: MIT Press, 1998.

[6] C. Huang, H. Z. Ai, B. Wu and S. H. Lao, ”Boosting Nested CascadeDetector for Multi-View Face Detection”, Proceedings of the 17th Interna-tional Conference on Pattern Recognition (ICPR’04), Vol.2 , pp. 415-418,2004.

[7] H. C. Lian and B. L. Lu, ”Multi-view Gender Classification Using LocalBinary Patterns and Support Vector Machines”, The Third InternationalSymposium on Neural Networks, LNCS 3972, Chengdu, pp. 202-209,2006.

[8] H. C. Lian and B. L. Lu, ”Gender Recognition Using a Min-Max ModularSupport Vector Machine”, ICNC’05-FSKD’05, pp. 438-441, 2005.

[9] A. Rehan, K. Stephen and J. Nathalie, ”Applying Support Vector Ma-chines to Imbalanced Datasets”, ECML 2004, LNAI 3201, pp. 39-50,2004.

[10] C. W. Hsu and C. J. Lin, ”A Comparison of Methods for MulticlassSupport Vector Machines”, IEEE Transactions on neural networks, 13, 2,pp. 415-425, 2002.

[11] H. Zhao, ”A Study on Min-Max Modular Classifier”, P.H.D. Thesis,ShangHai Jiao Tong University, 2005..

[12] I. Guyon, J. Weston, S. Barnhill and V. Vapnik, ”Gene selection forcancer classification using support vector machines”, Machine Learning,46, pp. 38-42, 2002.

[13] Y. M. Wen, B. L. Lu and H. Zhao, ”Equal Clustering Makes Min-MaxModular Support Vector Machines More Efficient”, ICONIP 2006, Taipei,Taiwan, 2005.

[14] W. Gao, B. Cao, S. G. Shan, X. H. Zhang and D. L. Zhou, ”The CAS-PEAL Large-Scale Chinese Face Database and Baseline Evaluations”,technical report of JDL, avaible on http://www.jdl.ac.cn/∼peal/peal tr.pdf,2004.

[15] A. M. Martinez and R. Benavente, ”The AR Face Database”, CVCTechnical Report #24, Computer Vision Center (CVC), Barcelona, Spain,1998.

564564

[IEEE 2009 Eighth IEEE/ACIS International Conference on Computer and Information Science - Shanghai,...

Documents

Transcript of [IEEE 2009 Eighth IEEE/ACIS International Conference on Computer and Information Science - Shanghai,...