Research Article An Interior Point Method for -SVM...

17
Research Article An Interior Point Method for 1/2 -SVM and Application to Feature Selection in Classification Lan Yao, 1 Xiongji Zhang, 2 Dong-Hui Li, 2 Feng Zeng, 3 and Haowen Chen 1 1 College of Mathematics and Econometrics, Hunan University, Changsha 410082, China 2 School of Mathematical Sciences, South China Normal University, Guangzhou 510631, China 3 School of Soſtware, Central South University, Changsha 410083, China Correspondence should be addressed to Dong-Hui Li; [email protected] Received 4 November 2013; Revised 12 February 2014; Accepted 18 February 2014; Published 10 April 2014 Academic Editor: Frank Werner Copyright © 2014 Lan Yao et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. is paper studies feature selection for support vector machine (SVM). By the use of the 1/2 regularization technique, we propose a new model 1/2 -SVM. To solve this nonconvex and non-Lipschitz optimization problem, we first transform it into an equivalent quadratic constrained optimization model with linear objective function and then develop an interior point algorithm. We establish the convergence of the proposed algorithm. Our experiments with artificial data and real data demonstrate that the 1/2 -SVM model works well and the proposed algorithm is more effective than some popular methods in selecting relevant features and improving classification performance. 1. Introduction Feature selection plays an important role in solving the classification problems with high dimension features, such as text categorization [1, 2], gene expression array analysis [35], and combinatorial chemistry [6, 7]. e advantages of feature selection include (i) ignoring noisy or irrelevant features would prevent overfitting and improve the gener- alization performance; (ii) a sparse classifier can reduce the computation cost; (iii) a small set of important features is desirable for interpretability. We address the embedded feature selection methods in the context of linear support vector machines (SVMs). Existing feature selection methods embedded in SVMs fall into three approaches [8]. In the first approach, some greedy search strategies are applied to iteratively adding or removing features from the data. Guyon et al. [3] developed a recursive feature elimination (RFE) algorithm, which has shown good performance on gene selection for microarray data. Begin- ning with the full feature subset, SVM-RFE trains a SVM at each iteration, and then eliminates the feature that decreases the margin the least. Rakotomamonjy et al. [9] extended this method by using other ranking criteria including the radius margin bound and the span-estimate. e second approach is to optimize a scaling parameter vector [0, 1] that indicates the importance of each feature. Weston et al. [10] proposed an iterative method to optimize the scaling parameters by minimizing the bounds on leave-one-out error. Peleg and Meir [11] learned the scaling factors based on the global minimization of a data-dependent generalization error bound. e third category of approaches is to minimize the number of features by adding a sparsity term to the SVM formulation. ough standard SVM based on ‖‖ 2 can be solved easily by convex quadratic programming, its solution may not be a desirable sparse solution. A popular way to deal with this problem is the use of regularization technique, which results in a -SVM. It is to minimize ‖‖ , subject to some linear constraints, where ∈ [0, 1]. When ∈ (0, 1], ‖‖ =( =1 ) 1/ . (1) When = 0, ‖‖ 0 = =1 ( ̸ = 0) . e 0 -SVM can find the sparsest classifier by minimizing ‖‖ 0 , the number of nonzero elements in . However, it is discrete and NP- hard. From computational point of view, it is very difficult Hindawi Publishing Corporation Journal of Applied Mathematics Volume 2014, Article ID 942520, 16 pages http://dx.doi.org/10.1155/2014/942520

Transcript of Research Article An Interior Point Method for -SVM...

Page 1: Research Article An Interior Point Method for -SVM …downloads.hindawi.com/journals/jam/2014/942520.pdfResearch Article An Interior Point Method for 1/2-SVM and Application to Feature

Research ArticleAn Interior Point Method for 119871

12-SVM and Application to

Feature Selection in Classification

Lan Yao1 Xiongji Zhang2 Dong-Hui Li2 Feng Zeng3 and Haowen Chen1

1 College of Mathematics and Econometrics Hunan University Changsha 410082 China2 School of Mathematical Sciences South China Normal University Guangzhou 510631 China3 School of Software Central South University Changsha 410083 China

Correspondence should be addressed to Dong-Hui Li dhliscnueducn

Received 4 November 2013 Revised 12 February 2014 Accepted 18 February 2014 Published 10 April 2014

Academic Editor Frank Werner

Copyright copy 2014 Lan Yao et al This is an open access article distributed under the Creative Commons Attribution License whichpermits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

This paper studies feature selection for support vector machine (SVM) By the use of the 11987112

regularization technique we proposea new model 119871

12-SVM To solve this nonconvex and non-Lipschitz optimization problem we first transform it into an equivalent

quadratic constrained optimizationmodel with linear objective function and then develop an interior point algorithmWe establishthe convergence of the proposed algorithmOur experimentswith artificial data and real data demonstrate that the119871

12-SVMmodel

works well and the proposed algorithm is more effective than some popular methods in selecting relevant features and improvingclassification performance

1 Introduction

Feature selection plays an important role in solving theclassification problems with high dimension features suchas text categorization [1 2] gene expression array analysis[3ndash5] and combinatorial chemistry [6 7] The advantagesof feature selection include (i) ignoring noisy or irrelevantfeatures would prevent overfitting and improve the gener-alization performance (ii) a sparse classifier can reduce thecomputation cost (iii) a small set of important features isdesirable for interpretability

We address the embedded feature selection methodsin the context of linear support vector machines (SVMs)Existing feature selection methods embedded in SVMs fallinto three approaches [8] In the first approach some greedysearch strategies are applied to iteratively adding or removingfeatures from the data Guyon et al [3] developed a recursivefeature elimination (RFE) algorithm which has shown goodperformance on gene selection for microarray data Begin-ning with the full feature subset SVM-RFE trains a SVM ateach iteration and then eliminates the feature that decreasesthe margin the least Rakotomamonjy et al [9] extended thismethod by using other ranking criteria including the radiusmargin bound and the span-estimate

The second approach is to optimize a scaling parametervector 120590 isin [0 1]

119899 that indicates the importance of eachfeature Weston et al [10] proposed an iterative method tooptimize the scaling parameters by minimizing the boundson leave-one-out error Peleg andMeir [11] learned the scalingfactors based on the globalminimization of a data-dependentgeneralization error bound

The third category of approaches is to minimize thenumber of features by adding a sparsity term to the SVMformulation Though standard SVM based on 119908

2can be

solved easily by convex quadratic programming its solutionmay not be a desirable sparse solution A popular way to dealwith this problem is the use of 119871

119901regularization technique

which results in a 119871119901-SVM It is to minimize 119908

119901 subject to

some linear constraints where 119901 isin [0 1] When 119901 isin (0 1]

119908119901 = (

119898

sum

119895=1

10038161003816100381610038161199081198941003816100381610038161003816

119901

)

1119901

(1)

When 119901 = 0 1199080

= sum119898

119895=1119868(119908119895 = 0)

The 1198710-SVM can

find the sparsest classifier by minimizing 1199080 the number

of nonzero elements in 119908 However it is discrete and NP-hard From computational point of view it is very difficult

Hindawi Publishing CorporationJournal of Applied MathematicsVolume 2014 Article ID 942520 16 pageshttpdxdoiorg1011552014942520

2 Journal of Applied Mathematics

to develop efficient numerical methods to solve the problemA widely used technique in dealing with the 119871

0-SVM is to

use a smoothing technique so that the discrete model isapproached by a smooth problem [4 12 13] However asthe function 119908

0is not even continuous it is not desirable

that a smoothing technique based method would work wellChan et al [14] explored a convex relaxation to the cardinalityconstraint andobtained a relaxed convex problem that is closeto but different from the previous 119871

0-SVM An alternative

method is to minimize the convex envelope of the 1199080

such as 1198711-SVM The 119871

1-SVM is a convex problem and

can yield sparse solution It can be equivalent to a linearprogramming and hence can be solved efficiently Indeedthe 119871

1regularization has become quite welcome in SVM

[12 15 16] and is well known as the LASSO [17] in thestatistics literature However the 119871

1regularization problem

often leads to suboptimal sparsity in reality [18] In manycases the solutions yielded from 119871

1-SVM are less sparse than

those of 1198710-SVM The 119871

119901problem with 0 lt 119901 lt 1 can find

sparser solutions than the 1198711problem which was evidenced

in extensive computational [19ndash21] It has become a welcomestrategy in sparse SVM [22ndash27]

In this paper we focus on the 11987112

regularization andpropose a novel 119871

12-SVM Recently Xu et al [28] justified

that the sparsity-promotion ability of the 11987112

problem wasstrongest among the 119871

119901minimization problems with all 119901 isin

[12 1) and similar in 119901 isin (0 12] So the 11987112

problem canbe taken as a representative of 119871

119901(0 lt 119901 lt 1) problems

However as proved by Ge et al [29] finding the globalminimal value of the 119871

12problemwas still strongly NP-hard

But computing a local minimizer of the problem could bedone in polynomial time Our contributions of this paper aretwofold One is to derive a smooth constrained optimizationreformulation to the 119871

12-SVMThe objective function of the

problem is a linear function and the constraints are quadraticand linear We will establish the equivalence between theconstrained problem and the 119871

12-SVM We will also show

the existence of the KKT condition of the constrainedproblem Our second contribution is to develop an interiorpoint method to solve the constrained optimization reformu-lation and establish its global convergence We will also testand verify the effectiveness of the proposed method usingartificial data and real data

The rest of this paper is organized as follows In Section 2we first briefly introduce the model of the standard SVM(1198712-SVM) and the sparse regularization SVMs We then

reformulate the 11987112

-SVM into a smooth constrained opti-mization problem We propose an interior point methodto solve the constrained optimization reformulation andestablish its global convergence in Section 3 In Section 4we do numerical experiments to test the proposed methodSection 5 gives the conclusive remarks

2 A Smooth Constrained OptimizationReformulation to the 119871

12-SVM

In this section after simply reviewing the model of the stan-dard SVM (119871

2-SVM) and the sparse regularization SVMs

we derive an equivalent smooth optimization problem to the

11987112

-SVM model The smooth optimization problem is tominimize a linear function subject to some simple quadraticconstraints and linear constraints

21 Standard SVM In a two-class classification problem weare given a training data set 119863 = (119909

119894 119910119894)119899

119894=1 where 119909

119894isin 119877119898 is

the feature vector and 119910119894isin 0 1 is the class label The linear

classifier is to construct the following decision function

119891 (119909) = 119908119879119909 + 119887 (2)

where 119908 = (1199081 1199082 119908

119898) is the weight vector and 119887 is the

bias The prediction label is +1 if 119891(119909) gt 0 and minus1 otherwiseThe standard SVM (119871

2-SVM) [30] aims to find the separating

hyperplane 119891(119909) = 0 between two classes with maximalmargin 2w

2

2and minimal training errors which leads to

the following convex optimization problem

min 1

21199082

2+ 119862

119899

sum

119894=1

120585119894

st 119910119894(119908119879119909119894+ 119887) ge 1 minus 120585

119894119894 = 1 119899 120585

119894ge 0

(3)

where 1199082

= (sum119898

119895=11199082

119895)12 is the 119871

2norm of 119908 120585

119894is the

loss function to allow training errors for data that may notbe linearly separable and 119862 is a user-specified parameter tobalance themargin and the losses As the problem is a convexquadratic program it can be solved by existingmethods suchas the interior point method and active set method efficiently

22 Sparse SVM The 1198712-SVM is a nonsparse regularizer in

the sense that the learned decision hyperplane often utilizesall the features In practice peoples prefer to sparse SVM sothat only a few features are used to make a decision For thispurposes the following 119871

119901-SVM becomes very welcome

min 119908119901

119901+ 119862

119899

sum

119894=1

120585119894

st 119910119894(119908119879119909119894+ 119887) ge 1 minus 120585

119894119894 = 1 119899 120585

119894ge 0

(4)

where 119901 isin [0 1] 1199080stands for the number of nonzero

elements of 119908 and for 119901 isin (0 1] 119908119901is defined by (1)

Problem (4) is obtained by replacing 1198712penalty (119908

2

2) by 119871

119901

penalty (119908119901

119901) in (3) The standard SVM (3) corresponds to

the model (4) with 119901 = 2Figure 1 plots the 119871

119901penalty in one dimension We can

see from the figure that the smaller 119901 is the larger penaltiesare imposed on the small coefficients (|119908| lt 1)Therefore the119871119901penalties with 119901 lt 1 may achieve sparser solution than

the 1198711penalty In addition the 119871

1imposes large penalties

on large coefficients which may lead to biased estimation forlarge coefficients Consequently the 119871

119901(0 lt 119901 lt 1) penalties

become attractive due to their good properties in sparsityunbiasedness [31] and oracle [32] We are particularly inter-ested in the 119871

12penalty Recently Xu et al [21] revealed the

representative role of the 11987112

penalty in the 119871119901regularization

with 119901 isin (0 1) We will apply 11987112

penalty to SVM to performfeature selection and classification jointly

Journal of Applied Mathematics 3

0 1 20

1

2

3

4

W

Lp

pena

lty

p = 2

p = 1

p = 05

minus2 minus15 minus1 minus05

35

25

15

15

05

05

Figure 1 119871119901penalty in one dimension

23 11987112

-SVM Model We pay particular attention to the11987112

-SVM namely problem (4) with 119901 = 12 We will derivea smooth constrained optimization reformulation to the119871

12-

SVM so that it is relatively easy to design numerical methodsWe first specify the 119871

12-SVM

min119898

sum

119895=1

10038161003816100381610038161003816119908119895

10038161003816100381610038161003816

12

+ 119862

119899

sum

119894=1

120585119894≜ 120601 (119908 120585 119887)

st 119910119894(119908119879119909119894+ 119887) ge 1 minus 120585

119894 119894 = 1 119899

120585119894ge 0 119894 = 1 119899

(5)

Denote by 119906 = (119908 120585 119887) andD the feasible region of the pro-blem that is

D = (119908 120585 119887) | 119910119894(119908119879119909119894+ 119887) ge 1 minus 120585

119894

120585119894ge 0 119894 = 1 119899

(6)

Then the 11987112

-SVM can be written as an impact form

min120601 (119906) 119906 isin D (7)

It is a nonconvex and non-Lipschitz problem Due to theexistence of the term |119908

119894|12 the objective function is not even

directionally differentiable at a point with some119908119894= 0 which

makes the problem very difficult to solve Existing numericalmethods that are very efficient for solving smooth problemcould not be used directly One possible way to developnumerical methods for solving (7) is to smoothing the term|119908119895|12 using some smoothing function such as 120601

120598(119908119895) =

(1199082

119895+ 1205982)14 with some 120598 gt 0 However it is easy to see that

the derivative of 120601120598(119908119895) will be unbounded as 119908

119895rarr 0 and

120598 rarr 0 Consequently it is not desirable that the smoothingfunction based numerical methods could work well

Recently Tian and Yang [33] proposed an interior point11987112

-penalty functionmethod to solve general nonlinear pro-gramming problems by using a quadratic relaxation schemefor their 119871

12-lower order penalty problems We will follow

the idea of [33] to develop an interior point method forsolving the 119871

12-SVM To this end in the next subsection we

reformulate problem (7) to a smooth constrained optimiza-tion problem

24 A Reformulation to the 11987112

-SVM Model Consider thefollowing constrained optimization problem

min119898

sum

119895=1

119905119895+ 119862

119899

sum

119894=1

120585119894≜ 119891 (119908 120585 119905 119887)

st 1199052

119895minus 119908119895ge 0 119895 = 1 119898

1199052

119895+ 119908119895ge 0 119895 = 1 119898

119905119895ge 0 119895 = 1 119898

119910119894(119908119879119909119894+ 119887) ge 1 minus 120585

119894 119894 = 1 119899

120585119894ge 0 119894 = 1 119899

(8)

It is obtained by letting 119905119895= |119908119895|12 in the objective function

and adding constraints 1199052

119895minus 119908119895

ge 0 and 1199052

119895+ 119908119895

ge 0119895 = 1 119898 in (7) Denote by F the feasible region of theproblem that is

F = (119908 120585 119905 119887) | 1199052

119895minus 119908119895ge 0 119905

2

119895+ 119908119895ge 0

119905119895ge 0 119895 = 1 119898

cap (119908 120585 119905 119887) | 119910119894(119908119879119909119894+ 119887) ge 1 minus 120585

119894

120585119894ge 0 119894 = 1 119899

(9)

Let 119911 = (119908 120585 119905 119887) Then the above problem can be written as

min119891 (119911) 119911 isin F (10)

The following theorem establishes the equivalencebetween the 119871

12-SVM and (10)

Theorem 1 If 119906lowast = (119908lowast 120585lowast 119887lowast) isin 119877119898+119899+1 is a solution of the

11987112

minus 119878119881119872 (7) then 119911lowast

= (119908lowast 120585lowast |119908lowast|12

119887lowast) isin 119877

119898+119899+119898+1

is a solution of the optimization problem (10) Conversely if(119908lowast 120585lowast 119905lowast 119887lowast) is a solution of the optimization problem (10)

then (119908lowast 120585lowast 119887lowast) is a solution of the 119871

12-119878119881119872 (7)

Proof Let 119906lowast = (119908lowast 120585lowast 119887lowast) be a solution of the 119871

12-SVM

(7) and let = ( 120585 119905 ) be a solution of the constrained

4 Journal of Applied Mathematics

optimization problem (10) It is clear that 119911lowast

= (119908lowast 120585lowast

|119908lowast|12

119887lowast) isin F Moreover we have 119905

2

119895ge |

119895| forall119895 =

1 2 119898 and hence

119891 () =

119898

sum

119895=1

119905119895+ 119862

119899

sum

119894=1

120585119894ge

119898

sum

119895=1

10038161003816100381610038161003816119895

10038161003816100381610038161003816

12

+ 119862

119899

sum

119894=1

120585119894

ge

119898

sum

119895=1

10038161003816100381610038161003816119908lowast

119895

10038161003816100381610038161003816

12

+ 119862

119899

sum

119894=1

120585lowast

119894= 120601 (119906

lowast)

(11)

Since 119911lowast isin F we have

120601 (119906lowast) = 119891 (119911

lowast) ge 119891 () (12)

This together with (11) implies that 119891() = 120601(119906lowast) The proof

is complete

It is clear that the constraint functions of (10) are convexConsequently at any feasible point the set of all feasibledirections is the same as the set of all linearized feasibledirections

As a result the KKT point exists The KKT system ofthe problem (10) can be written as the following system ofnonlinear equations

119877 (119908 120585 119905 119887 120582) =

(((((((((((

(

120582(1)

minus 120582(2)

minus 119883119879119884119879120582(4)

119862 lowast 119890119899minus 120582(4)

minus 120582(5)

119890119898minus 2119879120582

(1)minus 2119879120582

(2)minus 120582(3)

119910119879120582(4)

min 1199052

119895minus 119908119895 120582(1)

119895119895=12119898

min 1199052

119895+ 119908119895 120582(2)

119895119895=12119898

min 119905119895 120582(3)

119895119895=12119898

min 119901119894 120582(4)

119894119894=12119899

min 120585119894 120582(5)

119894119894=12119899

)))))))))))

)

= 0

(13)

where 120582 = (120582(1)

120582(2)

120582(3)

120582(4)

120582(5)

) are the Lagrangianmultipliers 119883 = (119909

1 1199092 119909

119899)119879 119884 = diag(119910) is diagonal

matrix and 119901119894= 119910119894(119908119879119909119894+ 119887) minus (1 minus 120585

119894) 119894 = 1 2 119899

For the sake of simplicity the properties of the reformu-lation to 119871

12-SVM are shown in Appendix A

3 An Interior Point Method

In this section we develop an interior point method to solvethe equivalent constrained problem (10) of the 119871

12-SVM (7)

31 Auxiliary Function Following the idea of the interiorpoint method the constrained problem (10) can be solvedby minimizing a sequence of logarithmic barrier functions asfollows

min Φ120583(119908 120585 119905 119887)

119898

sum

119895=1

119905119895+ 119862

119899

sum

119894=1

120585119894

minus 120583

119898

sum

119895=1

[log (1199052119895minus 119908119895) + log (1199052

119895+ 119908119895) + log 119905

119895]

minus 120583

119899

sum

119894=1

(log 119901119894+ log 120585

119894)

st 1199052

119895minus 119908119895gt 0 119895 = 1 2 119898

1199052

119895+ 119908119895gt 0 119895 = 1 2 119898

119905119895gt 0 119895 = 1 2 119898

119901119894= 119910119894(119908119879119909119894+ 119887) + 120585

119894minus 1 gt 0 119894 = 1 2 119899

120585119894gt 0 119894 = 1 2 sdot sdot sdot 119899

(14)

where 120583 is the barrier parameter converging to zero fromabove

The KKT system of problem (14) is the following systemof linear equations

120582(1)

minus 120582(2)

minus 119883119879119884119879120582(4)

= 0

119862 lowast 119890119899minus 120582(4)

minus 120582(5)

= 0

119890119898minus 2119879120582

(1)minus 2119879120582

(2)minus 120582(3)

= 0

119910119879120582(4)

= 0

(1198792minus 119882)120582

(1)minus 120583119890119898

= 0

(1198792+ 119882)120582

(2)minus 120583119890119898

= 0

119879120582(3)

minus 120583119890119898

= 0

119875120582(4)

minus 120583119890119899= 0

Ξ120582(5)

minus 120583119890119899= 0

(15)

where 120582 = (120582(1)

120582(2)

120582(3)

120582(4)

120582(5)

) are the Lagrangianmulti-pliers 119879 = diag(119905) 119882 = diag(119908) Ξ = diag(120585) and 119875 =

diag(119884(119883119908+119887lowast119890119899)+120585minus119890

119899) are diagonalmatrices and 119890

119898isin 119877119898

and 119890119899isin 119877119899 stand for the vector whose elements are all ones

32 Newtonrsquos Method We apply Newtonrsquos method to solvethe nonlinear system (15) in variables 119908 120585 119905 119887 and 120582 Thesubproblem of the method is the following system of linearequations

119872 =

((((((((

(

Δ119908

Δ120585

Δ119905

Δ119887

Δ120582(1)

Δ120582(2)

Δ120582(3)

Δ120582(4)

Δ120582(5)

))))))))

)

= minus

(((((((((

(

120582(1)

minus 120582(2)

minus 119883119879119884119879120582(4)

119862 lowast 119890119899minus 120582(4)

minus 120582(5)

119890119898minus 2119879120582

(1)minus 2119879120582

(2)minus 120582(3)

119890119879

119899119884120582(4)

(1198792minus 119882)120582

(1)minus 120583119890119898

(1198792+ 119882)120582

(2)minus 120583119890119898

119879120582(3)

minus 120583119890119898

119875120582(4)

minus 120583119890119899

Ξ120582(5)

minus 120583119890119899

)))))))))

)

(16)

where 119872 is the Jacobian of the left function in (15) and takesthe form

Journal of Applied Mathematics 5

119872 =

(((((((

(

0 0 0 0 119868 minus119868 0 minus119883119879119884119879

0

0 0 0 0 0 0 0 minus119868 minus119868

0 0 minus2 (1198631+ 1198632) 0 minus2119879 minus2119879 minus119868 0 0

0 0 0 0 0 0 0 119910119879

0

minus1198631

0 21198791198631

0 1198792minus 119882 0 0 0 0

1198632

0 21198791198632

0 0 1198792+ 119882 0 0 0

0 0 1198633

0 0 0 119879 0 0

1198634119884119883 119863

40 119863

4119910 0 0 0 119875 0

0 1198635

0 0 0 0 0 0 Ξ

)))))))

)

(17)

where 1198631

= diag(120582(1)) 1198632

= diag(120582(2)) 1198633

= diag(120582(3))1198634= diag(120582(4)) and119863

5= diag(120582(5))

We can rewrite (16) as

(120582(1)

+ Δ120582(1)

) minus (120582(2)

+ Δ120582(2)

) minus 119883119879119884119879(120582(4)

+ Δ120582(4)

) = 0

(120582(4)

+ Δ120582(4)

) + (120582(5)

+ Δ120582(5)

) = 119862 lowast 119890119899

2 (1198631+ 1198632) Δ119905 + 2119879 (120582

(1)+ Δ120582(1)

)

+ 2119879 (120582(2)

+ Δ120582(2)

) + (120582(3)

+ Δ120582(3)

) = 119890119898

119910119879(120582(4)

+ Δ120582(4)

) = 0

minus1198631Δ119908 + 2119879119863

1Δ119905 + (119879

2minus 119882) (120582

(1)+ Δ120582(1)

) = 120583119890119898

1198632Δ119908 + 2119879119863

2Δ119905 + (119879

2+ 119882) (120582

(2)+ Δ120582(2)

) = 120583119890119898

1198633Δ119905 + 119879 (120582

(3)+ Δ120582(3)

) = 120583119890119898

1198634119884119883Δ119908 + 119863

4Δ120585 + 119863

4119910 lowast Δ119887 + 119875 (120582

(4)+ Δ120582(4)

) = 120583119890119899

1198635Δ120585 + Ξ (120582

(5)+ Δ120582(5)

) = 120583119890119899

(18)

It follows from the last five equations that vector ≜ 120582 + Δ120582

can be expressed as

(1)

= (1198792minus 119882)minus1

(120583119890119898+ 1198631Δ119908 minus 2119879119863

1Δ119905)

(2)

= (1198792+ 119882)minus1

(120583119890119898minus 1198632Δ119908 minus 2119879119863

2Δ119905)

(3)

= 119879minus1

(120583119890119898minus 1198633Δ119905)

(4)

= 119875minus1

(120583119890119899minus 1198634119884119883Δ119908 minus 119863

4Δ120585 minus 119863

4119910 lowast Δ119887)

(5)

= Ξminus1

(120583119890119899minus 1198635Δ120585)

(19)

Substituting (19) into the first four equations of (18) weobtain

119878(

Δ119908

Δ120585

Δ119905

Δ119887

)

=

(((

(

minus120583(1198792minus 119882)minus1

119890119898+ 120583(119879

2+ 119882)minus1

119890119898

+119883119879119884119879119875minus1119890119899

minus119862 lowast 119890119899+ 120583119875minus1119890119899+ 120583Ξminus1119890119899

minus119890119898+ 2120583119879(119879

2minus 119882)minus1

119890119898

+2120583119879(1198792+ 119882)minus1

119890119898+ 120583119879minus1119890119898

120583119910119879119875minus1119890119899

)))

)

(20)

where matrix 119878 takes the form

119878 = (

11987811

11987812

11987813

11987814

11987821

11987822

11987823

11987824

11987831

11987832

11987833

11987834

11987841

11987842

11987843

11987844

) (21)

with blocks

11987811

= 119880 + 119881 + 119883119879119884119879119875minus11198634119884119883

11987812

= 119883119879119884119879119875minus11198634

11987813

= minus2 (119880 minus 119881)119879 11987814

= 119883119879119884119879119875minus11198634119910

11987821

= 119875minus11198634119884119883

11987822

= 119875minus11198634+ Ξminus11198635

11987823

= 0 11987824

= 119875minus11198634119910

6 Journal of Applied Mathematics

11987831

= minus2 (119880 minus 119881)119879 11987832

= 0

11987833

= 4119879 (119880 + 119881)119879 + 119879minus11198633minus 2 (119863

1+ 1198632) 119878

34= 0

11987841

= 119910119879119875minus11198634119884119883 119878

42= 119910119879119875minus11198634

11987843

= 0 11987844

= 119910119879119875minus11198634119910

(22)

and 119880 = (1198792minus 119882)minus11198631and 119881 = (119879

2+ 119882)minus11198632

33The Interior PointerAlgorithm Let 119911 = (119908 120585 119905 119887) and120582 =

(120582(1)

120582(2)

120582(3)

120582(4)

120582(5)

) we first present the interior pointeralgorithm to solve the barrier problem (14) and then discussthe details of the algorithm

Algorithm 2 The interior pointer algorithm (IPA) is asfollows

Step 0 Given tolerance 120598120583 set 120591

1isin (0 (12)) 119897 gt 0

1205741gt 1 120573 isin (0 1) Let 119896 = 0

Step 1 Stop if KKT condition (15) holds

Step 2 Compute Δ119911119896 from (20) and

119896+1 from (19)Compute 119911

119896+1= 119911119896+ 120572119896Δ119911119896 Update the Lagrangian

multipliers to obtain 120582119896+1

Step 3 Let 119896 = 119896 + 1 Go to Step 1

In Step 2 a step length 120572119896 is used to calculate 119911

119896+1We estimate 120572

119896 by Armijo line search [34] in which 120572119896

=

max 120573119895| 119895 = 0 1 2 for some 120573 isin (0 1) and satisfies the

following inequalities

(119905119896+1

)2

minus 119908119896+1

gt 0

(119905119896+1

)2

+ 119908119896+1

gt 0

119905119896+1

gt 0

Φ120583(119911119896+1

) minus Φ120583(119911119896)

le 1205911120572119896(nabla119908Φ

120583(119911119896)119879

Δ119908119896

+ nabla119905Φ120583(119911119896)119879

Δ119905119896

+nabla120585Φ120583(119911119896)119879

Δ120585119896+ nabla119887Φ120583(119911119896)119879

Δ119887119896)

(23)

where 1205911isin (0 12)

To avoid ill-conditioned growth of 119896 and guarantee thestrict dual feasibility the Lagrangian multipliers 120582

119896 should

be sufficiently positive and bounded from above Followinga similar idea of [33] we first update the dual multipliers by

(1)(119896+1)

119894

=

min119897120583

(119905119896

119894)2

minus 119908119896

119894

if (1)(119896+1)119894

lt min119897120583

(119905119896

119894)2

minus 119908119896

119894

(1)(119896+1)

119894 if min119897

120583

(119905119896

119894)2

minus x119896119894

le (1)(119896+1)

119894le

1205831205741

(119905119896

119894)2

minus 119908119896

119894

1205831205741

(119905119896

119894)2

minus 119908119896

119894

if (1)(119896+1)119894

gt1205831205741

(119905119896

119894)2

minus 119908119896

119894

(2)(119896+1)

119894

=

min119897120583

(119905119896

119894)2

+ 119908119896

119894

if (2)(119896+1)119894

lt min119897120583

(119905119896

119894)2

+ 119908119896

119894

(2)(119896+1)

119894 if min119897

120583

(119905119896

119894)2

+ 119908119896

119894

le (2)(119896+1)

119894le

1205831205741

(119905119896

119894)2

+ 119908119896

119894

1205831205741

(119905119896

119894)2

+ 119908119896

119894

if (2)(119896+1)119894

gt1205831205741

(119905119896

119894)2

+ 119908119896

119894

(3)(119896+1)

119894=

min119897120583

119905119896

119894

if (3)(119896+1)119894

lt min119897120583

119905119896

119894

(3)(119896+1)

119894 if min119897

120583

119905119896

119894

le (3)(119896+1)

119894le

1205831205741

119905119896

119894

1205831205741

119905119896

119894

if (3)(119896+1)119894

gt1205831205741

119905119896

119894

(4)(119896+1)

119894=

min119897120583

119901119896

119894

if (4)(119896+1)119894

lt min119897120583

119901119896

119894

(4)(119896+1)

119894 if min119897

120583

119901119896

119894

le (4)(119896+1)

119894le

1205831205741

119901119896

119894

1205831205741

119901119896

119894

if (4)(119896+1)119894

gt1205831205741

119901119896

119894

(24)

Journal of Applied Mathematics 7

where 119901119894= 119910119894(119908119879119909119894+ 119887) + 120585

119894minus 1

(5)(119896+1)

119894=

min119897120583

120585119896

119894

if (5)(119896+1)119894

lt min119897120583

120585119896

119894

(5)(119896+1)

119894 if min119897

120583

120585119896

119894

le (5)(119896+1)

119894le

1205831205741

120585119896

119894

1205831205741

120585119896

119894

if (5)(119896+1)119894

gt1205831205741

120585119896

119894

(25)

where the parameters 119897 and 1205741satisfy 0 lt 119897 120574

1gt 1

Since positive definiteness of the matrix 119878 is demanded inthis method the Lagrangian multipliers

119896+1 should satisfythe following condition

120582(3)

minus 2 (120582(1)

+ 120582(2)

) 119905 ge 0 (26)

For the sake of simplicity the proof is given in Appendix BTherefore if 119896+1 satisfies (26) we let 120582119896+1 =

119896+1 Other-wise we would further update it by the following setting

120582(1)(119896+1)

= 1205742(1)(119896+1)

120582(2)(119896+1)

= 1205742(2)(119896+1)

120582(3)(119896+1)

= 1205743(3)(119896+1)

(27)

where constants 1205742isin (0 1) and 120574

3ge 1 satisfy

1205743

1205742

= max119894isin119864

2119905119896+1

119894((1)(119896+1)

119894+ (2)(119896+1)

119894)

(3)(119896+1)

119894

(28)

with 119864 = 1 2 119899 It is not difficult to see that the vector((1)(119896+1)

(2)(119896+1)

(3)(119896+1)

) determined by (27) satisfies (26)In practice the KKT conditions (15) are allowed to be

satisfied within a tolerance 120598120583 It turns to be that the iterative

process stops while the following inequalities meet

Res (119911 120582 120583) =

10038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817

120582(1)

minus 120582(2)

minus 119883119879119884119879120582(4)

119862 lowast 119890119899minus 120582(4)

minus 120582(5)

119890119898minus 2119879120582

(1)minus 2119879120582

(2)minus 120582(3)

119890119879

119899119884120582(4)

(1198792minus 119882)120582

(1)minus 120583119890119898

(1198792+ 119882)120582

(2)minus 120583119890119898

119879120582(3)

minus 120583119890119898

119875120582(4)

minus 120583119890119899

Ξ120582(5)

minus 120583119890119899

10038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817

lt 120598120583 (29)

120582 ge minus120598120583119890 (30)

where 120598120583is related to the current barrier parameter 120583 and

satisfies 120598120583darr 0 as 120583 rarr 0

Since functionΦ120583in (14) is convex we have the following

lemma which shows that Algorithm 2 is well defined (theproof is given in Appendix B)

Lemma 3 Let 119911119896 be strictly feasible for problem (10) If Δ119911119896=

0 then (23) is satisfied for all 120572119896

ge 0 If Δ119911119896

= 0 then thereexists a

119896isin (0 1] such that (23) holds for all 120572

119896isin (0

119896]

The proposed interior point method successively solvesthe barrier subproblem (14) with a decreasing sequence 120583

119896

We simply reduce both 120598120583and 120583 by a constant factor 120573 isin

(0 1) Finally we test optimality for problem (10) by meansof the residual norm Res(119911 0)

Here we present the whole algorithm to solve the 11987112

-SVM problem(10)

Algorithm 4 Algorithm for solving 11987112

-SVM problem is asfollows

Step 0 Set 1199080 isin 119877119898 1198870isin 119877 120585

0isin 119877119899 1205820 =

0gt 0

1199050

119894ge radic|119908

0

119894| + (12) 119894 isin 1 2 119898 Given constants

1205830gt 0 1205981205830

gt 0 120573 isin (0 1) and 120598 gt 0 let 119895 = 0

Step 1 Stop if Res(119911119895 119895 0) le 120598 and 119895ge 0

Step 2 Starting from (119911119895 120582119895) apply Algorithm 2 to

solve (14) with barrier parameter 120583119895and stopping

tolerance 120598120583119895 Set 119911119895+1 = 119911

119895119896 119895+1 = 119895119896 and 120582

119895+1=

120582119895119896

Step 3 Set 120583119895+1

= 120573120583119895and 120598120583119895+1

= 120573120598120583119895 Let 119895 = 119895 + 1

and go to Step 1

In Algorithm 4 the index 119895 denotes an outer iterationwhile 119896 denotes the last inner iteration of Algorithm 2

The convergence of the proposed interior point methodcan be proved We list the theorem here and give the proof inAppendix C

Theorem 5 Let (119911119896 120582119896) be generated by Algorithm 2 Thenany limit point of the sequence (119911

119896 119896) generated by

Algorithm 2 satisfies the first-order optimality conditions (15)

Theorem 6 Let (119911119895 119895) be generated by Algorithm 4 by

ignoring its termination condition Then the following state-ments are true

(i) The limit point of (119911119895 119895) satisfies the first order opti-mality condition (13)

(ii) The limit point 119911lowast of the convergent subsequence

119911119895J sube 119911

119895 with unbounded multipliers

119895J is a

Fritz-John point [35] of problem (10)

4 Experiments

In this section we tested the constrained optimization refor-mulation to the 119871

12-SVM and the proposed interior point

method We compared the performance of the 11987112

-SVMwith 119871

2-SVM [30] 119871

1-SVM [12] and 119871

0-SVM [12] on

artificial data and ten UCI data sets (httparchiveicsucieduml) These four problems were solved in primal refer-encing the machine learning toolbox Spider (httppeoplekybtuebingenmpgdespider) The 119871

2-SVM and 119871

1-SVM

were solved directly by quadratic programming and linearprogramming respectively To the NP-hard problem 119871

0-

SVM a commonly cited approximation method FeatureSelection Concave (FSV) [12] was applied and then the FSV

8 Journal of Applied Mathematics

problem was solved by a Successive Linear Approximation(SLA) algorithm All the experiments were run in thepersonal computer (16 GHz of CPU 4GB of RAM) withMATLAB R2010b on 64 bit Windows 7

In the proposed interior point method (Algorithms 2 and4) we set the parameters as 120573 = 06 120591

1= 05 119897 = 000001

1205741

= 100000 and 120573 = 05 The balance parameter 119862 wasselected by 5-fold cross-validation on training set over therange 2

minus10 2minus9 210 After training the weights that didnot satisfy the criteria |119908

119895|max

119894(|119908119894|) ge 10

4 [14] were set tozeroThen the cardinality of the hyperplane was computed asthe number of the nonzero weights

41 Artificial Data First we took an artificial binary linearclassification problem as an example The problem is similarto that in [13] The probability of 119910 = 1 or minus1 is equal Thefirst 6 features are relevant but redundant In 70 samplesthe first three features 119909

1 1199092 1199093were drawn as 119909

119894= 119910119873(119894 1)

and the second three features 1199094 1199095 1199096 as 119909

119894= 119873(0 1)

Otherwise the first three were drawn as 119909119894= 119873(0 1) and the

second three as 119909119894= 119910119873(119894 minus 3 1) The rest features are noise

119909119894= 119873(0 20) 119894 = 7 119898 Here119898 is the dimension of input

features The inputs were scaled to mean zero and standarddeviation In each trial 500 points were generated for testingand the average results were estimated over 30 trials

In the first experiment we consider the cases with thefixed feature size 119898 = 30 and different training sample sizes119899 = 10 20 100 The average results over the 30 trialsare shown in Table 1 and Figure 2 Figure 2 (left) plots theaverage cardinality of each classifier Since the artificial datasets have 2 relevant and nonredundant features the idealaverage cardinality is 2 Figure 2 (left) shows that the threesparse SVMs 119871

1-SVM 119871

12-SVM and 119871

0-SVM can achieve

sparse solution while the 1198712-SVM almost uses full features

in each data set Furthermore the solutions of 11987112

-SVMand 119871

0-SVM are much sparser than 119871

1-SVM As shown in

Table 1 the 1198711-SVM selects more than 6 features in all cases

which implies that some redundant or irrelevant featuresare selected The average cardinalities of 119871

12-SVM and 119871

0-

SVM are similar and close to 2 However when 119899 = 10

and 20 the 1198710-SVM has the average cardinalities of 142

and 187 respectively It means that the 1198710-SVM sometimes

selects only one feature in low sample data set and maybeignores some really relevant feature Consequently with thecardinalities between 205 and 29 119871

12-SVM has the more

reliable solution than 1198710-SVM In short as far as the number

of selected features is concerned the 11987112

-SVM behavesbetter than the other three methods

Figure 2 (right) plots the trend of the prediction accuracyversus the size of the training sample The classificationperformance of all methods is generally improved with theincreasing of the training sample size 119899 119871

1-SVM has the best

prediction performance in all cases and a slightly better than11987112

-SVM 11987112

-SVM shows more accuracy in classificationthan 119871

2-SVM and 119871

0-SVM especially in the case of 119899 =

10 50 As shown in Table 1 when there are only 10 train-ing samples the average accuracy of 119871

12-SVM is 8805

while the results of 1198712-SVM and 119871

0-SVM are 8465 and

7765 respectively Compared with 1198712-SVM and 119871

0-SVM

11987112

-SVM has the average accuracy increased by 34 and104 respectively as can be explained in what follows Tothe 1198712-SVM all features are selected without discrimination

and the prediction would bemisled by the irrelevant featuresTo the 119871

0-SVM few features are selected and some relevant

features are not included which would put negative impacton the prediction result As the tradeoff between119871

2-SVMand

1198710-SVM 119871

12-SVM has better performance than the two

The average results over ten artificial data sets in the firstexperiment are shown in the bottom of Table 1 On averagethe accuracy of 119871

12-SVM is 087 lower than the 119871

1-SVM

while the features selected by 11987112

-SVM are 7414 less than1198711-SVM It indicates that the 119871

12-SVM can achieve much

sparser solution than 1198711-SVM with little cost of accuracy

Moreover the average accuracy of 11987112

-SVMover 10 data setsis 222 higher than 119871

0-SVM with the similar cardinality

To sum up the 11987112

-SVM provides the best balance betweenaccuracy and sparsity among the three sparse SVMs

To further evaluate the feature selection performance of11987112

-SVM we investigate whether the features are correctlyselected For the119871

2-SVM is not designed for feature selection

it is not included in this comparison Since our artificial datasets have 2 best features (119909

3 1199096) the best result should have

the two features (1199093 1199096) ranking on the top according to their

absolute values of weights |119908119895| In the experiment we select

the top 2 features with the maximal |119908119895| for each method

and calculate the frequency that the top 2 features are 1199093

and 1199096in 30 runs The results are listed in Table 2 When the

training sample size is too small it is difficult to discriminatethe two most important features for all sparse SVMs Forexample when 119899 = 10 the selected frequencies of 119871

1-SVM

1198710-SVM and 119871

12-SVM are 7 3 and 9 respectively When 119899

increases all methods tend to make more correct selectionMoreover Table 2 shows that the 119871

12-SVM outperforms the

other two methods in all cases For example when 119899 = 100the selected frequencies of 119871

1-SVM and119871

0-SVM are 22 and

25 respectively and the result of 11987112

-SVM is 27 The 1198711-

SVMselects toomany redundant or irrelevant features whichmay influence the ranking in some extent Therefore 119871

1-

SVM is not so good as 11987112

-SVM at distinguishing the criticalfeatures The 119871

0-SVM has the lower hit frequency than 119871

12-

SVM which is probably due to the excessive small featuresubset it obtained Above all Tables 1 and 2 and Figure 2clearly show that the 119871

12-SVM is a promising sparsity driven

classification methodIn the second simulation we consider the cases with

various dimensions of feature space 119898 = 20 40 180 200

and the fixed training sample size 119899 = 100 The averageresults over 30 trials are shown in Figure 3 and Table 3 Sincethere are only 6 relevant features yet the larger 119898 meansthe more noisy features Figure 3 (left) shows that as thedimension increases from 20 to 200 the number of featuresselected by 119871

1-SVM increases from 826 to 231 However the

cardinalities of 11987112

-SVM and 1198710-SVM keep stable (from 22

to 295) It indicates that the 11987112

-SVM and 1198710-SVM aremore

suitable for feature selection than 1198711-SVM

Figure 3 (right) shows thatwith the increasing of the noisefeatures the accuracy of 119871

2-SVM drops significantly (from

9868 to 8747) On the contrary to the other three sparse

Journal of Applied Mathematics 9

0 20 40 60 80 1000

5

10

15

20

25

30

Number of training samples

Card

inal

ity

0 20 40 60 80 10075

80

85

90

95

100

Number of training samples

Accu

racy

L2-SVML1-SVM

L0-SVML12-SVM

L2-SVML1-SVM

L0-SVML12-SVM

Figure 2 Results comparison on 10 artificial data sets with various training samples

Table 1 Results comparison on 10 artificial data sets with various training sample sizes

1198991198712-SVM 119871

1-SVM 119871

0-SVM 119871

12-SVM

Card Acc Card Acc Card Acc Card Acc10 2997 8465 693 8855 143 7765 225 880520 3000 9221 883 9679 187 9317 235 955030 3000 9427 847 9823 203 9467 205 974140 2997 9599 900 9875 207 9539 210 977550 3000 9661 950 9891 213 9657 245 976860 2997 9725 1000 9894 220 9761 245 979870 3000 9741 1100 9898 220 9756 240 982380 3000 9768 1003 9909 223 9761 260 985090 2997 9789 920 9921 230 9741 290 9830100 3000 9813 1027 9909 250 9793 255 9836On average 2999 9521 932 9765 210 9456 241 9678ldquo119899rdquo is the number of training samples ldquoCardrdquo represents the number of selected features and ldquoAccrdquo is the classification accuracy

Table 2 Frequency of the most important features (1199093 1199096) ranked on top 2 in 30 runs

119899 10 20 30 40 50 60 70 80 90 1001198711-SVM 7 14 16 18 22 23 20 22 21 22

1198710-SVM 3 11 12 15 19 22 22 23 22 25

11987112-SVM 9 16 19 24 24 23 27 26 26 27

Table 3 Average results over 10 artificial data sets with varies dimension

1198712-SVM 119871

1-SVM 119871

0-SVM 119871

12-SVM

Cardinality 10995 1585 238 241Accuracy 9293 9907 9776 9826

10 Journal of Applied Mathematics

Table 4 Feature selection performance comparison on UCI data sets (Cardinality)

No UCI Data Set 119899 1198981198712-SVM 119871

1-SVM 119871

0-SVM 119871

12-SVM

Mean Std Mean Std Mean Std Mean Std1 Pima diabetes 768 8 800 000 740 089 620 110 760 1522 Breast Cancer 683 9 900 000 860 122 640 207 840 1673 Wine (3) 178 13 1300 000 973 069 373 028 407 0494 Image (7) 2100 19 1900 000 820 053 366 097 511 0625 SPECT 267 22 2200 000 1740 527 980 427 9 1176 WDBC 569 30 3000 000 980 152 900 303 560 2057 Ionosphere 351 34 3320 045 2500 335 2460 1021 2180 8448 SPECTF 267 44 4400 000 3880 1098 2400 1132 2560 7759 Sonar 208 60 6000 000 3140 1305 2700 718 1360 100610 Musk1 476 166 16600 000 8580 1885 4860 428 41 1416

Average cardinality 4050 4042 2421 1630 1418

Table 5 Classification performance comparison on UCI data sets (accuaracy)

No UCI Data Set 1198712-SVM 119871

1-SVM 119871

0-SVM 119871

12-SVM

Mean Std Mean Std Mean Std Mean Std1 Pima diabetes 7699 347 7673 085 7673 294 7712 1522 Breast cancer 9618 223 9706 096 9632 138 9647 1593 Wine (3) 9771 373 9829 128 9314 239 9714 2564 Image (7) 8894 149 9127 177 8667 458 9136 1945 SPECT 8340 215 7849 231 7811 338 8234 5106 WDBC 9611 211 9646 131 9558 115 9646 1317 Ionosphere 8343 544 8800 370 8457 538 8743 5948 SPECTF 7849 391 7434 450 7387 325 7849 1899 Sonar 7366 529 7561 582 7415 798 7610 43510 Musk1 8484 173 8211 242 7600 564 8232 338

Average accuracy 8597 8584 8351 8652

SVMs there is little change in the accuracy It reveals thatSVMs can benefit from the features reduction

Table 3 shows the average results over all data sets inthe second experiment On average the solution of 119871

12-

SVM yields much sparser than 1198711-SVM and a slightly better

accuracy than 1198710-SVM

42 UCI Data Sets We further tested the reformulation andthe proposed interior point methods to 119871

12-SVM on 10 UCI

data sets [36] There are 8 binary classification problems and2multiclass problems (wine image) Each feature of the inputdata was normalized to zero mean and unit variance and theinstances with missing value were deletedThen the data wasrandomly split into training set (80) and testing set (20)For the two multiclass problems a one-against-rest methodwas applied to construct a binary classifier for each class Werepeated the training and testing procedure 10 times and theaverage results were shown in Tables 4 and 5 and Figure 4

Tables 4 and 5 summarize the feature selection andclassification performance of the numerical experiments onUCI data sets respectively Here 119899 is the numbers of samplesand 119898 is the number of the input features For the two

multiclass data sets the numbers of the classes are markedbehind their names Sparsity is defined as card119898 and thesmall value of sparsity is preferredThe data sets are arrangedin descending order according to the dimension The lowestcardinality and the best accuracy rate for each problem arebolded

As shown in Tables 4 and 5 the three sparse SVMs canencourage sparsity in all data sets while remaining roughlyidentical accuracy with 119871

2-SVM Among the three sparse

methods the 11987112

-SVM has the lowest cardinality (1418)and the highest classification accuracy (8652) on averageWhile the 119871

1-SVM has the worst feature selection perform-

ance with the highest average cardinality (2421) and the1198710-SVM has the lowest average classification accuracy

(8351)Figure 4 plots the sparsity (left) and classification accu-

racy (right) of each classifier on UCI data sets In three datasets (6 9 10) the 119871

12-SVM has the best performance both

in feature selection and classification among the three sparseSVMs Compared with 119871

1-SVM 119871

12-SVM can achieve

sparser solution in nine data sets For example in the dataset ldquo8 SPECTFrdquo the features selected by 119871

12-SVM are 34

Journal of Applied Mathematics 11

50 100 150 2000

20

40

60

80

100

120

140

160

180

200

Dimension

Card

inal

ity

50 100 150 20080

82

84

86

88

90

92

94

96

98

100

Dimension

Accu

racy

L2-SVML1-SVM

L0-SVML12-SVM

L2-SVML1-SVM

L0-SVML12-SVM

Figure 3 Results comparison on artificial data set with various dimensions

less than 1198711-SVM at the same time the accuracy of 119871

12-

SVM is 41 higher than 1198711-SVM In most data sets (4

5 6 9 10) the cardinality of 11987112

-SVM drops significantly(at least 37) with the equal or a slightly better result inaccuracy Only in the data set ldquo3 Winerdquo the accuracy of11987112

-SVM is decreased by 11 but the sparsity provided by11987112

-SVM leads to 582 improvement over 1198711-SVM In the

rest three data sets (1 2 7) the two methods have similarresults in feature selection and classification As seen abovethe 119871

12-SVM can provide lower dimension representation

than 1198711-SVM with the competitive prediction perform-

anceFigure 4 (right) shows that compared with 119871

0-SVM

11987112

-SVM has the classification accuracy improved in alldata sets For instance in four data sets (3 4 8 10) theclassification accuracy of 119871

12-SVM is at least 40 higher

than 1198710-SVM Especially in the data set ldquo10Muskrdquo 119871

12-SVM

gives a 63 rise in accuracy over 1198710-SVM Meanwhile it

can be observed from Figure 4 (left) that 11987112

-SVM selectsfewer feature than 119871

0-SVM in five data sets (5 6 7 9 10) For

example in the data sets 6 and 9 the cardinalities of 11987112

-SVM are 378 and 496 less than 119871

0-SVM respectively

In summary 11987112

-SVM presents better classification perfor-mance than 119871

0-SVM while it is effective in choosing relevant

features

5 Conclusions

In this paper we proposed a 11987112

regularization techniquefor simultaneous feature selection and classification in theSVMWe have reformulated the 119871

12-SVM into an equivalent

smooth constrained optimization problem The problem

possesses a very simple structure and is relatively easy todevelop numerical methods By the use of this interestingreformulation we proposed an interior point method andestablished its convergence Our numerical results supportedthe reformulation and the proposed method The 119871

12-

SVM can get more sparsity solution than 1198711-SVM with

the comparable classification accuracy Furthermore the11987112

-SVM can achieve more accuracy classification resultsthan 119871

0-SVM (FSV)

Inspired by the good performance of the smooth opti-mization reformulation of 119871

12-SVM there are some inter-

esting topics deserving further research For examples todevelop more efficient algorithms for solving the reformu-lation to study nonlinear 119871

12-SVM and to explore varies

applications of the 11987112

-SVM and further validate its effectiveare interesting research topics Some of them are under ourcurrent investigation

Appendix

A Properties of the Reformulation to11987112

-SVM

Let 119863 be set of all feasible points of the problem (10) Forany 119911 isin 119863 we let 119865119863(119911119863) and 119871119865119863(119911119863) be the set of allfeasible directions and linearized feasible directions of 119863 at119911 Since the constraint functions of (10) are all convex weimmediately have the following lemma

Lemma A1 For any feasible point 119911 of (10) one has119865119863 (119911119863) = 119871119865119863 (119911119863) (A1)

12 Journal of Applied Mathematics

1 2 3 4 5 6 7 8 9 1010

20

30

40

50

60

70

80

90

100

Data set number Data set number

Spar

sity

1 2 3 4 5 6 7 8 9 1070

75

80

85

90

95

100

Accu

racy

L2-SVML1-SVM

L0-SVML12-SVM

L2-SVML1-SVM

L0-SVML12-SVM

Figure 4 Results comparison on UCI data sets

Based on the above Lemma we can easily derive a firstorder necessary condition for (10) The Lagrangian functionof (10) is

119871 (119908 119905 120585 119887 120582 120583 ] 119901 119902)

= 119862

119899

sum

119894=1

120585119894+

119898

sum

119895=1

119905119895minus

119898

sum

119895=1

120582119895(1199052

119895minus 119908119895) minus

119898

sum

119895=1

120583119895(1199052

119895+ 119908119895)

minus

119898

sum

119895=1

]119895119905119895minus

119899

sum

119894=1

119901119894(119910119894119908119879119909119894+ 119910119894119887 + 120585119894minus 1) minus

119899

sum

119894=1

119902119894120585119894

(A2)

where 120582 120583 ] 119901 119902 are the Lagrangian multipliers By the useof Lemma A1 we immediately have the following theoremabout the first order necessary condition

Theorem A2 Let (119908 119905 120585 119887) be a local solution of (10) Thenthere are Lagrangian multipliers (120582 120583 ] 119901 119902) isin 119877

119898+119898+119898+119899+119899

such that the following KKT conditions hold

120597119871

120597119908= 120582119895minus 120583119895minus 119901119894

119899

sum

119894=1

119910119894119909119894= 0 119894 = 1 2 119899

120597119871

120597119905= 1 minus 2120582

119895119905119895minus 2120583119895119905119895minus ]119895= 0 119895 = 1 2 119898

120597119871

120597120585= 119862 minus 119901

119894minus 119902119894= 0 119894 = 1 2 119899

120597119871

120597119887=

119899

sum

119894=1

119901119894119910119894= 0 119894 = 1 2 119899

120582119895ge 0 119905

2

119895minus 119908119895ge 0 120582

119895(1199052

119895minus 119908119895) = 0

119895 = 1 2 119898

120583119895ge 0 119905

2

119895+ 119908119895ge 0 120583

119895(1199052

119895+ 119908119895) = 0

119895 = 1 2 119898

]119895ge 0 119905

119895ge 0 ]

119895119905119895= 0 119895 = 1 2 119898

119901119894ge 0 119910

119894(119908119879119909119894+ 119887) + 120585

119894minus 1 ge 0

119901119894(119910119894(119908119879119909119894+ 119887) + 120585

119894minus 1) = 0 119894 = 1 2 119899

119902119894ge 0 120585

119894ge 0 119902

119894120585119894= 0 119894 = 1 2 119899

(A3)

The following theorem shows that the level set will bebounded

Theorem A3 For any given constant 119888 gt 0 the level set

119882119888= 119911 = (119908 119905 120585) | 119891 (119911) le 119888 cap 119863 (A4)

is bounded

Proof For any (119908 119905 120585) isin Ω119888 we have

119898

sum

119894

119905119894+ 119862

119899

sum

119895

120585119895le 119888 (A5)

Journal of Applied Mathematics 13

Combining with 119905119894ge 0 119894 = 1 119898 and 120585

119895ge 0 119895 = 1 119899

we have

0 le 119905119894le 119888 119894 = 1 119898

0 le 120585119895le

119888

119862 119895 = 1 119899

(A6)

Moreover for any (119908 119905 120585) isin 119863

10038161003816100381610038161199081198941003816100381610038161003816 le 1199052

119894 119894 = 1 119899 (A7)

Consequently 119908 119905 120585 are bounded What is more from thecondition

119910119894(119908119879119909119894+ 119887) ge 1 minus 120585

119894 119894 = 1 119898 (A8)

we have

119908119879119909119894+ 119887 ge 1 minus 120585

119894 if 119910

119894= 1 119894 = 1 119898

119908119879119909119894+ 119887 le minus1 + 120585

119894 if 119910

119894= minus1 119894 = 1 119898

(A9)

Thus if the feasible region119863 is not empty then we have

max119910119894=1

(1 minus 120585119894minus 119908119879119909119894) le 119887 le min

119910119894=minus1

(minus1 + 120585119894minus 119908119879119909119894) (A10)

Hence 119887 is also bounded The proof is complete

B Proof of Lemma 3

Lemma 3 (see Section 3) shows that Algorithm 2 is welldefinedWe first introduce the following proposition to showthat the matrix 119878 defined by (21) is always positive definitewhich ensures Algorithm 2 to be well defined

Proposition B1 Let the inequality (26) hold then the matrix119878 defined by (21) is positive definite

Proof By an elementary deduction we have for any V(1) isin

119877119898 V(2) isin 119877

119899 V(3) isin 119877119898 V(4) isin 119877 with (V(1) V(2) V(3) V(4)) = 0

(V(1)119879 V(2)119879 V(3)119879 V(4)119879) 119878(

V(1)

V(2)

V(3)

V(4))

= V(1)11987911987811V(1) + V(1)119879119878

12V(2) + V(1)119879119878

13V(3)

+ V(1)11987911987814V(4) + V(2)119879119878

21V(1) + V(2)119879119878

22V(2)

+ V(2)11987911987823V(3) + V(2)119879119878

24V(4)

+ V(3)11987911987831V(1) + V(3)119879119878

32V(2)

+ V(3)11987911987833V(3) + V(3)119879119878

34V(4)

+ V(4)11987911987841V(1) + V(4)119879119878

42V(2)

+ V(4)11987911987843V(3) + V(4)119879119878

44V(4)

= V(1)119879 (119880 + 119881 + 119883119879119884119879119875minus11198634119884119883) V(1)

+ V(1)119879 (119883119879119884119879119875minus11198634) V(2)

+ V(1)119879 (minus2 (119880 minus 119881)119879) V(3)

+ V(1)119879 (119883119879119884119879119875minus11198634119910) V(4)

+ V(2)119879119875minus11198634119884119883V(1) + V(2)119879 (119875minus1119863

4+ Ξminus11198635) V(2)

+ V(2)119879119875minus11198634119910V(4)

minus 2V(3)119879 (119880 minus 119881)119879V(1)

+ V(3)119879 (4119879 (119880 + 119881)119879 + 119879minus11198633minus 2 (119863

1+ 1198632)) V(3)

+ V(4)119879119910119879119875minus11198634119884119883V(1)

+ V(4)119879119910119879119875minus11198634V(2) + V(4)119879119910119879119875minus1119863

4119910V(4)

= V(1)119879119880V(1) minus 4V(3)119879119880119879V(1) + 4V(3)119879119879119880119879V(3)

+ V(1)119879119881V(1) + 4V(3)119879119881119879V(1) + 4V(3)119879119879119881119879V(3)

+ V(2)119879Ξminus11198635V(2) + V(3)119879 (119879minus1119863

3minus 21198631minus 21198632) V(3)

+ (119884119883V(1) + V(2) + V(4)119910)119879

119875minus11198634

times (119884119883V(1) + V(2) + V(4)119910)

= 119890119879

119898119880(119863V

1minus 2119879119863V

3)2

119890119898+ 119890119879

119898119881(119863V

1+ 2119879119863V

3)2

119890119898

+ V(2)119879Ξminus11198635V(2)

+ V(3)119879 (119879minus11198633minus 21198631minus 21198632) V(3)

+ (119884119883V(1) + V(2) + V(4)119910)119879

119875minus11198634(119884119883V(1) + V(2) + V(4)119910)

gt 0

(B1)

where 119863V1= diag(V(1)) 119863V

2= diag(V(2)) 119863V

3= diag(V(3))

and119863V4= diag(V(4))

14 Journal of Applied Mathematics

B1 Proof of Lemma 3

Proof It is easy to see that if Δ119911119896= 0 which implies that (23)

holds for all 120572119896ge 0

Suppose Δ119911119896

= 0 We have

nabla119908Φ120583(119911119896)119879

Δ119908119896+ nabla119905Φ120583(119911119896)119879

Δ119905119896

+ nabla120585Φ120583(119911119896)119879

Δ120585119896+ nabla119887Φ120583(119911119896)119879

Δ119887119896

= minus (Δ119908119896119879

Δ120585119896119879

Δ119905119896119879

Δ119887119896) 119878(

Δ119908119896

Δ120585119896

Δ119905119896

Δ119887119896

)

(B2)

Since matrix 119878 is positive definite and Δ119911119896

= 0 the lastequation implies

nabla119908Φ120583(119911119896)119879

Δ119908119896+ nabla119905Φ120583(119911119896)119879

Δ119905119896

+ nabla120585Φ120583(119911119896)119879

Δ120585119896+ nabla119887Φ120583(119911119896)119879

Δ119887119896lt 0

(B3)

Consequently there exists a 119896isin (0

119896] such that the fourth

inequality in (23) is satisfied for all 120572119896isin (0

119896]

On the other hand since (119909119896 119905119896) is strictly feasible the

point (119909119896 119905119896) + 120572(Δ119909 119905119896 Δ119905119896) will be feasible for all 120572 gt 0

sufficiently small The proof is complete

C Convergence Analysis

This appendix is devoted to the global convergence of theinterior point method We first show the convergence ofAlgorithm 2 when a fixed 120583 is applied to the barrier subprob-lem (14)

Lemma C1 Let (119911119896 120582119896) be generated by Algorithm 2 Then119911119896 are strictly feasible for problem (10) and the Lagrangian

multipliers 120582119896 are bounded from above

Proof For the sake of convenience we use 119892119894(119911) 119894 =

1 2 3119898 + 2119899 to denote the constraint functions of theconstrained problem (10)

We first show that 119911119896 are strictly feasible Suppose onthe contrary that there exists an infinite index subset K andan index 119894 isin 1 2 3119898 + 2119899 such that 119892

119894(119911119896)K darr 0

By the definition of Φ120583(119911) and 119891(119911) = sum

119898

119895=1119905119894+ 119862sum

119899

119894=1120585119894

being bounded from below in the feasible set it must holdthat Φ

120583(119911119896)K rarr infin

However the line search rule implies that the sequenceΦ120583(119911119896) is decreasing So we get a contradiction Conse-

quently for any 119894 isin 1 2 3119898+2119899 119892119894(119909119896 119905119896) is bounded

away from zero The boundedness of 120582119896 then follows from(24)ndash(27)

Lemma C2 Let (119911119896 120582119896) and (Δ119911119896 119896+1

) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenthe sequence (Δ119911

119896 119896+1

)K is bounded

Proof Again we use 119892119894(119911) 119894 = 1 2 3119898+2119899 to denote the

constraint functions of the constrained problem (10)We suppose on the contrary that there exists an infinite

subsetK1015840 sube K such that the subsequence (Δ119911119896 119896+1

)K1015840

tends to infinity Let 119911119896K1015840 rarr 119911

lowast It follows fromLemma C1 that there is an infinite subset K sube K1015840 such that120582119896K rarr 120582

lowast and 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 Thus

we have

119878119896997888rarr 119878lowast (C1)

as 119896 rarr infin with 119896 isin K By Proposition B1 119878lowast is positivedefinite Since the right hand size of (18) is bounded andcontinuous the unboundedness of (Δ119911

119896 119896+1

)K impliesthat the limit of the coefficient matrices of (18) is singularwhich yields a contradiction The proof is complete

Lemma C3 Let (119911119896 120582119896) and (Δ119911119896 119896+1

) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenone has Δ119911

119896K rarr 0

Proof It follows from Lemma C1 that the sequence Δ119911119896K

is bounded Suppose on the contrary that there exists aninfinite subsetK1015840 sube K such that Δ119911

119896K1015840 rarr Δ119911

lowast= 0 Since

subsequences 119911119896K1015840 120582

119896K1015840 and

119896K1015840 are all bounded

there are points 119911lowast 120582lowast and

lowast as well as an infinite indexset K sube K1015840 such that 119911119896K rarr 119911

lowast 120582119896K rarr 120582lowast and

119896K rarr

lowast By Lemma C1 we have 119892119894(119911lowast) gt 0 forall119894 isin

1 2 3119898 + 2119899 Similar to the proof of Lemma 3 it is notdifficult to get

nabla119908Φ120583(119911lowast)119879

Δ119908lowast+ nabla119905Φ120583(119911lowast)119879

Δ119905lowast

+ nabla120585Φ120583(119911lowast)119879

Δ120585lowast+ nabla119887Φ120583(119911lowast)119879

Δ119887lowast

= minus (Δ119908lowast119879

Δ120585lowast119879

Δ119905lowast119879

Δ119887lowast) 119878(

Δ119908lowast

Δ120585lowast

Δ119905lowast

Δ119887lowast

) lt 0

(C2)

Since 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 there exists a

isin (0 1] such that for all 120572 isin (0 ]

119892119894(119911lowast+ 120572Δ119911

lowast) gt 0 forall119894 isin 1 2 3119899 (C3)

Taking into account 1205911isin (0 12) we claim that there exists

a isin (0 ] such that the following inequality holds for all120572 isin (0 ]

Φ120583(119911lowast+ 120572Δ119911

lowast) minus Φ120583(119911lowast) le 11120591

1120572nabla119911Φ120583(119911lowast)119879

Δ119911lowast (C4)

Let 119898lowast

= min119895 | 120573119895isin (0 ] 119895 = 0 1 and 120572

lowast= 120573119898lowast

It follows from (C3) that the following inequality is satisfiedfor all 119896 isin K sufficient large and any 119894 isin 1 2 3119898 + 2119899

119892119894(119911119896+ 120572lowastΔ119911119896) gt 0 (C5)

Journal of Applied Mathematics 15

Moreover we have

Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911119896)

= Φ120583(119911lowast+ 120572lowastΔ119911lowast) minus Φ120583(119911lowast)

+ Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911lowast+ 120572lowastΔ119911lowast)

+ Φ120583(119911lowast) minus Φ120583(119911119896)

le 1051205911120572lowastnabla119911Φ120583(119911lowast)119879

Δ119911lowast

le 1205911120572lowastnabla119911Φ120583(119911119896)119879

Δ119911119896

(C6)

By the backtracking line search rule the last inequalitytogether with (C5) yields the inequality120572

119896ge 120572lowastfor all 119896 isin K

large enough Consequently when 119896 isin K is sufficiently largewe have from (23) and (C3) that

Φ120583(119911119896+1

) le Φ120583(119911119896) + 1205911120572119896nabla119911Φ120583(119911119896)119879

Δ119911119896

le Φ120583(119911119896) + 1205911120572lowastnabla119911Φ120583(119911119896)119879

Δ119911119896

le Φ120583(119911119896) +

1

21205911120572lowastnabla119911Φ120583(119911lowast)119879

Δ119911lowast

(C7)

This shows that Φ120583(119911119896)K rarr minusinfin contradicting with the

fact that Φ120583(119911119896)K is bounded from below The proof is

complete

Then we will establish the convergence of Algorithms 2and 4 that is Theorems 5 and 6 in Section 3

C1 Proof of Theorem 5

Proof We let (119911lowast lowast) be a limit point of (119911119896 119896) and thesubsequence (119911119896 119896)K converges to (119911

lowast lowast)

Recall that (Δ119911119896 119896) are the solution of (18) with (119911 120582) =

(119911119896 120582119896) Taking limits in both sides of (18) with (119911 120582) =

(119911119896 120582119896) as 119896 rarr infinwith 119896 isin K by the use of Lemma C1 we

obtain

lowast(1)

minus lowast(2)

minus 119883119879119884119879lowast(4)

= 0

119862 lowast 119890119899minus lowast(4)

minus lowast(5)

= 0

119890119898minus 2119879lowastlowast(1)

minus 2119879lowastlowast(2)

minus lowast(3)

= 0

119910119879lowast(4)

= 0

(119879lowast2

minus 119882lowast) lowast(1)

minus 120583119890119898

= 0

(119879lowast2

+ 119882lowast) lowast(2)

minus 120583119890119898

= 0

119879lowastlowast(3)

minus 120583119890119898

= 0

119875lowastlowast(4)

minus 120583119890119899= 0

Ξlowastlowast(5)

minus 120583119890119899= 0

(C8)

This shows that (119911lowast lowast) satisfies the first-order optimalityconditions (15)

C2 Proof of Theorem 6

Proof (i) Without loss of generality we suppose that thebounded subsequence (119911

119895 119895)J converges to some point

(119911lowast lowast) It is clear that 119911lowast is a feasible point of (10) Since

120583119895 120598120583119895 rarr 0 it follows from (29) and (30) that

Res (119911119895 119895 120583119895)

J997888rarr Res (119911lowast lowast 0) = 0

119895J

997888rarr lowastge 0

(C9)

Consequently the (119911lowast lowast) satisfies the KKT conditions (13)(ii) Let 120585

119895= max119895

infin 1 and

119895= 120585minus1

119895119895 Obviously

119895 is bounded Hence there exists an infinite subsetJ1015840 sube J

such that 1198951015840J rarr = 0 and 119895infin

= 1 for large 119895 isin JFrom (30) we know that ge 0 Dividing both sides of thefirst inequality of (18) with (119911 120582) = (119911

119896 120582119896) by 120585119895 then taking

limits as 119895 rarr infin with 119895 isin J we get

(1)

minus (2)

minus 119883119879119884119879(4)

= 0

(4)

+ (5)

= 0

2119879lowast(1)

+ 2119879lowast(2)

+ (3)

= 0

119910119879(4)

= 0

(119879lowast2

minus 119882lowast) (1)

= 0

(119879lowast2

+ 119882lowast) (2)

= 0

119879lowast(3)

= 0

119875lowast(4)

= 0

Ξlowast(5)

= 0

(C10)

Since 119911lowast is feasible the above equations has shown that 119911lowast is

a Fritz-John point of problem (10)

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

The authors would like to acknowledge support for thisproject from the National Science Foundation (NSF Grants11371154 11071087 11271069 and 61103202) of China

References

[1] Y M Yang and J O Pedersen ldquoA comparative study on featureselection in text categorizationrdquo in Proceedings of the 14th

16 Journal of Applied Mathematics

International Conference on Machine Learning (ICML rsquo97) vol97 pp 412ndash420 1997

[2] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquoThe Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003

[3] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002

[4] H H Zhang J Ahn X Lin and C Park ldquoGene selection usingsupport vector machines with non-convex penaltyrdquo Bioinfor-matics vol 22 no 1 pp 88ndash95 2006

[5] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007

[6] J Weston F Perez-Cruz O Bousquet O Chapelle A Elisseeffand B Scholkopf ldquoFeature selection and transduction forprediction of molecular bioactivity for drug designrdquo Bioinfor-matics vol 19 no 6 pp 764ndash771 2003

[7] Y Liu ldquoA comparative study on feature selection methods fordrug discoveryrdquo Journal of Chemical Information and ComputerSciences vol 44 no 5 pp 1823ndash1828 2004

[8] I Guyon J Weston S Barnhill and V Vapnik Eds FeatureExtraction Foundations and Applications vol 207 Springer2006

[9] A Rakotomamonjy ldquoVariable selection using SVM-based cri-teriardquo Journal of Machine Learning Research vol 3 no 7-8 pp1357ndash1370 2003

[10] J Weston S Mukherjee O Chapelle M Pontil T Poggio andV Vapnik ldquoFeature selection for SVMsrdquo in Proceedings of theConference onNeural Information Processing Systems vol 13 pp668ndash674 Denver Colo USA 2000

[11] D Peleg and R Meir ldquoA feature selection algorithm based onthe global minimization of a generalization error boundrdquo inProceedings of the Conference on Neural Information ProcessingSystems Vancouver Canada 2004

[12] P S Bradley and O L Mangasarian ldquoFeature selection via con-cave minimization and support vector machinesrdquo in Proceed-ings of the International Conference on Machine Learning pp82ndash90 San Francisco Calif USA 1998

[13] JWeston A Elisseeff B Scholkopf andM Tipping ldquoUse of thezero-norm with linear models and kernel methodsrdquo Journal ofMachine Learning Research vol 3 pp 1439ndash1461 2003

[14] A B Chan N Vasconcelos and G R G Lanckriet ldquoDirectconvex relaxations of sparse SVMrdquo in Proceedings of the 24thInternational Conference on Machine Learning (ICML rsquo07) pp145ndash153 Corvallis Ore USA June 2007

[15] GM Fung andO L Mangasarian ldquoA feature selection Newtonmethod for support vector machine classificationrdquo Computa-tional Optimization and Applications vol 28 no 2 pp 185ndash2022004

[16] J Zhu S Rosset T Hastie and R Tibshirani ldquo1-norm supportvector machinesrdquo Advances in Neural Information ProcessingSystems vol 16 no 1 pp 49ndash56 2004

[17] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society B Methodological vol 58no 1 pp 267ndash288 1996

[18] T Zhang ldquoAnalysis of multi-stage convex relaxation for sparseregularizationrdquo Journal of Machine Learning Research vol 11pp 1081ndash1107 2010

[19] R Chartrand ldquoExact reconstruction of sparse signals via non-convexminimizationrdquo IEEE Signal Processing Letters vol 14 no10 pp 707ndash710 2007

[20] R Chartrand ldquoNonconvex regularizaron for shape preserva-tionrdquo inProceedings of the 14th IEEE International Conference onImage Processing (ICIP rsquo07) vol 1 pp I293ndashI296 San AntonioTex USA September 2007

[21] Z Xu H Zhang Y Wang X Chang and Y Liang ldquo11987112

regularizationrdquo Science China Information Sciences vol 53 no6 pp 1159ndash1169 2010

[22] W J Chen and Y J Tian ldquoLp-norm proximal support vectormachine and its applicationsrdquo Procedia Computer Science vol1 no 1 pp 2417ndash2423 2010

[23] J Liu J Li W Xu and Y Shi ldquoA weighted L119902adaptive least

squares support vector machine classifiers-Robust and sparseapproximationrdquo Expert Systems with Applications vol 38 no 3pp 2253ndash2259 2011

[24] Y Liu H H Zhang C Park and J Ahn ldquoSupport vector mach-ines with adaptive 119871

119902penaltyrdquo Computational Statistics amp Data

Analysis vol 51 no 12 pp 6380ndash6394 2007[25] Z Liu S Lin andM Tan ldquoSparse support vectormachines with

Lp penalty for biomarker identificationrdquo IEEEACM Transac-tions on Computational Biology and Bioinformatics vol 7 no1 pp 100ndash107 2010

[26] A Rakotomamonjy R Flamary G Gasso and S Canu ldquol119901-l119902

penalty for sparse linear and sparse multiple kernel multitasklearningrdquo IEEE Transactions on Neural Networks vol 22 no 8pp 1307ndash1320 2011

[27] J Y Tan Z Zhang L Zhen C H Zhang and N Y DengldquoAdaptive feature selection via a new version of support vectormachinerdquo Neural Computing and Applications vol 23 no 3-4pp 937ndash945 2013

[28] Z B Xu X Y Chang F M Xu and H Zhang ldquo11987112

regu-larization an iterative half thresholding algorithmrdquo IEEETrans-actions on Neural Networks and Learning Systems vol 23 no 7pp 1013ndash1027 2012

[29] D Ge X Jiang and Y Ye ldquoA note on the complexity of 119871119901

minimizationrdquo Mathematical Programming vol 129 no 2 pp285ndash299 2011

[30] VN VapnikTheNature of Statistical LearningTheory SpringerNew York NY USA 1995

[31] J Fan and H Peng ldquoNonconcave penalized likelihood with adiverging number of parametersrdquo The Annals of Statistics vol32 no 3 pp 928ndash961 2004

[32] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoThe Annals of Statistics vol 28 no 5 pp 1356ndash1378 2000

[33] B S Tian and X Q Yang ldquoAn interior-point l 11987112-penalty-

method for nonlinear programmingrdquo Technical ReportDepartment of Applied Mathematics Hong Kong PolytechnicUniversity 2013

[34] L Armijo ldquoMinimization of functions having Lipschitz con-tinuous first partial derivativesrdquo Pacific Journal of Mathematicsvol 16 pp 1ndash3 1966

[35] F John ldquoExtremum problems with inequalities as side-con-ditionsrdquo in Studies and Essays Courant Anniversary Volume pp187ndash1204 1948

[36] D J Newman S Hettich C L Blake and C J Merz ldquoUCIrepository of machine learning databasesrdquo Technical Report9702 Department of Information and Computer Science Uni-verisity of California Irvine Calif USA 1998 httparchiveicsucieduml

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 2: Research Article An Interior Point Method for -SVM …downloads.hindawi.com/journals/jam/2014/942520.pdfResearch Article An Interior Point Method for 1/2-SVM and Application to Feature

2 Journal of Applied Mathematics

to develop efficient numerical methods to solve the problemA widely used technique in dealing with the 119871

0-SVM is to

use a smoothing technique so that the discrete model isapproached by a smooth problem [4 12 13] However asthe function 119908

0is not even continuous it is not desirable

that a smoothing technique based method would work wellChan et al [14] explored a convex relaxation to the cardinalityconstraint andobtained a relaxed convex problem that is closeto but different from the previous 119871

0-SVM An alternative

method is to minimize the convex envelope of the 1199080

such as 1198711-SVM The 119871

1-SVM is a convex problem and

can yield sparse solution It can be equivalent to a linearprogramming and hence can be solved efficiently Indeedthe 119871

1regularization has become quite welcome in SVM

[12 15 16] and is well known as the LASSO [17] in thestatistics literature However the 119871

1regularization problem

often leads to suboptimal sparsity in reality [18] In manycases the solutions yielded from 119871

1-SVM are less sparse than

those of 1198710-SVM The 119871

119901problem with 0 lt 119901 lt 1 can find

sparser solutions than the 1198711problem which was evidenced

in extensive computational [19ndash21] It has become a welcomestrategy in sparse SVM [22ndash27]

In this paper we focus on the 11987112

regularization andpropose a novel 119871

12-SVM Recently Xu et al [28] justified

that the sparsity-promotion ability of the 11987112

problem wasstrongest among the 119871

119901minimization problems with all 119901 isin

[12 1) and similar in 119901 isin (0 12] So the 11987112

problem canbe taken as a representative of 119871

119901(0 lt 119901 lt 1) problems

However as proved by Ge et al [29] finding the globalminimal value of the 119871

12problemwas still strongly NP-hard

But computing a local minimizer of the problem could bedone in polynomial time Our contributions of this paper aretwofold One is to derive a smooth constrained optimizationreformulation to the 119871

12-SVMThe objective function of the

problem is a linear function and the constraints are quadraticand linear We will establish the equivalence between theconstrained problem and the 119871

12-SVM We will also show

the existence of the KKT condition of the constrainedproblem Our second contribution is to develop an interiorpoint method to solve the constrained optimization reformu-lation and establish its global convergence We will also testand verify the effectiveness of the proposed method usingartificial data and real data

The rest of this paper is organized as follows In Section 2we first briefly introduce the model of the standard SVM(1198712-SVM) and the sparse regularization SVMs We then

reformulate the 11987112

-SVM into a smooth constrained opti-mization problem We propose an interior point methodto solve the constrained optimization reformulation andestablish its global convergence in Section 3 In Section 4we do numerical experiments to test the proposed methodSection 5 gives the conclusive remarks

2 A Smooth Constrained OptimizationReformulation to the 119871

12-SVM

In this section after simply reviewing the model of the stan-dard SVM (119871

2-SVM) and the sparse regularization SVMs

we derive an equivalent smooth optimization problem to the

11987112

-SVM model The smooth optimization problem is tominimize a linear function subject to some simple quadraticconstraints and linear constraints

21 Standard SVM In a two-class classification problem weare given a training data set 119863 = (119909

119894 119910119894)119899

119894=1 where 119909

119894isin 119877119898 is

the feature vector and 119910119894isin 0 1 is the class label The linear

classifier is to construct the following decision function

119891 (119909) = 119908119879119909 + 119887 (2)

where 119908 = (1199081 1199082 119908

119898) is the weight vector and 119887 is the

bias The prediction label is +1 if 119891(119909) gt 0 and minus1 otherwiseThe standard SVM (119871

2-SVM) [30] aims to find the separating

hyperplane 119891(119909) = 0 between two classes with maximalmargin 2w

2

2and minimal training errors which leads to

the following convex optimization problem

min 1

21199082

2+ 119862

119899

sum

119894=1

120585119894

st 119910119894(119908119879119909119894+ 119887) ge 1 minus 120585

119894119894 = 1 119899 120585

119894ge 0

(3)

where 1199082

= (sum119898

119895=11199082

119895)12 is the 119871

2norm of 119908 120585

119894is the

loss function to allow training errors for data that may notbe linearly separable and 119862 is a user-specified parameter tobalance themargin and the losses As the problem is a convexquadratic program it can be solved by existingmethods suchas the interior point method and active set method efficiently

22 Sparse SVM The 1198712-SVM is a nonsparse regularizer in

the sense that the learned decision hyperplane often utilizesall the features In practice peoples prefer to sparse SVM sothat only a few features are used to make a decision For thispurposes the following 119871

119901-SVM becomes very welcome

min 119908119901

119901+ 119862

119899

sum

119894=1

120585119894

st 119910119894(119908119879119909119894+ 119887) ge 1 minus 120585

119894119894 = 1 119899 120585

119894ge 0

(4)

where 119901 isin [0 1] 1199080stands for the number of nonzero

elements of 119908 and for 119901 isin (0 1] 119908119901is defined by (1)

Problem (4) is obtained by replacing 1198712penalty (119908

2

2) by 119871

119901

penalty (119908119901

119901) in (3) The standard SVM (3) corresponds to

the model (4) with 119901 = 2Figure 1 plots the 119871

119901penalty in one dimension We can

see from the figure that the smaller 119901 is the larger penaltiesare imposed on the small coefficients (|119908| lt 1)Therefore the119871119901penalties with 119901 lt 1 may achieve sparser solution than

the 1198711penalty In addition the 119871

1imposes large penalties

on large coefficients which may lead to biased estimation forlarge coefficients Consequently the 119871

119901(0 lt 119901 lt 1) penalties

become attractive due to their good properties in sparsityunbiasedness [31] and oracle [32] We are particularly inter-ested in the 119871

12penalty Recently Xu et al [21] revealed the

representative role of the 11987112

penalty in the 119871119901regularization

with 119901 isin (0 1) We will apply 11987112

penalty to SVM to performfeature selection and classification jointly

Journal of Applied Mathematics 3

0 1 20

1

2

3

4

W

Lp

pena

lty

p = 2

p = 1

p = 05

minus2 minus15 minus1 minus05

35

25

15

15

05

05

Figure 1 119871119901penalty in one dimension

23 11987112

-SVM Model We pay particular attention to the11987112

-SVM namely problem (4) with 119901 = 12 We will derivea smooth constrained optimization reformulation to the119871

12-

SVM so that it is relatively easy to design numerical methodsWe first specify the 119871

12-SVM

min119898

sum

119895=1

10038161003816100381610038161003816119908119895

10038161003816100381610038161003816

12

+ 119862

119899

sum

119894=1

120585119894≜ 120601 (119908 120585 119887)

st 119910119894(119908119879119909119894+ 119887) ge 1 minus 120585

119894 119894 = 1 119899

120585119894ge 0 119894 = 1 119899

(5)

Denote by 119906 = (119908 120585 119887) andD the feasible region of the pro-blem that is

D = (119908 120585 119887) | 119910119894(119908119879119909119894+ 119887) ge 1 minus 120585

119894

120585119894ge 0 119894 = 1 119899

(6)

Then the 11987112

-SVM can be written as an impact form

min120601 (119906) 119906 isin D (7)

It is a nonconvex and non-Lipschitz problem Due to theexistence of the term |119908

119894|12 the objective function is not even

directionally differentiable at a point with some119908119894= 0 which

makes the problem very difficult to solve Existing numericalmethods that are very efficient for solving smooth problemcould not be used directly One possible way to developnumerical methods for solving (7) is to smoothing the term|119908119895|12 using some smoothing function such as 120601

120598(119908119895) =

(1199082

119895+ 1205982)14 with some 120598 gt 0 However it is easy to see that

the derivative of 120601120598(119908119895) will be unbounded as 119908

119895rarr 0 and

120598 rarr 0 Consequently it is not desirable that the smoothingfunction based numerical methods could work well

Recently Tian and Yang [33] proposed an interior point11987112

-penalty functionmethod to solve general nonlinear pro-gramming problems by using a quadratic relaxation schemefor their 119871

12-lower order penalty problems We will follow

the idea of [33] to develop an interior point method forsolving the 119871

12-SVM To this end in the next subsection we

reformulate problem (7) to a smooth constrained optimiza-tion problem

24 A Reformulation to the 11987112

-SVM Model Consider thefollowing constrained optimization problem

min119898

sum

119895=1

119905119895+ 119862

119899

sum

119894=1

120585119894≜ 119891 (119908 120585 119905 119887)

st 1199052

119895minus 119908119895ge 0 119895 = 1 119898

1199052

119895+ 119908119895ge 0 119895 = 1 119898

119905119895ge 0 119895 = 1 119898

119910119894(119908119879119909119894+ 119887) ge 1 minus 120585

119894 119894 = 1 119899

120585119894ge 0 119894 = 1 119899

(8)

It is obtained by letting 119905119895= |119908119895|12 in the objective function

and adding constraints 1199052

119895minus 119908119895

ge 0 and 1199052

119895+ 119908119895

ge 0119895 = 1 119898 in (7) Denote by F the feasible region of theproblem that is

F = (119908 120585 119905 119887) | 1199052

119895minus 119908119895ge 0 119905

2

119895+ 119908119895ge 0

119905119895ge 0 119895 = 1 119898

cap (119908 120585 119905 119887) | 119910119894(119908119879119909119894+ 119887) ge 1 minus 120585

119894

120585119894ge 0 119894 = 1 119899

(9)

Let 119911 = (119908 120585 119905 119887) Then the above problem can be written as

min119891 (119911) 119911 isin F (10)

The following theorem establishes the equivalencebetween the 119871

12-SVM and (10)

Theorem 1 If 119906lowast = (119908lowast 120585lowast 119887lowast) isin 119877119898+119899+1 is a solution of the

11987112

minus 119878119881119872 (7) then 119911lowast

= (119908lowast 120585lowast |119908lowast|12

119887lowast) isin 119877

119898+119899+119898+1

is a solution of the optimization problem (10) Conversely if(119908lowast 120585lowast 119905lowast 119887lowast) is a solution of the optimization problem (10)

then (119908lowast 120585lowast 119887lowast) is a solution of the 119871

12-119878119881119872 (7)

Proof Let 119906lowast = (119908lowast 120585lowast 119887lowast) be a solution of the 119871

12-SVM

(7) and let = ( 120585 119905 ) be a solution of the constrained

4 Journal of Applied Mathematics

optimization problem (10) It is clear that 119911lowast

= (119908lowast 120585lowast

|119908lowast|12

119887lowast) isin F Moreover we have 119905

2

119895ge |

119895| forall119895 =

1 2 119898 and hence

119891 () =

119898

sum

119895=1

119905119895+ 119862

119899

sum

119894=1

120585119894ge

119898

sum

119895=1

10038161003816100381610038161003816119895

10038161003816100381610038161003816

12

+ 119862

119899

sum

119894=1

120585119894

ge

119898

sum

119895=1

10038161003816100381610038161003816119908lowast

119895

10038161003816100381610038161003816

12

+ 119862

119899

sum

119894=1

120585lowast

119894= 120601 (119906

lowast)

(11)

Since 119911lowast isin F we have

120601 (119906lowast) = 119891 (119911

lowast) ge 119891 () (12)

This together with (11) implies that 119891() = 120601(119906lowast) The proof

is complete

It is clear that the constraint functions of (10) are convexConsequently at any feasible point the set of all feasibledirections is the same as the set of all linearized feasibledirections

As a result the KKT point exists The KKT system ofthe problem (10) can be written as the following system ofnonlinear equations

119877 (119908 120585 119905 119887 120582) =

(((((((((((

(

120582(1)

minus 120582(2)

minus 119883119879119884119879120582(4)

119862 lowast 119890119899minus 120582(4)

minus 120582(5)

119890119898minus 2119879120582

(1)minus 2119879120582

(2)minus 120582(3)

119910119879120582(4)

min 1199052

119895minus 119908119895 120582(1)

119895119895=12119898

min 1199052

119895+ 119908119895 120582(2)

119895119895=12119898

min 119905119895 120582(3)

119895119895=12119898

min 119901119894 120582(4)

119894119894=12119899

min 120585119894 120582(5)

119894119894=12119899

)))))))))))

)

= 0

(13)

where 120582 = (120582(1)

120582(2)

120582(3)

120582(4)

120582(5)

) are the Lagrangianmultipliers 119883 = (119909

1 1199092 119909

119899)119879 119884 = diag(119910) is diagonal

matrix and 119901119894= 119910119894(119908119879119909119894+ 119887) minus (1 minus 120585

119894) 119894 = 1 2 119899

For the sake of simplicity the properties of the reformu-lation to 119871

12-SVM are shown in Appendix A

3 An Interior Point Method

In this section we develop an interior point method to solvethe equivalent constrained problem (10) of the 119871

12-SVM (7)

31 Auxiliary Function Following the idea of the interiorpoint method the constrained problem (10) can be solvedby minimizing a sequence of logarithmic barrier functions asfollows

min Φ120583(119908 120585 119905 119887)

119898

sum

119895=1

119905119895+ 119862

119899

sum

119894=1

120585119894

minus 120583

119898

sum

119895=1

[log (1199052119895minus 119908119895) + log (1199052

119895+ 119908119895) + log 119905

119895]

minus 120583

119899

sum

119894=1

(log 119901119894+ log 120585

119894)

st 1199052

119895minus 119908119895gt 0 119895 = 1 2 119898

1199052

119895+ 119908119895gt 0 119895 = 1 2 119898

119905119895gt 0 119895 = 1 2 119898

119901119894= 119910119894(119908119879119909119894+ 119887) + 120585

119894minus 1 gt 0 119894 = 1 2 119899

120585119894gt 0 119894 = 1 2 sdot sdot sdot 119899

(14)

where 120583 is the barrier parameter converging to zero fromabove

The KKT system of problem (14) is the following systemof linear equations

120582(1)

minus 120582(2)

minus 119883119879119884119879120582(4)

= 0

119862 lowast 119890119899minus 120582(4)

minus 120582(5)

= 0

119890119898minus 2119879120582

(1)minus 2119879120582

(2)minus 120582(3)

= 0

119910119879120582(4)

= 0

(1198792minus 119882)120582

(1)minus 120583119890119898

= 0

(1198792+ 119882)120582

(2)minus 120583119890119898

= 0

119879120582(3)

minus 120583119890119898

= 0

119875120582(4)

minus 120583119890119899= 0

Ξ120582(5)

minus 120583119890119899= 0

(15)

where 120582 = (120582(1)

120582(2)

120582(3)

120582(4)

120582(5)

) are the Lagrangianmulti-pliers 119879 = diag(119905) 119882 = diag(119908) Ξ = diag(120585) and 119875 =

diag(119884(119883119908+119887lowast119890119899)+120585minus119890

119899) are diagonalmatrices and 119890

119898isin 119877119898

and 119890119899isin 119877119899 stand for the vector whose elements are all ones

32 Newtonrsquos Method We apply Newtonrsquos method to solvethe nonlinear system (15) in variables 119908 120585 119905 119887 and 120582 Thesubproblem of the method is the following system of linearequations

119872 =

((((((((

(

Δ119908

Δ120585

Δ119905

Δ119887

Δ120582(1)

Δ120582(2)

Δ120582(3)

Δ120582(4)

Δ120582(5)

))))))))

)

= minus

(((((((((

(

120582(1)

minus 120582(2)

minus 119883119879119884119879120582(4)

119862 lowast 119890119899minus 120582(4)

minus 120582(5)

119890119898minus 2119879120582

(1)minus 2119879120582

(2)minus 120582(3)

119890119879

119899119884120582(4)

(1198792minus 119882)120582

(1)minus 120583119890119898

(1198792+ 119882)120582

(2)minus 120583119890119898

119879120582(3)

minus 120583119890119898

119875120582(4)

minus 120583119890119899

Ξ120582(5)

minus 120583119890119899

)))))))))

)

(16)

where 119872 is the Jacobian of the left function in (15) and takesthe form

Journal of Applied Mathematics 5

119872 =

(((((((

(

0 0 0 0 119868 minus119868 0 minus119883119879119884119879

0

0 0 0 0 0 0 0 minus119868 minus119868

0 0 minus2 (1198631+ 1198632) 0 minus2119879 minus2119879 minus119868 0 0

0 0 0 0 0 0 0 119910119879

0

minus1198631

0 21198791198631

0 1198792minus 119882 0 0 0 0

1198632

0 21198791198632

0 0 1198792+ 119882 0 0 0

0 0 1198633

0 0 0 119879 0 0

1198634119884119883 119863

40 119863

4119910 0 0 0 119875 0

0 1198635

0 0 0 0 0 0 Ξ

)))))))

)

(17)

where 1198631

= diag(120582(1)) 1198632

= diag(120582(2)) 1198633

= diag(120582(3))1198634= diag(120582(4)) and119863

5= diag(120582(5))

We can rewrite (16) as

(120582(1)

+ Δ120582(1)

) minus (120582(2)

+ Δ120582(2)

) minus 119883119879119884119879(120582(4)

+ Δ120582(4)

) = 0

(120582(4)

+ Δ120582(4)

) + (120582(5)

+ Δ120582(5)

) = 119862 lowast 119890119899

2 (1198631+ 1198632) Δ119905 + 2119879 (120582

(1)+ Δ120582(1)

)

+ 2119879 (120582(2)

+ Δ120582(2)

) + (120582(3)

+ Δ120582(3)

) = 119890119898

119910119879(120582(4)

+ Δ120582(4)

) = 0

minus1198631Δ119908 + 2119879119863

1Δ119905 + (119879

2minus 119882) (120582

(1)+ Δ120582(1)

) = 120583119890119898

1198632Δ119908 + 2119879119863

2Δ119905 + (119879

2+ 119882) (120582

(2)+ Δ120582(2)

) = 120583119890119898

1198633Δ119905 + 119879 (120582

(3)+ Δ120582(3)

) = 120583119890119898

1198634119884119883Δ119908 + 119863

4Δ120585 + 119863

4119910 lowast Δ119887 + 119875 (120582

(4)+ Δ120582(4)

) = 120583119890119899

1198635Δ120585 + Ξ (120582

(5)+ Δ120582(5)

) = 120583119890119899

(18)

It follows from the last five equations that vector ≜ 120582 + Δ120582

can be expressed as

(1)

= (1198792minus 119882)minus1

(120583119890119898+ 1198631Δ119908 minus 2119879119863

1Δ119905)

(2)

= (1198792+ 119882)minus1

(120583119890119898minus 1198632Δ119908 minus 2119879119863

2Δ119905)

(3)

= 119879minus1

(120583119890119898minus 1198633Δ119905)

(4)

= 119875minus1

(120583119890119899minus 1198634119884119883Δ119908 minus 119863

4Δ120585 minus 119863

4119910 lowast Δ119887)

(5)

= Ξminus1

(120583119890119899minus 1198635Δ120585)

(19)

Substituting (19) into the first four equations of (18) weobtain

119878(

Δ119908

Δ120585

Δ119905

Δ119887

)

=

(((

(

minus120583(1198792minus 119882)minus1

119890119898+ 120583(119879

2+ 119882)minus1

119890119898

+119883119879119884119879119875minus1119890119899

minus119862 lowast 119890119899+ 120583119875minus1119890119899+ 120583Ξminus1119890119899

minus119890119898+ 2120583119879(119879

2minus 119882)minus1

119890119898

+2120583119879(1198792+ 119882)minus1

119890119898+ 120583119879minus1119890119898

120583119910119879119875minus1119890119899

)))

)

(20)

where matrix 119878 takes the form

119878 = (

11987811

11987812

11987813

11987814

11987821

11987822

11987823

11987824

11987831

11987832

11987833

11987834

11987841

11987842

11987843

11987844

) (21)

with blocks

11987811

= 119880 + 119881 + 119883119879119884119879119875minus11198634119884119883

11987812

= 119883119879119884119879119875minus11198634

11987813

= minus2 (119880 minus 119881)119879 11987814

= 119883119879119884119879119875minus11198634119910

11987821

= 119875minus11198634119884119883

11987822

= 119875minus11198634+ Ξminus11198635

11987823

= 0 11987824

= 119875minus11198634119910

6 Journal of Applied Mathematics

11987831

= minus2 (119880 minus 119881)119879 11987832

= 0

11987833

= 4119879 (119880 + 119881)119879 + 119879minus11198633minus 2 (119863

1+ 1198632) 119878

34= 0

11987841

= 119910119879119875minus11198634119884119883 119878

42= 119910119879119875minus11198634

11987843

= 0 11987844

= 119910119879119875minus11198634119910

(22)

and 119880 = (1198792minus 119882)minus11198631and 119881 = (119879

2+ 119882)minus11198632

33The Interior PointerAlgorithm Let 119911 = (119908 120585 119905 119887) and120582 =

(120582(1)

120582(2)

120582(3)

120582(4)

120582(5)

) we first present the interior pointeralgorithm to solve the barrier problem (14) and then discussthe details of the algorithm

Algorithm 2 The interior pointer algorithm (IPA) is asfollows

Step 0 Given tolerance 120598120583 set 120591

1isin (0 (12)) 119897 gt 0

1205741gt 1 120573 isin (0 1) Let 119896 = 0

Step 1 Stop if KKT condition (15) holds

Step 2 Compute Δ119911119896 from (20) and

119896+1 from (19)Compute 119911

119896+1= 119911119896+ 120572119896Δ119911119896 Update the Lagrangian

multipliers to obtain 120582119896+1

Step 3 Let 119896 = 119896 + 1 Go to Step 1

In Step 2 a step length 120572119896 is used to calculate 119911

119896+1We estimate 120572

119896 by Armijo line search [34] in which 120572119896

=

max 120573119895| 119895 = 0 1 2 for some 120573 isin (0 1) and satisfies the

following inequalities

(119905119896+1

)2

minus 119908119896+1

gt 0

(119905119896+1

)2

+ 119908119896+1

gt 0

119905119896+1

gt 0

Φ120583(119911119896+1

) minus Φ120583(119911119896)

le 1205911120572119896(nabla119908Φ

120583(119911119896)119879

Δ119908119896

+ nabla119905Φ120583(119911119896)119879

Δ119905119896

+nabla120585Φ120583(119911119896)119879

Δ120585119896+ nabla119887Φ120583(119911119896)119879

Δ119887119896)

(23)

where 1205911isin (0 12)

To avoid ill-conditioned growth of 119896 and guarantee thestrict dual feasibility the Lagrangian multipliers 120582

119896 should

be sufficiently positive and bounded from above Followinga similar idea of [33] we first update the dual multipliers by

(1)(119896+1)

119894

=

min119897120583

(119905119896

119894)2

minus 119908119896

119894

if (1)(119896+1)119894

lt min119897120583

(119905119896

119894)2

minus 119908119896

119894

(1)(119896+1)

119894 if min119897

120583

(119905119896

119894)2

minus x119896119894

le (1)(119896+1)

119894le

1205831205741

(119905119896

119894)2

minus 119908119896

119894

1205831205741

(119905119896

119894)2

minus 119908119896

119894

if (1)(119896+1)119894

gt1205831205741

(119905119896

119894)2

minus 119908119896

119894

(2)(119896+1)

119894

=

min119897120583

(119905119896

119894)2

+ 119908119896

119894

if (2)(119896+1)119894

lt min119897120583

(119905119896

119894)2

+ 119908119896

119894

(2)(119896+1)

119894 if min119897

120583

(119905119896

119894)2

+ 119908119896

119894

le (2)(119896+1)

119894le

1205831205741

(119905119896

119894)2

+ 119908119896

119894

1205831205741

(119905119896

119894)2

+ 119908119896

119894

if (2)(119896+1)119894

gt1205831205741

(119905119896

119894)2

+ 119908119896

119894

(3)(119896+1)

119894=

min119897120583

119905119896

119894

if (3)(119896+1)119894

lt min119897120583

119905119896

119894

(3)(119896+1)

119894 if min119897

120583

119905119896

119894

le (3)(119896+1)

119894le

1205831205741

119905119896

119894

1205831205741

119905119896

119894

if (3)(119896+1)119894

gt1205831205741

119905119896

119894

(4)(119896+1)

119894=

min119897120583

119901119896

119894

if (4)(119896+1)119894

lt min119897120583

119901119896

119894

(4)(119896+1)

119894 if min119897

120583

119901119896

119894

le (4)(119896+1)

119894le

1205831205741

119901119896

119894

1205831205741

119901119896

119894

if (4)(119896+1)119894

gt1205831205741

119901119896

119894

(24)

Journal of Applied Mathematics 7

where 119901119894= 119910119894(119908119879119909119894+ 119887) + 120585

119894minus 1

(5)(119896+1)

119894=

min119897120583

120585119896

119894

if (5)(119896+1)119894

lt min119897120583

120585119896

119894

(5)(119896+1)

119894 if min119897

120583

120585119896

119894

le (5)(119896+1)

119894le

1205831205741

120585119896

119894

1205831205741

120585119896

119894

if (5)(119896+1)119894

gt1205831205741

120585119896

119894

(25)

where the parameters 119897 and 1205741satisfy 0 lt 119897 120574

1gt 1

Since positive definiteness of the matrix 119878 is demanded inthis method the Lagrangian multipliers

119896+1 should satisfythe following condition

120582(3)

minus 2 (120582(1)

+ 120582(2)

) 119905 ge 0 (26)

For the sake of simplicity the proof is given in Appendix BTherefore if 119896+1 satisfies (26) we let 120582119896+1 =

119896+1 Other-wise we would further update it by the following setting

120582(1)(119896+1)

= 1205742(1)(119896+1)

120582(2)(119896+1)

= 1205742(2)(119896+1)

120582(3)(119896+1)

= 1205743(3)(119896+1)

(27)

where constants 1205742isin (0 1) and 120574

3ge 1 satisfy

1205743

1205742

= max119894isin119864

2119905119896+1

119894((1)(119896+1)

119894+ (2)(119896+1)

119894)

(3)(119896+1)

119894

(28)

with 119864 = 1 2 119899 It is not difficult to see that the vector((1)(119896+1)

(2)(119896+1)

(3)(119896+1)

) determined by (27) satisfies (26)In practice the KKT conditions (15) are allowed to be

satisfied within a tolerance 120598120583 It turns to be that the iterative

process stops while the following inequalities meet

Res (119911 120582 120583) =

10038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817

120582(1)

minus 120582(2)

minus 119883119879119884119879120582(4)

119862 lowast 119890119899minus 120582(4)

minus 120582(5)

119890119898minus 2119879120582

(1)minus 2119879120582

(2)minus 120582(3)

119890119879

119899119884120582(4)

(1198792minus 119882)120582

(1)minus 120583119890119898

(1198792+ 119882)120582

(2)minus 120583119890119898

119879120582(3)

minus 120583119890119898

119875120582(4)

minus 120583119890119899

Ξ120582(5)

minus 120583119890119899

10038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817

lt 120598120583 (29)

120582 ge minus120598120583119890 (30)

where 120598120583is related to the current barrier parameter 120583 and

satisfies 120598120583darr 0 as 120583 rarr 0

Since functionΦ120583in (14) is convex we have the following

lemma which shows that Algorithm 2 is well defined (theproof is given in Appendix B)

Lemma 3 Let 119911119896 be strictly feasible for problem (10) If Δ119911119896=

0 then (23) is satisfied for all 120572119896

ge 0 If Δ119911119896

= 0 then thereexists a

119896isin (0 1] such that (23) holds for all 120572

119896isin (0

119896]

The proposed interior point method successively solvesthe barrier subproblem (14) with a decreasing sequence 120583

119896

We simply reduce both 120598120583and 120583 by a constant factor 120573 isin

(0 1) Finally we test optimality for problem (10) by meansof the residual norm Res(119911 0)

Here we present the whole algorithm to solve the 11987112

-SVM problem(10)

Algorithm 4 Algorithm for solving 11987112

-SVM problem is asfollows

Step 0 Set 1199080 isin 119877119898 1198870isin 119877 120585

0isin 119877119899 1205820 =

0gt 0

1199050

119894ge radic|119908

0

119894| + (12) 119894 isin 1 2 119898 Given constants

1205830gt 0 1205981205830

gt 0 120573 isin (0 1) and 120598 gt 0 let 119895 = 0

Step 1 Stop if Res(119911119895 119895 0) le 120598 and 119895ge 0

Step 2 Starting from (119911119895 120582119895) apply Algorithm 2 to

solve (14) with barrier parameter 120583119895and stopping

tolerance 120598120583119895 Set 119911119895+1 = 119911

119895119896 119895+1 = 119895119896 and 120582

119895+1=

120582119895119896

Step 3 Set 120583119895+1

= 120573120583119895and 120598120583119895+1

= 120573120598120583119895 Let 119895 = 119895 + 1

and go to Step 1

In Algorithm 4 the index 119895 denotes an outer iterationwhile 119896 denotes the last inner iteration of Algorithm 2

The convergence of the proposed interior point methodcan be proved We list the theorem here and give the proof inAppendix C

Theorem 5 Let (119911119896 120582119896) be generated by Algorithm 2 Thenany limit point of the sequence (119911

119896 119896) generated by

Algorithm 2 satisfies the first-order optimality conditions (15)

Theorem 6 Let (119911119895 119895) be generated by Algorithm 4 by

ignoring its termination condition Then the following state-ments are true

(i) The limit point of (119911119895 119895) satisfies the first order opti-mality condition (13)

(ii) The limit point 119911lowast of the convergent subsequence

119911119895J sube 119911

119895 with unbounded multipliers

119895J is a

Fritz-John point [35] of problem (10)

4 Experiments

In this section we tested the constrained optimization refor-mulation to the 119871

12-SVM and the proposed interior point

method We compared the performance of the 11987112

-SVMwith 119871

2-SVM [30] 119871

1-SVM [12] and 119871

0-SVM [12] on

artificial data and ten UCI data sets (httparchiveicsucieduml) These four problems were solved in primal refer-encing the machine learning toolbox Spider (httppeoplekybtuebingenmpgdespider) The 119871

2-SVM and 119871

1-SVM

were solved directly by quadratic programming and linearprogramming respectively To the NP-hard problem 119871

0-

SVM a commonly cited approximation method FeatureSelection Concave (FSV) [12] was applied and then the FSV

8 Journal of Applied Mathematics

problem was solved by a Successive Linear Approximation(SLA) algorithm All the experiments were run in thepersonal computer (16 GHz of CPU 4GB of RAM) withMATLAB R2010b on 64 bit Windows 7

In the proposed interior point method (Algorithms 2 and4) we set the parameters as 120573 = 06 120591

1= 05 119897 = 000001

1205741

= 100000 and 120573 = 05 The balance parameter 119862 wasselected by 5-fold cross-validation on training set over therange 2

minus10 2minus9 210 After training the weights that didnot satisfy the criteria |119908

119895|max

119894(|119908119894|) ge 10

4 [14] were set tozeroThen the cardinality of the hyperplane was computed asthe number of the nonzero weights

41 Artificial Data First we took an artificial binary linearclassification problem as an example The problem is similarto that in [13] The probability of 119910 = 1 or minus1 is equal Thefirst 6 features are relevant but redundant In 70 samplesthe first three features 119909

1 1199092 1199093were drawn as 119909

119894= 119910119873(119894 1)

and the second three features 1199094 1199095 1199096 as 119909

119894= 119873(0 1)

Otherwise the first three were drawn as 119909119894= 119873(0 1) and the

second three as 119909119894= 119910119873(119894 minus 3 1) The rest features are noise

119909119894= 119873(0 20) 119894 = 7 119898 Here119898 is the dimension of input

features The inputs were scaled to mean zero and standarddeviation In each trial 500 points were generated for testingand the average results were estimated over 30 trials

In the first experiment we consider the cases with thefixed feature size 119898 = 30 and different training sample sizes119899 = 10 20 100 The average results over the 30 trialsare shown in Table 1 and Figure 2 Figure 2 (left) plots theaverage cardinality of each classifier Since the artificial datasets have 2 relevant and nonredundant features the idealaverage cardinality is 2 Figure 2 (left) shows that the threesparse SVMs 119871

1-SVM 119871

12-SVM and 119871

0-SVM can achieve

sparse solution while the 1198712-SVM almost uses full features

in each data set Furthermore the solutions of 11987112

-SVMand 119871

0-SVM are much sparser than 119871

1-SVM As shown in

Table 1 the 1198711-SVM selects more than 6 features in all cases

which implies that some redundant or irrelevant featuresare selected The average cardinalities of 119871

12-SVM and 119871

0-

SVM are similar and close to 2 However when 119899 = 10

and 20 the 1198710-SVM has the average cardinalities of 142

and 187 respectively It means that the 1198710-SVM sometimes

selects only one feature in low sample data set and maybeignores some really relevant feature Consequently with thecardinalities between 205 and 29 119871

12-SVM has the more

reliable solution than 1198710-SVM In short as far as the number

of selected features is concerned the 11987112

-SVM behavesbetter than the other three methods

Figure 2 (right) plots the trend of the prediction accuracyversus the size of the training sample The classificationperformance of all methods is generally improved with theincreasing of the training sample size 119899 119871

1-SVM has the best

prediction performance in all cases and a slightly better than11987112

-SVM 11987112

-SVM shows more accuracy in classificationthan 119871

2-SVM and 119871

0-SVM especially in the case of 119899 =

10 50 As shown in Table 1 when there are only 10 train-ing samples the average accuracy of 119871

12-SVM is 8805

while the results of 1198712-SVM and 119871

0-SVM are 8465 and

7765 respectively Compared with 1198712-SVM and 119871

0-SVM

11987112

-SVM has the average accuracy increased by 34 and104 respectively as can be explained in what follows Tothe 1198712-SVM all features are selected without discrimination

and the prediction would bemisled by the irrelevant featuresTo the 119871

0-SVM few features are selected and some relevant

features are not included which would put negative impacton the prediction result As the tradeoff between119871

2-SVMand

1198710-SVM 119871

12-SVM has better performance than the two

The average results over ten artificial data sets in the firstexperiment are shown in the bottom of Table 1 On averagethe accuracy of 119871

12-SVM is 087 lower than the 119871

1-SVM

while the features selected by 11987112

-SVM are 7414 less than1198711-SVM It indicates that the 119871

12-SVM can achieve much

sparser solution than 1198711-SVM with little cost of accuracy

Moreover the average accuracy of 11987112

-SVMover 10 data setsis 222 higher than 119871

0-SVM with the similar cardinality

To sum up the 11987112

-SVM provides the best balance betweenaccuracy and sparsity among the three sparse SVMs

To further evaluate the feature selection performance of11987112

-SVM we investigate whether the features are correctlyselected For the119871

2-SVM is not designed for feature selection

it is not included in this comparison Since our artificial datasets have 2 best features (119909

3 1199096) the best result should have

the two features (1199093 1199096) ranking on the top according to their

absolute values of weights |119908119895| In the experiment we select

the top 2 features with the maximal |119908119895| for each method

and calculate the frequency that the top 2 features are 1199093

and 1199096in 30 runs The results are listed in Table 2 When the

training sample size is too small it is difficult to discriminatethe two most important features for all sparse SVMs Forexample when 119899 = 10 the selected frequencies of 119871

1-SVM

1198710-SVM and 119871

12-SVM are 7 3 and 9 respectively When 119899

increases all methods tend to make more correct selectionMoreover Table 2 shows that the 119871

12-SVM outperforms the

other two methods in all cases For example when 119899 = 100the selected frequencies of 119871

1-SVM and119871

0-SVM are 22 and

25 respectively and the result of 11987112

-SVM is 27 The 1198711-

SVMselects toomany redundant or irrelevant features whichmay influence the ranking in some extent Therefore 119871

1-

SVM is not so good as 11987112

-SVM at distinguishing the criticalfeatures The 119871

0-SVM has the lower hit frequency than 119871

12-

SVM which is probably due to the excessive small featuresubset it obtained Above all Tables 1 and 2 and Figure 2clearly show that the 119871

12-SVM is a promising sparsity driven

classification methodIn the second simulation we consider the cases with

various dimensions of feature space 119898 = 20 40 180 200

and the fixed training sample size 119899 = 100 The averageresults over 30 trials are shown in Figure 3 and Table 3 Sincethere are only 6 relevant features yet the larger 119898 meansthe more noisy features Figure 3 (left) shows that as thedimension increases from 20 to 200 the number of featuresselected by 119871

1-SVM increases from 826 to 231 However the

cardinalities of 11987112

-SVM and 1198710-SVM keep stable (from 22

to 295) It indicates that the 11987112

-SVM and 1198710-SVM aremore

suitable for feature selection than 1198711-SVM

Figure 3 (right) shows thatwith the increasing of the noisefeatures the accuracy of 119871

2-SVM drops significantly (from

9868 to 8747) On the contrary to the other three sparse

Journal of Applied Mathematics 9

0 20 40 60 80 1000

5

10

15

20

25

30

Number of training samples

Card

inal

ity

0 20 40 60 80 10075

80

85

90

95

100

Number of training samples

Accu

racy

L2-SVML1-SVM

L0-SVML12-SVM

L2-SVML1-SVM

L0-SVML12-SVM

Figure 2 Results comparison on 10 artificial data sets with various training samples

Table 1 Results comparison on 10 artificial data sets with various training sample sizes

1198991198712-SVM 119871

1-SVM 119871

0-SVM 119871

12-SVM

Card Acc Card Acc Card Acc Card Acc10 2997 8465 693 8855 143 7765 225 880520 3000 9221 883 9679 187 9317 235 955030 3000 9427 847 9823 203 9467 205 974140 2997 9599 900 9875 207 9539 210 977550 3000 9661 950 9891 213 9657 245 976860 2997 9725 1000 9894 220 9761 245 979870 3000 9741 1100 9898 220 9756 240 982380 3000 9768 1003 9909 223 9761 260 985090 2997 9789 920 9921 230 9741 290 9830100 3000 9813 1027 9909 250 9793 255 9836On average 2999 9521 932 9765 210 9456 241 9678ldquo119899rdquo is the number of training samples ldquoCardrdquo represents the number of selected features and ldquoAccrdquo is the classification accuracy

Table 2 Frequency of the most important features (1199093 1199096) ranked on top 2 in 30 runs

119899 10 20 30 40 50 60 70 80 90 1001198711-SVM 7 14 16 18 22 23 20 22 21 22

1198710-SVM 3 11 12 15 19 22 22 23 22 25

11987112-SVM 9 16 19 24 24 23 27 26 26 27

Table 3 Average results over 10 artificial data sets with varies dimension

1198712-SVM 119871

1-SVM 119871

0-SVM 119871

12-SVM

Cardinality 10995 1585 238 241Accuracy 9293 9907 9776 9826

10 Journal of Applied Mathematics

Table 4 Feature selection performance comparison on UCI data sets (Cardinality)

No UCI Data Set 119899 1198981198712-SVM 119871

1-SVM 119871

0-SVM 119871

12-SVM

Mean Std Mean Std Mean Std Mean Std1 Pima diabetes 768 8 800 000 740 089 620 110 760 1522 Breast Cancer 683 9 900 000 860 122 640 207 840 1673 Wine (3) 178 13 1300 000 973 069 373 028 407 0494 Image (7) 2100 19 1900 000 820 053 366 097 511 0625 SPECT 267 22 2200 000 1740 527 980 427 9 1176 WDBC 569 30 3000 000 980 152 900 303 560 2057 Ionosphere 351 34 3320 045 2500 335 2460 1021 2180 8448 SPECTF 267 44 4400 000 3880 1098 2400 1132 2560 7759 Sonar 208 60 6000 000 3140 1305 2700 718 1360 100610 Musk1 476 166 16600 000 8580 1885 4860 428 41 1416

Average cardinality 4050 4042 2421 1630 1418

Table 5 Classification performance comparison on UCI data sets (accuaracy)

No UCI Data Set 1198712-SVM 119871

1-SVM 119871

0-SVM 119871

12-SVM

Mean Std Mean Std Mean Std Mean Std1 Pima diabetes 7699 347 7673 085 7673 294 7712 1522 Breast cancer 9618 223 9706 096 9632 138 9647 1593 Wine (3) 9771 373 9829 128 9314 239 9714 2564 Image (7) 8894 149 9127 177 8667 458 9136 1945 SPECT 8340 215 7849 231 7811 338 8234 5106 WDBC 9611 211 9646 131 9558 115 9646 1317 Ionosphere 8343 544 8800 370 8457 538 8743 5948 SPECTF 7849 391 7434 450 7387 325 7849 1899 Sonar 7366 529 7561 582 7415 798 7610 43510 Musk1 8484 173 8211 242 7600 564 8232 338

Average accuracy 8597 8584 8351 8652

SVMs there is little change in the accuracy It reveals thatSVMs can benefit from the features reduction

Table 3 shows the average results over all data sets inthe second experiment On average the solution of 119871

12-

SVM yields much sparser than 1198711-SVM and a slightly better

accuracy than 1198710-SVM

42 UCI Data Sets We further tested the reformulation andthe proposed interior point methods to 119871

12-SVM on 10 UCI

data sets [36] There are 8 binary classification problems and2multiclass problems (wine image) Each feature of the inputdata was normalized to zero mean and unit variance and theinstances with missing value were deletedThen the data wasrandomly split into training set (80) and testing set (20)For the two multiclass problems a one-against-rest methodwas applied to construct a binary classifier for each class Werepeated the training and testing procedure 10 times and theaverage results were shown in Tables 4 and 5 and Figure 4

Tables 4 and 5 summarize the feature selection andclassification performance of the numerical experiments onUCI data sets respectively Here 119899 is the numbers of samplesand 119898 is the number of the input features For the two

multiclass data sets the numbers of the classes are markedbehind their names Sparsity is defined as card119898 and thesmall value of sparsity is preferredThe data sets are arrangedin descending order according to the dimension The lowestcardinality and the best accuracy rate for each problem arebolded

As shown in Tables 4 and 5 the three sparse SVMs canencourage sparsity in all data sets while remaining roughlyidentical accuracy with 119871

2-SVM Among the three sparse

methods the 11987112

-SVM has the lowest cardinality (1418)and the highest classification accuracy (8652) on averageWhile the 119871

1-SVM has the worst feature selection perform-

ance with the highest average cardinality (2421) and the1198710-SVM has the lowest average classification accuracy

(8351)Figure 4 plots the sparsity (left) and classification accu-

racy (right) of each classifier on UCI data sets In three datasets (6 9 10) the 119871

12-SVM has the best performance both

in feature selection and classification among the three sparseSVMs Compared with 119871

1-SVM 119871

12-SVM can achieve

sparser solution in nine data sets For example in the dataset ldquo8 SPECTFrdquo the features selected by 119871

12-SVM are 34

Journal of Applied Mathematics 11

50 100 150 2000

20

40

60

80

100

120

140

160

180

200

Dimension

Card

inal

ity

50 100 150 20080

82

84

86

88

90

92

94

96

98

100

Dimension

Accu

racy

L2-SVML1-SVM

L0-SVML12-SVM

L2-SVML1-SVM

L0-SVML12-SVM

Figure 3 Results comparison on artificial data set with various dimensions

less than 1198711-SVM at the same time the accuracy of 119871

12-

SVM is 41 higher than 1198711-SVM In most data sets (4

5 6 9 10) the cardinality of 11987112

-SVM drops significantly(at least 37) with the equal or a slightly better result inaccuracy Only in the data set ldquo3 Winerdquo the accuracy of11987112

-SVM is decreased by 11 but the sparsity provided by11987112

-SVM leads to 582 improvement over 1198711-SVM In the

rest three data sets (1 2 7) the two methods have similarresults in feature selection and classification As seen abovethe 119871

12-SVM can provide lower dimension representation

than 1198711-SVM with the competitive prediction perform-

anceFigure 4 (right) shows that compared with 119871

0-SVM

11987112

-SVM has the classification accuracy improved in alldata sets For instance in four data sets (3 4 8 10) theclassification accuracy of 119871

12-SVM is at least 40 higher

than 1198710-SVM Especially in the data set ldquo10Muskrdquo 119871

12-SVM

gives a 63 rise in accuracy over 1198710-SVM Meanwhile it

can be observed from Figure 4 (left) that 11987112

-SVM selectsfewer feature than 119871

0-SVM in five data sets (5 6 7 9 10) For

example in the data sets 6 and 9 the cardinalities of 11987112

-SVM are 378 and 496 less than 119871

0-SVM respectively

In summary 11987112

-SVM presents better classification perfor-mance than 119871

0-SVM while it is effective in choosing relevant

features

5 Conclusions

In this paper we proposed a 11987112

regularization techniquefor simultaneous feature selection and classification in theSVMWe have reformulated the 119871

12-SVM into an equivalent

smooth constrained optimization problem The problem

possesses a very simple structure and is relatively easy todevelop numerical methods By the use of this interestingreformulation we proposed an interior point method andestablished its convergence Our numerical results supportedthe reformulation and the proposed method The 119871

12-

SVM can get more sparsity solution than 1198711-SVM with

the comparable classification accuracy Furthermore the11987112

-SVM can achieve more accuracy classification resultsthan 119871

0-SVM (FSV)

Inspired by the good performance of the smooth opti-mization reformulation of 119871

12-SVM there are some inter-

esting topics deserving further research For examples todevelop more efficient algorithms for solving the reformu-lation to study nonlinear 119871

12-SVM and to explore varies

applications of the 11987112

-SVM and further validate its effectiveare interesting research topics Some of them are under ourcurrent investigation

Appendix

A Properties of the Reformulation to11987112

-SVM

Let 119863 be set of all feasible points of the problem (10) Forany 119911 isin 119863 we let 119865119863(119911119863) and 119871119865119863(119911119863) be the set of allfeasible directions and linearized feasible directions of 119863 at119911 Since the constraint functions of (10) are all convex weimmediately have the following lemma

Lemma A1 For any feasible point 119911 of (10) one has119865119863 (119911119863) = 119871119865119863 (119911119863) (A1)

12 Journal of Applied Mathematics

1 2 3 4 5 6 7 8 9 1010

20

30

40

50

60

70

80

90

100

Data set number Data set number

Spar

sity

1 2 3 4 5 6 7 8 9 1070

75

80

85

90

95

100

Accu

racy

L2-SVML1-SVM

L0-SVML12-SVM

L2-SVML1-SVM

L0-SVML12-SVM

Figure 4 Results comparison on UCI data sets

Based on the above Lemma we can easily derive a firstorder necessary condition for (10) The Lagrangian functionof (10) is

119871 (119908 119905 120585 119887 120582 120583 ] 119901 119902)

= 119862

119899

sum

119894=1

120585119894+

119898

sum

119895=1

119905119895minus

119898

sum

119895=1

120582119895(1199052

119895minus 119908119895) minus

119898

sum

119895=1

120583119895(1199052

119895+ 119908119895)

minus

119898

sum

119895=1

]119895119905119895minus

119899

sum

119894=1

119901119894(119910119894119908119879119909119894+ 119910119894119887 + 120585119894minus 1) minus

119899

sum

119894=1

119902119894120585119894

(A2)

where 120582 120583 ] 119901 119902 are the Lagrangian multipliers By the useof Lemma A1 we immediately have the following theoremabout the first order necessary condition

Theorem A2 Let (119908 119905 120585 119887) be a local solution of (10) Thenthere are Lagrangian multipliers (120582 120583 ] 119901 119902) isin 119877

119898+119898+119898+119899+119899

such that the following KKT conditions hold

120597119871

120597119908= 120582119895minus 120583119895minus 119901119894

119899

sum

119894=1

119910119894119909119894= 0 119894 = 1 2 119899

120597119871

120597119905= 1 minus 2120582

119895119905119895minus 2120583119895119905119895minus ]119895= 0 119895 = 1 2 119898

120597119871

120597120585= 119862 minus 119901

119894minus 119902119894= 0 119894 = 1 2 119899

120597119871

120597119887=

119899

sum

119894=1

119901119894119910119894= 0 119894 = 1 2 119899

120582119895ge 0 119905

2

119895minus 119908119895ge 0 120582

119895(1199052

119895minus 119908119895) = 0

119895 = 1 2 119898

120583119895ge 0 119905

2

119895+ 119908119895ge 0 120583

119895(1199052

119895+ 119908119895) = 0

119895 = 1 2 119898

]119895ge 0 119905

119895ge 0 ]

119895119905119895= 0 119895 = 1 2 119898

119901119894ge 0 119910

119894(119908119879119909119894+ 119887) + 120585

119894minus 1 ge 0

119901119894(119910119894(119908119879119909119894+ 119887) + 120585

119894minus 1) = 0 119894 = 1 2 119899

119902119894ge 0 120585

119894ge 0 119902

119894120585119894= 0 119894 = 1 2 119899

(A3)

The following theorem shows that the level set will bebounded

Theorem A3 For any given constant 119888 gt 0 the level set

119882119888= 119911 = (119908 119905 120585) | 119891 (119911) le 119888 cap 119863 (A4)

is bounded

Proof For any (119908 119905 120585) isin Ω119888 we have

119898

sum

119894

119905119894+ 119862

119899

sum

119895

120585119895le 119888 (A5)

Journal of Applied Mathematics 13

Combining with 119905119894ge 0 119894 = 1 119898 and 120585

119895ge 0 119895 = 1 119899

we have

0 le 119905119894le 119888 119894 = 1 119898

0 le 120585119895le

119888

119862 119895 = 1 119899

(A6)

Moreover for any (119908 119905 120585) isin 119863

10038161003816100381610038161199081198941003816100381610038161003816 le 1199052

119894 119894 = 1 119899 (A7)

Consequently 119908 119905 120585 are bounded What is more from thecondition

119910119894(119908119879119909119894+ 119887) ge 1 minus 120585

119894 119894 = 1 119898 (A8)

we have

119908119879119909119894+ 119887 ge 1 minus 120585

119894 if 119910

119894= 1 119894 = 1 119898

119908119879119909119894+ 119887 le minus1 + 120585

119894 if 119910

119894= minus1 119894 = 1 119898

(A9)

Thus if the feasible region119863 is not empty then we have

max119910119894=1

(1 minus 120585119894minus 119908119879119909119894) le 119887 le min

119910119894=minus1

(minus1 + 120585119894minus 119908119879119909119894) (A10)

Hence 119887 is also bounded The proof is complete

B Proof of Lemma 3

Lemma 3 (see Section 3) shows that Algorithm 2 is welldefinedWe first introduce the following proposition to showthat the matrix 119878 defined by (21) is always positive definitewhich ensures Algorithm 2 to be well defined

Proposition B1 Let the inequality (26) hold then the matrix119878 defined by (21) is positive definite

Proof By an elementary deduction we have for any V(1) isin

119877119898 V(2) isin 119877

119899 V(3) isin 119877119898 V(4) isin 119877 with (V(1) V(2) V(3) V(4)) = 0

(V(1)119879 V(2)119879 V(3)119879 V(4)119879) 119878(

V(1)

V(2)

V(3)

V(4))

= V(1)11987911987811V(1) + V(1)119879119878

12V(2) + V(1)119879119878

13V(3)

+ V(1)11987911987814V(4) + V(2)119879119878

21V(1) + V(2)119879119878

22V(2)

+ V(2)11987911987823V(3) + V(2)119879119878

24V(4)

+ V(3)11987911987831V(1) + V(3)119879119878

32V(2)

+ V(3)11987911987833V(3) + V(3)119879119878

34V(4)

+ V(4)11987911987841V(1) + V(4)119879119878

42V(2)

+ V(4)11987911987843V(3) + V(4)119879119878

44V(4)

= V(1)119879 (119880 + 119881 + 119883119879119884119879119875minus11198634119884119883) V(1)

+ V(1)119879 (119883119879119884119879119875minus11198634) V(2)

+ V(1)119879 (minus2 (119880 minus 119881)119879) V(3)

+ V(1)119879 (119883119879119884119879119875minus11198634119910) V(4)

+ V(2)119879119875minus11198634119884119883V(1) + V(2)119879 (119875minus1119863

4+ Ξminus11198635) V(2)

+ V(2)119879119875minus11198634119910V(4)

minus 2V(3)119879 (119880 minus 119881)119879V(1)

+ V(3)119879 (4119879 (119880 + 119881)119879 + 119879minus11198633minus 2 (119863

1+ 1198632)) V(3)

+ V(4)119879119910119879119875minus11198634119884119883V(1)

+ V(4)119879119910119879119875minus11198634V(2) + V(4)119879119910119879119875minus1119863

4119910V(4)

= V(1)119879119880V(1) minus 4V(3)119879119880119879V(1) + 4V(3)119879119879119880119879V(3)

+ V(1)119879119881V(1) + 4V(3)119879119881119879V(1) + 4V(3)119879119879119881119879V(3)

+ V(2)119879Ξminus11198635V(2) + V(3)119879 (119879minus1119863

3minus 21198631minus 21198632) V(3)

+ (119884119883V(1) + V(2) + V(4)119910)119879

119875minus11198634

times (119884119883V(1) + V(2) + V(4)119910)

= 119890119879

119898119880(119863V

1minus 2119879119863V

3)2

119890119898+ 119890119879

119898119881(119863V

1+ 2119879119863V

3)2

119890119898

+ V(2)119879Ξminus11198635V(2)

+ V(3)119879 (119879minus11198633minus 21198631minus 21198632) V(3)

+ (119884119883V(1) + V(2) + V(4)119910)119879

119875minus11198634(119884119883V(1) + V(2) + V(4)119910)

gt 0

(B1)

where 119863V1= diag(V(1)) 119863V

2= diag(V(2)) 119863V

3= diag(V(3))

and119863V4= diag(V(4))

14 Journal of Applied Mathematics

B1 Proof of Lemma 3

Proof It is easy to see that if Δ119911119896= 0 which implies that (23)

holds for all 120572119896ge 0

Suppose Δ119911119896

= 0 We have

nabla119908Φ120583(119911119896)119879

Δ119908119896+ nabla119905Φ120583(119911119896)119879

Δ119905119896

+ nabla120585Φ120583(119911119896)119879

Δ120585119896+ nabla119887Φ120583(119911119896)119879

Δ119887119896

= minus (Δ119908119896119879

Δ120585119896119879

Δ119905119896119879

Δ119887119896) 119878(

Δ119908119896

Δ120585119896

Δ119905119896

Δ119887119896

)

(B2)

Since matrix 119878 is positive definite and Δ119911119896

= 0 the lastequation implies

nabla119908Φ120583(119911119896)119879

Δ119908119896+ nabla119905Φ120583(119911119896)119879

Δ119905119896

+ nabla120585Φ120583(119911119896)119879

Δ120585119896+ nabla119887Φ120583(119911119896)119879

Δ119887119896lt 0

(B3)

Consequently there exists a 119896isin (0

119896] such that the fourth

inequality in (23) is satisfied for all 120572119896isin (0

119896]

On the other hand since (119909119896 119905119896) is strictly feasible the

point (119909119896 119905119896) + 120572(Δ119909 119905119896 Δ119905119896) will be feasible for all 120572 gt 0

sufficiently small The proof is complete

C Convergence Analysis

This appendix is devoted to the global convergence of theinterior point method We first show the convergence ofAlgorithm 2 when a fixed 120583 is applied to the barrier subprob-lem (14)

Lemma C1 Let (119911119896 120582119896) be generated by Algorithm 2 Then119911119896 are strictly feasible for problem (10) and the Lagrangian

multipliers 120582119896 are bounded from above

Proof For the sake of convenience we use 119892119894(119911) 119894 =

1 2 3119898 + 2119899 to denote the constraint functions of theconstrained problem (10)

We first show that 119911119896 are strictly feasible Suppose onthe contrary that there exists an infinite index subset K andan index 119894 isin 1 2 3119898 + 2119899 such that 119892

119894(119911119896)K darr 0

By the definition of Φ120583(119911) and 119891(119911) = sum

119898

119895=1119905119894+ 119862sum

119899

119894=1120585119894

being bounded from below in the feasible set it must holdthat Φ

120583(119911119896)K rarr infin

However the line search rule implies that the sequenceΦ120583(119911119896) is decreasing So we get a contradiction Conse-

quently for any 119894 isin 1 2 3119898+2119899 119892119894(119909119896 119905119896) is bounded

away from zero The boundedness of 120582119896 then follows from(24)ndash(27)

Lemma C2 Let (119911119896 120582119896) and (Δ119911119896 119896+1

) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenthe sequence (Δ119911

119896 119896+1

)K is bounded

Proof Again we use 119892119894(119911) 119894 = 1 2 3119898+2119899 to denote the

constraint functions of the constrained problem (10)We suppose on the contrary that there exists an infinite

subsetK1015840 sube K such that the subsequence (Δ119911119896 119896+1

)K1015840

tends to infinity Let 119911119896K1015840 rarr 119911

lowast It follows fromLemma C1 that there is an infinite subset K sube K1015840 such that120582119896K rarr 120582

lowast and 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 Thus

we have

119878119896997888rarr 119878lowast (C1)

as 119896 rarr infin with 119896 isin K By Proposition B1 119878lowast is positivedefinite Since the right hand size of (18) is bounded andcontinuous the unboundedness of (Δ119911

119896 119896+1

)K impliesthat the limit of the coefficient matrices of (18) is singularwhich yields a contradiction The proof is complete

Lemma C3 Let (119911119896 120582119896) and (Δ119911119896 119896+1

) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenone has Δ119911

119896K rarr 0

Proof It follows from Lemma C1 that the sequence Δ119911119896K

is bounded Suppose on the contrary that there exists aninfinite subsetK1015840 sube K such that Δ119911

119896K1015840 rarr Δ119911

lowast= 0 Since

subsequences 119911119896K1015840 120582

119896K1015840 and

119896K1015840 are all bounded

there are points 119911lowast 120582lowast and

lowast as well as an infinite indexset K sube K1015840 such that 119911119896K rarr 119911

lowast 120582119896K rarr 120582lowast and

119896K rarr

lowast By Lemma C1 we have 119892119894(119911lowast) gt 0 forall119894 isin

1 2 3119898 + 2119899 Similar to the proof of Lemma 3 it is notdifficult to get

nabla119908Φ120583(119911lowast)119879

Δ119908lowast+ nabla119905Φ120583(119911lowast)119879

Δ119905lowast

+ nabla120585Φ120583(119911lowast)119879

Δ120585lowast+ nabla119887Φ120583(119911lowast)119879

Δ119887lowast

= minus (Δ119908lowast119879

Δ120585lowast119879

Δ119905lowast119879

Δ119887lowast) 119878(

Δ119908lowast

Δ120585lowast

Δ119905lowast

Δ119887lowast

) lt 0

(C2)

Since 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 there exists a

isin (0 1] such that for all 120572 isin (0 ]

119892119894(119911lowast+ 120572Δ119911

lowast) gt 0 forall119894 isin 1 2 3119899 (C3)

Taking into account 1205911isin (0 12) we claim that there exists

a isin (0 ] such that the following inequality holds for all120572 isin (0 ]

Φ120583(119911lowast+ 120572Δ119911

lowast) minus Φ120583(119911lowast) le 11120591

1120572nabla119911Φ120583(119911lowast)119879

Δ119911lowast (C4)

Let 119898lowast

= min119895 | 120573119895isin (0 ] 119895 = 0 1 and 120572

lowast= 120573119898lowast

It follows from (C3) that the following inequality is satisfiedfor all 119896 isin K sufficient large and any 119894 isin 1 2 3119898 + 2119899

119892119894(119911119896+ 120572lowastΔ119911119896) gt 0 (C5)

Journal of Applied Mathematics 15

Moreover we have

Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911119896)

= Φ120583(119911lowast+ 120572lowastΔ119911lowast) minus Φ120583(119911lowast)

+ Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911lowast+ 120572lowastΔ119911lowast)

+ Φ120583(119911lowast) minus Φ120583(119911119896)

le 1051205911120572lowastnabla119911Φ120583(119911lowast)119879

Δ119911lowast

le 1205911120572lowastnabla119911Φ120583(119911119896)119879

Δ119911119896

(C6)

By the backtracking line search rule the last inequalitytogether with (C5) yields the inequality120572

119896ge 120572lowastfor all 119896 isin K

large enough Consequently when 119896 isin K is sufficiently largewe have from (23) and (C3) that

Φ120583(119911119896+1

) le Φ120583(119911119896) + 1205911120572119896nabla119911Φ120583(119911119896)119879

Δ119911119896

le Φ120583(119911119896) + 1205911120572lowastnabla119911Φ120583(119911119896)119879

Δ119911119896

le Φ120583(119911119896) +

1

21205911120572lowastnabla119911Φ120583(119911lowast)119879

Δ119911lowast

(C7)

This shows that Φ120583(119911119896)K rarr minusinfin contradicting with the

fact that Φ120583(119911119896)K is bounded from below The proof is

complete

Then we will establish the convergence of Algorithms 2and 4 that is Theorems 5 and 6 in Section 3

C1 Proof of Theorem 5

Proof We let (119911lowast lowast) be a limit point of (119911119896 119896) and thesubsequence (119911119896 119896)K converges to (119911

lowast lowast)

Recall that (Δ119911119896 119896) are the solution of (18) with (119911 120582) =

(119911119896 120582119896) Taking limits in both sides of (18) with (119911 120582) =

(119911119896 120582119896) as 119896 rarr infinwith 119896 isin K by the use of Lemma C1 we

obtain

lowast(1)

minus lowast(2)

minus 119883119879119884119879lowast(4)

= 0

119862 lowast 119890119899minus lowast(4)

minus lowast(5)

= 0

119890119898minus 2119879lowastlowast(1)

minus 2119879lowastlowast(2)

minus lowast(3)

= 0

119910119879lowast(4)

= 0

(119879lowast2

minus 119882lowast) lowast(1)

minus 120583119890119898

= 0

(119879lowast2

+ 119882lowast) lowast(2)

minus 120583119890119898

= 0

119879lowastlowast(3)

minus 120583119890119898

= 0

119875lowastlowast(4)

minus 120583119890119899= 0

Ξlowastlowast(5)

minus 120583119890119899= 0

(C8)

This shows that (119911lowast lowast) satisfies the first-order optimalityconditions (15)

C2 Proof of Theorem 6

Proof (i) Without loss of generality we suppose that thebounded subsequence (119911

119895 119895)J converges to some point

(119911lowast lowast) It is clear that 119911lowast is a feasible point of (10) Since

120583119895 120598120583119895 rarr 0 it follows from (29) and (30) that

Res (119911119895 119895 120583119895)

J997888rarr Res (119911lowast lowast 0) = 0

119895J

997888rarr lowastge 0

(C9)

Consequently the (119911lowast lowast) satisfies the KKT conditions (13)(ii) Let 120585

119895= max119895

infin 1 and

119895= 120585minus1

119895119895 Obviously

119895 is bounded Hence there exists an infinite subsetJ1015840 sube J

such that 1198951015840J rarr = 0 and 119895infin

= 1 for large 119895 isin JFrom (30) we know that ge 0 Dividing both sides of thefirst inequality of (18) with (119911 120582) = (119911

119896 120582119896) by 120585119895 then taking

limits as 119895 rarr infin with 119895 isin J we get

(1)

minus (2)

minus 119883119879119884119879(4)

= 0

(4)

+ (5)

= 0

2119879lowast(1)

+ 2119879lowast(2)

+ (3)

= 0

119910119879(4)

= 0

(119879lowast2

minus 119882lowast) (1)

= 0

(119879lowast2

+ 119882lowast) (2)

= 0

119879lowast(3)

= 0

119875lowast(4)

= 0

Ξlowast(5)

= 0

(C10)

Since 119911lowast is feasible the above equations has shown that 119911lowast is

a Fritz-John point of problem (10)

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

The authors would like to acknowledge support for thisproject from the National Science Foundation (NSF Grants11371154 11071087 11271069 and 61103202) of China

References

[1] Y M Yang and J O Pedersen ldquoA comparative study on featureselection in text categorizationrdquo in Proceedings of the 14th

16 Journal of Applied Mathematics

International Conference on Machine Learning (ICML rsquo97) vol97 pp 412ndash420 1997

[2] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquoThe Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003

[3] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002

[4] H H Zhang J Ahn X Lin and C Park ldquoGene selection usingsupport vector machines with non-convex penaltyrdquo Bioinfor-matics vol 22 no 1 pp 88ndash95 2006

[5] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007

[6] J Weston F Perez-Cruz O Bousquet O Chapelle A Elisseeffand B Scholkopf ldquoFeature selection and transduction forprediction of molecular bioactivity for drug designrdquo Bioinfor-matics vol 19 no 6 pp 764ndash771 2003

[7] Y Liu ldquoA comparative study on feature selection methods fordrug discoveryrdquo Journal of Chemical Information and ComputerSciences vol 44 no 5 pp 1823ndash1828 2004

[8] I Guyon J Weston S Barnhill and V Vapnik Eds FeatureExtraction Foundations and Applications vol 207 Springer2006

[9] A Rakotomamonjy ldquoVariable selection using SVM-based cri-teriardquo Journal of Machine Learning Research vol 3 no 7-8 pp1357ndash1370 2003

[10] J Weston S Mukherjee O Chapelle M Pontil T Poggio andV Vapnik ldquoFeature selection for SVMsrdquo in Proceedings of theConference onNeural Information Processing Systems vol 13 pp668ndash674 Denver Colo USA 2000

[11] D Peleg and R Meir ldquoA feature selection algorithm based onthe global minimization of a generalization error boundrdquo inProceedings of the Conference on Neural Information ProcessingSystems Vancouver Canada 2004

[12] P S Bradley and O L Mangasarian ldquoFeature selection via con-cave minimization and support vector machinesrdquo in Proceed-ings of the International Conference on Machine Learning pp82ndash90 San Francisco Calif USA 1998

[13] JWeston A Elisseeff B Scholkopf andM Tipping ldquoUse of thezero-norm with linear models and kernel methodsrdquo Journal ofMachine Learning Research vol 3 pp 1439ndash1461 2003

[14] A B Chan N Vasconcelos and G R G Lanckriet ldquoDirectconvex relaxations of sparse SVMrdquo in Proceedings of the 24thInternational Conference on Machine Learning (ICML rsquo07) pp145ndash153 Corvallis Ore USA June 2007

[15] GM Fung andO L Mangasarian ldquoA feature selection Newtonmethod for support vector machine classificationrdquo Computa-tional Optimization and Applications vol 28 no 2 pp 185ndash2022004

[16] J Zhu S Rosset T Hastie and R Tibshirani ldquo1-norm supportvector machinesrdquo Advances in Neural Information ProcessingSystems vol 16 no 1 pp 49ndash56 2004

[17] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society B Methodological vol 58no 1 pp 267ndash288 1996

[18] T Zhang ldquoAnalysis of multi-stage convex relaxation for sparseregularizationrdquo Journal of Machine Learning Research vol 11pp 1081ndash1107 2010

[19] R Chartrand ldquoExact reconstruction of sparse signals via non-convexminimizationrdquo IEEE Signal Processing Letters vol 14 no10 pp 707ndash710 2007

[20] R Chartrand ldquoNonconvex regularizaron for shape preserva-tionrdquo inProceedings of the 14th IEEE International Conference onImage Processing (ICIP rsquo07) vol 1 pp I293ndashI296 San AntonioTex USA September 2007

[21] Z Xu H Zhang Y Wang X Chang and Y Liang ldquo11987112

regularizationrdquo Science China Information Sciences vol 53 no6 pp 1159ndash1169 2010

[22] W J Chen and Y J Tian ldquoLp-norm proximal support vectormachine and its applicationsrdquo Procedia Computer Science vol1 no 1 pp 2417ndash2423 2010

[23] J Liu J Li W Xu and Y Shi ldquoA weighted L119902adaptive least

squares support vector machine classifiers-Robust and sparseapproximationrdquo Expert Systems with Applications vol 38 no 3pp 2253ndash2259 2011

[24] Y Liu H H Zhang C Park and J Ahn ldquoSupport vector mach-ines with adaptive 119871

119902penaltyrdquo Computational Statistics amp Data

Analysis vol 51 no 12 pp 6380ndash6394 2007[25] Z Liu S Lin andM Tan ldquoSparse support vectormachines with

Lp penalty for biomarker identificationrdquo IEEEACM Transac-tions on Computational Biology and Bioinformatics vol 7 no1 pp 100ndash107 2010

[26] A Rakotomamonjy R Flamary G Gasso and S Canu ldquol119901-l119902

penalty for sparse linear and sparse multiple kernel multitasklearningrdquo IEEE Transactions on Neural Networks vol 22 no 8pp 1307ndash1320 2011

[27] J Y Tan Z Zhang L Zhen C H Zhang and N Y DengldquoAdaptive feature selection via a new version of support vectormachinerdquo Neural Computing and Applications vol 23 no 3-4pp 937ndash945 2013

[28] Z B Xu X Y Chang F M Xu and H Zhang ldquo11987112

regu-larization an iterative half thresholding algorithmrdquo IEEETrans-actions on Neural Networks and Learning Systems vol 23 no 7pp 1013ndash1027 2012

[29] D Ge X Jiang and Y Ye ldquoA note on the complexity of 119871119901

minimizationrdquo Mathematical Programming vol 129 no 2 pp285ndash299 2011

[30] VN VapnikTheNature of Statistical LearningTheory SpringerNew York NY USA 1995

[31] J Fan and H Peng ldquoNonconcave penalized likelihood with adiverging number of parametersrdquo The Annals of Statistics vol32 no 3 pp 928ndash961 2004

[32] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoThe Annals of Statistics vol 28 no 5 pp 1356ndash1378 2000

[33] B S Tian and X Q Yang ldquoAn interior-point l 11987112-penalty-

method for nonlinear programmingrdquo Technical ReportDepartment of Applied Mathematics Hong Kong PolytechnicUniversity 2013

[34] L Armijo ldquoMinimization of functions having Lipschitz con-tinuous first partial derivativesrdquo Pacific Journal of Mathematicsvol 16 pp 1ndash3 1966

[35] F John ldquoExtremum problems with inequalities as side-con-ditionsrdquo in Studies and Essays Courant Anniversary Volume pp187ndash1204 1948

[36] D J Newman S Hettich C L Blake and C J Merz ldquoUCIrepository of machine learning databasesrdquo Technical Report9702 Department of Information and Computer Science Uni-verisity of California Irvine Calif USA 1998 httparchiveicsucieduml

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 3: Research Article An Interior Point Method for -SVM …downloads.hindawi.com/journals/jam/2014/942520.pdfResearch Article An Interior Point Method for 1/2-SVM and Application to Feature

Journal of Applied Mathematics 3

0 1 20

1

2

3

4

W

Lp

pena

lty

p = 2

p = 1

p = 05

minus2 minus15 minus1 minus05

35

25

15

15

05

05

Figure 1 119871119901penalty in one dimension

23 11987112

-SVM Model We pay particular attention to the11987112

-SVM namely problem (4) with 119901 = 12 We will derivea smooth constrained optimization reformulation to the119871

12-

SVM so that it is relatively easy to design numerical methodsWe first specify the 119871

12-SVM

min119898

sum

119895=1

10038161003816100381610038161003816119908119895

10038161003816100381610038161003816

12

+ 119862

119899

sum

119894=1

120585119894≜ 120601 (119908 120585 119887)

st 119910119894(119908119879119909119894+ 119887) ge 1 minus 120585

119894 119894 = 1 119899

120585119894ge 0 119894 = 1 119899

(5)

Denote by 119906 = (119908 120585 119887) andD the feasible region of the pro-blem that is

D = (119908 120585 119887) | 119910119894(119908119879119909119894+ 119887) ge 1 minus 120585

119894

120585119894ge 0 119894 = 1 119899

(6)

Then the 11987112

-SVM can be written as an impact form

min120601 (119906) 119906 isin D (7)

It is a nonconvex and non-Lipschitz problem Due to theexistence of the term |119908

119894|12 the objective function is not even

directionally differentiable at a point with some119908119894= 0 which

makes the problem very difficult to solve Existing numericalmethods that are very efficient for solving smooth problemcould not be used directly One possible way to developnumerical methods for solving (7) is to smoothing the term|119908119895|12 using some smoothing function such as 120601

120598(119908119895) =

(1199082

119895+ 1205982)14 with some 120598 gt 0 However it is easy to see that

the derivative of 120601120598(119908119895) will be unbounded as 119908

119895rarr 0 and

120598 rarr 0 Consequently it is not desirable that the smoothingfunction based numerical methods could work well

Recently Tian and Yang [33] proposed an interior point11987112

-penalty functionmethod to solve general nonlinear pro-gramming problems by using a quadratic relaxation schemefor their 119871

12-lower order penalty problems We will follow

the idea of [33] to develop an interior point method forsolving the 119871

12-SVM To this end in the next subsection we

reformulate problem (7) to a smooth constrained optimiza-tion problem

24 A Reformulation to the 11987112

-SVM Model Consider thefollowing constrained optimization problem

min119898

sum

119895=1

119905119895+ 119862

119899

sum

119894=1

120585119894≜ 119891 (119908 120585 119905 119887)

st 1199052

119895minus 119908119895ge 0 119895 = 1 119898

1199052

119895+ 119908119895ge 0 119895 = 1 119898

119905119895ge 0 119895 = 1 119898

119910119894(119908119879119909119894+ 119887) ge 1 minus 120585

119894 119894 = 1 119899

120585119894ge 0 119894 = 1 119899

(8)

It is obtained by letting 119905119895= |119908119895|12 in the objective function

and adding constraints 1199052

119895minus 119908119895

ge 0 and 1199052

119895+ 119908119895

ge 0119895 = 1 119898 in (7) Denote by F the feasible region of theproblem that is

F = (119908 120585 119905 119887) | 1199052

119895minus 119908119895ge 0 119905

2

119895+ 119908119895ge 0

119905119895ge 0 119895 = 1 119898

cap (119908 120585 119905 119887) | 119910119894(119908119879119909119894+ 119887) ge 1 minus 120585

119894

120585119894ge 0 119894 = 1 119899

(9)

Let 119911 = (119908 120585 119905 119887) Then the above problem can be written as

min119891 (119911) 119911 isin F (10)

The following theorem establishes the equivalencebetween the 119871

12-SVM and (10)

Theorem 1 If 119906lowast = (119908lowast 120585lowast 119887lowast) isin 119877119898+119899+1 is a solution of the

11987112

minus 119878119881119872 (7) then 119911lowast

= (119908lowast 120585lowast |119908lowast|12

119887lowast) isin 119877

119898+119899+119898+1

is a solution of the optimization problem (10) Conversely if(119908lowast 120585lowast 119905lowast 119887lowast) is a solution of the optimization problem (10)

then (119908lowast 120585lowast 119887lowast) is a solution of the 119871

12-119878119881119872 (7)

Proof Let 119906lowast = (119908lowast 120585lowast 119887lowast) be a solution of the 119871

12-SVM

(7) and let = ( 120585 119905 ) be a solution of the constrained

4 Journal of Applied Mathematics

optimization problem (10) It is clear that 119911lowast

= (119908lowast 120585lowast

|119908lowast|12

119887lowast) isin F Moreover we have 119905

2

119895ge |

119895| forall119895 =

1 2 119898 and hence

119891 () =

119898

sum

119895=1

119905119895+ 119862

119899

sum

119894=1

120585119894ge

119898

sum

119895=1

10038161003816100381610038161003816119895

10038161003816100381610038161003816

12

+ 119862

119899

sum

119894=1

120585119894

ge

119898

sum

119895=1

10038161003816100381610038161003816119908lowast

119895

10038161003816100381610038161003816

12

+ 119862

119899

sum

119894=1

120585lowast

119894= 120601 (119906

lowast)

(11)

Since 119911lowast isin F we have

120601 (119906lowast) = 119891 (119911

lowast) ge 119891 () (12)

This together with (11) implies that 119891() = 120601(119906lowast) The proof

is complete

It is clear that the constraint functions of (10) are convexConsequently at any feasible point the set of all feasibledirections is the same as the set of all linearized feasibledirections

As a result the KKT point exists The KKT system ofthe problem (10) can be written as the following system ofnonlinear equations

119877 (119908 120585 119905 119887 120582) =

(((((((((((

(

120582(1)

minus 120582(2)

minus 119883119879119884119879120582(4)

119862 lowast 119890119899minus 120582(4)

minus 120582(5)

119890119898minus 2119879120582

(1)minus 2119879120582

(2)minus 120582(3)

119910119879120582(4)

min 1199052

119895minus 119908119895 120582(1)

119895119895=12119898

min 1199052

119895+ 119908119895 120582(2)

119895119895=12119898

min 119905119895 120582(3)

119895119895=12119898

min 119901119894 120582(4)

119894119894=12119899

min 120585119894 120582(5)

119894119894=12119899

)))))))))))

)

= 0

(13)

where 120582 = (120582(1)

120582(2)

120582(3)

120582(4)

120582(5)

) are the Lagrangianmultipliers 119883 = (119909

1 1199092 119909

119899)119879 119884 = diag(119910) is diagonal

matrix and 119901119894= 119910119894(119908119879119909119894+ 119887) minus (1 minus 120585

119894) 119894 = 1 2 119899

For the sake of simplicity the properties of the reformu-lation to 119871

12-SVM are shown in Appendix A

3 An Interior Point Method

In this section we develop an interior point method to solvethe equivalent constrained problem (10) of the 119871

12-SVM (7)

31 Auxiliary Function Following the idea of the interiorpoint method the constrained problem (10) can be solvedby minimizing a sequence of logarithmic barrier functions asfollows

min Φ120583(119908 120585 119905 119887)

119898

sum

119895=1

119905119895+ 119862

119899

sum

119894=1

120585119894

minus 120583

119898

sum

119895=1

[log (1199052119895minus 119908119895) + log (1199052

119895+ 119908119895) + log 119905

119895]

minus 120583

119899

sum

119894=1

(log 119901119894+ log 120585

119894)

st 1199052

119895minus 119908119895gt 0 119895 = 1 2 119898

1199052

119895+ 119908119895gt 0 119895 = 1 2 119898

119905119895gt 0 119895 = 1 2 119898

119901119894= 119910119894(119908119879119909119894+ 119887) + 120585

119894minus 1 gt 0 119894 = 1 2 119899

120585119894gt 0 119894 = 1 2 sdot sdot sdot 119899

(14)

where 120583 is the barrier parameter converging to zero fromabove

The KKT system of problem (14) is the following systemof linear equations

120582(1)

minus 120582(2)

minus 119883119879119884119879120582(4)

= 0

119862 lowast 119890119899minus 120582(4)

minus 120582(5)

= 0

119890119898minus 2119879120582

(1)minus 2119879120582

(2)minus 120582(3)

= 0

119910119879120582(4)

= 0

(1198792minus 119882)120582

(1)minus 120583119890119898

= 0

(1198792+ 119882)120582

(2)minus 120583119890119898

= 0

119879120582(3)

minus 120583119890119898

= 0

119875120582(4)

minus 120583119890119899= 0

Ξ120582(5)

minus 120583119890119899= 0

(15)

where 120582 = (120582(1)

120582(2)

120582(3)

120582(4)

120582(5)

) are the Lagrangianmulti-pliers 119879 = diag(119905) 119882 = diag(119908) Ξ = diag(120585) and 119875 =

diag(119884(119883119908+119887lowast119890119899)+120585minus119890

119899) are diagonalmatrices and 119890

119898isin 119877119898

and 119890119899isin 119877119899 stand for the vector whose elements are all ones

32 Newtonrsquos Method We apply Newtonrsquos method to solvethe nonlinear system (15) in variables 119908 120585 119905 119887 and 120582 Thesubproblem of the method is the following system of linearequations

119872 =

((((((((

(

Δ119908

Δ120585

Δ119905

Δ119887

Δ120582(1)

Δ120582(2)

Δ120582(3)

Δ120582(4)

Δ120582(5)

))))))))

)

= minus

(((((((((

(

120582(1)

minus 120582(2)

minus 119883119879119884119879120582(4)

119862 lowast 119890119899minus 120582(4)

minus 120582(5)

119890119898minus 2119879120582

(1)minus 2119879120582

(2)minus 120582(3)

119890119879

119899119884120582(4)

(1198792minus 119882)120582

(1)minus 120583119890119898

(1198792+ 119882)120582

(2)minus 120583119890119898

119879120582(3)

minus 120583119890119898

119875120582(4)

minus 120583119890119899

Ξ120582(5)

minus 120583119890119899

)))))))))

)

(16)

where 119872 is the Jacobian of the left function in (15) and takesthe form

Journal of Applied Mathematics 5

119872 =

(((((((

(

0 0 0 0 119868 minus119868 0 minus119883119879119884119879

0

0 0 0 0 0 0 0 minus119868 minus119868

0 0 minus2 (1198631+ 1198632) 0 minus2119879 minus2119879 minus119868 0 0

0 0 0 0 0 0 0 119910119879

0

minus1198631

0 21198791198631

0 1198792minus 119882 0 0 0 0

1198632

0 21198791198632

0 0 1198792+ 119882 0 0 0

0 0 1198633

0 0 0 119879 0 0

1198634119884119883 119863

40 119863

4119910 0 0 0 119875 0

0 1198635

0 0 0 0 0 0 Ξ

)))))))

)

(17)

where 1198631

= diag(120582(1)) 1198632

= diag(120582(2)) 1198633

= diag(120582(3))1198634= diag(120582(4)) and119863

5= diag(120582(5))

We can rewrite (16) as

(120582(1)

+ Δ120582(1)

) minus (120582(2)

+ Δ120582(2)

) minus 119883119879119884119879(120582(4)

+ Δ120582(4)

) = 0

(120582(4)

+ Δ120582(4)

) + (120582(5)

+ Δ120582(5)

) = 119862 lowast 119890119899

2 (1198631+ 1198632) Δ119905 + 2119879 (120582

(1)+ Δ120582(1)

)

+ 2119879 (120582(2)

+ Δ120582(2)

) + (120582(3)

+ Δ120582(3)

) = 119890119898

119910119879(120582(4)

+ Δ120582(4)

) = 0

minus1198631Δ119908 + 2119879119863

1Δ119905 + (119879

2minus 119882) (120582

(1)+ Δ120582(1)

) = 120583119890119898

1198632Δ119908 + 2119879119863

2Δ119905 + (119879

2+ 119882) (120582

(2)+ Δ120582(2)

) = 120583119890119898

1198633Δ119905 + 119879 (120582

(3)+ Δ120582(3)

) = 120583119890119898

1198634119884119883Δ119908 + 119863

4Δ120585 + 119863

4119910 lowast Δ119887 + 119875 (120582

(4)+ Δ120582(4)

) = 120583119890119899

1198635Δ120585 + Ξ (120582

(5)+ Δ120582(5)

) = 120583119890119899

(18)

It follows from the last five equations that vector ≜ 120582 + Δ120582

can be expressed as

(1)

= (1198792minus 119882)minus1

(120583119890119898+ 1198631Δ119908 minus 2119879119863

1Δ119905)

(2)

= (1198792+ 119882)minus1

(120583119890119898minus 1198632Δ119908 minus 2119879119863

2Δ119905)

(3)

= 119879minus1

(120583119890119898minus 1198633Δ119905)

(4)

= 119875minus1

(120583119890119899minus 1198634119884119883Δ119908 minus 119863

4Δ120585 minus 119863

4119910 lowast Δ119887)

(5)

= Ξminus1

(120583119890119899minus 1198635Δ120585)

(19)

Substituting (19) into the first four equations of (18) weobtain

119878(

Δ119908

Δ120585

Δ119905

Δ119887

)

=

(((

(

minus120583(1198792minus 119882)minus1

119890119898+ 120583(119879

2+ 119882)minus1

119890119898

+119883119879119884119879119875minus1119890119899

minus119862 lowast 119890119899+ 120583119875minus1119890119899+ 120583Ξminus1119890119899

minus119890119898+ 2120583119879(119879

2minus 119882)minus1

119890119898

+2120583119879(1198792+ 119882)minus1

119890119898+ 120583119879minus1119890119898

120583119910119879119875minus1119890119899

)))

)

(20)

where matrix 119878 takes the form

119878 = (

11987811

11987812

11987813

11987814

11987821

11987822

11987823

11987824

11987831

11987832

11987833

11987834

11987841

11987842

11987843

11987844

) (21)

with blocks

11987811

= 119880 + 119881 + 119883119879119884119879119875minus11198634119884119883

11987812

= 119883119879119884119879119875minus11198634

11987813

= minus2 (119880 minus 119881)119879 11987814

= 119883119879119884119879119875minus11198634119910

11987821

= 119875minus11198634119884119883

11987822

= 119875minus11198634+ Ξminus11198635

11987823

= 0 11987824

= 119875minus11198634119910

6 Journal of Applied Mathematics

11987831

= minus2 (119880 minus 119881)119879 11987832

= 0

11987833

= 4119879 (119880 + 119881)119879 + 119879minus11198633minus 2 (119863

1+ 1198632) 119878

34= 0

11987841

= 119910119879119875minus11198634119884119883 119878

42= 119910119879119875minus11198634

11987843

= 0 11987844

= 119910119879119875minus11198634119910

(22)

and 119880 = (1198792minus 119882)minus11198631and 119881 = (119879

2+ 119882)minus11198632

33The Interior PointerAlgorithm Let 119911 = (119908 120585 119905 119887) and120582 =

(120582(1)

120582(2)

120582(3)

120582(4)

120582(5)

) we first present the interior pointeralgorithm to solve the barrier problem (14) and then discussthe details of the algorithm

Algorithm 2 The interior pointer algorithm (IPA) is asfollows

Step 0 Given tolerance 120598120583 set 120591

1isin (0 (12)) 119897 gt 0

1205741gt 1 120573 isin (0 1) Let 119896 = 0

Step 1 Stop if KKT condition (15) holds

Step 2 Compute Δ119911119896 from (20) and

119896+1 from (19)Compute 119911

119896+1= 119911119896+ 120572119896Δ119911119896 Update the Lagrangian

multipliers to obtain 120582119896+1

Step 3 Let 119896 = 119896 + 1 Go to Step 1

In Step 2 a step length 120572119896 is used to calculate 119911

119896+1We estimate 120572

119896 by Armijo line search [34] in which 120572119896

=

max 120573119895| 119895 = 0 1 2 for some 120573 isin (0 1) and satisfies the

following inequalities

(119905119896+1

)2

minus 119908119896+1

gt 0

(119905119896+1

)2

+ 119908119896+1

gt 0

119905119896+1

gt 0

Φ120583(119911119896+1

) minus Φ120583(119911119896)

le 1205911120572119896(nabla119908Φ

120583(119911119896)119879

Δ119908119896

+ nabla119905Φ120583(119911119896)119879

Δ119905119896

+nabla120585Φ120583(119911119896)119879

Δ120585119896+ nabla119887Φ120583(119911119896)119879

Δ119887119896)

(23)

where 1205911isin (0 12)

To avoid ill-conditioned growth of 119896 and guarantee thestrict dual feasibility the Lagrangian multipliers 120582

119896 should

be sufficiently positive and bounded from above Followinga similar idea of [33] we first update the dual multipliers by

(1)(119896+1)

119894

=

min119897120583

(119905119896

119894)2

minus 119908119896

119894

if (1)(119896+1)119894

lt min119897120583

(119905119896

119894)2

minus 119908119896

119894

(1)(119896+1)

119894 if min119897

120583

(119905119896

119894)2

minus x119896119894

le (1)(119896+1)

119894le

1205831205741

(119905119896

119894)2

minus 119908119896

119894

1205831205741

(119905119896

119894)2

minus 119908119896

119894

if (1)(119896+1)119894

gt1205831205741

(119905119896

119894)2

minus 119908119896

119894

(2)(119896+1)

119894

=

min119897120583

(119905119896

119894)2

+ 119908119896

119894

if (2)(119896+1)119894

lt min119897120583

(119905119896

119894)2

+ 119908119896

119894

(2)(119896+1)

119894 if min119897

120583

(119905119896

119894)2

+ 119908119896

119894

le (2)(119896+1)

119894le

1205831205741

(119905119896

119894)2

+ 119908119896

119894

1205831205741

(119905119896

119894)2

+ 119908119896

119894

if (2)(119896+1)119894

gt1205831205741

(119905119896

119894)2

+ 119908119896

119894

(3)(119896+1)

119894=

min119897120583

119905119896

119894

if (3)(119896+1)119894

lt min119897120583

119905119896

119894

(3)(119896+1)

119894 if min119897

120583

119905119896

119894

le (3)(119896+1)

119894le

1205831205741

119905119896

119894

1205831205741

119905119896

119894

if (3)(119896+1)119894

gt1205831205741

119905119896

119894

(4)(119896+1)

119894=

min119897120583

119901119896

119894

if (4)(119896+1)119894

lt min119897120583

119901119896

119894

(4)(119896+1)

119894 if min119897

120583

119901119896

119894

le (4)(119896+1)

119894le

1205831205741

119901119896

119894

1205831205741

119901119896

119894

if (4)(119896+1)119894

gt1205831205741

119901119896

119894

(24)

Journal of Applied Mathematics 7

where 119901119894= 119910119894(119908119879119909119894+ 119887) + 120585

119894minus 1

(5)(119896+1)

119894=

min119897120583

120585119896

119894

if (5)(119896+1)119894

lt min119897120583

120585119896

119894

(5)(119896+1)

119894 if min119897

120583

120585119896

119894

le (5)(119896+1)

119894le

1205831205741

120585119896

119894

1205831205741

120585119896

119894

if (5)(119896+1)119894

gt1205831205741

120585119896

119894

(25)

where the parameters 119897 and 1205741satisfy 0 lt 119897 120574

1gt 1

Since positive definiteness of the matrix 119878 is demanded inthis method the Lagrangian multipliers

119896+1 should satisfythe following condition

120582(3)

minus 2 (120582(1)

+ 120582(2)

) 119905 ge 0 (26)

For the sake of simplicity the proof is given in Appendix BTherefore if 119896+1 satisfies (26) we let 120582119896+1 =

119896+1 Other-wise we would further update it by the following setting

120582(1)(119896+1)

= 1205742(1)(119896+1)

120582(2)(119896+1)

= 1205742(2)(119896+1)

120582(3)(119896+1)

= 1205743(3)(119896+1)

(27)

where constants 1205742isin (0 1) and 120574

3ge 1 satisfy

1205743

1205742

= max119894isin119864

2119905119896+1

119894((1)(119896+1)

119894+ (2)(119896+1)

119894)

(3)(119896+1)

119894

(28)

with 119864 = 1 2 119899 It is not difficult to see that the vector((1)(119896+1)

(2)(119896+1)

(3)(119896+1)

) determined by (27) satisfies (26)In practice the KKT conditions (15) are allowed to be

satisfied within a tolerance 120598120583 It turns to be that the iterative

process stops while the following inequalities meet

Res (119911 120582 120583) =

10038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817

120582(1)

minus 120582(2)

minus 119883119879119884119879120582(4)

119862 lowast 119890119899minus 120582(4)

minus 120582(5)

119890119898minus 2119879120582

(1)minus 2119879120582

(2)minus 120582(3)

119890119879

119899119884120582(4)

(1198792minus 119882)120582

(1)minus 120583119890119898

(1198792+ 119882)120582

(2)minus 120583119890119898

119879120582(3)

minus 120583119890119898

119875120582(4)

minus 120583119890119899

Ξ120582(5)

minus 120583119890119899

10038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817

lt 120598120583 (29)

120582 ge minus120598120583119890 (30)

where 120598120583is related to the current barrier parameter 120583 and

satisfies 120598120583darr 0 as 120583 rarr 0

Since functionΦ120583in (14) is convex we have the following

lemma which shows that Algorithm 2 is well defined (theproof is given in Appendix B)

Lemma 3 Let 119911119896 be strictly feasible for problem (10) If Δ119911119896=

0 then (23) is satisfied for all 120572119896

ge 0 If Δ119911119896

= 0 then thereexists a

119896isin (0 1] such that (23) holds for all 120572

119896isin (0

119896]

The proposed interior point method successively solvesthe barrier subproblem (14) with a decreasing sequence 120583

119896

We simply reduce both 120598120583and 120583 by a constant factor 120573 isin

(0 1) Finally we test optimality for problem (10) by meansof the residual norm Res(119911 0)

Here we present the whole algorithm to solve the 11987112

-SVM problem(10)

Algorithm 4 Algorithm for solving 11987112

-SVM problem is asfollows

Step 0 Set 1199080 isin 119877119898 1198870isin 119877 120585

0isin 119877119899 1205820 =

0gt 0

1199050

119894ge radic|119908

0

119894| + (12) 119894 isin 1 2 119898 Given constants

1205830gt 0 1205981205830

gt 0 120573 isin (0 1) and 120598 gt 0 let 119895 = 0

Step 1 Stop if Res(119911119895 119895 0) le 120598 and 119895ge 0

Step 2 Starting from (119911119895 120582119895) apply Algorithm 2 to

solve (14) with barrier parameter 120583119895and stopping

tolerance 120598120583119895 Set 119911119895+1 = 119911

119895119896 119895+1 = 119895119896 and 120582

119895+1=

120582119895119896

Step 3 Set 120583119895+1

= 120573120583119895and 120598120583119895+1

= 120573120598120583119895 Let 119895 = 119895 + 1

and go to Step 1

In Algorithm 4 the index 119895 denotes an outer iterationwhile 119896 denotes the last inner iteration of Algorithm 2

The convergence of the proposed interior point methodcan be proved We list the theorem here and give the proof inAppendix C

Theorem 5 Let (119911119896 120582119896) be generated by Algorithm 2 Thenany limit point of the sequence (119911

119896 119896) generated by

Algorithm 2 satisfies the first-order optimality conditions (15)

Theorem 6 Let (119911119895 119895) be generated by Algorithm 4 by

ignoring its termination condition Then the following state-ments are true

(i) The limit point of (119911119895 119895) satisfies the first order opti-mality condition (13)

(ii) The limit point 119911lowast of the convergent subsequence

119911119895J sube 119911

119895 with unbounded multipliers

119895J is a

Fritz-John point [35] of problem (10)

4 Experiments

In this section we tested the constrained optimization refor-mulation to the 119871

12-SVM and the proposed interior point

method We compared the performance of the 11987112

-SVMwith 119871

2-SVM [30] 119871

1-SVM [12] and 119871

0-SVM [12] on

artificial data and ten UCI data sets (httparchiveicsucieduml) These four problems were solved in primal refer-encing the machine learning toolbox Spider (httppeoplekybtuebingenmpgdespider) The 119871

2-SVM and 119871

1-SVM

were solved directly by quadratic programming and linearprogramming respectively To the NP-hard problem 119871

0-

SVM a commonly cited approximation method FeatureSelection Concave (FSV) [12] was applied and then the FSV

8 Journal of Applied Mathematics

problem was solved by a Successive Linear Approximation(SLA) algorithm All the experiments were run in thepersonal computer (16 GHz of CPU 4GB of RAM) withMATLAB R2010b on 64 bit Windows 7

In the proposed interior point method (Algorithms 2 and4) we set the parameters as 120573 = 06 120591

1= 05 119897 = 000001

1205741

= 100000 and 120573 = 05 The balance parameter 119862 wasselected by 5-fold cross-validation on training set over therange 2

minus10 2minus9 210 After training the weights that didnot satisfy the criteria |119908

119895|max

119894(|119908119894|) ge 10

4 [14] were set tozeroThen the cardinality of the hyperplane was computed asthe number of the nonzero weights

41 Artificial Data First we took an artificial binary linearclassification problem as an example The problem is similarto that in [13] The probability of 119910 = 1 or minus1 is equal Thefirst 6 features are relevant but redundant In 70 samplesthe first three features 119909

1 1199092 1199093were drawn as 119909

119894= 119910119873(119894 1)

and the second three features 1199094 1199095 1199096 as 119909

119894= 119873(0 1)

Otherwise the first three were drawn as 119909119894= 119873(0 1) and the

second three as 119909119894= 119910119873(119894 minus 3 1) The rest features are noise

119909119894= 119873(0 20) 119894 = 7 119898 Here119898 is the dimension of input

features The inputs were scaled to mean zero and standarddeviation In each trial 500 points were generated for testingand the average results were estimated over 30 trials

In the first experiment we consider the cases with thefixed feature size 119898 = 30 and different training sample sizes119899 = 10 20 100 The average results over the 30 trialsare shown in Table 1 and Figure 2 Figure 2 (left) plots theaverage cardinality of each classifier Since the artificial datasets have 2 relevant and nonredundant features the idealaverage cardinality is 2 Figure 2 (left) shows that the threesparse SVMs 119871

1-SVM 119871

12-SVM and 119871

0-SVM can achieve

sparse solution while the 1198712-SVM almost uses full features

in each data set Furthermore the solutions of 11987112

-SVMand 119871

0-SVM are much sparser than 119871

1-SVM As shown in

Table 1 the 1198711-SVM selects more than 6 features in all cases

which implies that some redundant or irrelevant featuresare selected The average cardinalities of 119871

12-SVM and 119871

0-

SVM are similar and close to 2 However when 119899 = 10

and 20 the 1198710-SVM has the average cardinalities of 142

and 187 respectively It means that the 1198710-SVM sometimes

selects only one feature in low sample data set and maybeignores some really relevant feature Consequently with thecardinalities between 205 and 29 119871

12-SVM has the more

reliable solution than 1198710-SVM In short as far as the number

of selected features is concerned the 11987112

-SVM behavesbetter than the other three methods

Figure 2 (right) plots the trend of the prediction accuracyversus the size of the training sample The classificationperformance of all methods is generally improved with theincreasing of the training sample size 119899 119871

1-SVM has the best

prediction performance in all cases and a slightly better than11987112

-SVM 11987112

-SVM shows more accuracy in classificationthan 119871

2-SVM and 119871

0-SVM especially in the case of 119899 =

10 50 As shown in Table 1 when there are only 10 train-ing samples the average accuracy of 119871

12-SVM is 8805

while the results of 1198712-SVM and 119871

0-SVM are 8465 and

7765 respectively Compared with 1198712-SVM and 119871

0-SVM

11987112

-SVM has the average accuracy increased by 34 and104 respectively as can be explained in what follows Tothe 1198712-SVM all features are selected without discrimination

and the prediction would bemisled by the irrelevant featuresTo the 119871

0-SVM few features are selected and some relevant

features are not included which would put negative impacton the prediction result As the tradeoff between119871

2-SVMand

1198710-SVM 119871

12-SVM has better performance than the two

The average results over ten artificial data sets in the firstexperiment are shown in the bottom of Table 1 On averagethe accuracy of 119871

12-SVM is 087 lower than the 119871

1-SVM

while the features selected by 11987112

-SVM are 7414 less than1198711-SVM It indicates that the 119871

12-SVM can achieve much

sparser solution than 1198711-SVM with little cost of accuracy

Moreover the average accuracy of 11987112

-SVMover 10 data setsis 222 higher than 119871

0-SVM with the similar cardinality

To sum up the 11987112

-SVM provides the best balance betweenaccuracy and sparsity among the three sparse SVMs

To further evaluate the feature selection performance of11987112

-SVM we investigate whether the features are correctlyselected For the119871

2-SVM is not designed for feature selection

it is not included in this comparison Since our artificial datasets have 2 best features (119909

3 1199096) the best result should have

the two features (1199093 1199096) ranking on the top according to their

absolute values of weights |119908119895| In the experiment we select

the top 2 features with the maximal |119908119895| for each method

and calculate the frequency that the top 2 features are 1199093

and 1199096in 30 runs The results are listed in Table 2 When the

training sample size is too small it is difficult to discriminatethe two most important features for all sparse SVMs Forexample when 119899 = 10 the selected frequencies of 119871

1-SVM

1198710-SVM and 119871

12-SVM are 7 3 and 9 respectively When 119899

increases all methods tend to make more correct selectionMoreover Table 2 shows that the 119871

12-SVM outperforms the

other two methods in all cases For example when 119899 = 100the selected frequencies of 119871

1-SVM and119871

0-SVM are 22 and

25 respectively and the result of 11987112

-SVM is 27 The 1198711-

SVMselects toomany redundant or irrelevant features whichmay influence the ranking in some extent Therefore 119871

1-

SVM is not so good as 11987112

-SVM at distinguishing the criticalfeatures The 119871

0-SVM has the lower hit frequency than 119871

12-

SVM which is probably due to the excessive small featuresubset it obtained Above all Tables 1 and 2 and Figure 2clearly show that the 119871

12-SVM is a promising sparsity driven

classification methodIn the second simulation we consider the cases with

various dimensions of feature space 119898 = 20 40 180 200

and the fixed training sample size 119899 = 100 The averageresults over 30 trials are shown in Figure 3 and Table 3 Sincethere are only 6 relevant features yet the larger 119898 meansthe more noisy features Figure 3 (left) shows that as thedimension increases from 20 to 200 the number of featuresselected by 119871

1-SVM increases from 826 to 231 However the

cardinalities of 11987112

-SVM and 1198710-SVM keep stable (from 22

to 295) It indicates that the 11987112

-SVM and 1198710-SVM aremore

suitable for feature selection than 1198711-SVM

Figure 3 (right) shows thatwith the increasing of the noisefeatures the accuracy of 119871

2-SVM drops significantly (from

9868 to 8747) On the contrary to the other three sparse

Journal of Applied Mathematics 9

0 20 40 60 80 1000

5

10

15

20

25

30

Number of training samples

Card

inal

ity

0 20 40 60 80 10075

80

85

90

95

100

Number of training samples

Accu

racy

L2-SVML1-SVM

L0-SVML12-SVM

L2-SVML1-SVM

L0-SVML12-SVM

Figure 2 Results comparison on 10 artificial data sets with various training samples

Table 1 Results comparison on 10 artificial data sets with various training sample sizes

1198991198712-SVM 119871

1-SVM 119871

0-SVM 119871

12-SVM

Card Acc Card Acc Card Acc Card Acc10 2997 8465 693 8855 143 7765 225 880520 3000 9221 883 9679 187 9317 235 955030 3000 9427 847 9823 203 9467 205 974140 2997 9599 900 9875 207 9539 210 977550 3000 9661 950 9891 213 9657 245 976860 2997 9725 1000 9894 220 9761 245 979870 3000 9741 1100 9898 220 9756 240 982380 3000 9768 1003 9909 223 9761 260 985090 2997 9789 920 9921 230 9741 290 9830100 3000 9813 1027 9909 250 9793 255 9836On average 2999 9521 932 9765 210 9456 241 9678ldquo119899rdquo is the number of training samples ldquoCardrdquo represents the number of selected features and ldquoAccrdquo is the classification accuracy

Table 2 Frequency of the most important features (1199093 1199096) ranked on top 2 in 30 runs

119899 10 20 30 40 50 60 70 80 90 1001198711-SVM 7 14 16 18 22 23 20 22 21 22

1198710-SVM 3 11 12 15 19 22 22 23 22 25

11987112-SVM 9 16 19 24 24 23 27 26 26 27

Table 3 Average results over 10 artificial data sets with varies dimension

1198712-SVM 119871

1-SVM 119871

0-SVM 119871

12-SVM

Cardinality 10995 1585 238 241Accuracy 9293 9907 9776 9826

10 Journal of Applied Mathematics

Table 4 Feature selection performance comparison on UCI data sets (Cardinality)

No UCI Data Set 119899 1198981198712-SVM 119871

1-SVM 119871

0-SVM 119871

12-SVM

Mean Std Mean Std Mean Std Mean Std1 Pima diabetes 768 8 800 000 740 089 620 110 760 1522 Breast Cancer 683 9 900 000 860 122 640 207 840 1673 Wine (3) 178 13 1300 000 973 069 373 028 407 0494 Image (7) 2100 19 1900 000 820 053 366 097 511 0625 SPECT 267 22 2200 000 1740 527 980 427 9 1176 WDBC 569 30 3000 000 980 152 900 303 560 2057 Ionosphere 351 34 3320 045 2500 335 2460 1021 2180 8448 SPECTF 267 44 4400 000 3880 1098 2400 1132 2560 7759 Sonar 208 60 6000 000 3140 1305 2700 718 1360 100610 Musk1 476 166 16600 000 8580 1885 4860 428 41 1416

Average cardinality 4050 4042 2421 1630 1418

Table 5 Classification performance comparison on UCI data sets (accuaracy)

No UCI Data Set 1198712-SVM 119871

1-SVM 119871

0-SVM 119871

12-SVM

Mean Std Mean Std Mean Std Mean Std1 Pima diabetes 7699 347 7673 085 7673 294 7712 1522 Breast cancer 9618 223 9706 096 9632 138 9647 1593 Wine (3) 9771 373 9829 128 9314 239 9714 2564 Image (7) 8894 149 9127 177 8667 458 9136 1945 SPECT 8340 215 7849 231 7811 338 8234 5106 WDBC 9611 211 9646 131 9558 115 9646 1317 Ionosphere 8343 544 8800 370 8457 538 8743 5948 SPECTF 7849 391 7434 450 7387 325 7849 1899 Sonar 7366 529 7561 582 7415 798 7610 43510 Musk1 8484 173 8211 242 7600 564 8232 338

Average accuracy 8597 8584 8351 8652

SVMs there is little change in the accuracy It reveals thatSVMs can benefit from the features reduction

Table 3 shows the average results over all data sets inthe second experiment On average the solution of 119871

12-

SVM yields much sparser than 1198711-SVM and a slightly better

accuracy than 1198710-SVM

42 UCI Data Sets We further tested the reformulation andthe proposed interior point methods to 119871

12-SVM on 10 UCI

data sets [36] There are 8 binary classification problems and2multiclass problems (wine image) Each feature of the inputdata was normalized to zero mean and unit variance and theinstances with missing value were deletedThen the data wasrandomly split into training set (80) and testing set (20)For the two multiclass problems a one-against-rest methodwas applied to construct a binary classifier for each class Werepeated the training and testing procedure 10 times and theaverage results were shown in Tables 4 and 5 and Figure 4

Tables 4 and 5 summarize the feature selection andclassification performance of the numerical experiments onUCI data sets respectively Here 119899 is the numbers of samplesand 119898 is the number of the input features For the two

multiclass data sets the numbers of the classes are markedbehind their names Sparsity is defined as card119898 and thesmall value of sparsity is preferredThe data sets are arrangedin descending order according to the dimension The lowestcardinality and the best accuracy rate for each problem arebolded

As shown in Tables 4 and 5 the three sparse SVMs canencourage sparsity in all data sets while remaining roughlyidentical accuracy with 119871

2-SVM Among the three sparse

methods the 11987112

-SVM has the lowest cardinality (1418)and the highest classification accuracy (8652) on averageWhile the 119871

1-SVM has the worst feature selection perform-

ance with the highest average cardinality (2421) and the1198710-SVM has the lowest average classification accuracy

(8351)Figure 4 plots the sparsity (left) and classification accu-

racy (right) of each classifier on UCI data sets In three datasets (6 9 10) the 119871

12-SVM has the best performance both

in feature selection and classification among the three sparseSVMs Compared with 119871

1-SVM 119871

12-SVM can achieve

sparser solution in nine data sets For example in the dataset ldquo8 SPECTFrdquo the features selected by 119871

12-SVM are 34

Journal of Applied Mathematics 11

50 100 150 2000

20

40

60

80

100

120

140

160

180

200

Dimension

Card

inal

ity

50 100 150 20080

82

84

86

88

90

92

94

96

98

100

Dimension

Accu

racy

L2-SVML1-SVM

L0-SVML12-SVM

L2-SVML1-SVM

L0-SVML12-SVM

Figure 3 Results comparison on artificial data set with various dimensions

less than 1198711-SVM at the same time the accuracy of 119871

12-

SVM is 41 higher than 1198711-SVM In most data sets (4

5 6 9 10) the cardinality of 11987112

-SVM drops significantly(at least 37) with the equal or a slightly better result inaccuracy Only in the data set ldquo3 Winerdquo the accuracy of11987112

-SVM is decreased by 11 but the sparsity provided by11987112

-SVM leads to 582 improvement over 1198711-SVM In the

rest three data sets (1 2 7) the two methods have similarresults in feature selection and classification As seen abovethe 119871

12-SVM can provide lower dimension representation

than 1198711-SVM with the competitive prediction perform-

anceFigure 4 (right) shows that compared with 119871

0-SVM

11987112

-SVM has the classification accuracy improved in alldata sets For instance in four data sets (3 4 8 10) theclassification accuracy of 119871

12-SVM is at least 40 higher

than 1198710-SVM Especially in the data set ldquo10Muskrdquo 119871

12-SVM

gives a 63 rise in accuracy over 1198710-SVM Meanwhile it

can be observed from Figure 4 (left) that 11987112

-SVM selectsfewer feature than 119871

0-SVM in five data sets (5 6 7 9 10) For

example in the data sets 6 and 9 the cardinalities of 11987112

-SVM are 378 and 496 less than 119871

0-SVM respectively

In summary 11987112

-SVM presents better classification perfor-mance than 119871

0-SVM while it is effective in choosing relevant

features

5 Conclusions

In this paper we proposed a 11987112

regularization techniquefor simultaneous feature selection and classification in theSVMWe have reformulated the 119871

12-SVM into an equivalent

smooth constrained optimization problem The problem

possesses a very simple structure and is relatively easy todevelop numerical methods By the use of this interestingreformulation we proposed an interior point method andestablished its convergence Our numerical results supportedthe reformulation and the proposed method The 119871

12-

SVM can get more sparsity solution than 1198711-SVM with

the comparable classification accuracy Furthermore the11987112

-SVM can achieve more accuracy classification resultsthan 119871

0-SVM (FSV)

Inspired by the good performance of the smooth opti-mization reformulation of 119871

12-SVM there are some inter-

esting topics deserving further research For examples todevelop more efficient algorithms for solving the reformu-lation to study nonlinear 119871

12-SVM and to explore varies

applications of the 11987112

-SVM and further validate its effectiveare interesting research topics Some of them are under ourcurrent investigation

Appendix

A Properties of the Reformulation to11987112

-SVM

Let 119863 be set of all feasible points of the problem (10) Forany 119911 isin 119863 we let 119865119863(119911119863) and 119871119865119863(119911119863) be the set of allfeasible directions and linearized feasible directions of 119863 at119911 Since the constraint functions of (10) are all convex weimmediately have the following lemma

Lemma A1 For any feasible point 119911 of (10) one has119865119863 (119911119863) = 119871119865119863 (119911119863) (A1)

12 Journal of Applied Mathematics

1 2 3 4 5 6 7 8 9 1010

20

30

40

50

60

70

80

90

100

Data set number Data set number

Spar

sity

1 2 3 4 5 6 7 8 9 1070

75

80

85

90

95

100

Accu

racy

L2-SVML1-SVM

L0-SVML12-SVM

L2-SVML1-SVM

L0-SVML12-SVM

Figure 4 Results comparison on UCI data sets

Based on the above Lemma we can easily derive a firstorder necessary condition for (10) The Lagrangian functionof (10) is

119871 (119908 119905 120585 119887 120582 120583 ] 119901 119902)

= 119862

119899

sum

119894=1

120585119894+

119898

sum

119895=1

119905119895minus

119898

sum

119895=1

120582119895(1199052

119895minus 119908119895) minus

119898

sum

119895=1

120583119895(1199052

119895+ 119908119895)

minus

119898

sum

119895=1

]119895119905119895minus

119899

sum

119894=1

119901119894(119910119894119908119879119909119894+ 119910119894119887 + 120585119894minus 1) minus

119899

sum

119894=1

119902119894120585119894

(A2)

where 120582 120583 ] 119901 119902 are the Lagrangian multipliers By the useof Lemma A1 we immediately have the following theoremabout the first order necessary condition

Theorem A2 Let (119908 119905 120585 119887) be a local solution of (10) Thenthere are Lagrangian multipliers (120582 120583 ] 119901 119902) isin 119877

119898+119898+119898+119899+119899

such that the following KKT conditions hold

120597119871

120597119908= 120582119895minus 120583119895minus 119901119894

119899

sum

119894=1

119910119894119909119894= 0 119894 = 1 2 119899

120597119871

120597119905= 1 minus 2120582

119895119905119895minus 2120583119895119905119895minus ]119895= 0 119895 = 1 2 119898

120597119871

120597120585= 119862 minus 119901

119894minus 119902119894= 0 119894 = 1 2 119899

120597119871

120597119887=

119899

sum

119894=1

119901119894119910119894= 0 119894 = 1 2 119899

120582119895ge 0 119905

2

119895minus 119908119895ge 0 120582

119895(1199052

119895minus 119908119895) = 0

119895 = 1 2 119898

120583119895ge 0 119905

2

119895+ 119908119895ge 0 120583

119895(1199052

119895+ 119908119895) = 0

119895 = 1 2 119898

]119895ge 0 119905

119895ge 0 ]

119895119905119895= 0 119895 = 1 2 119898

119901119894ge 0 119910

119894(119908119879119909119894+ 119887) + 120585

119894minus 1 ge 0

119901119894(119910119894(119908119879119909119894+ 119887) + 120585

119894minus 1) = 0 119894 = 1 2 119899

119902119894ge 0 120585

119894ge 0 119902

119894120585119894= 0 119894 = 1 2 119899

(A3)

The following theorem shows that the level set will bebounded

Theorem A3 For any given constant 119888 gt 0 the level set

119882119888= 119911 = (119908 119905 120585) | 119891 (119911) le 119888 cap 119863 (A4)

is bounded

Proof For any (119908 119905 120585) isin Ω119888 we have

119898

sum

119894

119905119894+ 119862

119899

sum

119895

120585119895le 119888 (A5)

Journal of Applied Mathematics 13

Combining with 119905119894ge 0 119894 = 1 119898 and 120585

119895ge 0 119895 = 1 119899

we have

0 le 119905119894le 119888 119894 = 1 119898

0 le 120585119895le

119888

119862 119895 = 1 119899

(A6)

Moreover for any (119908 119905 120585) isin 119863

10038161003816100381610038161199081198941003816100381610038161003816 le 1199052

119894 119894 = 1 119899 (A7)

Consequently 119908 119905 120585 are bounded What is more from thecondition

119910119894(119908119879119909119894+ 119887) ge 1 minus 120585

119894 119894 = 1 119898 (A8)

we have

119908119879119909119894+ 119887 ge 1 minus 120585

119894 if 119910

119894= 1 119894 = 1 119898

119908119879119909119894+ 119887 le minus1 + 120585

119894 if 119910

119894= minus1 119894 = 1 119898

(A9)

Thus if the feasible region119863 is not empty then we have

max119910119894=1

(1 minus 120585119894minus 119908119879119909119894) le 119887 le min

119910119894=minus1

(minus1 + 120585119894minus 119908119879119909119894) (A10)

Hence 119887 is also bounded The proof is complete

B Proof of Lemma 3

Lemma 3 (see Section 3) shows that Algorithm 2 is welldefinedWe first introduce the following proposition to showthat the matrix 119878 defined by (21) is always positive definitewhich ensures Algorithm 2 to be well defined

Proposition B1 Let the inequality (26) hold then the matrix119878 defined by (21) is positive definite

Proof By an elementary deduction we have for any V(1) isin

119877119898 V(2) isin 119877

119899 V(3) isin 119877119898 V(4) isin 119877 with (V(1) V(2) V(3) V(4)) = 0

(V(1)119879 V(2)119879 V(3)119879 V(4)119879) 119878(

V(1)

V(2)

V(3)

V(4))

= V(1)11987911987811V(1) + V(1)119879119878

12V(2) + V(1)119879119878

13V(3)

+ V(1)11987911987814V(4) + V(2)119879119878

21V(1) + V(2)119879119878

22V(2)

+ V(2)11987911987823V(3) + V(2)119879119878

24V(4)

+ V(3)11987911987831V(1) + V(3)119879119878

32V(2)

+ V(3)11987911987833V(3) + V(3)119879119878

34V(4)

+ V(4)11987911987841V(1) + V(4)119879119878

42V(2)

+ V(4)11987911987843V(3) + V(4)119879119878

44V(4)

= V(1)119879 (119880 + 119881 + 119883119879119884119879119875minus11198634119884119883) V(1)

+ V(1)119879 (119883119879119884119879119875minus11198634) V(2)

+ V(1)119879 (minus2 (119880 minus 119881)119879) V(3)

+ V(1)119879 (119883119879119884119879119875minus11198634119910) V(4)

+ V(2)119879119875minus11198634119884119883V(1) + V(2)119879 (119875minus1119863

4+ Ξminus11198635) V(2)

+ V(2)119879119875minus11198634119910V(4)

minus 2V(3)119879 (119880 minus 119881)119879V(1)

+ V(3)119879 (4119879 (119880 + 119881)119879 + 119879minus11198633minus 2 (119863

1+ 1198632)) V(3)

+ V(4)119879119910119879119875minus11198634119884119883V(1)

+ V(4)119879119910119879119875minus11198634V(2) + V(4)119879119910119879119875minus1119863

4119910V(4)

= V(1)119879119880V(1) minus 4V(3)119879119880119879V(1) + 4V(3)119879119879119880119879V(3)

+ V(1)119879119881V(1) + 4V(3)119879119881119879V(1) + 4V(3)119879119879119881119879V(3)

+ V(2)119879Ξminus11198635V(2) + V(3)119879 (119879minus1119863

3minus 21198631minus 21198632) V(3)

+ (119884119883V(1) + V(2) + V(4)119910)119879

119875minus11198634

times (119884119883V(1) + V(2) + V(4)119910)

= 119890119879

119898119880(119863V

1minus 2119879119863V

3)2

119890119898+ 119890119879

119898119881(119863V

1+ 2119879119863V

3)2

119890119898

+ V(2)119879Ξminus11198635V(2)

+ V(3)119879 (119879minus11198633minus 21198631minus 21198632) V(3)

+ (119884119883V(1) + V(2) + V(4)119910)119879

119875minus11198634(119884119883V(1) + V(2) + V(4)119910)

gt 0

(B1)

where 119863V1= diag(V(1)) 119863V

2= diag(V(2)) 119863V

3= diag(V(3))

and119863V4= diag(V(4))

14 Journal of Applied Mathematics

B1 Proof of Lemma 3

Proof It is easy to see that if Δ119911119896= 0 which implies that (23)

holds for all 120572119896ge 0

Suppose Δ119911119896

= 0 We have

nabla119908Φ120583(119911119896)119879

Δ119908119896+ nabla119905Φ120583(119911119896)119879

Δ119905119896

+ nabla120585Φ120583(119911119896)119879

Δ120585119896+ nabla119887Φ120583(119911119896)119879

Δ119887119896

= minus (Δ119908119896119879

Δ120585119896119879

Δ119905119896119879

Δ119887119896) 119878(

Δ119908119896

Δ120585119896

Δ119905119896

Δ119887119896

)

(B2)

Since matrix 119878 is positive definite and Δ119911119896

= 0 the lastequation implies

nabla119908Φ120583(119911119896)119879

Δ119908119896+ nabla119905Φ120583(119911119896)119879

Δ119905119896

+ nabla120585Φ120583(119911119896)119879

Δ120585119896+ nabla119887Φ120583(119911119896)119879

Δ119887119896lt 0

(B3)

Consequently there exists a 119896isin (0

119896] such that the fourth

inequality in (23) is satisfied for all 120572119896isin (0

119896]

On the other hand since (119909119896 119905119896) is strictly feasible the

point (119909119896 119905119896) + 120572(Δ119909 119905119896 Δ119905119896) will be feasible for all 120572 gt 0

sufficiently small The proof is complete

C Convergence Analysis

This appendix is devoted to the global convergence of theinterior point method We first show the convergence ofAlgorithm 2 when a fixed 120583 is applied to the barrier subprob-lem (14)

Lemma C1 Let (119911119896 120582119896) be generated by Algorithm 2 Then119911119896 are strictly feasible for problem (10) and the Lagrangian

multipliers 120582119896 are bounded from above

Proof For the sake of convenience we use 119892119894(119911) 119894 =

1 2 3119898 + 2119899 to denote the constraint functions of theconstrained problem (10)

We first show that 119911119896 are strictly feasible Suppose onthe contrary that there exists an infinite index subset K andan index 119894 isin 1 2 3119898 + 2119899 such that 119892

119894(119911119896)K darr 0

By the definition of Φ120583(119911) and 119891(119911) = sum

119898

119895=1119905119894+ 119862sum

119899

119894=1120585119894

being bounded from below in the feasible set it must holdthat Φ

120583(119911119896)K rarr infin

However the line search rule implies that the sequenceΦ120583(119911119896) is decreasing So we get a contradiction Conse-

quently for any 119894 isin 1 2 3119898+2119899 119892119894(119909119896 119905119896) is bounded

away from zero The boundedness of 120582119896 then follows from(24)ndash(27)

Lemma C2 Let (119911119896 120582119896) and (Δ119911119896 119896+1

) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenthe sequence (Δ119911

119896 119896+1

)K is bounded

Proof Again we use 119892119894(119911) 119894 = 1 2 3119898+2119899 to denote the

constraint functions of the constrained problem (10)We suppose on the contrary that there exists an infinite

subsetK1015840 sube K such that the subsequence (Δ119911119896 119896+1

)K1015840

tends to infinity Let 119911119896K1015840 rarr 119911

lowast It follows fromLemma C1 that there is an infinite subset K sube K1015840 such that120582119896K rarr 120582

lowast and 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 Thus

we have

119878119896997888rarr 119878lowast (C1)

as 119896 rarr infin with 119896 isin K By Proposition B1 119878lowast is positivedefinite Since the right hand size of (18) is bounded andcontinuous the unboundedness of (Δ119911

119896 119896+1

)K impliesthat the limit of the coefficient matrices of (18) is singularwhich yields a contradiction The proof is complete

Lemma C3 Let (119911119896 120582119896) and (Δ119911119896 119896+1

) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenone has Δ119911

119896K rarr 0

Proof It follows from Lemma C1 that the sequence Δ119911119896K

is bounded Suppose on the contrary that there exists aninfinite subsetK1015840 sube K such that Δ119911

119896K1015840 rarr Δ119911

lowast= 0 Since

subsequences 119911119896K1015840 120582

119896K1015840 and

119896K1015840 are all bounded

there are points 119911lowast 120582lowast and

lowast as well as an infinite indexset K sube K1015840 such that 119911119896K rarr 119911

lowast 120582119896K rarr 120582lowast and

119896K rarr

lowast By Lemma C1 we have 119892119894(119911lowast) gt 0 forall119894 isin

1 2 3119898 + 2119899 Similar to the proof of Lemma 3 it is notdifficult to get

nabla119908Φ120583(119911lowast)119879

Δ119908lowast+ nabla119905Φ120583(119911lowast)119879

Δ119905lowast

+ nabla120585Φ120583(119911lowast)119879

Δ120585lowast+ nabla119887Φ120583(119911lowast)119879

Δ119887lowast

= minus (Δ119908lowast119879

Δ120585lowast119879

Δ119905lowast119879

Δ119887lowast) 119878(

Δ119908lowast

Δ120585lowast

Δ119905lowast

Δ119887lowast

) lt 0

(C2)

Since 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 there exists a

isin (0 1] such that for all 120572 isin (0 ]

119892119894(119911lowast+ 120572Δ119911

lowast) gt 0 forall119894 isin 1 2 3119899 (C3)

Taking into account 1205911isin (0 12) we claim that there exists

a isin (0 ] such that the following inequality holds for all120572 isin (0 ]

Φ120583(119911lowast+ 120572Δ119911

lowast) minus Φ120583(119911lowast) le 11120591

1120572nabla119911Φ120583(119911lowast)119879

Δ119911lowast (C4)

Let 119898lowast

= min119895 | 120573119895isin (0 ] 119895 = 0 1 and 120572

lowast= 120573119898lowast

It follows from (C3) that the following inequality is satisfiedfor all 119896 isin K sufficient large and any 119894 isin 1 2 3119898 + 2119899

119892119894(119911119896+ 120572lowastΔ119911119896) gt 0 (C5)

Journal of Applied Mathematics 15

Moreover we have

Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911119896)

= Φ120583(119911lowast+ 120572lowastΔ119911lowast) minus Φ120583(119911lowast)

+ Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911lowast+ 120572lowastΔ119911lowast)

+ Φ120583(119911lowast) minus Φ120583(119911119896)

le 1051205911120572lowastnabla119911Φ120583(119911lowast)119879

Δ119911lowast

le 1205911120572lowastnabla119911Φ120583(119911119896)119879

Δ119911119896

(C6)

By the backtracking line search rule the last inequalitytogether with (C5) yields the inequality120572

119896ge 120572lowastfor all 119896 isin K

large enough Consequently when 119896 isin K is sufficiently largewe have from (23) and (C3) that

Φ120583(119911119896+1

) le Φ120583(119911119896) + 1205911120572119896nabla119911Φ120583(119911119896)119879

Δ119911119896

le Φ120583(119911119896) + 1205911120572lowastnabla119911Φ120583(119911119896)119879

Δ119911119896

le Φ120583(119911119896) +

1

21205911120572lowastnabla119911Φ120583(119911lowast)119879

Δ119911lowast

(C7)

This shows that Φ120583(119911119896)K rarr minusinfin contradicting with the

fact that Φ120583(119911119896)K is bounded from below The proof is

complete

Then we will establish the convergence of Algorithms 2and 4 that is Theorems 5 and 6 in Section 3

C1 Proof of Theorem 5

Proof We let (119911lowast lowast) be a limit point of (119911119896 119896) and thesubsequence (119911119896 119896)K converges to (119911

lowast lowast)

Recall that (Δ119911119896 119896) are the solution of (18) with (119911 120582) =

(119911119896 120582119896) Taking limits in both sides of (18) with (119911 120582) =

(119911119896 120582119896) as 119896 rarr infinwith 119896 isin K by the use of Lemma C1 we

obtain

lowast(1)

minus lowast(2)

minus 119883119879119884119879lowast(4)

= 0

119862 lowast 119890119899minus lowast(4)

minus lowast(5)

= 0

119890119898minus 2119879lowastlowast(1)

minus 2119879lowastlowast(2)

minus lowast(3)

= 0

119910119879lowast(4)

= 0

(119879lowast2

minus 119882lowast) lowast(1)

minus 120583119890119898

= 0

(119879lowast2

+ 119882lowast) lowast(2)

minus 120583119890119898

= 0

119879lowastlowast(3)

minus 120583119890119898

= 0

119875lowastlowast(4)

minus 120583119890119899= 0

Ξlowastlowast(5)

minus 120583119890119899= 0

(C8)

This shows that (119911lowast lowast) satisfies the first-order optimalityconditions (15)

C2 Proof of Theorem 6

Proof (i) Without loss of generality we suppose that thebounded subsequence (119911

119895 119895)J converges to some point

(119911lowast lowast) It is clear that 119911lowast is a feasible point of (10) Since

120583119895 120598120583119895 rarr 0 it follows from (29) and (30) that

Res (119911119895 119895 120583119895)

J997888rarr Res (119911lowast lowast 0) = 0

119895J

997888rarr lowastge 0

(C9)

Consequently the (119911lowast lowast) satisfies the KKT conditions (13)(ii) Let 120585

119895= max119895

infin 1 and

119895= 120585minus1

119895119895 Obviously

119895 is bounded Hence there exists an infinite subsetJ1015840 sube J

such that 1198951015840J rarr = 0 and 119895infin

= 1 for large 119895 isin JFrom (30) we know that ge 0 Dividing both sides of thefirst inequality of (18) with (119911 120582) = (119911

119896 120582119896) by 120585119895 then taking

limits as 119895 rarr infin with 119895 isin J we get

(1)

minus (2)

minus 119883119879119884119879(4)

= 0

(4)

+ (5)

= 0

2119879lowast(1)

+ 2119879lowast(2)

+ (3)

= 0

119910119879(4)

= 0

(119879lowast2

minus 119882lowast) (1)

= 0

(119879lowast2

+ 119882lowast) (2)

= 0

119879lowast(3)

= 0

119875lowast(4)

= 0

Ξlowast(5)

= 0

(C10)

Since 119911lowast is feasible the above equations has shown that 119911lowast is

a Fritz-John point of problem (10)

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

The authors would like to acknowledge support for thisproject from the National Science Foundation (NSF Grants11371154 11071087 11271069 and 61103202) of China

References

[1] Y M Yang and J O Pedersen ldquoA comparative study on featureselection in text categorizationrdquo in Proceedings of the 14th

16 Journal of Applied Mathematics

International Conference on Machine Learning (ICML rsquo97) vol97 pp 412ndash420 1997

[2] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquoThe Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003

[3] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002

[4] H H Zhang J Ahn X Lin and C Park ldquoGene selection usingsupport vector machines with non-convex penaltyrdquo Bioinfor-matics vol 22 no 1 pp 88ndash95 2006

[5] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007

[6] J Weston F Perez-Cruz O Bousquet O Chapelle A Elisseeffand B Scholkopf ldquoFeature selection and transduction forprediction of molecular bioactivity for drug designrdquo Bioinfor-matics vol 19 no 6 pp 764ndash771 2003

[7] Y Liu ldquoA comparative study on feature selection methods fordrug discoveryrdquo Journal of Chemical Information and ComputerSciences vol 44 no 5 pp 1823ndash1828 2004

[8] I Guyon J Weston S Barnhill and V Vapnik Eds FeatureExtraction Foundations and Applications vol 207 Springer2006

[9] A Rakotomamonjy ldquoVariable selection using SVM-based cri-teriardquo Journal of Machine Learning Research vol 3 no 7-8 pp1357ndash1370 2003

[10] J Weston S Mukherjee O Chapelle M Pontil T Poggio andV Vapnik ldquoFeature selection for SVMsrdquo in Proceedings of theConference onNeural Information Processing Systems vol 13 pp668ndash674 Denver Colo USA 2000

[11] D Peleg and R Meir ldquoA feature selection algorithm based onthe global minimization of a generalization error boundrdquo inProceedings of the Conference on Neural Information ProcessingSystems Vancouver Canada 2004

[12] P S Bradley and O L Mangasarian ldquoFeature selection via con-cave minimization and support vector machinesrdquo in Proceed-ings of the International Conference on Machine Learning pp82ndash90 San Francisco Calif USA 1998

[13] JWeston A Elisseeff B Scholkopf andM Tipping ldquoUse of thezero-norm with linear models and kernel methodsrdquo Journal ofMachine Learning Research vol 3 pp 1439ndash1461 2003

[14] A B Chan N Vasconcelos and G R G Lanckriet ldquoDirectconvex relaxations of sparse SVMrdquo in Proceedings of the 24thInternational Conference on Machine Learning (ICML rsquo07) pp145ndash153 Corvallis Ore USA June 2007

[15] GM Fung andO L Mangasarian ldquoA feature selection Newtonmethod for support vector machine classificationrdquo Computa-tional Optimization and Applications vol 28 no 2 pp 185ndash2022004

[16] J Zhu S Rosset T Hastie and R Tibshirani ldquo1-norm supportvector machinesrdquo Advances in Neural Information ProcessingSystems vol 16 no 1 pp 49ndash56 2004

[17] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society B Methodological vol 58no 1 pp 267ndash288 1996

[18] T Zhang ldquoAnalysis of multi-stage convex relaxation for sparseregularizationrdquo Journal of Machine Learning Research vol 11pp 1081ndash1107 2010

[19] R Chartrand ldquoExact reconstruction of sparse signals via non-convexminimizationrdquo IEEE Signal Processing Letters vol 14 no10 pp 707ndash710 2007

[20] R Chartrand ldquoNonconvex regularizaron for shape preserva-tionrdquo inProceedings of the 14th IEEE International Conference onImage Processing (ICIP rsquo07) vol 1 pp I293ndashI296 San AntonioTex USA September 2007

[21] Z Xu H Zhang Y Wang X Chang and Y Liang ldquo11987112

regularizationrdquo Science China Information Sciences vol 53 no6 pp 1159ndash1169 2010

[22] W J Chen and Y J Tian ldquoLp-norm proximal support vectormachine and its applicationsrdquo Procedia Computer Science vol1 no 1 pp 2417ndash2423 2010

[23] J Liu J Li W Xu and Y Shi ldquoA weighted L119902adaptive least

squares support vector machine classifiers-Robust and sparseapproximationrdquo Expert Systems with Applications vol 38 no 3pp 2253ndash2259 2011

[24] Y Liu H H Zhang C Park and J Ahn ldquoSupport vector mach-ines with adaptive 119871

119902penaltyrdquo Computational Statistics amp Data

Analysis vol 51 no 12 pp 6380ndash6394 2007[25] Z Liu S Lin andM Tan ldquoSparse support vectormachines with

Lp penalty for biomarker identificationrdquo IEEEACM Transac-tions on Computational Biology and Bioinformatics vol 7 no1 pp 100ndash107 2010

[26] A Rakotomamonjy R Flamary G Gasso and S Canu ldquol119901-l119902

penalty for sparse linear and sparse multiple kernel multitasklearningrdquo IEEE Transactions on Neural Networks vol 22 no 8pp 1307ndash1320 2011

[27] J Y Tan Z Zhang L Zhen C H Zhang and N Y DengldquoAdaptive feature selection via a new version of support vectormachinerdquo Neural Computing and Applications vol 23 no 3-4pp 937ndash945 2013

[28] Z B Xu X Y Chang F M Xu and H Zhang ldquo11987112

regu-larization an iterative half thresholding algorithmrdquo IEEETrans-actions on Neural Networks and Learning Systems vol 23 no 7pp 1013ndash1027 2012

[29] D Ge X Jiang and Y Ye ldquoA note on the complexity of 119871119901

minimizationrdquo Mathematical Programming vol 129 no 2 pp285ndash299 2011

[30] VN VapnikTheNature of Statistical LearningTheory SpringerNew York NY USA 1995

[31] J Fan and H Peng ldquoNonconcave penalized likelihood with adiverging number of parametersrdquo The Annals of Statistics vol32 no 3 pp 928ndash961 2004

[32] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoThe Annals of Statistics vol 28 no 5 pp 1356ndash1378 2000

[33] B S Tian and X Q Yang ldquoAn interior-point l 11987112-penalty-

method for nonlinear programmingrdquo Technical ReportDepartment of Applied Mathematics Hong Kong PolytechnicUniversity 2013

[34] L Armijo ldquoMinimization of functions having Lipschitz con-tinuous first partial derivativesrdquo Pacific Journal of Mathematicsvol 16 pp 1ndash3 1966

[35] F John ldquoExtremum problems with inequalities as side-con-ditionsrdquo in Studies and Essays Courant Anniversary Volume pp187ndash1204 1948

[36] D J Newman S Hettich C L Blake and C J Merz ldquoUCIrepository of machine learning databasesrdquo Technical Report9702 Department of Information and Computer Science Uni-verisity of California Irvine Calif USA 1998 httparchiveicsucieduml

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 4: Research Article An Interior Point Method for -SVM …downloads.hindawi.com/journals/jam/2014/942520.pdfResearch Article An Interior Point Method for 1/2-SVM and Application to Feature

4 Journal of Applied Mathematics

optimization problem (10) It is clear that 119911lowast

= (119908lowast 120585lowast

|119908lowast|12

119887lowast) isin F Moreover we have 119905

2

119895ge |

119895| forall119895 =

1 2 119898 and hence

119891 () =

119898

sum

119895=1

119905119895+ 119862

119899

sum

119894=1

120585119894ge

119898

sum

119895=1

10038161003816100381610038161003816119895

10038161003816100381610038161003816

12

+ 119862

119899

sum

119894=1

120585119894

ge

119898

sum

119895=1

10038161003816100381610038161003816119908lowast

119895

10038161003816100381610038161003816

12

+ 119862

119899

sum

119894=1

120585lowast

119894= 120601 (119906

lowast)

(11)

Since 119911lowast isin F we have

120601 (119906lowast) = 119891 (119911

lowast) ge 119891 () (12)

This together with (11) implies that 119891() = 120601(119906lowast) The proof

is complete

It is clear that the constraint functions of (10) are convexConsequently at any feasible point the set of all feasibledirections is the same as the set of all linearized feasibledirections

As a result the KKT point exists The KKT system ofthe problem (10) can be written as the following system ofnonlinear equations

119877 (119908 120585 119905 119887 120582) =

(((((((((((

(

120582(1)

minus 120582(2)

minus 119883119879119884119879120582(4)

119862 lowast 119890119899minus 120582(4)

minus 120582(5)

119890119898minus 2119879120582

(1)minus 2119879120582

(2)minus 120582(3)

119910119879120582(4)

min 1199052

119895minus 119908119895 120582(1)

119895119895=12119898

min 1199052

119895+ 119908119895 120582(2)

119895119895=12119898

min 119905119895 120582(3)

119895119895=12119898

min 119901119894 120582(4)

119894119894=12119899

min 120585119894 120582(5)

119894119894=12119899

)))))))))))

)

= 0

(13)

where 120582 = (120582(1)

120582(2)

120582(3)

120582(4)

120582(5)

) are the Lagrangianmultipliers 119883 = (119909

1 1199092 119909

119899)119879 119884 = diag(119910) is diagonal

matrix and 119901119894= 119910119894(119908119879119909119894+ 119887) minus (1 minus 120585

119894) 119894 = 1 2 119899

For the sake of simplicity the properties of the reformu-lation to 119871

12-SVM are shown in Appendix A

3 An Interior Point Method

In this section we develop an interior point method to solvethe equivalent constrained problem (10) of the 119871

12-SVM (7)

31 Auxiliary Function Following the idea of the interiorpoint method the constrained problem (10) can be solvedby minimizing a sequence of logarithmic barrier functions asfollows

min Φ120583(119908 120585 119905 119887)

119898

sum

119895=1

119905119895+ 119862

119899

sum

119894=1

120585119894

minus 120583

119898

sum

119895=1

[log (1199052119895minus 119908119895) + log (1199052

119895+ 119908119895) + log 119905

119895]

minus 120583

119899

sum

119894=1

(log 119901119894+ log 120585

119894)

st 1199052

119895minus 119908119895gt 0 119895 = 1 2 119898

1199052

119895+ 119908119895gt 0 119895 = 1 2 119898

119905119895gt 0 119895 = 1 2 119898

119901119894= 119910119894(119908119879119909119894+ 119887) + 120585

119894minus 1 gt 0 119894 = 1 2 119899

120585119894gt 0 119894 = 1 2 sdot sdot sdot 119899

(14)

where 120583 is the barrier parameter converging to zero fromabove

The KKT system of problem (14) is the following systemof linear equations

120582(1)

minus 120582(2)

minus 119883119879119884119879120582(4)

= 0

119862 lowast 119890119899minus 120582(4)

minus 120582(5)

= 0

119890119898minus 2119879120582

(1)minus 2119879120582

(2)minus 120582(3)

= 0

119910119879120582(4)

= 0

(1198792minus 119882)120582

(1)minus 120583119890119898

= 0

(1198792+ 119882)120582

(2)minus 120583119890119898

= 0

119879120582(3)

minus 120583119890119898

= 0

119875120582(4)

minus 120583119890119899= 0

Ξ120582(5)

minus 120583119890119899= 0

(15)

where 120582 = (120582(1)

120582(2)

120582(3)

120582(4)

120582(5)

) are the Lagrangianmulti-pliers 119879 = diag(119905) 119882 = diag(119908) Ξ = diag(120585) and 119875 =

diag(119884(119883119908+119887lowast119890119899)+120585minus119890

119899) are diagonalmatrices and 119890

119898isin 119877119898

and 119890119899isin 119877119899 stand for the vector whose elements are all ones

32 Newtonrsquos Method We apply Newtonrsquos method to solvethe nonlinear system (15) in variables 119908 120585 119905 119887 and 120582 Thesubproblem of the method is the following system of linearequations

119872 =

((((((((

(

Δ119908

Δ120585

Δ119905

Δ119887

Δ120582(1)

Δ120582(2)

Δ120582(3)

Δ120582(4)

Δ120582(5)

))))))))

)

= minus

(((((((((

(

120582(1)

minus 120582(2)

minus 119883119879119884119879120582(4)

119862 lowast 119890119899minus 120582(4)

minus 120582(5)

119890119898minus 2119879120582

(1)minus 2119879120582

(2)minus 120582(3)

119890119879

119899119884120582(4)

(1198792minus 119882)120582

(1)minus 120583119890119898

(1198792+ 119882)120582

(2)minus 120583119890119898

119879120582(3)

minus 120583119890119898

119875120582(4)

minus 120583119890119899

Ξ120582(5)

minus 120583119890119899

)))))))))

)

(16)

where 119872 is the Jacobian of the left function in (15) and takesthe form

Journal of Applied Mathematics 5

119872 =

(((((((

(

0 0 0 0 119868 minus119868 0 minus119883119879119884119879

0

0 0 0 0 0 0 0 minus119868 minus119868

0 0 minus2 (1198631+ 1198632) 0 minus2119879 minus2119879 minus119868 0 0

0 0 0 0 0 0 0 119910119879

0

minus1198631

0 21198791198631

0 1198792minus 119882 0 0 0 0

1198632

0 21198791198632

0 0 1198792+ 119882 0 0 0

0 0 1198633

0 0 0 119879 0 0

1198634119884119883 119863

40 119863

4119910 0 0 0 119875 0

0 1198635

0 0 0 0 0 0 Ξ

)))))))

)

(17)

where 1198631

= diag(120582(1)) 1198632

= diag(120582(2)) 1198633

= diag(120582(3))1198634= diag(120582(4)) and119863

5= diag(120582(5))

We can rewrite (16) as

(120582(1)

+ Δ120582(1)

) minus (120582(2)

+ Δ120582(2)

) minus 119883119879119884119879(120582(4)

+ Δ120582(4)

) = 0

(120582(4)

+ Δ120582(4)

) + (120582(5)

+ Δ120582(5)

) = 119862 lowast 119890119899

2 (1198631+ 1198632) Δ119905 + 2119879 (120582

(1)+ Δ120582(1)

)

+ 2119879 (120582(2)

+ Δ120582(2)

) + (120582(3)

+ Δ120582(3)

) = 119890119898

119910119879(120582(4)

+ Δ120582(4)

) = 0

minus1198631Δ119908 + 2119879119863

1Δ119905 + (119879

2minus 119882) (120582

(1)+ Δ120582(1)

) = 120583119890119898

1198632Δ119908 + 2119879119863

2Δ119905 + (119879

2+ 119882) (120582

(2)+ Δ120582(2)

) = 120583119890119898

1198633Δ119905 + 119879 (120582

(3)+ Δ120582(3)

) = 120583119890119898

1198634119884119883Δ119908 + 119863

4Δ120585 + 119863

4119910 lowast Δ119887 + 119875 (120582

(4)+ Δ120582(4)

) = 120583119890119899

1198635Δ120585 + Ξ (120582

(5)+ Δ120582(5)

) = 120583119890119899

(18)

It follows from the last five equations that vector ≜ 120582 + Δ120582

can be expressed as

(1)

= (1198792minus 119882)minus1

(120583119890119898+ 1198631Δ119908 minus 2119879119863

1Δ119905)

(2)

= (1198792+ 119882)minus1

(120583119890119898minus 1198632Δ119908 minus 2119879119863

2Δ119905)

(3)

= 119879minus1

(120583119890119898minus 1198633Δ119905)

(4)

= 119875minus1

(120583119890119899minus 1198634119884119883Δ119908 minus 119863

4Δ120585 minus 119863

4119910 lowast Δ119887)

(5)

= Ξminus1

(120583119890119899minus 1198635Δ120585)

(19)

Substituting (19) into the first four equations of (18) weobtain

119878(

Δ119908

Δ120585

Δ119905

Δ119887

)

=

(((

(

minus120583(1198792minus 119882)minus1

119890119898+ 120583(119879

2+ 119882)minus1

119890119898

+119883119879119884119879119875minus1119890119899

minus119862 lowast 119890119899+ 120583119875minus1119890119899+ 120583Ξminus1119890119899

minus119890119898+ 2120583119879(119879

2minus 119882)minus1

119890119898

+2120583119879(1198792+ 119882)minus1

119890119898+ 120583119879minus1119890119898

120583119910119879119875minus1119890119899

)))

)

(20)

where matrix 119878 takes the form

119878 = (

11987811

11987812

11987813

11987814

11987821

11987822

11987823

11987824

11987831

11987832

11987833

11987834

11987841

11987842

11987843

11987844

) (21)

with blocks

11987811

= 119880 + 119881 + 119883119879119884119879119875minus11198634119884119883

11987812

= 119883119879119884119879119875minus11198634

11987813

= minus2 (119880 minus 119881)119879 11987814

= 119883119879119884119879119875minus11198634119910

11987821

= 119875minus11198634119884119883

11987822

= 119875minus11198634+ Ξminus11198635

11987823

= 0 11987824

= 119875minus11198634119910

6 Journal of Applied Mathematics

11987831

= minus2 (119880 minus 119881)119879 11987832

= 0

11987833

= 4119879 (119880 + 119881)119879 + 119879minus11198633minus 2 (119863

1+ 1198632) 119878

34= 0

11987841

= 119910119879119875minus11198634119884119883 119878

42= 119910119879119875minus11198634

11987843

= 0 11987844

= 119910119879119875minus11198634119910

(22)

and 119880 = (1198792minus 119882)minus11198631and 119881 = (119879

2+ 119882)minus11198632

33The Interior PointerAlgorithm Let 119911 = (119908 120585 119905 119887) and120582 =

(120582(1)

120582(2)

120582(3)

120582(4)

120582(5)

) we first present the interior pointeralgorithm to solve the barrier problem (14) and then discussthe details of the algorithm

Algorithm 2 The interior pointer algorithm (IPA) is asfollows

Step 0 Given tolerance 120598120583 set 120591

1isin (0 (12)) 119897 gt 0

1205741gt 1 120573 isin (0 1) Let 119896 = 0

Step 1 Stop if KKT condition (15) holds

Step 2 Compute Δ119911119896 from (20) and

119896+1 from (19)Compute 119911

119896+1= 119911119896+ 120572119896Δ119911119896 Update the Lagrangian

multipliers to obtain 120582119896+1

Step 3 Let 119896 = 119896 + 1 Go to Step 1

In Step 2 a step length 120572119896 is used to calculate 119911

119896+1We estimate 120572

119896 by Armijo line search [34] in which 120572119896

=

max 120573119895| 119895 = 0 1 2 for some 120573 isin (0 1) and satisfies the

following inequalities

(119905119896+1

)2

minus 119908119896+1

gt 0

(119905119896+1

)2

+ 119908119896+1

gt 0

119905119896+1

gt 0

Φ120583(119911119896+1

) minus Φ120583(119911119896)

le 1205911120572119896(nabla119908Φ

120583(119911119896)119879

Δ119908119896

+ nabla119905Φ120583(119911119896)119879

Δ119905119896

+nabla120585Φ120583(119911119896)119879

Δ120585119896+ nabla119887Φ120583(119911119896)119879

Δ119887119896)

(23)

where 1205911isin (0 12)

To avoid ill-conditioned growth of 119896 and guarantee thestrict dual feasibility the Lagrangian multipliers 120582

119896 should

be sufficiently positive and bounded from above Followinga similar idea of [33] we first update the dual multipliers by

(1)(119896+1)

119894

=

min119897120583

(119905119896

119894)2

minus 119908119896

119894

if (1)(119896+1)119894

lt min119897120583

(119905119896

119894)2

minus 119908119896

119894

(1)(119896+1)

119894 if min119897

120583

(119905119896

119894)2

minus x119896119894

le (1)(119896+1)

119894le

1205831205741

(119905119896

119894)2

minus 119908119896

119894

1205831205741

(119905119896

119894)2

minus 119908119896

119894

if (1)(119896+1)119894

gt1205831205741

(119905119896

119894)2

minus 119908119896

119894

(2)(119896+1)

119894

=

min119897120583

(119905119896

119894)2

+ 119908119896

119894

if (2)(119896+1)119894

lt min119897120583

(119905119896

119894)2

+ 119908119896

119894

(2)(119896+1)

119894 if min119897

120583

(119905119896

119894)2

+ 119908119896

119894

le (2)(119896+1)

119894le

1205831205741

(119905119896

119894)2

+ 119908119896

119894

1205831205741

(119905119896

119894)2

+ 119908119896

119894

if (2)(119896+1)119894

gt1205831205741

(119905119896

119894)2

+ 119908119896

119894

(3)(119896+1)

119894=

min119897120583

119905119896

119894

if (3)(119896+1)119894

lt min119897120583

119905119896

119894

(3)(119896+1)

119894 if min119897

120583

119905119896

119894

le (3)(119896+1)

119894le

1205831205741

119905119896

119894

1205831205741

119905119896

119894

if (3)(119896+1)119894

gt1205831205741

119905119896

119894

(4)(119896+1)

119894=

min119897120583

119901119896

119894

if (4)(119896+1)119894

lt min119897120583

119901119896

119894

(4)(119896+1)

119894 if min119897

120583

119901119896

119894

le (4)(119896+1)

119894le

1205831205741

119901119896

119894

1205831205741

119901119896

119894

if (4)(119896+1)119894

gt1205831205741

119901119896

119894

(24)

Journal of Applied Mathematics 7

where 119901119894= 119910119894(119908119879119909119894+ 119887) + 120585

119894minus 1

(5)(119896+1)

119894=

min119897120583

120585119896

119894

if (5)(119896+1)119894

lt min119897120583

120585119896

119894

(5)(119896+1)

119894 if min119897

120583

120585119896

119894

le (5)(119896+1)

119894le

1205831205741

120585119896

119894

1205831205741

120585119896

119894

if (5)(119896+1)119894

gt1205831205741

120585119896

119894

(25)

where the parameters 119897 and 1205741satisfy 0 lt 119897 120574

1gt 1

Since positive definiteness of the matrix 119878 is demanded inthis method the Lagrangian multipliers

119896+1 should satisfythe following condition

120582(3)

minus 2 (120582(1)

+ 120582(2)

) 119905 ge 0 (26)

For the sake of simplicity the proof is given in Appendix BTherefore if 119896+1 satisfies (26) we let 120582119896+1 =

119896+1 Other-wise we would further update it by the following setting

120582(1)(119896+1)

= 1205742(1)(119896+1)

120582(2)(119896+1)

= 1205742(2)(119896+1)

120582(3)(119896+1)

= 1205743(3)(119896+1)

(27)

where constants 1205742isin (0 1) and 120574

3ge 1 satisfy

1205743

1205742

= max119894isin119864

2119905119896+1

119894((1)(119896+1)

119894+ (2)(119896+1)

119894)

(3)(119896+1)

119894

(28)

with 119864 = 1 2 119899 It is not difficult to see that the vector((1)(119896+1)

(2)(119896+1)

(3)(119896+1)

) determined by (27) satisfies (26)In practice the KKT conditions (15) are allowed to be

satisfied within a tolerance 120598120583 It turns to be that the iterative

process stops while the following inequalities meet

Res (119911 120582 120583) =

10038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817

120582(1)

minus 120582(2)

minus 119883119879119884119879120582(4)

119862 lowast 119890119899minus 120582(4)

minus 120582(5)

119890119898minus 2119879120582

(1)minus 2119879120582

(2)minus 120582(3)

119890119879

119899119884120582(4)

(1198792minus 119882)120582

(1)minus 120583119890119898

(1198792+ 119882)120582

(2)minus 120583119890119898

119879120582(3)

minus 120583119890119898

119875120582(4)

minus 120583119890119899

Ξ120582(5)

minus 120583119890119899

10038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817

lt 120598120583 (29)

120582 ge minus120598120583119890 (30)

where 120598120583is related to the current barrier parameter 120583 and

satisfies 120598120583darr 0 as 120583 rarr 0

Since functionΦ120583in (14) is convex we have the following

lemma which shows that Algorithm 2 is well defined (theproof is given in Appendix B)

Lemma 3 Let 119911119896 be strictly feasible for problem (10) If Δ119911119896=

0 then (23) is satisfied for all 120572119896

ge 0 If Δ119911119896

= 0 then thereexists a

119896isin (0 1] such that (23) holds for all 120572

119896isin (0

119896]

The proposed interior point method successively solvesthe barrier subproblem (14) with a decreasing sequence 120583

119896

We simply reduce both 120598120583and 120583 by a constant factor 120573 isin

(0 1) Finally we test optimality for problem (10) by meansof the residual norm Res(119911 0)

Here we present the whole algorithm to solve the 11987112

-SVM problem(10)

Algorithm 4 Algorithm for solving 11987112

-SVM problem is asfollows

Step 0 Set 1199080 isin 119877119898 1198870isin 119877 120585

0isin 119877119899 1205820 =

0gt 0

1199050

119894ge radic|119908

0

119894| + (12) 119894 isin 1 2 119898 Given constants

1205830gt 0 1205981205830

gt 0 120573 isin (0 1) and 120598 gt 0 let 119895 = 0

Step 1 Stop if Res(119911119895 119895 0) le 120598 and 119895ge 0

Step 2 Starting from (119911119895 120582119895) apply Algorithm 2 to

solve (14) with barrier parameter 120583119895and stopping

tolerance 120598120583119895 Set 119911119895+1 = 119911

119895119896 119895+1 = 119895119896 and 120582

119895+1=

120582119895119896

Step 3 Set 120583119895+1

= 120573120583119895and 120598120583119895+1

= 120573120598120583119895 Let 119895 = 119895 + 1

and go to Step 1

In Algorithm 4 the index 119895 denotes an outer iterationwhile 119896 denotes the last inner iteration of Algorithm 2

The convergence of the proposed interior point methodcan be proved We list the theorem here and give the proof inAppendix C

Theorem 5 Let (119911119896 120582119896) be generated by Algorithm 2 Thenany limit point of the sequence (119911

119896 119896) generated by

Algorithm 2 satisfies the first-order optimality conditions (15)

Theorem 6 Let (119911119895 119895) be generated by Algorithm 4 by

ignoring its termination condition Then the following state-ments are true

(i) The limit point of (119911119895 119895) satisfies the first order opti-mality condition (13)

(ii) The limit point 119911lowast of the convergent subsequence

119911119895J sube 119911

119895 with unbounded multipliers

119895J is a

Fritz-John point [35] of problem (10)

4 Experiments

In this section we tested the constrained optimization refor-mulation to the 119871

12-SVM and the proposed interior point

method We compared the performance of the 11987112

-SVMwith 119871

2-SVM [30] 119871

1-SVM [12] and 119871

0-SVM [12] on

artificial data and ten UCI data sets (httparchiveicsucieduml) These four problems were solved in primal refer-encing the machine learning toolbox Spider (httppeoplekybtuebingenmpgdespider) The 119871

2-SVM and 119871

1-SVM

were solved directly by quadratic programming and linearprogramming respectively To the NP-hard problem 119871

0-

SVM a commonly cited approximation method FeatureSelection Concave (FSV) [12] was applied and then the FSV

8 Journal of Applied Mathematics

problem was solved by a Successive Linear Approximation(SLA) algorithm All the experiments were run in thepersonal computer (16 GHz of CPU 4GB of RAM) withMATLAB R2010b on 64 bit Windows 7

In the proposed interior point method (Algorithms 2 and4) we set the parameters as 120573 = 06 120591

1= 05 119897 = 000001

1205741

= 100000 and 120573 = 05 The balance parameter 119862 wasselected by 5-fold cross-validation on training set over therange 2

minus10 2minus9 210 After training the weights that didnot satisfy the criteria |119908

119895|max

119894(|119908119894|) ge 10

4 [14] were set tozeroThen the cardinality of the hyperplane was computed asthe number of the nonzero weights

41 Artificial Data First we took an artificial binary linearclassification problem as an example The problem is similarto that in [13] The probability of 119910 = 1 or minus1 is equal Thefirst 6 features are relevant but redundant In 70 samplesthe first three features 119909

1 1199092 1199093were drawn as 119909

119894= 119910119873(119894 1)

and the second three features 1199094 1199095 1199096 as 119909

119894= 119873(0 1)

Otherwise the first three were drawn as 119909119894= 119873(0 1) and the

second three as 119909119894= 119910119873(119894 minus 3 1) The rest features are noise

119909119894= 119873(0 20) 119894 = 7 119898 Here119898 is the dimension of input

features The inputs were scaled to mean zero and standarddeviation In each trial 500 points were generated for testingand the average results were estimated over 30 trials

In the first experiment we consider the cases with thefixed feature size 119898 = 30 and different training sample sizes119899 = 10 20 100 The average results over the 30 trialsare shown in Table 1 and Figure 2 Figure 2 (left) plots theaverage cardinality of each classifier Since the artificial datasets have 2 relevant and nonredundant features the idealaverage cardinality is 2 Figure 2 (left) shows that the threesparse SVMs 119871

1-SVM 119871

12-SVM and 119871

0-SVM can achieve

sparse solution while the 1198712-SVM almost uses full features

in each data set Furthermore the solutions of 11987112

-SVMand 119871

0-SVM are much sparser than 119871

1-SVM As shown in

Table 1 the 1198711-SVM selects more than 6 features in all cases

which implies that some redundant or irrelevant featuresare selected The average cardinalities of 119871

12-SVM and 119871

0-

SVM are similar and close to 2 However when 119899 = 10

and 20 the 1198710-SVM has the average cardinalities of 142

and 187 respectively It means that the 1198710-SVM sometimes

selects only one feature in low sample data set and maybeignores some really relevant feature Consequently with thecardinalities between 205 and 29 119871

12-SVM has the more

reliable solution than 1198710-SVM In short as far as the number

of selected features is concerned the 11987112

-SVM behavesbetter than the other three methods

Figure 2 (right) plots the trend of the prediction accuracyversus the size of the training sample The classificationperformance of all methods is generally improved with theincreasing of the training sample size 119899 119871

1-SVM has the best

prediction performance in all cases and a slightly better than11987112

-SVM 11987112

-SVM shows more accuracy in classificationthan 119871

2-SVM and 119871

0-SVM especially in the case of 119899 =

10 50 As shown in Table 1 when there are only 10 train-ing samples the average accuracy of 119871

12-SVM is 8805

while the results of 1198712-SVM and 119871

0-SVM are 8465 and

7765 respectively Compared with 1198712-SVM and 119871

0-SVM

11987112

-SVM has the average accuracy increased by 34 and104 respectively as can be explained in what follows Tothe 1198712-SVM all features are selected without discrimination

and the prediction would bemisled by the irrelevant featuresTo the 119871

0-SVM few features are selected and some relevant

features are not included which would put negative impacton the prediction result As the tradeoff between119871

2-SVMand

1198710-SVM 119871

12-SVM has better performance than the two

The average results over ten artificial data sets in the firstexperiment are shown in the bottom of Table 1 On averagethe accuracy of 119871

12-SVM is 087 lower than the 119871

1-SVM

while the features selected by 11987112

-SVM are 7414 less than1198711-SVM It indicates that the 119871

12-SVM can achieve much

sparser solution than 1198711-SVM with little cost of accuracy

Moreover the average accuracy of 11987112

-SVMover 10 data setsis 222 higher than 119871

0-SVM with the similar cardinality

To sum up the 11987112

-SVM provides the best balance betweenaccuracy and sparsity among the three sparse SVMs

To further evaluate the feature selection performance of11987112

-SVM we investigate whether the features are correctlyselected For the119871

2-SVM is not designed for feature selection

it is not included in this comparison Since our artificial datasets have 2 best features (119909

3 1199096) the best result should have

the two features (1199093 1199096) ranking on the top according to their

absolute values of weights |119908119895| In the experiment we select

the top 2 features with the maximal |119908119895| for each method

and calculate the frequency that the top 2 features are 1199093

and 1199096in 30 runs The results are listed in Table 2 When the

training sample size is too small it is difficult to discriminatethe two most important features for all sparse SVMs Forexample when 119899 = 10 the selected frequencies of 119871

1-SVM

1198710-SVM and 119871

12-SVM are 7 3 and 9 respectively When 119899

increases all methods tend to make more correct selectionMoreover Table 2 shows that the 119871

12-SVM outperforms the

other two methods in all cases For example when 119899 = 100the selected frequencies of 119871

1-SVM and119871

0-SVM are 22 and

25 respectively and the result of 11987112

-SVM is 27 The 1198711-

SVMselects toomany redundant or irrelevant features whichmay influence the ranking in some extent Therefore 119871

1-

SVM is not so good as 11987112

-SVM at distinguishing the criticalfeatures The 119871

0-SVM has the lower hit frequency than 119871

12-

SVM which is probably due to the excessive small featuresubset it obtained Above all Tables 1 and 2 and Figure 2clearly show that the 119871

12-SVM is a promising sparsity driven

classification methodIn the second simulation we consider the cases with

various dimensions of feature space 119898 = 20 40 180 200

and the fixed training sample size 119899 = 100 The averageresults over 30 trials are shown in Figure 3 and Table 3 Sincethere are only 6 relevant features yet the larger 119898 meansthe more noisy features Figure 3 (left) shows that as thedimension increases from 20 to 200 the number of featuresselected by 119871

1-SVM increases from 826 to 231 However the

cardinalities of 11987112

-SVM and 1198710-SVM keep stable (from 22

to 295) It indicates that the 11987112

-SVM and 1198710-SVM aremore

suitable for feature selection than 1198711-SVM

Figure 3 (right) shows thatwith the increasing of the noisefeatures the accuracy of 119871

2-SVM drops significantly (from

9868 to 8747) On the contrary to the other three sparse

Journal of Applied Mathematics 9

0 20 40 60 80 1000

5

10

15

20

25

30

Number of training samples

Card

inal

ity

0 20 40 60 80 10075

80

85

90

95

100

Number of training samples

Accu

racy

L2-SVML1-SVM

L0-SVML12-SVM

L2-SVML1-SVM

L0-SVML12-SVM

Figure 2 Results comparison on 10 artificial data sets with various training samples

Table 1 Results comparison on 10 artificial data sets with various training sample sizes

1198991198712-SVM 119871

1-SVM 119871

0-SVM 119871

12-SVM

Card Acc Card Acc Card Acc Card Acc10 2997 8465 693 8855 143 7765 225 880520 3000 9221 883 9679 187 9317 235 955030 3000 9427 847 9823 203 9467 205 974140 2997 9599 900 9875 207 9539 210 977550 3000 9661 950 9891 213 9657 245 976860 2997 9725 1000 9894 220 9761 245 979870 3000 9741 1100 9898 220 9756 240 982380 3000 9768 1003 9909 223 9761 260 985090 2997 9789 920 9921 230 9741 290 9830100 3000 9813 1027 9909 250 9793 255 9836On average 2999 9521 932 9765 210 9456 241 9678ldquo119899rdquo is the number of training samples ldquoCardrdquo represents the number of selected features and ldquoAccrdquo is the classification accuracy

Table 2 Frequency of the most important features (1199093 1199096) ranked on top 2 in 30 runs

119899 10 20 30 40 50 60 70 80 90 1001198711-SVM 7 14 16 18 22 23 20 22 21 22

1198710-SVM 3 11 12 15 19 22 22 23 22 25

11987112-SVM 9 16 19 24 24 23 27 26 26 27

Table 3 Average results over 10 artificial data sets with varies dimension

1198712-SVM 119871

1-SVM 119871

0-SVM 119871

12-SVM

Cardinality 10995 1585 238 241Accuracy 9293 9907 9776 9826

10 Journal of Applied Mathematics

Table 4 Feature selection performance comparison on UCI data sets (Cardinality)

No UCI Data Set 119899 1198981198712-SVM 119871

1-SVM 119871

0-SVM 119871

12-SVM

Mean Std Mean Std Mean Std Mean Std1 Pima diabetes 768 8 800 000 740 089 620 110 760 1522 Breast Cancer 683 9 900 000 860 122 640 207 840 1673 Wine (3) 178 13 1300 000 973 069 373 028 407 0494 Image (7) 2100 19 1900 000 820 053 366 097 511 0625 SPECT 267 22 2200 000 1740 527 980 427 9 1176 WDBC 569 30 3000 000 980 152 900 303 560 2057 Ionosphere 351 34 3320 045 2500 335 2460 1021 2180 8448 SPECTF 267 44 4400 000 3880 1098 2400 1132 2560 7759 Sonar 208 60 6000 000 3140 1305 2700 718 1360 100610 Musk1 476 166 16600 000 8580 1885 4860 428 41 1416

Average cardinality 4050 4042 2421 1630 1418

Table 5 Classification performance comparison on UCI data sets (accuaracy)

No UCI Data Set 1198712-SVM 119871

1-SVM 119871

0-SVM 119871

12-SVM

Mean Std Mean Std Mean Std Mean Std1 Pima diabetes 7699 347 7673 085 7673 294 7712 1522 Breast cancer 9618 223 9706 096 9632 138 9647 1593 Wine (3) 9771 373 9829 128 9314 239 9714 2564 Image (7) 8894 149 9127 177 8667 458 9136 1945 SPECT 8340 215 7849 231 7811 338 8234 5106 WDBC 9611 211 9646 131 9558 115 9646 1317 Ionosphere 8343 544 8800 370 8457 538 8743 5948 SPECTF 7849 391 7434 450 7387 325 7849 1899 Sonar 7366 529 7561 582 7415 798 7610 43510 Musk1 8484 173 8211 242 7600 564 8232 338

Average accuracy 8597 8584 8351 8652

SVMs there is little change in the accuracy It reveals thatSVMs can benefit from the features reduction

Table 3 shows the average results over all data sets inthe second experiment On average the solution of 119871

12-

SVM yields much sparser than 1198711-SVM and a slightly better

accuracy than 1198710-SVM

42 UCI Data Sets We further tested the reformulation andthe proposed interior point methods to 119871

12-SVM on 10 UCI

data sets [36] There are 8 binary classification problems and2multiclass problems (wine image) Each feature of the inputdata was normalized to zero mean and unit variance and theinstances with missing value were deletedThen the data wasrandomly split into training set (80) and testing set (20)For the two multiclass problems a one-against-rest methodwas applied to construct a binary classifier for each class Werepeated the training and testing procedure 10 times and theaverage results were shown in Tables 4 and 5 and Figure 4

Tables 4 and 5 summarize the feature selection andclassification performance of the numerical experiments onUCI data sets respectively Here 119899 is the numbers of samplesand 119898 is the number of the input features For the two

multiclass data sets the numbers of the classes are markedbehind their names Sparsity is defined as card119898 and thesmall value of sparsity is preferredThe data sets are arrangedin descending order according to the dimension The lowestcardinality and the best accuracy rate for each problem arebolded

As shown in Tables 4 and 5 the three sparse SVMs canencourage sparsity in all data sets while remaining roughlyidentical accuracy with 119871

2-SVM Among the three sparse

methods the 11987112

-SVM has the lowest cardinality (1418)and the highest classification accuracy (8652) on averageWhile the 119871

1-SVM has the worst feature selection perform-

ance with the highest average cardinality (2421) and the1198710-SVM has the lowest average classification accuracy

(8351)Figure 4 plots the sparsity (left) and classification accu-

racy (right) of each classifier on UCI data sets In three datasets (6 9 10) the 119871

12-SVM has the best performance both

in feature selection and classification among the three sparseSVMs Compared with 119871

1-SVM 119871

12-SVM can achieve

sparser solution in nine data sets For example in the dataset ldquo8 SPECTFrdquo the features selected by 119871

12-SVM are 34

Journal of Applied Mathematics 11

50 100 150 2000

20

40

60

80

100

120

140

160

180

200

Dimension

Card

inal

ity

50 100 150 20080

82

84

86

88

90

92

94

96

98

100

Dimension

Accu

racy

L2-SVML1-SVM

L0-SVML12-SVM

L2-SVML1-SVM

L0-SVML12-SVM

Figure 3 Results comparison on artificial data set with various dimensions

less than 1198711-SVM at the same time the accuracy of 119871

12-

SVM is 41 higher than 1198711-SVM In most data sets (4

5 6 9 10) the cardinality of 11987112

-SVM drops significantly(at least 37) with the equal or a slightly better result inaccuracy Only in the data set ldquo3 Winerdquo the accuracy of11987112

-SVM is decreased by 11 but the sparsity provided by11987112

-SVM leads to 582 improvement over 1198711-SVM In the

rest three data sets (1 2 7) the two methods have similarresults in feature selection and classification As seen abovethe 119871

12-SVM can provide lower dimension representation

than 1198711-SVM with the competitive prediction perform-

anceFigure 4 (right) shows that compared with 119871

0-SVM

11987112

-SVM has the classification accuracy improved in alldata sets For instance in four data sets (3 4 8 10) theclassification accuracy of 119871

12-SVM is at least 40 higher

than 1198710-SVM Especially in the data set ldquo10Muskrdquo 119871

12-SVM

gives a 63 rise in accuracy over 1198710-SVM Meanwhile it

can be observed from Figure 4 (left) that 11987112

-SVM selectsfewer feature than 119871

0-SVM in five data sets (5 6 7 9 10) For

example in the data sets 6 and 9 the cardinalities of 11987112

-SVM are 378 and 496 less than 119871

0-SVM respectively

In summary 11987112

-SVM presents better classification perfor-mance than 119871

0-SVM while it is effective in choosing relevant

features

5 Conclusions

In this paper we proposed a 11987112

regularization techniquefor simultaneous feature selection and classification in theSVMWe have reformulated the 119871

12-SVM into an equivalent

smooth constrained optimization problem The problem

possesses a very simple structure and is relatively easy todevelop numerical methods By the use of this interestingreformulation we proposed an interior point method andestablished its convergence Our numerical results supportedthe reformulation and the proposed method The 119871

12-

SVM can get more sparsity solution than 1198711-SVM with

the comparable classification accuracy Furthermore the11987112

-SVM can achieve more accuracy classification resultsthan 119871

0-SVM (FSV)

Inspired by the good performance of the smooth opti-mization reformulation of 119871

12-SVM there are some inter-

esting topics deserving further research For examples todevelop more efficient algorithms for solving the reformu-lation to study nonlinear 119871

12-SVM and to explore varies

applications of the 11987112

-SVM and further validate its effectiveare interesting research topics Some of them are under ourcurrent investigation

Appendix

A Properties of the Reformulation to11987112

-SVM

Let 119863 be set of all feasible points of the problem (10) Forany 119911 isin 119863 we let 119865119863(119911119863) and 119871119865119863(119911119863) be the set of allfeasible directions and linearized feasible directions of 119863 at119911 Since the constraint functions of (10) are all convex weimmediately have the following lemma

Lemma A1 For any feasible point 119911 of (10) one has119865119863 (119911119863) = 119871119865119863 (119911119863) (A1)

12 Journal of Applied Mathematics

1 2 3 4 5 6 7 8 9 1010

20

30

40

50

60

70

80

90

100

Data set number Data set number

Spar

sity

1 2 3 4 5 6 7 8 9 1070

75

80

85

90

95

100

Accu

racy

L2-SVML1-SVM

L0-SVML12-SVM

L2-SVML1-SVM

L0-SVML12-SVM

Figure 4 Results comparison on UCI data sets

Based on the above Lemma we can easily derive a firstorder necessary condition for (10) The Lagrangian functionof (10) is

119871 (119908 119905 120585 119887 120582 120583 ] 119901 119902)

= 119862

119899

sum

119894=1

120585119894+

119898

sum

119895=1

119905119895minus

119898

sum

119895=1

120582119895(1199052

119895minus 119908119895) minus

119898

sum

119895=1

120583119895(1199052

119895+ 119908119895)

minus

119898

sum

119895=1

]119895119905119895minus

119899

sum

119894=1

119901119894(119910119894119908119879119909119894+ 119910119894119887 + 120585119894minus 1) minus

119899

sum

119894=1

119902119894120585119894

(A2)

where 120582 120583 ] 119901 119902 are the Lagrangian multipliers By the useof Lemma A1 we immediately have the following theoremabout the first order necessary condition

Theorem A2 Let (119908 119905 120585 119887) be a local solution of (10) Thenthere are Lagrangian multipliers (120582 120583 ] 119901 119902) isin 119877

119898+119898+119898+119899+119899

such that the following KKT conditions hold

120597119871

120597119908= 120582119895minus 120583119895minus 119901119894

119899

sum

119894=1

119910119894119909119894= 0 119894 = 1 2 119899

120597119871

120597119905= 1 minus 2120582

119895119905119895minus 2120583119895119905119895minus ]119895= 0 119895 = 1 2 119898

120597119871

120597120585= 119862 minus 119901

119894minus 119902119894= 0 119894 = 1 2 119899

120597119871

120597119887=

119899

sum

119894=1

119901119894119910119894= 0 119894 = 1 2 119899

120582119895ge 0 119905

2

119895minus 119908119895ge 0 120582

119895(1199052

119895minus 119908119895) = 0

119895 = 1 2 119898

120583119895ge 0 119905

2

119895+ 119908119895ge 0 120583

119895(1199052

119895+ 119908119895) = 0

119895 = 1 2 119898

]119895ge 0 119905

119895ge 0 ]

119895119905119895= 0 119895 = 1 2 119898

119901119894ge 0 119910

119894(119908119879119909119894+ 119887) + 120585

119894minus 1 ge 0

119901119894(119910119894(119908119879119909119894+ 119887) + 120585

119894minus 1) = 0 119894 = 1 2 119899

119902119894ge 0 120585

119894ge 0 119902

119894120585119894= 0 119894 = 1 2 119899

(A3)

The following theorem shows that the level set will bebounded

Theorem A3 For any given constant 119888 gt 0 the level set

119882119888= 119911 = (119908 119905 120585) | 119891 (119911) le 119888 cap 119863 (A4)

is bounded

Proof For any (119908 119905 120585) isin Ω119888 we have

119898

sum

119894

119905119894+ 119862

119899

sum

119895

120585119895le 119888 (A5)

Journal of Applied Mathematics 13

Combining with 119905119894ge 0 119894 = 1 119898 and 120585

119895ge 0 119895 = 1 119899

we have

0 le 119905119894le 119888 119894 = 1 119898

0 le 120585119895le

119888

119862 119895 = 1 119899

(A6)

Moreover for any (119908 119905 120585) isin 119863

10038161003816100381610038161199081198941003816100381610038161003816 le 1199052

119894 119894 = 1 119899 (A7)

Consequently 119908 119905 120585 are bounded What is more from thecondition

119910119894(119908119879119909119894+ 119887) ge 1 minus 120585

119894 119894 = 1 119898 (A8)

we have

119908119879119909119894+ 119887 ge 1 minus 120585

119894 if 119910

119894= 1 119894 = 1 119898

119908119879119909119894+ 119887 le minus1 + 120585

119894 if 119910

119894= minus1 119894 = 1 119898

(A9)

Thus if the feasible region119863 is not empty then we have

max119910119894=1

(1 minus 120585119894minus 119908119879119909119894) le 119887 le min

119910119894=minus1

(minus1 + 120585119894minus 119908119879119909119894) (A10)

Hence 119887 is also bounded The proof is complete

B Proof of Lemma 3

Lemma 3 (see Section 3) shows that Algorithm 2 is welldefinedWe first introduce the following proposition to showthat the matrix 119878 defined by (21) is always positive definitewhich ensures Algorithm 2 to be well defined

Proposition B1 Let the inequality (26) hold then the matrix119878 defined by (21) is positive definite

Proof By an elementary deduction we have for any V(1) isin

119877119898 V(2) isin 119877

119899 V(3) isin 119877119898 V(4) isin 119877 with (V(1) V(2) V(3) V(4)) = 0

(V(1)119879 V(2)119879 V(3)119879 V(4)119879) 119878(

V(1)

V(2)

V(3)

V(4))

= V(1)11987911987811V(1) + V(1)119879119878

12V(2) + V(1)119879119878

13V(3)

+ V(1)11987911987814V(4) + V(2)119879119878

21V(1) + V(2)119879119878

22V(2)

+ V(2)11987911987823V(3) + V(2)119879119878

24V(4)

+ V(3)11987911987831V(1) + V(3)119879119878

32V(2)

+ V(3)11987911987833V(3) + V(3)119879119878

34V(4)

+ V(4)11987911987841V(1) + V(4)119879119878

42V(2)

+ V(4)11987911987843V(3) + V(4)119879119878

44V(4)

= V(1)119879 (119880 + 119881 + 119883119879119884119879119875minus11198634119884119883) V(1)

+ V(1)119879 (119883119879119884119879119875minus11198634) V(2)

+ V(1)119879 (minus2 (119880 minus 119881)119879) V(3)

+ V(1)119879 (119883119879119884119879119875minus11198634119910) V(4)

+ V(2)119879119875minus11198634119884119883V(1) + V(2)119879 (119875minus1119863

4+ Ξminus11198635) V(2)

+ V(2)119879119875minus11198634119910V(4)

minus 2V(3)119879 (119880 minus 119881)119879V(1)

+ V(3)119879 (4119879 (119880 + 119881)119879 + 119879minus11198633minus 2 (119863

1+ 1198632)) V(3)

+ V(4)119879119910119879119875minus11198634119884119883V(1)

+ V(4)119879119910119879119875minus11198634V(2) + V(4)119879119910119879119875minus1119863

4119910V(4)

= V(1)119879119880V(1) minus 4V(3)119879119880119879V(1) + 4V(3)119879119879119880119879V(3)

+ V(1)119879119881V(1) + 4V(3)119879119881119879V(1) + 4V(3)119879119879119881119879V(3)

+ V(2)119879Ξminus11198635V(2) + V(3)119879 (119879minus1119863

3minus 21198631minus 21198632) V(3)

+ (119884119883V(1) + V(2) + V(4)119910)119879

119875minus11198634

times (119884119883V(1) + V(2) + V(4)119910)

= 119890119879

119898119880(119863V

1minus 2119879119863V

3)2

119890119898+ 119890119879

119898119881(119863V

1+ 2119879119863V

3)2

119890119898

+ V(2)119879Ξminus11198635V(2)

+ V(3)119879 (119879minus11198633minus 21198631minus 21198632) V(3)

+ (119884119883V(1) + V(2) + V(4)119910)119879

119875minus11198634(119884119883V(1) + V(2) + V(4)119910)

gt 0

(B1)

where 119863V1= diag(V(1)) 119863V

2= diag(V(2)) 119863V

3= diag(V(3))

and119863V4= diag(V(4))

14 Journal of Applied Mathematics

B1 Proof of Lemma 3

Proof It is easy to see that if Δ119911119896= 0 which implies that (23)

holds for all 120572119896ge 0

Suppose Δ119911119896

= 0 We have

nabla119908Φ120583(119911119896)119879

Δ119908119896+ nabla119905Φ120583(119911119896)119879

Δ119905119896

+ nabla120585Φ120583(119911119896)119879

Δ120585119896+ nabla119887Φ120583(119911119896)119879

Δ119887119896

= minus (Δ119908119896119879

Δ120585119896119879

Δ119905119896119879

Δ119887119896) 119878(

Δ119908119896

Δ120585119896

Δ119905119896

Δ119887119896

)

(B2)

Since matrix 119878 is positive definite and Δ119911119896

= 0 the lastequation implies

nabla119908Φ120583(119911119896)119879

Δ119908119896+ nabla119905Φ120583(119911119896)119879

Δ119905119896

+ nabla120585Φ120583(119911119896)119879

Δ120585119896+ nabla119887Φ120583(119911119896)119879

Δ119887119896lt 0

(B3)

Consequently there exists a 119896isin (0

119896] such that the fourth

inequality in (23) is satisfied for all 120572119896isin (0

119896]

On the other hand since (119909119896 119905119896) is strictly feasible the

point (119909119896 119905119896) + 120572(Δ119909 119905119896 Δ119905119896) will be feasible for all 120572 gt 0

sufficiently small The proof is complete

C Convergence Analysis

This appendix is devoted to the global convergence of theinterior point method We first show the convergence ofAlgorithm 2 when a fixed 120583 is applied to the barrier subprob-lem (14)

Lemma C1 Let (119911119896 120582119896) be generated by Algorithm 2 Then119911119896 are strictly feasible for problem (10) and the Lagrangian

multipliers 120582119896 are bounded from above

Proof For the sake of convenience we use 119892119894(119911) 119894 =

1 2 3119898 + 2119899 to denote the constraint functions of theconstrained problem (10)

We first show that 119911119896 are strictly feasible Suppose onthe contrary that there exists an infinite index subset K andan index 119894 isin 1 2 3119898 + 2119899 such that 119892

119894(119911119896)K darr 0

By the definition of Φ120583(119911) and 119891(119911) = sum

119898

119895=1119905119894+ 119862sum

119899

119894=1120585119894

being bounded from below in the feasible set it must holdthat Φ

120583(119911119896)K rarr infin

However the line search rule implies that the sequenceΦ120583(119911119896) is decreasing So we get a contradiction Conse-

quently for any 119894 isin 1 2 3119898+2119899 119892119894(119909119896 119905119896) is bounded

away from zero The boundedness of 120582119896 then follows from(24)ndash(27)

Lemma C2 Let (119911119896 120582119896) and (Δ119911119896 119896+1

) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenthe sequence (Δ119911

119896 119896+1

)K is bounded

Proof Again we use 119892119894(119911) 119894 = 1 2 3119898+2119899 to denote the

constraint functions of the constrained problem (10)We suppose on the contrary that there exists an infinite

subsetK1015840 sube K such that the subsequence (Δ119911119896 119896+1

)K1015840

tends to infinity Let 119911119896K1015840 rarr 119911

lowast It follows fromLemma C1 that there is an infinite subset K sube K1015840 such that120582119896K rarr 120582

lowast and 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 Thus

we have

119878119896997888rarr 119878lowast (C1)

as 119896 rarr infin with 119896 isin K By Proposition B1 119878lowast is positivedefinite Since the right hand size of (18) is bounded andcontinuous the unboundedness of (Δ119911

119896 119896+1

)K impliesthat the limit of the coefficient matrices of (18) is singularwhich yields a contradiction The proof is complete

Lemma C3 Let (119911119896 120582119896) and (Δ119911119896 119896+1

) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenone has Δ119911

119896K rarr 0

Proof It follows from Lemma C1 that the sequence Δ119911119896K

is bounded Suppose on the contrary that there exists aninfinite subsetK1015840 sube K such that Δ119911

119896K1015840 rarr Δ119911

lowast= 0 Since

subsequences 119911119896K1015840 120582

119896K1015840 and

119896K1015840 are all bounded

there are points 119911lowast 120582lowast and

lowast as well as an infinite indexset K sube K1015840 such that 119911119896K rarr 119911

lowast 120582119896K rarr 120582lowast and

119896K rarr

lowast By Lemma C1 we have 119892119894(119911lowast) gt 0 forall119894 isin

1 2 3119898 + 2119899 Similar to the proof of Lemma 3 it is notdifficult to get

nabla119908Φ120583(119911lowast)119879

Δ119908lowast+ nabla119905Φ120583(119911lowast)119879

Δ119905lowast

+ nabla120585Φ120583(119911lowast)119879

Δ120585lowast+ nabla119887Φ120583(119911lowast)119879

Δ119887lowast

= minus (Δ119908lowast119879

Δ120585lowast119879

Δ119905lowast119879

Δ119887lowast) 119878(

Δ119908lowast

Δ120585lowast

Δ119905lowast

Δ119887lowast

) lt 0

(C2)

Since 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 there exists a

isin (0 1] such that for all 120572 isin (0 ]

119892119894(119911lowast+ 120572Δ119911

lowast) gt 0 forall119894 isin 1 2 3119899 (C3)

Taking into account 1205911isin (0 12) we claim that there exists

a isin (0 ] such that the following inequality holds for all120572 isin (0 ]

Φ120583(119911lowast+ 120572Δ119911

lowast) minus Φ120583(119911lowast) le 11120591

1120572nabla119911Φ120583(119911lowast)119879

Δ119911lowast (C4)

Let 119898lowast

= min119895 | 120573119895isin (0 ] 119895 = 0 1 and 120572

lowast= 120573119898lowast

It follows from (C3) that the following inequality is satisfiedfor all 119896 isin K sufficient large and any 119894 isin 1 2 3119898 + 2119899

119892119894(119911119896+ 120572lowastΔ119911119896) gt 0 (C5)

Journal of Applied Mathematics 15

Moreover we have

Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911119896)

= Φ120583(119911lowast+ 120572lowastΔ119911lowast) minus Φ120583(119911lowast)

+ Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911lowast+ 120572lowastΔ119911lowast)

+ Φ120583(119911lowast) minus Φ120583(119911119896)

le 1051205911120572lowastnabla119911Φ120583(119911lowast)119879

Δ119911lowast

le 1205911120572lowastnabla119911Φ120583(119911119896)119879

Δ119911119896

(C6)

By the backtracking line search rule the last inequalitytogether with (C5) yields the inequality120572

119896ge 120572lowastfor all 119896 isin K

large enough Consequently when 119896 isin K is sufficiently largewe have from (23) and (C3) that

Φ120583(119911119896+1

) le Φ120583(119911119896) + 1205911120572119896nabla119911Φ120583(119911119896)119879

Δ119911119896

le Φ120583(119911119896) + 1205911120572lowastnabla119911Φ120583(119911119896)119879

Δ119911119896

le Φ120583(119911119896) +

1

21205911120572lowastnabla119911Φ120583(119911lowast)119879

Δ119911lowast

(C7)

This shows that Φ120583(119911119896)K rarr minusinfin contradicting with the

fact that Φ120583(119911119896)K is bounded from below The proof is

complete

Then we will establish the convergence of Algorithms 2and 4 that is Theorems 5 and 6 in Section 3

C1 Proof of Theorem 5

Proof We let (119911lowast lowast) be a limit point of (119911119896 119896) and thesubsequence (119911119896 119896)K converges to (119911

lowast lowast)

Recall that (Δ119911119896 119896) are the solution of (18) with (119911 120582) =

(119911119896 120582119896) Taking limits in both sides of (18) with (119911 120582) =

(119911119896 120582119896) as 119896 rarr infinwith 119896 isin K by the use of Lemma C1 we

obtain

lowast(1)

minus lowast(2)

minus 119883119879119884119879lowast(4)

= 0

119862 lowast 119890119899minus lowast(4)

minus lowast(5)

= 0

119890119898minus 2119879lowastlowast(1)

minus 2119879lowastlowast(2)

minus lowast(3)

= 0

119910119879lowast(4)

= 0

(119879lowast2

minus 119882lowast) lowast(1)

minus 120583119890119898

= 0

(119879lowast2

+ 119882lowast) lowast(2)

minus 120583119890119898

= 0

119879lowastlowast(3)

minus 120583119890119898

= 0

119875lowastlowast(4)

minus 120583119890119899= 0

Ξlowastlowast(5)

minus 120583119890119899= 0

(C8)

This shows that (119911lowast lowast) satisfies the first-order optimalityconditions (15)

C2 Proof of Theorem 6

Proof (i) Without loss of generality we suppose that thebounded subsequence (119911

119895 119895)J converges to some point

(119911lowast lowast) It is clear that 119911lowast is a feasible point of (10) Since

120583119895 120598120583119895 rarr 0 it follows from (29) and (30) that

Res (119911119895 119895 120583119895)

J997888rarr Res (119911lowast lowast 0) = 0

119895J

997888rarr lowastge 0

(C9)

Consequently the (119911lowast lowast) satisfies the KKT conditions (13)(ii) Let 120585

119895= max119895

infin 1 and

119895= 120585minus1

119895119895 Obviously

119895 is bounded Hence there exists an infinite subsetJ1015840 sube J

such that 1198951015840J rarr = 0 and 119895infin

= 1 for large 119895 isin JFrom (30) we know that ge 0 Dividing both sides of thefirst inequality of (18) with (119911 120582) = (119911

119896 120582119896) by 120585119895 then taking

limits as 119895 rarr infin with 119895 isin J we get

(1)

minus (2)

minus 119883119879119884119879(4)

= 0

(4)

+ (5)

= 0

2119879lowast(1)

+ 2119879lowast(2)

+ (3)

= 0

119910119879(4)

= 0

(119879lowast2

minus 119882lowast) (1)

= 0

(119879lowast2

+ 119882lowast) (2)

= 0

119879lowast(3)

= 0

119875lowast(4)

= 0

Ξlowast(5)

= 0

(C10)

Since 119911lowast is feasible the above equations has shown that 119911lowast is

a Fritz-John point of problem (10)

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

The authors would like to acknowledge support for thisproject from the National Science Foundation (NSF Grants11371154 11071087 11271069 and 61103202) of China

References

[1] Y M Yang and J O Pedersen ldquoA comparative study on featureselection in text categorizationrdquo in Proceedings of the 14th

16 Journal of Applied Mathematics

International Conference on Machine Learning (ICML rsquo97) vol97 pp 412ndash420 1997

[2] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquoThe Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003

[3] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002

[4] H H Zhang J Ahn X Lin and C Park ldquoGene selection usingsupport vector machines with non-convex penaltyrdquo Bioinfor-matics vol 22 no 1 pp 88ndash95 2006

[5] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007

[6] J Weston F Perez-Cruz O Bousquet O Chapelle A Elisseeffand B Scholkopf ldquoFeature selection and transduction forprediction of molecular bioactivity for drug designrdquo Bioinfor-matics vol 19 no 6 pp 764ndash771 2003

[7] Y Liu ldquoA comparative study on feature selection methods fordrug discoveryrdquo Journal of Chemical Information and ComputerSciences vol 44 no 5 pp 1823ndash1828 2004

[8] I Guyon J Weston S Barnhill and V Vapnik Eds FeatureExtraction Foundations and Applications vol 207 Springer2006

[9] A Rakotomamonjy ldquoVariable selection using SVM-based cri-teriardquo Journal of Machine Learning Research vol 3 no 7-8 pp1357ndash1370 2003

[10] J Weston S Mukherjee O Chapelle M Pontil T Poggio andV Vapnik ldquoFeature selection for SVMsrdquo in Proceedings of theConference onNeural Information Processing Systems vol 13 pp668ndash674 Denver Colo USA 2000

[11] D Peleg and R Meir ldquoA feature selection algorithm based onthe global minimization of a generalization error boundrdquo inProceedings of the Conference on Neural Information ProcessingSystems Vancouver Canada 2004

[12] P S Bradley and O L Mangasarian ldquoFeature selection via con-cave minimization and support vector machinesrdquo in Proceed-ings of the International Conference on Machine Learning pp82ndash90 San Francisco Calif USA 1998

[13] JWeston A Elisseeff B Scholkopf andM Tipping ldquoUse of thezero-norm with linear models and kernel methodsrdquo Journal ofMachine Learning Research vol 3 pp 1439ndash1461 2003

[14] A B Chan N Vasconcelos and G R G Lanckriet ldquoDirectconvex relaxations of sparse SVMrdquo in Proceedings of the 24thInternational Conference on Machine Learning (ICML rsquo07) pp145ndash153 Corvallis Ore USA June 2007

[15] GM Fung andO L Mangasarian ldquoA feature selection Newtonmethod for support vector machine classificationrdquo Computa-tional Optimization and Applications vol 28 no 2 pp 185ndash2022004

[16] J Zhu S Rosset T Hastie and R Tibshirani ldquo1-norm supportvector machinesrdquo Advances in Neural Information ProcessingSystems vol 16 no 1 pp 49ndash56 2004

[17] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society B Methodological vol 58no 1 pp 267ndash288 1996

[18] T Zhang ldquoAnalysis of multi-stage convex relaxation for sparseregularizationrdquo Journal of Machine Learning Research vol 11pp 1081ndash1107 2010

[19] R Chartrand ldquoExact reconstruction of sparse signals via non-convexminimizationrdquo IEEE Signal Processing Letters vol 14 no10 pp 707ndash710 2007

[20] R Chartrand ldquoNonconvex regularizaron for shape preserva-tionrdquo inProceedings of the 14th IEEE International Conference onImage Processing (ICIP rsquo07) vol 1 pp I293ndashI296 San AntonioTex USA September 2007

[21] Z Xu H Zhang Y Wang X Chang and Y Liang ldquo11987112

regularizationrdquo Science China Information Sciences vol 53 no6 pp 1159ndash1169 2010

[22] W J Chen and Y J Tian ldquoLp-norm proximal support vectormachine and its applicationsrdquo Procedia Computer Science vol1 no 1 pp 2417ndash2423 2010

[23] J Liu J Li W Xu and Y Shi ldquoA weighted L119902adaptive least

squares support vector machine classifiers-Robust and sparseapproximationrdquo Expert Systems with Applications vol 38 no 3pp 2253ndash2259 2011

[24] Y Liu H H Zhang C Park and J Ahn ldquoSupport vector mach-ines with adaptive 119871

119902penaltyrdquo Computational Statistics amp Data

Analysis vol 51 no 12 pp 6380ndash6394 2007[25] Z Liu S Lin andM Tan ldquoSparse support vectormachines with

Lp penalty for biomarker identificationrdquo IEEEACM Transac-tions on Computational Biology and Bioinformatics vol 7 no1 pp 100ndash107 2010

[26] A Rakotomamonjy R Flamary G Gasso and S Canu ldquol119901-l119902

penalty for sparse linear and sparse multiple kernel multitasklearningrdquo IEEE Transactions on Neural Networks vol 22 no 8pp 1307ndash1320 2011

[27] J Y Tan Z Zhang L Zhen C H Zhang and N Y DengldquoAdaptive feature selection via a new version of support vectormachinerdquo Neural Computing and Applications vol 23 no 3-4pp 937ndash945 2013

[28] Z B Xu X Y Chang F M Xu and H Zhang ldquo11987112

regu-larization an iterative half thresholding algorithmrdquo IEEETrans-actions on Neural Networks and Learning Systems vol 23 no 7pp 1013ndash1027 2012

[29] D Ge X Jiang and Y Ye ldquoA note on the complexity of 119871119901

minimizationrdquo Mathematical Programming vol 129 no 2 pp285ndash299 2011

[30] VN VapnikTheNature of Statistical LearningTheory SpringerNew York NY USA 1995

[31] J Fan and H Peng ldquoNonconcave penalized likelihood with adiverging number of parametersrdquo The Annals of Statistics vol32 no 3 pp 928ndash961 2004

[32] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoThe Annals of Statistics vol 28 no 5 pp 1356ndash1378 2000

[33] B S Tian and X Q Yang ldquoAn interior-point l 11987112-penalty-

method for nonlinear programmingrdquo Technical ReportDepartment of Applied Mathematics Hong Kong PolytechnicUniversity 2013

[34] L Armijo ldquoMinimization of functions having Lipschitz con-tinuous first partial derivativesrdquo Pacific Journal of Mathematicsvol 16 pp 1ndash3 1966

[35] F John ldquoExtremum problems with inequalities as side-con-ditionsrdquo in Studies and Essays Courant Anniversary Volume pp187ndash1204 1948

[36] D J Newman S Hettich C L Blake and C J Merz ldquoUCIrepository of machine learning databasesrdquo Technical Report9702 Department of Information and Computer Science Uni-verisity of California Irvine Calif USA 1998 httparchiveicsucieduml

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 5: Research Article An Interior Point Method for -SVM …downloads.hindawi.com/journals/jam/2014/942520.pdfResearch Article An Interior Point Method for 1/2-SVM and Application to Feature

Journal of Applied Mathematics 5

119872 =

(((((((

(

0 0 0 0 119868 minus119868 0 minus119883119879119884119879

0

0 0 0 0 0 0 0 minus119868 minus119868

0 0 minus2 (1198631+ 1198632) 0 minus2119879 minus2119879 minus119868 0 0

0 0 0 0 0 0 0 119910119879

0

minus1198631

0 21198791198631

0 1198792minus 119882 0 0 0 0

1198632

0 21198791198632

0 0 1198792+ 119882 0 0 0

0 0 1198633

0 0 0 119879 0 0

1198634119884119883 119863

40 119863

4119910 0 0 0 119875 0

0 1198635

0 0 0 0 0 0 Ξ

)))))))

)

(17)

where 1198631

= diag(120582(1)) 1198632

= diag(120582(2)) 1198633

= diag(120582(3))1198634= diag(120582(4)) and119863

5= diag(120582(5))

We can rewrite (16) as

(120582(1)

+ Δ120582(1)

) minus (120582(2)

+ Δ120582(2)

) minus 119883119879119884119879(120582(4)

+ Δ120582(4)

) = 0

(120582(4)

+ Δ120582(4)

) + (120582(5)

+ Δ120582(5)

) = 119862 lowast 119890119899

2 (1198631+ 1198632) Δ119905 + 2119879 (120582

(1)+ Δ120582(1)

)

+ 2119879 (120582(2)

+ Δ120582(2)

) + (120582(3)

+ Δ120582(3)

) = 119890119898

119910119879(120582(4)

+ Δ120582(4)

) = 0

minus1198631Δ119908 + 2119879119863

1Δ119905 + (119879

2minus 119882) (120582

(1)+ Δ120582(1)

) = 120583119890119898

1198632Δ119908 + 2119879119863

2Δ119905 + (119879

2+ 119882) (120582

(2)+ Δ120582(2)

) = 120583119890119898

1198633Δ119905 + 119879 (120582

(3)+ Δ120582(3)

) = 120583119890119898

1198634119884119883Δ119908 + 119863

4Δ120585 + 119863

4119910 lowast Δ119887 + 119875 (120582

(4)+ Δ120582(4)

) = 120583119890119899

1198635Δ120585 + Ξ (120582

(5)+ Δ120582(5)

) = 120583119890119899

(18)

It follows from the last five equations that vector ≜ 120582 + Δ120582

can be expressed as

(1)

= (1198792minus 119882)minus1

(120583119890119898+ 1198631Δ119908 minus 2119879119863

1Δ119905)

(2)

= (1198792+ 119882)minus1

(120583119890119898minus 1198632Δ119908 minus 2119879119863

2Δ119905)

(3)

= 119879minus1

(120583119890119898minus 1198633Δ119905)

(4)

= 119875minus1

(120583119890119899minus 1198634119884119883Δ119908 minus 119863

4Δ120585 minus 119863

4119910 lowast Δ119887)

(5)

= Ξminus1

(120583119890119899minus 1198635Δ120585)

(19)

Substituting (19) into the first four equations of (18) weobtain

119878(

Δ119908

Δ120585

Δ119905

Δ119887

)

=

(((

(

minus120583(1198792minus 119882)minus1

119890119898+ 120583(119879

2+ 119882)minus1

119890119898

+119883119879119884119879119875minus1119890119899

minus119862 lowast 119890119899+ 120583119875minus1119890119899+ 120583Ξminus1119890119899

minus119890119898+ 2120583119879(119879

2minus 119882)minus1

119890119898

+2120583119879(1198792+ 119882)minus1

119890119898+ 120583119879minus1119890119898

120583119910119879119875minus1119890119899

)))

)

(20)

where matrix 119878 takes the form

119878 = (

11987811

11987812

11987813

11987814

11987821

11987822

11987823

11987824

11987831

11987832

11987833

11987834

11987841

11987842

11987843

11987844

) (21)

with blocks

11987811

= 119880 + 119881 + 119883119879119884119879119875minus11198634119884119883

11987812

= 119883119879119884119879119875minus11198634

11987813

= minus2 (119880 minus 119881)119879 11987814

= 119883119879119884119879119875minus11198634119910

11987821

= 119875minus11198634119884119883

11987822

= 119875minus11198634+ Ξminus11198635

11987823

= 0 11987824

= 119875minus11198634119910

6 Journal of Applied Mathematics

11987831

= minus2 (119880 minus 119881)119879 11987832

= 0

11987833

= 4119879 (119880 + 119881)119879 + 119879minus11198633minus 2 (119863

1+ 1198632) 119878

34= 0

11987841

= 119910119879119875minus11198634119884119883 119878

42= 119910119879119875minus11198634

11987843

= 0 11987844

= 119910119879119875minus11198634119910

(22)

and 119880 = (1198792minus 119882)minus11198631and 119881 = (119879

2+ 119882)minus11198632

33The Interior PointerAlgorithm Let 119911 = (119908 120585 119905 119887) and120582 =

(120582(1)

120582(2)

120582(3)

120582(4)

120582(5)

) we first present the interior pointeralgorithm to solve the barrier problem (14) and then discussthe details of the algorithm

Algorithm 2 The interior pointer algorithm (IPA) is asfollows

Step 0 Given tolerance 120598120583 set 120591

1isin (0 (12)) 119897 gt 0

1205741gt 1 120573 isin (0 1) Let 119896 = 0

Step 1 Stop if KKT condition (15) holds

Step 2 Compute Δ119911119896 from (20) and

119896+1 from (19)Compute 119911

119896+1= 119911119896+ 120572119896Δ119911119896 Update the Lagrangian

multipliers to obtain 120582119896+1

Step 3 Let 119896 = 119896 + 1 Go to Step 1

In Step 2 a step length 120572119896 is used to calculate 119911

119896+1We estimate 120572

119896 by Armijo line search [34] in which 120572119896

=

max 120573119895| 119895 = 0 1 2 for some 120573 isin (0 1) and satisfies the

following inequalities

(119905119896+1

)2

minus 119908119896+1

gt 0

(119905119896+1

)2

+ 119908119896+1

gt 0

119905119896+1

gt 0

Φ120583(119911119896+1

) minus Φ120583(119911119896)

le 1205911120572119896(nabla119908Φ

120583(119911119896)119879

Δ119908119896

+ nabla119905Φ120583(119911119896)119879

Δ119905119896

+nabla120585Φ120583(119911119896)119879

Δ120585119896+ nabla119887Φ120583(119911119896)119879

Δ119887119896)

(23)

where 1205911isin (0 12)

To avoid ill-conditioned growth of 119896 and guarantee thestrict dual feasibility the Lagrangian multipliers 120582

119896 should

be sufficiently positive and bounded from above Followinga similar idea of [33] we first update the dual multipliers by

(1)(119896+1)

119894

=

min119897120583

(119905119896

119894)2

minus 119908119896

119894

if (1)(119896+1)119894

lt min119897120583

(119905119896

119894)2

minus 119908119896

119894

(1)(119896+1)

119894 if min119897

120583

(119905119896

119894)2

minus x119896119894

le (1)(119896+1)

119894le

1205831205741

(119905119896

119894)2

minus 119908119896

119894

1205831205741

(119905119896

119894)2

minus 119908119896

119894

if (1)(119896+1)119894

gt1205831205741

(119905119896

119894)2

minus 119908119896

119894

(2)(119896+1)

119894

=

min119897120583

(119905119896

119894)2

+ 119908119896

119894

if (2)(119896+1)119894

lt min119897120583

(119905119896

119894)2

+ 119908119896

119894

(2)(119896+1)

119894 if min119897

120583

(119905119896

119894)2

+ 119908119896

119894

le (2)(119896+1)

119894le

1205831205741

(119905119896

119894)2

+ 119908119896

119894

1205831205741

(119905119896

119894)2

+ 119908119896

119894

if (2)(119896+1)119894

gt1205831205741

(119905119896

119894)2

+ 119908119896

119894

(3)(119896+1)

119894=

min119897120583

119905119896

119894

if (3)(119896+1)119894

lt min119897120583

119905119896

119894

(3)(119896+1)

119894 if min119897

120583

119905119896

119894

le (3)(119896+1)

119894le

1205831205741

119905119896

119894

1205831205741

119905119896

119894

if (3)(119896+1)119894

gt1205831205741

119905119896

119894

(4)(119896+1)

119894=

min119897120583

119901119896

119894

if (4)(119896+1)119894

lt min119897120583

119901119896

119894

(4)(119896+1)

119894 if min119897

120583

119901119896

119894

le (4)(119896+1)

119894le

1205831205741

119901119896

119894

1205831205741

119901119896

119894

if (4)(119896+1)119894

gt1205831205741

119901119896

119894

(24)

Journal of Applied Mathematics 7

where 119901119894= 119910119894(119908119879119909119894+ 119887) + 120585

119894minus 1

(5)(119896+1)

119894=

min119897120583

120585119896

119894

if (5)(119896+1)119894

lt min119897120583

120585119896

119894

(5)(119896+1)

119894 if min119897

120583

120585119896

119894

le (5)(119896+1)

119894le

1205831205741

120585119896

119894

1205831205741

120585119896

119894

if (5)(119896+1)119894

gt1205831205741

120585119896

119894

(25)

where the parameters 119897 and 1205741satisfy 0 lt 119897 120574

1gt 1

Since positive definiteness of the matrix 119878 is demanded inthis method the Lagrangian multipliers

119896+1 should satisfythe following condition

120582(3)

minus 2 (120582(1)

+ 120582(2)

) 119905 ge 0 (26)

For the sake of simplicity the proof is given in Appendix BTherefore if 119896+1 satisfies (26) we let 120582119896+1 =

119896+1 Other-wise we would further update it by the following setting

120582(1)(119896+1)

= 1205742(1)(119896+1)

120582(2)(119896+1)

= 1205742(2)(119896+1)

120582(3)(119896+1)

= 1205743(3)(119896+1)

(27)

where constants 1205742isin (0 1) and 120574

3ge 1 satisfy

1205743

1205742

= max119894isin119864

2119905119896+1

119894((1)(119896+1)

119894+ (2)(119896+1)

119894)

(3)(119896+1)

119894

(28)

with 119864 = 1 2 119899 It is not difficult to see that the vector((1)(119896+1)

(2)(119896+1)

(3)(119896+1)

) determined by (27) satisfies (26)In practice the KKT conditions (15) are allowed to be

satisfied within a tolerance 120598120583 It turns to be that the iterative

process stops while the following inequalities meet

Res (119911 120582 120583) =

10038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817

120582(1)

minus 120582(2)

minus 119883119879119884119879120582(4)

119862 lowast 119890119899minus 120582(4)

minus 120582(5)

119890119898minus 2119879120582

(1)minus 2119879120582

(2)minus 120582(3)

119890119879

119899119884120582(4)

(1198792minus 119882)120582

(1)minus 120583119890119898

(1198792+ 119882)120582

(2)minus 120583119890119898

119879120582(3)

minus 120583119890119898

119875120582(4)

minus 120583119890119899

Ξ120582(5)

minus 120583119890119899

10038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817

lt 120598120583 (29)

120582 ge minus120598120583119890 (30)

where 120598120583is related to the current barrier parameter 120583 and

satisfies 120598120583darr 0 as 120583 rarr 0

Since functionΦ120583in (14) is convex we have the following

lemma which shows that Algorithm 2 is well defined (theproof is given in Appendix B)

Lemma 3 Let 119911119896 be strictly feasible for problem (10) If Δ119911119896=

0 then (23) is satisfied for all 120572119896

ge 0 If Δ119911119896

= 0 then thereexists a

119896isin (0 1] such that (23) holds for all 120572

119896isin (0

119896]

The proposed interior point method successively solvesthe barrier subproblem (14) with a decreasing sequence 120583

119896

We simply reduce both 120598120583and 120583 by a constant factor 120573 isin

(0 1) Finally we test optimality for problem (10) by meansof the residual norm Res(119911 0)

Here we present the whole algorithm to solve the 11987112

-SVM problem(10)

Algorithm 4 Algorithm for solving 11987112

-SVM problem is asfollows

Step 0 Set 1199080 isin 119877119898 1198870isin 119877 120585

0isin 119877119899 1205820 =

0gt 0

1199050

119894ge radic|119908

0

119894| + (12) 119894 isin 1 2 119898 Given constants

1205830gt 0 1205981205830

gt 0 120573 isin (0 1) and 120598 gt 0 let 119895 = 0

Step 1 Stop if Res(119911119895 119895 0) le 120598 and 119895ge 0

Step 2 Starting from (119911119895 120582119895) apply Algorithm 2 to

solve (14) with barrier parameter 120583119895and stopping

tolerance 120598120583119895 Set 119911119895+1 = 119911

119895119896 119895+1 = 119895119896 and 120582

119895+1=

120582119895119896

Step 3 Set 120583119895+1

= 120573120583119895and 120598120583119895+1

= 120573120598120583119895 Let 119895 = 119895 + 1

and go to Step 1

In Algorithm 4 the index 119895 denotes an outer iterationwhile 119896 denotes the last inner iteration of Algorithm 2

The convergence of the proposed interior point methodcan be proved We list the theorem here and give the proof inAppendix C

Theorem 5 Let (119911119896 120582119896) be generated by Algorithm 2 Thenany limit point of the sequence (119911

119896 119896) generated by

Algorithm 2 satisfies the first-order optimality conditions (15)

Theorem 6 Let (119911119895 119895) be generated by Algorithm 4 by

ignoring its termination condition Then the following state-ments are true

(i) The limit point of (119911119895 119895) satisfies the first order opti-mality condition (13)

(ii) The limit point 119911lowast of the convergent subsequence

119911119895J sube 119911

119895 with unbounded multipliers

119895J is a

Fritz-John point [35] of problem (10)

4 Experiments

In this section we tested the constrained optimization refor-mulation to the 119871

12-SVM and the proposed interior point

method We compared the performance of the 11987112

-SVMwith 119871

2-SVM [30] 119871

1-SVM [12] and 119871

0-SVM [12] on

artificial data and ten UCI data sets (httparchiveicsucieduml) These four problems were solved in primal refer-encing the machine learning toolbox Spider (httppeoplekybtuebingenmpgdespider) The 119871

2-SVM and 119871

1-SVM

were solved directly by quadratic programming and linearprogramming respectively To the NP-hard problem 119871

0-

SVM a commonly cited approximation method FeatureSelection Concave (FSV) [12] was applied and then the FSV

8 Journal of Applied Mathematics

problem was solved by a Successive Linear Approximation(SLA) algorithm All the experiments were run in thepersonal computer (16 GHz of CPU 4GB of RAM) withMATLAB R2010b on 64 bit Windows 7

In the proposed interior point method (Algorithms 2 and4) we set the parameters as 120573 = 06 120591

1= 05 119897 = 000001

1205741

= 100000 and 120573 = 05 The balance parameter 119862 wasselected by 5-fold cross-validation on training set over therange 2

minus10 2minus9 210 After training the weights that didnot satisfy the criteria |119908

119895|max

119894(|119908119894|) ge 10

4 [14] were set tozeroThen the cardinality of the hyperplane was computed asthe number of the nonzero weights

41 Artificial Data First we took an artificial binary linearclassification problem as an example The problem is similarto that in [13] The probability of 119910 = 1 or minus1 is equal Thefirst 6 features are relevant but redundant In 70 samplesthe first three features 119909

1 1199092 1199093were drawn as 119909

119894= 119910119873(119894 1)

and the second three features 1199094 1199095 1199096 as 119909

119894= 119873(0 1)

Otherwise the first three were drawn as 119909119894= 119873(0 1) and the

second three as 119909119894= 119910119873(119894 minus 3 1) The rest features are noise

119909119894= 119873(0 20) 119894 = 7 119898 Here119898 is the dimension of input

features The inputs were scaled to mean zero and standarddeviation In each trial 500 points were generated for testingand the average results were estimated over 30 trials

In the first experiment we consider the cases with thefixed feature size 119898 = 30 and different training sample sizes119899 = 10 20 100 The average results over the 30 trialsare shown in Table 1 and Figure 2 Figure 2 (left) plots theaverage cardinality of each classifier Since the artificial datasets have 2 relevant and nonredundant features the idealaverage cardinality is 2 Figure 2 (left) shows that the threesparse SVMs 119871

1-SVM 119871

12-SVM and 119871

0-SVM can achieve

sparse solution while the 1198712-SVM almost uses full features

in each data set Furthermore the solutions of 11987112

-SVMand 119871

0-SVM are much sparser than 119871

1-SVM As shown in

Table 1 the 1198711-SVM selects more than 6 features in all cases

which implies that some redundant or irrelevant featuresare selected The average cardinalities of 119871

12-SVM and 119871

0-

SVM are similar and close to 2 However when 119899 = 10

and 20 the 1198710-SVM has the average cardinalities of 142

and 187 respectively It means that the 1198710-SVM sometimes

selects only one feature in low sample data set and maybeignores some really relevant feature Consequently with thecardinalities between 205 and 29 119871

12-SVM has the more

reliable solution than 1198710-SVM In short as far as the number

of selected features is concerned the 11987112

-SVM behavesbetter than the other three methods

Figure 2 (right) plots the trend of the prediction accuracyversus the size of the training sample The classificationperformance of all methods is generally improved with theincreasing of the training sample size 119899 119871

1-SVM has the best

prediction performance in all cases and a slightly better than11987112

-SVM 11987112

-SVM shows more accuracy in classificationthan 119871

2-SVM and 119871

0-SVM especially in the case of 119899 =

10 50 As shown in Table 1 when there are only 10 train-ing samples the average accuracy of 119871

12-SVM is 8805

while the results of 1198712-SVM and 119871

0-SVM are 8465 and

7765 respectively Compared with 1198712-SVM and 119871

0-SVM

11987112

-SVM has the average accuracy increased by 34 and104 respectively as can be explained in what follows Tothe 1198712-SVM all features are selected without discrimination

and the prediction would bemisled by the irrelevant featuresTo the 119871

0-SVM few features are selected and some relevant

features are not included which would put negative impacton the prediction result As the tradeoff between119871

2-SVMand

1198710-SVM 119871

12-SVM has better performance than the two

The average results over ten artificial data sets in the firstexperiment are shown in the bottom of Table 1 On averagethe accuracy of 119871

12-SVM is 087 lower than the 119871

1-SVM

while the features selected by 11987112

-SVM are 7414 less than1198711-SVM It indicates that the 119871

12-SVM can achieve much

sparser solution than 1198711-SVM with little cost of accuracy

Moreover the average accuracy of 11987112

-SVMover 10 data setsis 222 higher than 119871

0-SVM with the similar cardinality

To sum up the 11987112

-SVM provides the best balance betweenaccuracy and sparsity among the three sparse SVMs

To further evaluate the feature selection performance of11987112

-SVM we investigate whether the features are correctlyselected For the119871

2-SVM is not designed for feature selection

it is not included in this comparison Since our artificial datasets have 2 best features (119909

3 1199096) the best result should have

the two features (1199093 1199096) ranking on the top according to their

absolute values of weights |119908119895| In the experiment we select

the top 2 features with the maximal |119908119895| for each method

and calculate the frequency that the top 2 features are 1199093

and 1199096in 30 runs The results are listed in Table 2 When the

training sample size is too small it is difficult to discriminatethe two most important features for all sparse SVMs Forexample when 119899 = 10 the selected frequencies of 119871

1-SVM

1198710-SVM and 119871

12-SVM are 7 3 and 9 respectively When 119899

increases all methods tend to make more correct selectionMoreover Table 2 shows that the 119871

12-SVM outperforms the

other two methods in all cases For example when 119899 = 100the selected frequencies of 119871

1-SVM and119871

0-SVM are 22 and

25 respectively and the result of 11987112

-SVM is 27 The 1198711-

SVMselects toomany redundant or irrelevant features whichmay influence the ranking in some extent Therefore 119871

1-

SVM is not so good as 11987112

-SVM at distinguishing the criticalfeatures The 119871

0-SVM has the lower hit frequency than 119871

12-

SVM which is probably due to the excessive small featuresubset it obtained Above all Tables 1 and 2 and Figure 2clearly show that the 119871

12-SVM is a promising sparsity driven

classification methodIn the second simulation we consider the cases with

various dimensions of feature space 119898 = 20 40 180 200

and the fixed training sample size 119899 = 100 The averageresults over 30 trials are shown in Figure 3 and Table 3 Sincethere are only 6 relevant features yet the larger 119898 meansthe more noisy features Figure 3 (left) shows that as thedimension increases from 20 to 200 the number of featuresselected by 119871

1-SVM increases from 826 to 231 However the

cardinalities of 11987112

-SVM and 1198710-SVM keep stable (from 22

to 295) It indicates that the 11987112

-SVM and 1198710-SVM aremore

suitable for feature selection than 1198711-SVM

Figure 3 (right) shows thatwith the increasing of the noisefeatures the accuracy of 119871

2-SVM drops significantly (from

9868 to 8747) On the contrary to the other three sparse

Journal of Applied Mathematics 9

0 20 40 60 80 1000

5

10

15

20

25

30

Number of training samples

Card

inal

ity

0 20 40 60 80 10075

80

85

90

95

100

Number of training samples

Accu

racy

L2-SVML1-SVM

L0-SVML12-SVM

L2-SVML1-SVM

L0-SVML12-SVM

Figure 2 Results comparison on 10 artificial data sets with various training samples

Table 1 Results comparison on 10 artificial data sets with various training sample sizes

1198991198712-SVM 119871

1-SVM 119871

0-SVM 119871

12-SVM

Card Acc Card Acc Card Acc Card Acc10 2997 8465 693 8855 143 7765 225 880520 3000 9221 883 9679 187 9317 235 955030 3000 9427 847 9823 203 9467 205 974140 2997 9599 900 9875 207 9539 210 977550 3000 9661 950 9891 213 9657 245 976860 2997 9725 1000 9894 220 9761 245 979870 3000 9741 1100 9898 220 9756 240 982380 3000 9768 1003 9909 223 9761 260 985090 2997 9789 920 9921 230 9741 290 9830100 3000 9813 1027 9909 250 9793 255 9836On average 2999 9521 932 9765 210 9456 241 9678ldquo119899rdquo is the number of training samples ldquoCardrdquo represents the number of selected features and ldquoAccrdquo is the classification accuracy

Table 2 Frequency of the most important features (1199093 1199096) ranked on top 2 in 30 runs

119899 10 20 30 40 50 60 70 80 90 1001198711-SVM 7 14 16 18 22 23 20 22 21 22

1198710-SVM 3 11 12 15 19 22 22 23 22 25

11987112-SVM 9 16 19 24 24 23 27 26 26 27

Table 3 Average results over 10 artificial data sets with varies dimension

1198712-SVM 119871

1-SVM 119871

0-SVM 119871

12-SVM

Cardinality 10995 1585 238 241Accuracy 9293 9907 9776 9826

10 Journal of Applied Mathematics

Table 4 Feature selection performance comparison on UCI data sets (Cardinality)

No UCI Data Set 119899 1198981198712-SVM 119871

1-SVM 119871

0-SVM 119871

12-SVM

Mean Std Mean Std Mean Std Mean Std1 Pima diabetes 768 8 800 000 740 089 620 110 760 1522 Breast Cancer 683 9 900 000 860 122 640 207 840 1673 Wine (3) 178 13 1300 000 973 069 373 028 407 0494 Image (7) 2100 19 1900 000 820 053 366 097 511 0625 SPECT 267 22 2200 000 1740 527 980 427 9 1176 WDBC 569 30 3000 000 980 152 900 303 560 2057 Ionosphere 351 34 3320 045 2500 335 2460 1021 2180 8448 SPECTF 267 44 4400 000 3880 1098 2400 1132 2560 7759 Sonar 208 60 6000 000 3140 1305 2700 718 1360 100610 Musk1 476 166 16600 000 8580 1885 4860 428 41 1416

Average cardinality 4050 4042 2421 1630 1418

Table 5 Classification performance comparison on UCI data sets (accuaracy)

No UCI Data Set 1198712-SVM 119871

1-SVM 119871

0-SVM 119871

12-SVM

Mean Std Mean Std Mean Std Mean Std1 Pima diabetes 7699 347 7673 085 7673 294 7712 1522 Breast cancer 9618 223 9706 096 9632 138 9647 1593 Wine (3) 9771 373 9829 128 9314 239 9714 2564 Image (7) 8894 149 9127 177 8667 458 9136 1945 SPECT 8340 215 7849 231 7811 338 8234 5106 WDBC 9611 211 9646 131 9558 115 9646 1317 Ionosphere 8343 544 8800 370 8457 538 8743 5948 SPECTF 7849 391 7434 450 7387 325 7849 1899 Sonar 7366 529 7561 582 7415 798 7610 43510 Musk1 8484 173 8211 242 7600 564 8232 338

Average accuracy 8597 8584 8351 8652

SVMs there is little change in the accuracy It reveals thatSVMs can benefit from the features reduction

Table 3 shows the average results over all data sets inthe second experiment On average the solution of 119871

12-

SVM yields much sparser than 1198711-SVM and a slightly better

accuracy than 1198710-SVM

42 UCI Data Sets We further tested the reformulation andthe proposed interior point methods to 119871

12-SVM on 10 UCI

data sets [36] There are 8 binary classification problems and2multiclass problems (wine image) Each feature of the inputdata was normalized to zero mean and unit variance and theinstances with missing value were deletedThen the data wasrandomly split into training set (80) and testing set (20)For the two multiclass problems a one-against-rest methodwas applied to construct a binary classifier for each class Werepeated the training and testing procedure 10 times and theaverage results were shown in Tables 4 and 5 and Figure 4

Tables 4 and 5 summarize the feature selection andclassification performance of the numerical experiments onUCI data sets respectively Here 119899 is the numbers of samplesand 119898 is the number of the input features For the two

multiclass data sets the numbers of the classes are markedbehind their names Sparsity is defined as card119898 and thesmall value of sparsity is preferredThe data sets are arrangedin descending order according to the dimension The lowestcardinality and the best accuracy rate for each problem arebolded

As shown in Tables 4 and 5 the three sparse SVMs canencourage sparsity in all data sets while remaining roughlyidentical accuracy with 119871

2-SVM Among the three sparse

methods the 11987112

-SVM has the lowest cardinality (1418)and the highest classification accuracy (8652) on averageWhile the 119871

1-SVM has the worst feature selection perform-

ance with the highest average cardinality (2421) and the1198710-SVM has the lowest average classification accuracy

(8351)Figure 4 plots the sparsity (left) and classification accu-

racy (right) of each classifier on UCI data sets In three datasets (6 9 10) the 119871

12-SVM has the best performance both

in feature selection and classification among the three sparseSVMs Compared with 119871

1-SVM 119871

12-SVM can achieve

sparser solution in nine data sets For example in the dataset ldquo8 SPECTFrdquo the features selected by 119871

12-SVM are 34

Journal of Applied Mathematics 11

50 100 150 2000

20

40

60

80

100

120

140

160

180

200

Dimension

Card

inal

ity

50 100 150 20080

82

84

86

88

90

92

94

96

98

100

Dimension

Accu

racy

L2-SVML1-SVM

L0-SVML12-SVM

L2-SVML1-SVM

L0-SVML12-SVM

Figure 3 Results comparison on artificial data set with various dimensions

less than 1198711-SVM at the same time the accuracy of 119871

12-

SVM is 41 higher than 1198711-SVM In most data sets (4

5 6 9 10) the cardinality of 11987112

-SVM drops significantly(at least 37) with the equal or a slightly better result inaccuracy Only in the data set ldquo3 Winerdquo the accuracy of11987112

-SVM is decreased by 11 but the sparsity provided by11987112

-SVM leads to 582 improvement over 1198711-SVM In the

rest three data sets (1 2 7) the two methods have similarresults in feature selection and classification As seen abovethe 119871

12-SVM can provide lower dimension representation

than 1198711-SVM with the competitive prediction perform-

anceFigure 4 (right) shows that compared with 119871

0-SVM

11987112

-SVM has the classification accuracy improved in alldata sets For instance in four data sets (3 4 8 10) theclassification accuracy of 119871

12-SVM is at least 40 higher

than 1198710-SVM Especially in the data set ldquo10Muskrdquo 119871

12-SVM

gives a 63 rise in accuracy over 1198710-SVM Meanwhile it

can be observed from Figure 4 (left) that 11987112

-SVM selectsfewer feature than 119871

0-SVM in five data sets (5 6 7 9 10) For

example in the data sets 6 and 9 the cardinalities of 11987112

-SVM are 378 and 496 less than 119871

0-SVM respectively

In summary 11987112

-SVM presents better classification perfor-mance than 119871

0-SVM while it is effective in choosing relevant

features

5 Conclusions

In this paper we proposed a 11987112

regularization techniquefor simultaneous feature selection and classification in theSVMWe have reformulated the 119871

12-SVM into an equivalent

smooth constrained optimization problem The problem

possesses a very simple structure and is relatively easy todevelop numerical methods By the use of this interestingreformulation we proposed an interior point method andestablished its convergence Our numerical results supportedthe reformulation and the proposed method The 119871

12-

SVM can get more sparsity solution than 1198711-SVM with

the comparable classification accuracy Furthermore the11987112

-SVM can achieve more accuracy classification resultsthan 119871

0-SVM (FSV)

Inspired by the good performance of the smooth opti-mization reformulation of 119871

12-SVM there are some inter-

esting topics deserving further research For examples todevelop more efficient algorithms for solving the reformu-lation to study nonlinear 119871

12-SVM and to explore varies

applications of the 11987112

-SVM and further validate its effectiveare interesting research topics Some of them are under ourcurrent investigation

Appendix

A Properties of the Reformulation to11987112

-SVM

Let 119863 be set of all feasible points of the problem (10) Forany 119911 isin 119863 we let 119865119863(119911119863) and 119871119865119863(119911119863) be the set of allfeasible directions and linearized feasible directions of 119863 at119911 Since the constraint functions of (10) are all convex weimmediately have the following lemma

Lemma A1 For any feasible point 119911 of (10) one has119865119863 (119911119863) = 119871119865119863 (119911119863) (A1)

12 Journal of Applied Mathematics

1 2 3 4 5 6 7 8 9 1010

20

30

40

50

60

70

80

90

100

Data set number Data set number

Spar

sity

1 2 3 4 5 6 7 8 9 1070

75

80

85

90

95

100

Accu

racy

L2-SVML1-SVM

L0-SVML12-SVM

L2-SVML1-SVM

L0-SVML12-SVM

Figure 4 Results comparison on UCI data sets

Based on the above Lemma we can easily derive a firstorder necessary condition for (10) The Lagrangian functionof (10) is

119871 (119908 119905 120585 119887 120582 120583 ] 119901 119902)

= 119862

119899

sum

119894=1

120585119894+

119898

sum

119895=1

119905119895minus

119898

sum

119895=1

120582119895(1199052

119895minus 119908119895) minus

119898

sum

119895=1

120583119895(1199052

119895+ 119908119895)

minus

119898

sum

119895=1

]119895119905119895minus

119899

sum

119894=1

119901119894(119910119894119908119879119909119894+ 119910119894119887 + 120585119894minus 1) minus

119899

sum

119894=1

119902119894120585119894

(A2)

where 120582 120583 ] 119901 119902 are the Lagrangian multipliers By the useof Lemma A1 we immediately have the following theoremabout the first order necessary condition

Theorem A2 Let (119908 119905 120585 119887) be a local solution of (10) Thenthere are Lagrangian multipliers (120582 120583 ] 119901 119902) isin 119877

119898+119898+119898+119899+119899

such that the following KKT conditions hold

120597119871

120597119908= 120582119895minus 120583119895minus 119901119894

119899

sum

119894=1

119910119894119909119894= 0 119894 = 1 2 119899

120597119871

120597119905= 1 minus 2120582

119895119905119895minus 2120583119895119905119895minus ]119895= 0 119895 = 1 2 119898

120597119871

120597120585= 119862 minus 119901

119894minus 119902119894= 0 119894 = 1 2 119899

120597119871

120597119887=

119899

sum

119894=1

119901119894119910119894= 0 119894 = 1 2 119899

120582119895ge 0 119905

2

119895minus 119908119895ge 0 120582

119895(1199052

119895minus 119908119895) = 0

119895 = 1 2 119898

120583119895ge 0 119905

2

119895+ 119908119895ge 0 120583

119895(1199052

119895+ 119908119895) = 0

119895 = 1 2 119898

]119895ge 0 119905

119895ge 0 ]

119895119905119895= 0 119895 = 1 2 119898

119901119894ge 0 119910

119894(119908119879119909119894+ 119887) + 120585

119894minus 1 ge 0

119901119894(119910119894(119908119879119909119894+ 119887) + 120585

119894minus 1) = 0 119894 = 1 2 119899

119902119894ge 0 120585

119894ge 0 119902

119894120585119894= 0 119894 = 1 2 119899

(A3)

The following theorem shows that the level set will bebounded

Theorem A3 For any given constant 119888 gt 0 the level set

119882119888= 119911 = (119908 119905 120585) | 119891 (119911) le 119888 cap 119863 (A4)

is bounded

Proof For any (119908 119905 120585) isin Ω119888 we have

119898

sum

119894

119905119894+ 119862

119899

sum

119895

120585119895le 119888 (A5)

Journal of Applied Mathematics 13

Combining with 119905119894ge 0 119894 = 1 119898 and 120585

119895ge 0 119895 = 1 119899

we have

0 le 119905119894le 119888 119894 = 1 119898

0 le 120585119895le

119888

119862 119895 = 1 119899

(A6)

Moreover for any (119908 119905 120585) isin 119863

10038161003816100381610038161199081198941003816100381610038161003816 le 1199052

119894 119894 = 1 119899 (A7)

Consequently 119908 119905 120585 are bounded What is more from thecondition

119910119894(119908119879119909119894+ 119887) ge 1 minus 120585

119894 119894 = 1 119898 (A8)

we have

119908119879119909119894+ 119887 ge 1 minus 120585

119894 if 119910

119894= 1 119894 = 1 119898

119908119879119909119894+ 119887 le minus1 + 120585

119894 if 119910

119894= minus1 119894 = 1 119898

(A9)

Thus if the feasible region119863 is not empty then we have

max119910119894=1

(1 minus 120585119894minus 119908119879119909119894) le 119887 le min

119910119894=minus1

(minus1 + 120585119894minus 119908119879119909119894) (A10)

Hence 119887 is also bounded The proof is complete

B Proof of Lemma 3

Lemma 3 (see Section 3) shows that Algorithm 2 is welldefinedWe first introduce the following proposition to showthat the matrix 119878 defined by (21) is always positive definitewhich ensures Algorithm 2 to be well defined

Proposition B1 Let the inequality (26) hold then the matrix119878 defined by (21) is positive definite

Proof By an elementary deduction we have for any V(1) isin

119877119898 V(2) isin 119877

119899 V(3) isin 119877119898 V(4) isin 119877 with (V(1) V(2) V(3) V(4)) = 0

(V(1)119879 V(2)119879 V(3)119879 V(4)119879) 119878(

V(1)

V(2)

V(3)

V(4))

= V(1)11987911987811V(1) + V(1)119879119878

12V(2) + V(1)119879119878

13V(3)

+ V(1)11987911987814V(4) + V(2)119879119878

21V(1) + V(2)119879119878

22V(2)

+ V(2)11987911987823V(3) + V(2)119879119878

24V(4)

+ V(3)11987911987831V(1) + V(3)119879119878

32V(2)

+ V(3)11987911987833V(3) + V(3)119879119878

34V(4)

+ V(4)11987911987841V(1) + V(4)119879119878

42V(2)

+ V(4)11987911987843V(3) + V(4)119879119878

44V(4)

= V(1)119879 (119880 + 119881 + 119883119879119884119879119875minus11198634119884119883) V(1)

+ V(1)119879 (119883119879119884119879119875minus11198634) V(2)

+ V(1)119879 (minus2 (119880 minus 119881)119879) V(3)

+ V(1)119879 (119883119879119884119879119875minus11198634119910) V(4)

+ V(2)119879119875minus11198634119884119883V(1) + V(2)119879 (119875minus1119863

4+ Ξminus11198635) V(2)

+ V(2)119879119875minus11198634119910V(4)

minus 2V(3)119879 (119880 minus 119881)119879V(1)

+ V(3)119879 (4119879 (119880 + 119881)119879 + 119879minus11198633minus 2 (119863

1+ 1198632)) V(3)

+ V(4)119879119910119879119875minus11198634119884119883V(1)

+ V(4)119879119910119879119875minus11198634V(2) + V(4)119879119910119879119875minus1119863

4119910V(4)

= V(1)119879119880V(1) minus 4V(3)119879119880119879V(1) + 4V(3)119879119879119880119879V(3)

+ V(1)119879119881V(1) + 4V(3)119879119881119879V(1) + 4V(3)119879119879119881119879V(3)

+ V(2)119879Ξminus11198635V(2) + V(3)119879 (119879minus1119863

3minus 21198631minus 21198632) V(3)

+ (119884119883V(1) + V(2) + V(4)119910)119879

119875minus11198634

times (119884119883V(1) + V(2) + V(4)119910)

= 119890119879

119898119880(119863V

1minus 2119879119863V

3)2

119890119898+ 119890119879

119898119881(119863V

1+ 2119879119863V

3)2

119890119898

+ V(2)119879Ξminus11198635V(2)

+ V(3)119879 (119879minus11198633minus 21198631minus 21198632) V(3)

+ (119884119883V(1) + V(2) + V(4)119910)119879

119875minus11198634(119884119883V(1) + V(2) + V(4)119910)

gt 0

(B1)

where 119863V1= diag(V(1)) 119863V

2= diag(V(2)) 119863V

3= diag(V(3))

and119863V4= diag(V(4))

14 Journal of Applied Mathematics

B1 Proof of Lemma 3

Proof It is easy to see that if Δ119911119896= 0 which implies that (23)

holds for all 120572119896ge 0

Suppose Δ119911119896

= 0 We have

nabla119908Φ120583(119911119896)119879

Δ119908119896+ nabla119905Φ120583(119911119896)119879

Δ119905119896

+ nabla120585Φ120583(119911119896)119879

Δ120585119896+ nabla119887Φ120583(119911119896)119879

Δ119887119896

= minus (Δ119908119896119879

Δ120585119896119879

Δ119905119896119879

Δ119887119896) 119878(

Δ119908119896

Δ120585119896

Δ119905119896

Δ119887119896

)

(B2)

Since matrix 119878 is positive definite and Δ119911119896

= 0 the lastequation implies

nabla119908Φ120583(119911119896)119879

Δ119908119896+ nabla119905Φ120583(119911119896)119879

Δ119905119896

+ nabla120585Φ120583(119911119896)119879

Δ120585119896+ nabla119887Φ120583(119911119896)119879

Δ119887119896lt 0

(B3)

Consequently there exists a 119896isin (0

119896] such that the fourth

inequality in (23) is satisfied for all 120572119896isin (0

119896]

On the other hand since (119909119896 119905119896) is strictly feasible the

point (119909119896 119905119896) + 120572(Δ119909 119905119896 Δ119905119896) will be feasible for all 120572 gt 0

sufficiently small The proof is complete

C Convergence Analysis

This appendix is devoted to the global convergence of theinterior point method We first show the convergence ofAlgorithm 2 when a fixed 120583 is applied to the barrier subprob-lem (14)

Lemma C1 Let (119911119896 120582119896) be generated by Algorithm 2 Then119911119896 are strictly feasible for problem (10) and the Lagrangian

multipliers 120582119896 are bounded from above

Proof For the sake of convenience we use 119892119894(119911) 119894 =

1 2 3119898 + 2119899 to denote the constraint functions of theconstrained problem (10)

We first show that 119911119896 are strictly feasible Suppose onthe contrary that there exists an infinite index subset K andan index 119894 isin 1 2 3119898 + 2119899 such that 119892

119894(119911119896)K darr 0

By the definition of Φ120583(119911) and 119891(119911) = sum

119898

119895=1119905119894+ 119862sum

119899

119894=1120585119894

being bounded from below in the feasible set it must holdthat Φ

120583(119911119896)K rarr infin

However the line search rule implies that the sequenceΦ120583(119911119896) is decreasing So we get a contradiction Conse-

quently for any 119894 isin 1 2 3119898+2119899 119892119894(119909119896 119905119896) is bounded

away from zero The boundedness of 120582119896 then follows from(24)ndash(27)

Lemma C2 Let (119911119896 120582119896) and (Δ119911119896 119896+1

) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenthe sequence (Δ119911

119896 119896+1

)K is bounded

Proof Again we use 119892119894(119911) 119894 = 1 2 3119898+2119899 to denote the

constraint functions of the constrained problem (10)We suppose on the contrary that there exists an infinite

subsetK1015840 sube K such that the subsequence (Δ119911119896 119896+1

)K1015840

tends to infinity Let 119911119896K1015840 rarr 119911

lowast It follows fromLemma C1 that there is an infinite subset K sube K1015840 such that120582119896K rarr 120582

lowast and 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 Thus

we have

119878119896997888rarr 119878lowast (C1)

as 119896 rarr infin with 119896 isin K By Proposition B1 119878lowast is positivedefinite Since the right hand size of (18) is bounded andcontinuous the unboundedness of (Δ119911

119896 119896+1

)K impliesthat the limit of the coefficient matrices of (18) is singularwhich yields a contradiction The proof is complete

Lemma C3 Let (119911119896 120582119896) and (Δ119911119896 119896+1

) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenone has Δ119911

119896K rarr 0

Proof It follows from Lemma C1 that the sequence Δ119911119896K

is bounded Suppose on the contrary that there exists aninfinite subsetK1015840 sube K such that Δ119911

119896K1015840 rarr Δ119911

lowast= 0 Since

subsequences 119911119896K1015840 120582

119896K1015840 and

119896K1015840 are all bounded

there are points 119911lowast 120582lowast and

lowast as well as an infinite indexset K sube K1015840 such that 119911119896K rarr 119911

lowast 120582119896K rarr 120582lowast and

119896K rarr

lowast By Lemma C1 we have 119892119894(119911lowast) gt 0 forall119894 isin

1 2 3119898 + 2119899 Similar to the proof of Lemma 3 it is notdifficult to get

nabla119908Φ120583(119911lowast)119879

Δ119908lowast+ nabla119905Φ120583(119911lowast)119879

Δ119905lowast

+ nabla120585Φ120583(119911lowast)119879

Δ120585lowast+ nabla119887Φ120583(119911lowast)119879

Δ119887lowast

= minus (Δ119908lowast119879

Δ120585lowast119879

Δ119905lowast119879

Δ119887lowast) 119878(

Δ119908lowast

Δ120585lowast

Δ119905lowast

Δ119887lowast

) lt 0

(C2)

Since 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 there exists a

isin (0 1] such that for all 120572 isin (0 ]

119892119894(119911lowast+ 120572Δ119911

lowast) gt 0 forall119894 isin 1 2 3119899 (C3)

Taking into account 1205911isin (0 12) we claim that there exists

a isin (0 ] such that the following inequality holds for all120572 isin (0 ]

Φ120583(119911lowast+ 120572Δ119911

lowast) minus Φ120583(119911lowast) le 11120591

1120572nabla119911Φ120583(119911lowast)119879

Δ119911lowast (C4)

Let 119898lowast

= min119895 | 120573119895isin (0 ] 119895 = 0 1 and 120572

lowast= 120573119898lowast

It follows from (C3) that the following inequality is satisfiedfor all 119896 isin K sufficient large and any 119894 isin 1 2 3119898 + 2119899

119892119894(119911119896+ 120572lowastΔ119911119896) gt 0 (C5)

Journal of Applied Mathematics 15

Moreover we have

Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911119896)

= Φ120583(119911lowast+ 120572lowastΔ119911lowast) minus Φ120583(119911lowast)

+ Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911lowast+ 120572lowastΔ119911lowast)

+ Φ120583(119911lowast) minus Φ120583(119911119896)

le 1051205911120572lowastnabla119911Φ120583(119911lowast)119879

Δ119911lowast

le 1205911120572lowastnabla119911Φ120583(119911119896)119879

Δ119911119896

(C6)

By the backtracking line search rule the last inequalitytogether with (C5) yields the inequality120572

119896ge 120572lowastfor all 119896 isin K

large enough Consequently when 119896 isin K is sufficiently largewe have from (23) and (C3) that

Φ120583(119911119896+1

) le Φ120583(119911119896) + 1205911120572119896nabla119911Φ120583(119911119896)119879

Δ119911119896

le Φ120583(119911119896) + 1205911120572lowastnabla119911Φ120583(119911119896)119879

Δ119911119896

le Φ120583(119911119896) +

1

21205911120572lowastnabla119911Φ120583(119911lowast)119879

Δ119911lowast

(C7)

This shows that Φ120583(119911119896)K rarr minusinfin contradicting with the

fact that Φ120583(119911119896)K is bounded from below The proof is

complete

Then we will establish the convergence of Algorithms 2and 4 that is Theorems 5 and 6 in Section 3

C1 Proof of Theorem 5

Proof We let (119911lowast lowast) be a limit point of (119911119896 119896) and thesubsequence (119911119896 119896)K converges to (119911

lowast lowast)

Recall that (Δ119911119896 119896) are the solution of (18) with (119911 120582) =

(119911119896 120582119896) Taking limits in both sides of (18) with (119911 120582) =

(119911119896 120582119896) as 119896 rarr infinwith 119896 isin K by the use of Lemma C1 we

obtain

lowast(1)

minus lowast(2)

minus 119883119879119884119879lowast(4)

= 0

119862 lowast 119890119899minus lowast(4)

minus lowast(5)

= 0

119890119898minus 2119879lowastlowast(1)

minus 2119879lowastlowast(2)

minus lowast(3)

= 0

119910119879lowast(4)

= 0

(119879lowast2

minus 119882lowast) lowast(1)

minus 120583119890119898

= 0

(119879lowast2

+ 119882lowast) lowast(2)

minus 120583119890119898

= 0

119879lowastlowast(3)

minus 120583119890119898

= 0

119875lowastlowast(4)

minus 120583119890119899= 0

Ξlowastlowast(5)

minus 120583119890119899= 0

(C8)

This shows that (119911lowast lowast) satisfies the first-order optimalityconditions (15)

C2 Proof of Theorem 6

Proof (i) Without loss of generality we suppose that thebounded subsequence (119911

119895 119895)J converges to some point

(119911lowast lowast) It is clear that 119911lowast is a feasible point of (10) Since

120583119895 120598120583119895 rarr 0 it follows from (29) and (30) that

Res (119911119895 119895 120583119895)

J997888rarr Res (119911lowast lowast 0) = 0

119895J

997888rarr lowastge 0

(C9)

Consequently the (119911lowast lowast) satisfies the KKT conditions (13)(ii) Let 120585

119895= max119895

infin 1 and

119895= 120585minus1

119895119895 Obviously

119895 is bounded Hence there exists an infinite subsetJ1015840 sube J

such that 1198951015840J rarr = 0 and 119895infin

= 1 for large 119895 isin JFrom (30) we know that ge 0 Dividing both sides of thefirst inequality of (18) with (119911 120582) = (119911

119896 120582119896) by 120585119895 then taking

limits as 119895 rarr infin with 119895 isin J we get

(1)

minus (2)

minus 119883119879119884119879(4)

= 0

(4)

+ (5)

= 0

2119879lowast(1)

+ 2119879lowast(2)

+ (3)

= 0

119910119879(4)

= 0

(119879lowast2

minus 119882lowast) (1)

= 0

(119879lowast2

+ 119882lowast) (2)

= 0

119879lowast(3)

= 0

119875lowast(4)

= 0

Ξlowast(5)

= 0

(C10)

Since 119911lowast is feasible the above equations has shown that 119911lowast is

a Fritz-John point of problem (10)

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

The authors would like to acknowledge support for thisproject from the National Science Foundation (NSF Grants11371154 11071087 11271069 and 61103202) of China

References

[1] Y M Yang and J O Pedersen ldquoA comparative study on featureselection in text categorizationrdquo in Proceedings of the 14th

16 Journal of Applied Mathematics

International Conference on Machine Learning (ICML rsquo97) vol97 pp 412ndash420 1997

[2] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquoThe Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003

[3] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002

[4] H H Zhang J Ahn X Lin and C Park ldquoGene selection usingsupport vector machines with non-convex penaltyrdquo Bioinfor-matics vol 22 no 1 pp 88ndash95 2006

[5] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007

[6] J Weston F Perez-Cruz O Bousquet O Chapelle A Elisseeffand B Scholkopf ldquoFeature selection and transduction forprediction of molecular bioactivity for drug designrdquo Bioinfor-matics vol 19 no 6 pp 764ndash771 2003

[7] Y Liu ldquoA comparative study on feature selection methods fordrug discoveryrdquo Journal of Chemical Information and ComputerSciences vol 44 no 5 pp 1823ndash1828 2004

[8] I Guyon J Weston S Barnhill and V Vapnik Eds FeatureExtraction Foundations and Applications vol 207 Springer2006

[9] A Rakotomamonjy ldquoVariable selection using SVM-based cri-teriardquo Journal of Machine Learning Research vol 3 no 7-8 pp1357ndash1370 2003

[10] J Weston S Mukherjee O Chapelle M Pontil T Poggio andV Vapnik ldquoFeature selection for SVMsrdquo in Proceedings of theConference onNeural Information Processing Systems vol 13 pp668ndash674 Denver Colo USA 2000

[11] D Peleg and R Meir ldquoA feature selection algorithm based onthe global minimization of a generalization error boundrdquo inProceedings of the Conference on Neural Information ProcessingSystems Vancouver Canada 2004

[12] P S Bradley and O L Mangasarian ldquoFeature selection via con-cave minimization and support vector machinesrdquo in Proceed-ings of the International Conference on Machine Learning pp82ndash90 San Francisco Calif USA 1998

[13] JWeston A Elisseeff B Scholkopf andM Tipping ldquoUse of thezero-norm with linear models and kernel methodsrdquo Journal ofMachine Learning Research vol 3 pp 1439ndash1461 2003

[14] A B Chan N Vasconcelos and G R G Lanckriet ldquoDirectconvex relaxations of sparse SVMrdquo in Proceedings of the 24thInternational Conference on Machine Learning (ICML rsquo07) pp145ndash153 Corvallis Ore USA June 2007

[15] GM Fung andO L Mangasarian ldquoA feature selection Newtonmethod for support vector machine classificationrdquo Computa-tional Optimization and Applications vol 28 no 2 pp 185ndash2022004

[16] J Zhu S Rosset T Hastie and R Tibshirani ldquo1-norm supportvector machinesrdquo Advances in Neural Information ProcessingSystems vol 16 no 1 pp 49ndash56 2004

[17] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society B Methodological vol 58no 1 pp 267ndash288 1996

[18] T Zhang ldquoAnalysis of multi-stage convex relaxation for sparseregularizationrdquo Journal of Machine Learning Research vol 11pp 1081ndash1107 2010

[19] R Chartrand ldquoExact reconstruction of sparse signals via non-convexminimizationrdquo IEEE Signal Processing Letters vol 14 no10 pp 707ndash710 2007

[20] R Chartrand ldquoNonconvex regularizaron for shape preserva-tionrdquo inProceedings of the 14th IEEE International Conference onImage Processing (ICIP rsquo07) vol 1 pp I293ndashI296 San AntonioTex USA September 2007

[21] Z Xu H Zhang Y Wang X Chang and Y Liang ldquo11987112

regularizationrdquo Science China Information Sciences vol 53 no6 pp 1159ndash1169 2010

[22] W J Chen and Y J Tian ldquoLp-norm proximal support vectormachine and its applicationsrdquo Procedia Computer Science vol1 no 1 pp 2417ndash2423 2010

[23] J Liu J Li W Xu and Y Shi ldquoA weighted L119902adaptive least

squares support vector machine classifiers-Robust and sparseapproximationrdquo Expert Systems with Applications vol 38 no 3pp 2253ndash2259 2011

[24] Y Liu H H Zhang C Park and J Ahn ldquoSupport vector mach-ines with adaptive 119871

119902penaltyrdquo Computational Statistics amp Data

Analysis vol 51 no 12 pp 6380ndash6394 2007[25] Z Liu S Lin andM Tan ldquoSparse support vectormachines with

Lp penalty for biomarker identificationrdquo IEEEACM Transac-tions on Computational Biology and Bioinformatics vol 7 no1 pp 100ndash107 2010

[26] A Rakotomamonjy R Flamary G Gasso and S Canu ldquol119901-l119902

penalty for sparse linear and sparse multiple kernel multitasklearningrdquo IEEE Transactions on Neural Networks vol 22 no 8pp 1307ndash1320 2011

[27] J Y Tan Z Zhang L Zhen C H Zhang and N Y DengldquoAdaptive feature selection via a new version of support vectormachinerdquo Neural Computing and Applications vol 23 no 3-4pp 937ndash945 2013

[28] Z B Xu X Y Chang F M Xu and H Zhang ldquo11987112

regu-larization an iterative half thresholding algorithmrdquo IEEETrans-actions on Neural Networks and Learning Systems vol 23 no 7pp 1013ndash1027 2012

[29] D Ge X Jiang and Y Ye ldquoA note on the complexity of 119871119901

minimizationrdquo Mathematical Programming vol 129 no 2 pp285ndash299 2011

[30] VN VapnikTheNature of Statistical LearningTheory SpringerNew York NY USA 1995

[31] J Fan and H Peng ldquoNonconcave penalized likelihood with adiverging number of parametersrdquo The Annals of Statistics vol32 no 3 pp 928ndash961 2004

[32] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoThe Annals of Statistics vol 28 no 5 pp 1356ndash1378 2000

[33] B S Tian and X Q Yang ldquoAn interior-point l 11987112-penalty-

method for nonlinear programmingrdquo Technical ReportDepartment of Applied Mathematics Hong Kong PolytechnicUniversity 2013

[34] L Armijo ldquoMinimization of functions having Lipschitz con-tinuous first partial derivativesrdquo Pacific Journal of Mathematicsvol 16 pp 1ndash3 1966

[35] F John ldquoExtremum problems with inequalities as side-con-ditionsrdquo in Studies and Essays Courant Anniversary Volume pp187ndash1204 1948

[36] D J Newman S Hettich C L Blake and C J Merz ldquoUCIrepository of machine learning databasesrdquo Technical Report9702 Department of Information and Computer Science Uni-verisity of California Irvine Calif USA 1998 httparchiveicsucieduml

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 6: Research Article An Interior Point Method for -SVM …downloads.hindawi.com/journals/jam/2014/942520.pdfResearch Article An Interior Point Method for 1/2-SVM and Application to Feature

6 Journal of Applied Mathematics

11987831

= minus2 (119880 minus 119881)119879 11987832

= 0

11987833

= 4119879 (119880 + 119881)119879 + 119879minus11198633minus 2 (119863

1+ 1198632) 119878

34= 0

11987841

= 119910119879119875minus11198634119884119883 119878

42= 119910119879119875minus11198634

11987843

= 0 11987844

= 119910119879119875minus11198634119910

(22)

and 119880 = (1198792minus 119882)minus11198631and 119881 = (119879

2+ 119882)minus11198632

33The Interior PointerAlgorithm Let 119911 = (119908 120585 119905 119887) and120582 =

(120582(1)

120582(2)

120582(3)

120582(4)

120582(5)

) we first present the interior pointeralgorithm to solve the barrier problem (14) and then discussthe details of the algorithm

Algorithm 2 The interior pointer algorithm (IPA) is asfollows

Step 0 Given tolerance 120598120583 set 120591

1isin (0 (12)) 119897 gt 0

1205741gt 1 120573 isin (0 1) Let 119896 = 0

Step 1 Stop if KKT condition (15) holds

Step 2 Compute Δ119911119896 from (20) and

119896+1 from (19)Compute 119911

119896+1= 119911119896+ 120572119896Δ119911119896 Update the Lagrangian

multipliers to obtain 120582119896+1

Step 3 Let 119896 = 119896 + 1 Go to Step 1

In Step 2 a step length 120572119896 is used to calculate 119911

119896+1We estimate 120572

119896 by Armijo line search [34] in which 120572119896

=

max 120573119895| 119895 = 0 1 2 for some 120573 isin (0 1) and satisfies the

following inequalities

(119905119896+1

)2

minus 119908119896+1

gt 0

(119905119896+1

)2

+ 119908119896+1

gt 0

119905119896+1

gt 0

Φ120583(119911119896+1

) minus Φ120583(119911119896)

le 1205911120572119896(nabla119908Φ

120583(119911119896)119879

Δ119908119896

+ nabla119905Φ120583(119911119896)119879

Δ119905119896

+nabla120585Φ120583(119911119896)119879

Δ120585119896+ nabla119887Φ120583(119911119896)119879

Δ119887119896)

(23)

where 1205911isin (0 12)

To avoid ill-conditioned growth of 119896 and guarantee thestrict dual feasibility the Lagrangian multipliers 120582

119896 should

be sufficiently positive and bounded from above Followinga similar idea of [33] we first update the dual multipliers by

(1)(119896+1)

119894

=

min119897120583

(119905119896

119894)2

minus 119908119896

119894

if (1)(119896+1)119894

lt min119897120583

(119905119896

119894)2

minus 119908119896

119894

(1)(119896+1)

119894 if min119897

120583

(119905119896

119894)2

minus x119896119894

le (1)(119896+1)

119894le

1205831205741

(119905119896

119894)2

minus 119908119896

119894

1205831205741

(119905119896

119894)2

minus 119908119896

119894

if (1)(119896+1)119894

gt1205831205741

(119905119896

119894)2

minus 119908119896

119894

(2)(119896+1)

119894

=

min119897120583

(119905119896

119894)2

+ 119908119896

119894

if (2)(119896+1)119894

lt min119897120583

(119905119896

119894)2

+ 119908119896

119894

(2)(119896+1)

119894 if min119897

120583

(119905119896

119894)2

+ 119908119896

119894

le (2)(119896+1)

119894le

1205831205741

(119905119896

119894)2

+ 119908119896

119894

1205831205741

(119905119896

119894)2

+ 119908119896

119894

if (2)(119896+1)119894

gt1205831205741

(119905119896

119894)2

+ 119908119896

119894

(3)(119896+1)

119894=

min119897120583

119905119896

119894

if (3)(119896+1)119894

lt min119897120583

119905119896

119894

(3)(119896+1)

119894 if min119897

120583

119905119896

119894

le (3)(119896+1)

119894le

1205831205741

119905119896

119894

1205831205741

119905119896

119894

if (3)(119896+1)119894

gt1205831205741

119905119896

119894

(4)(119896+1)

119894=

min119897120583

119901119896

119894

if (4)(119896+1)119894

lt min119897120583

119901119896

119894

(4)(119896+1)

119894 if min119897

120583

119901119896

119894

le (4)(119896+1)

119894le

1205831205741

119901119896

119894

1205831205741

119901119896

119894

if (4)(119896+1)119894

gt1205831205741

119901119896

119894

(24)

Journal of Applied Mathematics 7

where 119901119894= 119910119894(119908119879119909119894+ 119887) + 120585

119894minus 1

(5)(119896+1)

119894=

min119897120583

120585119896

119894

if (5)(119896+1)119894

lt min119897120583

120585119896

119894

(5)(119896+1)

119894 if min119897

120583

120585119896

119894

le (5)(119896+1)

119894le

1205831205741

120585119896

119894

1205831205741

120585119896

119894

if (5)(119896+1)119894

gt1205831205741

120585119896

119894

(25)

where the parameters 119897 and 1205741satisfy 0 lt 119897 120574

1gt 1

Since positive definiteness of the matrix 119878 is demanded inthis method the Lagrangian multipliers

119896+1 should satisfythe following condition

120582(3)

minus 2 (120582(1)

+ 120582(2)

) 119905 ge 0 (26)

For the sake of simplicity the proof is given in Appendix BTherefore if 119896+1 satisfies (26) we let 120582119896+1 =

119896+1 Other-wise we would further update it by the following setting

120582(1)(119896+1)

= 1205742(1)(119896+1)

120582(2)(119896+1)

= 1205742(2)(119896+1)

120582(3)(119896+1)

= 1205743(3)(119896+1)

(27)

where constants 1205742isin (0 1) and 120574

3ge 1 satisfy

1205743

1205742

= max119894isin119864

2119905119896+1

119894((1)(119896+1)

119894+ (2)(119896+1)

119894)

(3)(119896+1)

119894

(28)

with 119864 = 1 2 119899 It is not difficult to see that the vector((1)(119896+1)

(2)(119896+1)

(3)(119896+1)

) determined by (27) satisfies (26)In practice the KKT conditions (15) are allowed to be

satisfied within a tolerance 120598120583 It turns to be that the iterative

process stops while the following inequalities meet

Res (119911 120582 120583) =

10038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817

120582(1)

minus 120582(2)

minus 119883119879119884119879120582(4)

119862 lowast 119890119899minus 120582(4)

minus 120582(5)

119890119898minus 2119879120582

(1)minus 2119879120582

(2)minus 120582(3)

119890119879

119899119884120582(4)

(1198792minus 119882)120582

(1)minus 120583119890119898

(1198792+ 119882)120582

(2)minus 120583119890119898

119879120582(3)

minus 120583119890119898

119875120582(4)

minus 120583119890119899

Ξ120582(5)

minus 120583119890119899

10038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817

lt 120598120583 (29)

120582 ge minus120598120583119890 (30)

where 120598120583is related to the current barrier parameter 120583 and

satisfies 120598120583darr 0 as 120583 rarr 0

Since functionΦ120583in (14) is convex we have the following

lemma which shows that Algorithm 2 is well defined (theproof is given in Appendix B)

Lemma 3 Let 119911119896 be strictly feasible for problem (10) If Δ119911119896=

0 then (23) is satisfied for all 120572119896

ge 0 If Δ119911119896

= 0 then thereexists a

119896isin (0 1] such that (23) holds for all 120572

119896isin (0

119896]

The proposed interior point method successively solvesthe barrier subproblem (14) with a decreasing sequence 120583

119896

We simply reduce both 120598120583and 120583 by a constant factor 120573 isin

(0 1) Finally we test optimality for problem (10) by meansof the residual norm Res(119911 0)

Here we present the whole algorithm to solve the 11987112

-SVM problem(10)

Algorithm 4 Algorithm for solving 11987112

-SVM problem is asfollows

Step 0 Set 1199080 isin 119877119898 1198870isin 119877 120585

0isin 119877119899 1205820 =

0gt 0

1199050

119894ge radic|119908

0

119894| + (12) 119894 isin 1 2 119898 Given constants

1205830gt 0 1205981205830

gt 0 120573 isin (0 1) and 120598 gt 0 let 119895 = 0

Step 1 Stop if Res(119911119895 119895 0) le 120598 and 119895ge 0

Step 2 Starting from (119911119895 120582119895) apply Algorithm 2 to

solve (14) with barrier parameter 120583119895and stopping

tolerance 120598120583119895 Set 119911119895+1 = 119911

119895119896 119895+1 = 119895119896 and 120582

119895+1=

120582119895119896

Step 3 Set 120583119895+1

= 120573120583119895and 120598120583119895+1

= 120573120598120583119895 Let 119895 = 119895 + 1

and go to Step 1

In Algorithm 4 the index 119895 denotes an outer iterationwhile 119896 denotes the last inner iteration of Algorithm 2

The convergence of the proposed interior point methodcan be proved We list the theorem here and give the proof inAppendix C

Theorem 5 Let (119911119896 120582119896) be generated by Algorithm 2 Thenany limit point of the sequence (119911

119896 119896) generated by

Algorithm 2 satisfies the first-order optimality conditions (15)

Theorem 6 Let (119911119895 119895) be generated by Algorithm 4 by

ignoring its termination condition Then the following state-ments are true

(i) The limit point of (119911119895 119895) satisfies the first order opti-mality condition (13)

(ii) The limit point 119911lowast of the convergent subsequence

119911119895J sube 119911

119895 with unbounded multipliers

119895J is a

Fritz-John point [35] of problem (10)

4 Experiments

In this section we tested the constrained optimization refor-mulation to the 119871

12-SVM and the proposed interior point

method We compared the performance of the 11987112

-SVMwith 119871

2-SVM [30] 119871

1-SVM [12] and 119871

0-SVM [12] on

artificial data and ten UCI data sets (httparchiveicsucieduml) These four problems were solved in primal refer-encing the machine learning toolbox Spider (httppeoplekybtuebingenmpgdespider) The 119871

2-SVM and 119871

1-SVM

were solved directly by quadratic programming and linearprogramming respectively To the NP-hard problem 119871

0-

SVM a commonly cited approximation method FeatureSelection Concave (FSV) [12] was applied and then the FSV

8 Journal of Applied Mathematics

problem was solved by a Successive Linear Approximation(SLA) algorithm All the experiments were run in thepersonal computer (16 GHz of CPU 4GB of RAM) withMATLAB R2010b on 64 bit Windows 7

In the proposed interior point method (Algorithms 2 and4) we set the parameters as 120573 = 06 120591

1= 05 119897 = 000001

1205741

= 100000 and 120573 = 05 The balance parameter 119862 wasselected by 5-fold cross-validation on training set over therange 2

minus10 2minus9 210 After training the weights that didnot satisfy the criteria |119908

119895|max

119894(|119908119894|) ge 10

4 [14] were set tozeroThen the cardinality of the hyperplane was computed asthe number of the nonzero weights

41 Artificial Data First we took an artificial binary linearclassification problem as an example The problem is similarto that in [13] The probability of 119910 = 1 or minus1 is equal Thefirst 6 features are relevant but redundant In 70 samplesthe first three features 119909

1 1199092 1199093were drawn as 119909

119894= 119910119873(119894 1)

and the second three features 1199094 1199095 1199096 as 119909

119894= 119873(0 1)

Otherwise the first three were drawn as 119909119894= 119873(0 1) and the

second three as 119909119894= 119910119873(119894 minus 3 1) The rest features are noise

119909119894= 119873(0 20) 119894 = 7 119898 Here119898 is the dimension of input

features The inputs were scaled to mean zero and standarddeviation In each trial 500 points were generated for testingand the average results were estimated over 30 trials

In the first experiment we consider the cases with thefixed feature size 119898 = 30 and different training sample sizes119899 = 10 20 100 The average results over the 30 trialsare shown in Table 1 and Figure 2 Figure 2 (left) plots theaverage cardinality of each classifier Since the artificial datasets have 2 relevant and nonredundant features the idealaverage cardinality is 2 Figure 2 (left) shows that the threesparse SVMs 119871

1-SVM 119871

12-SVM and 119871

0-SVM can achieve

sparse solution while the 1198712-SVM almost uses full features

in each data set Furthermore the solutions of 11987112

-SVMand 119871

0-SVM are much sparser than 119871

1-SVM As shown in

Table 1 the 1198711-SVM selects more than 6 features in all cases

which implies that some redundant or irrelevant featuresare selected The average cardinalities of 119871

12-SVM and 119871

0-

SVM are similar and close to 2 However when 119899 = 10

and 20 the 1198710-SVM has the average cardinalities of 142

and 187 respectively It means that the 1198710-SVM sometimes

selects only one feature in low sample data set and maybeignores some really relevant feature Consequently with thecardinalities between 205 and 29 119871

12-SVM has the more

reliable solution than 1198710-SVM In short as far as the number

of selected features is concerned the 11987112

-SVM behavesbetter than the other three methods

Figure 2 (right) plots the trend of the prediction accuracyversus the size of the training sample The classificationperformance of all methods is generally improved with theincreasing of the training sample size 119899 119871

1-SVM has the best

prediction performance in all cases and a slightly better than11987112

-SVM 11987112

-SVM shows more accuracy in classificationthan 119871

2-SVM and 119871

0-SVM especially in the case of 119899 =

10 50 As shown in Table 1 when there are only 10 train-ing samples the average accuracy of 119871

12-SVM is 8805

while the results of 1198712-SVM and 119871

0-SVM are 8465 and

7765 respectively Compared with 1198712-SVM and 119871

0-SVM

11987112

-SVM has the average accuracy increased by 34 and104 respectively as can be explained in what follows Tothe 1198712-SVM all features are selected without discrimination

and the prediction would bemisled by the irrelevant featuresTo the 119871

0-SVM few features are selected and some relevant

features are not included which would put negative impacton the prediction result As the tradeoff between119871

2-SVMand

1198710-SVM 119871

12-SVM has better performance than the two

The average results over ten artificial data sets in the firstexperiment are shown in the bottom of Table 1 On averagethe accuracy of 119871

12-SVM is 087 lower than the 119871

1-SVM

while the features selected by 11987112

-SVM are 7414 less than1198711-SVM It indicates that the 119871

12-SVM can achieve much

sparser solution than 1198711-SVM with little cost of accuracy

Moreover the average accuracy of 11987112

-SVMover 10 data setsis 222 higher than 119871

0-SVM with the similar cardinality

To sum up the 11987112

-SVM provides the best balance betweenaccuracy and sparsity among the three sparse SVMs

To further evaluate the feature selection performance of11987112

-SVM we investigate whether the features are correctlyselected For the119871

2-SVM is not designed for feature selection

it is not included in this comparison Since our artificial datasets have 2 best features (119909

3 1199096) the best result should have

the two features (1199093 1199096) ranking on the top according to their

absolute values of weights |119908119895| In the experiment we select

the top 2 features with the maximal |119908119895| for each method

and calculate the frequency that the top 2 features are 1199093

and 1199096in 30 runs The results are listed in Table 2 When the

training sample size is too small it is difficult to discriminatethe two most important features for all sparse SVMs Forexample when 119899 = 10 the selected frequencies of 119871

1-SVM

1198710-SVM and 119871

12-SVM are 7 3 and 9 respectively When 119899

increases all methods tend to make more correct selectionMoreover Table 2 shows that the 119871

12-SVM outperforms the

other two methods in all cases For example when 119899 = 100the selected frequencies of 119871

1-SVM and119871

0-SVM are 22 and

25 respectively and the result of 11987112

-SVM is 27 The 1198711-

SVMselects toomany redundant or irrelevant features whichmay influence the ranking in some extent Therefore 119871

1-

SVM is not so good as 11987112

-SVM at distinguishing the criticalfeatures The 119871

0-SVM has the lower hit frequency than 119871

12-

SVM which is probably due to the excessive small featuresubset it obtained Above all Tables 1 and 2 and Figure 2clearly show that the 119871

12-SVM is a promising sparsity driven

classification methodIn the second simulation we consider the cases with

various dimensions of feature space 119898 = 20 40 180 200

and the fixed training sample size 119899 = 100 The averageresults over 30 trials are shown in Figure 3 and Table 3 Sincethere are only 6 relevant features yet the larger 119898 meansthe more noisy features Figure 3 (left) shows that as thedimension increases from 20 to 200 the number of featuresselected by 119871

1-SVM increases from 826 to 231 However the

cardinalities of 11987112

-SVM and 1198710-SVM keep stable (from 22

to 295) It indicates that the 11987112

-SVM and 1198710-SVM aremore

suitable for feature selection than 1198711-SVM

Figure 3 (right) shows thatwith the increasing of the noisefeatures the accuracy of 119871

2-SVM drops significantly (from

9868 to 8747) On the contrary to the other three sparse

Journal of Applied Mathematics 9

0 20 40 60 80 1000

5

10

15

20

25

30

Number of training samples

Card

inal

ity

0 20 40 60 80 10075

80

85

90

95

100

Number of training samples

Accu

racy

L2-SVML1-SVM

L0-SVML12-SVM

L2-SVML1-SVM

L0-SVML12-SVM

Figure 2 Results comparison on 10 artificial data sets with various training samples

Table 1 Results comparison on 10 artificial data sets with various training sample sizes

1198991198712-SVM 119871

1-SVM 119871

0-SVM 119871

12-SVM

Card Acc Card Acc Card Acc Card Acc10 2997 8465 693 8855 143 7765 225 880520 3000 9221 883 9679 187 9317 235 955030 3000 9427 847 9823 203 9467 205 974140 2997 9599 900 9875 207 9539 210 977550 3000 9661 950 9891 213 9657 245 976860 2997 9725 1000 9894 220 9761 245 979870 3000 9741 1100 9898 220 9756 240 982380 3000 9768 1003 9909 223 9761 260 985090 2997 9789 920 9921 230 9741 290 9830100 3000 9813 1027 9909 250 9793 255 9836On average 2999 9521 932 9765 210 9456 241 9678ldquo119899rdquo is the number of training samples ldquoCardrdquo represents the number of selected features and ldquoAccrdquo is the classification accuracy

Table 2 Frequency of the most important features (1199093 1199096) ranked on top 2 in 30 runs

119899 10 20 30 40 50 60 70 80 90 1001198711-SVM 7 14 16 18 22 23 20 22 21 22

1198710-SVM 3 11 12 15 19 22 22 23 22 25

11987112-SVM 9 16 19 24 24 23 27 26 26 27

Table 3 Average results over 10 artificial data sets with varies dimension

1198712-SVM 119871

1-SVM 119871

0-SVM 119871

12-SVM

Cardinality 10995 1585 238 241Accuracy 9293 9907 9776 9826

10 Journal of Applied Mathematics

Table 4 Feature selection performance comparison on UCI data sets (Cardinality)

No UCI Data Set 119899 1198981198712-SVM 119871

1-SVM 119871

0-SVM 119871

12-SVM

Mean Std Mean Std Mean Std Mean Std1 Pima diabetes 768 8 800 000 740 089 620 110 760 1522 Breast Cancer 683 9 900 000 860 122 640 207 840 1673 Wine (3) 178 13 1300 000 973 069 373 028 407 0494 Image (7) 2100 19 1900 000 820 053 366 097 511 0625 SPECT 267 22 2200 000 1740 527 980 427 9 1176 WDBC 569 30 3000 000 980 152 900 303 560 2057 Ionosphere 351 34 3320 045 2500 335 2460 1021 2180 8448 SPECTF 267 44 4400 000 3880 1098 2400 1132 2560 7759 Sonar 208 60 6000 000 3140 1305 2700 718 1360 100610 Musk1 476 166 16600 000 8580 1885 4860 428 41 1416

Average cardinality 4050 4042 2421 1630 1418

Table 5 Classification performance comparison on UCI data sets (accuaracy)

No UCI Data Set 1198712-SVM 119871

1-SVM 119871

0-SVM 119871

12-SVM

Mean Std Mean Std Mean Std Mean Std1 Pima diabetes 7699 347 7673 085 7673 294 7712 1522 Breast cancer 9618 223 9706 096 9632 138 9647 1593 Wine (3) 9771 373 9829 128 9314 239 9714 2564 Image (7) 8894 149 9127 177 8667 458 9136 1945 SPECT 8340 215 7849 231 7811 338 8234 5106 WDBC 9611 211 9646 131 9558 115 9646 1317 Ionosphere 8343 544 8800 370 8457 538 8743 5948 SPECTF 7849 391 7434 450 7387 325 7849 1899 Sonar 7366 529 7561 582 7415 798 7610 43510 Musk1 8484 173 8211 242 7600 564 8232 338

Average accuracy 8597 8584 8351 8652

SVMs there is little change in the accuracy It reveals thatSVMs can benefit from the features reduction

Table 3 shows the average results over all data sets inthe second experiment On average the solution of 119871

12-

SVM yields much sparser than 1198711-SVM and a slightly better

accuracy than 1198710-SVM

42 UCI Data Sets We further tested the reformulation andthe proposed interior point methods to 119871

12-SVM on 10 UCI

data sets [36] There are 8 binary classification problems and2multiclass problems (wine image) Each feature of the inputdata was normalized to zero mean and unit variance and theinstances with missing value were deletedThen the data wasrandomly split into training set (80) and testing set (20)For the two multiclass problems a one-against-rest methodwas applied to construct a binary classifier for each class Werepeated the training and testing procedure 10 times and theaverage results were shown in Tables 4 and 5 and Figure 4

Tables 4 and 5 summarize the feature selection andclassification performance of the numerical experiments onUCI data sets respectively Here 119899 is the numbers of samplesand 119898 is the number of the input features For the two

multiclass data sets the numbers of the classes are markedbehind their names Sparsity is defined as card119898 and thesmall value of sparsity is preferredThe data sets are arrangedin descending order according to the dimension The lowestcardinality and the best accuracy rate for each problem arebolded

As shown in Tables 4 and 5 the three sparse SVMs canencourage sparsity in all data sets while remaining roughlyidentical accuracy with 119871

2-SVM Among the three sparse

methods the 11987112

-SVM has the lowest cardinality (1418)and the highest classification accuracy (8652) on averageWhile the 119871

1-SVM has the worst feature selection perform-

ance with the highest average cardinality (2421) and the1198710-SVM has the lowest average classification accuracy

(8351)Figure 4 plots the sparsity (left) and classification accu-

racy (right) of each classifier on UCI data sets In three datasets (6 9 10) the 119871

12-SVM has the best performance both

in feature selection and classification among the three sparseSVMs Compared with 119871

1-SVM 119871

12-SVM can achieve

sparser solution in nine data sets For example in the dataset ldquo8 SPECTFrdquo the features selected by 119871

12-SVM are 34

Journal of Applied Mathematics 11

50 100 150 2000

20

40

60

80

100

120

140

160

180

200

Dimension

Card

inal

ity

50 100 150 20080

82

84

86

88

90

92

94

96

98

100

Dimension

Accu

racy

L2-SVML1-SVM

L0-SVML12-SVM

L2-SVML1-SVM

L0-SVML12-SVM

Figure 3 Results comparison on artificial data set with various dimensions

less than 1198711-SVM at the same time the accuracy of 119871

12-

SVM is 41 higher than 1198711-SVM In most data sets (4

5 6 9 10) the cardinality of 11987112

-SVM drops significantly(at least 37) with the equal or a slightly better result inaccuracy Only in the data set ldquo3 Winerdquo the accuracy of11987112

-SVM is decreased by 11 but the sparsity provided by11987112

-SVM leads to 582 improvement over 1198711-SVM In the

rest three data sets (1 2 7) the two methods have similarresults in feature selection and classification As seen abovethe 119871

12-SVM can provide lower dimension representation

than 1198711-SVM with the competitive prediction perform-

anceFigure 4 (right) shows that compared with 119871

0-SVM

11987112

-SVM has the classification accuracy improved in alldata sets For instance in four data sets (3 4 8 10) theclassification accuracy of 119871

12-SVM is at least 40 higher

than 1198710-SVM Especially in the data set ldquo10Muskrdquo 119871

12-SVM

gives a 63 rise in accuracy over 1198710-SVM Meanwhile it

can be observed from Figure 4 (left) that 11987112

-SVM selectsfewer feature than 119871

0-SVM in five data sets (5 6 7 9 10) For

example in the data sets 6 and 9 the cardinalities of 11987112

-SVM are 378 and 496 less than 119871

0-SVM respectively

In summary 11987112

-SVM presents better classification perfor-mance than 119871

0-SVM while it is effective in choosing relevant

features

5 Conclusions

In this paper we proposed a 11987112

regularization techniquefor simultaneous feature selection and classification in theSVMWe have reformulated the 119871

12-SVM into an equivalent

smooth constrained optimization problem The problem

possesses a very simple structure and is relatively easy todevelop numerical methods By the use of this interestingreformulation we proposed an interior point method andestablished its convergence Our numerical results supportedthe reformulation and the proposed method The 119871

12-

SVM can get more sparsity solution than 1198711-SVM with

the comparable classification accuracy Furthermore the11987112

-SVM can achieve more accuracy classification resultsthan 119871

0-SVM (FSV)

Inspired by the good performance of the smooth opti-mization reformulation of 119871

12-SVM there are some inter-

esting topics deserving further research For examples todevelop more efficient algorithms for solving the reformu-lation to study nonlinear 119871

12-SVM and to explore varies

applications of the 11987112

-SVM and further validate its effectiveare interesting research topics Some of them are under ourcurrent investigation

Appendix

A Properties of the Reformulation to11987112

-SVM

Let 119863 be set of all feasible points of the problem (10) Forany 119911 isin 119863 we let 119865119863(119911119863) and 119871119865119863(119911119863) be the set of allfeasible directions and linearized feasible directions of 119863 at119911 Since the constraint functions of (10) are all convex weimmediately have the following lemma

Lemma A1 For any feasible point 119911 of (10) one has119865119863 (119911119863) = 119871119865119863 (119911119863) (A1)

12 Journal of Applied Mathematics

1 2 3 4 5 6 7 8 9 1010

20

30

40

50

60

70

80

90

100

Data set number Data set number

Spar

sity

1 2 3 4 5 6 7 8 9 1070

75

80

85

90

95

100

Accu

racy

L2-SVML1-SVM

L0-SVML12-SVM

L2-SVML1-SVM

L0-SVML12-SVM

Figure 4 Results comparison on UCI data sets

Based on the above Lemma we can easily derive a firstorder necessary condition for (10) The Lagrangian functionof (10) is

119871 (119908 119905 120585 119887 120582 120583 ] 119901 119902)

= 119862

119899

sum

119894=1

120585119894+

119898

sum

119895=1

119905119895minus

119898

sum

119895=1

120582119895(1199052

119895minus 119908119895) minus

119898

sum

119895=1

120583119895(1199052

119895+ 119908119895)

minus

119898

sum

119895=1

]119895119905119895minus

119899

sum

119894=1

119901119894(119910119894119908119879119909119894+ 119910119894119887 + 120585119894minus 1) minus

119899

sum

119894=1

119902119894120585119894

(A2)

where 120582 120583 ] 119901 119902 are the Lagrangian multipliers By the useof Lemma A1 we immediately have the following theoremabout the first order necessary condition

Theorem A2 Let (119908 119905 120585 119887) be a local solution of (10) Thenthere are Lagrangian multipliers (120582 120583 ] 119901 119902) isin 119877

119898+119898+119898+119899+119899

such that the following KKT conditions hold

120597119871

120597119908= 120582119895minus 120583119895minus 119901119894

119899

sum

119894=1

119910119894119909119894= 0 119894 = 1 2 119899

120597119871

120597119905= 1 minus 2120582

119895119905119895minus 2120583119895119905119895minus ]119895= 0 119895 = 1 2 119898

120597119871

120597120585= 119862 minus 119901

119894minus 119902119894= 0 119894 = 1 2 119899

120597119871

120597119887=

119899

sum

119894=1

119901119894119910119894= 0 119894 = 1 2 119899

120582119895ge 0 119905

2

119895minus 119908119895ge 0 120582

119895(1199052

119895minus 119908119895) = 0

119895 = 1 2 119898

120583119895ge 0 119905

2

119895+ 119908119895ge 0 120583

119895(1199052

119895+ 119908119895) = 0

119895 = 1 2 119898

]119895ge 0 119905

119895ge 0 ]

119895119905119895= 0 119895 = 1 2 119898

119901119894ge 0 119910

119894(119908119879119909119894+ 119887) + 120585

119894minus 1 ge 0

119901119894(119910119894(119908119879119909119894+ 119887) + 120585

119894minus 1) = 0 119894 = 1 2 119899

119902119894ge 0 120585

119894ge 0 119902

119894120585119894= 0 119894 = 1 2 119899

(A3)

The following theorem shows that the level set will bebounded

Theorem A3 For any given constant 119888 gt 0 the level set

119882119888= 119911 = (119908 119905 120585) | 119891 (119911) le 119888 cap 119863 (A4)

is bounded

Proof For any (119908 119905 120585) isin Ω119888 we have

119898

sum

119894

119905119894+ 119862

119899

sum

119895

120585119895le 119888 (A5)

Journal of Applied Mathematics 13

Combining with 119905119894ge 0 119894 = 1 119898 and 120585

119895ge 0 119895 = 1 119899

we have

0 le 119905119894le 119888 119894 = 1 119898

0 le 120585119895le

119888

119862 119895 = 1 119899

(A6)

Moreover for any (119908 119905 120585) isin 119863

10038161003816100381610038161199081198941003816100381610038161003816 le 1199052

119894 119894 = 1 119899 (A7)

Consequently 119908 119905 120585 are bounded What is more from thecondition

119910119894(119908119879119909119894+ 119887) ge 1 minus 120585

119894 119894 = 1 119898 (A8)

we have

119908119879119909119894+ 119887 ge 1 minus 120585

119894 if 119910

119894= 1 119894 = 1 119898

119908119879119909119894+ 119887 le minus1 + 120585

119894 if 119910

119894= minus1 119894 = 1 119898

(A9)

Thus if the feasible region119863 is not empty then we have

max119910119894=1

(1 minus 120585119894minus 119908119879119909119894) le 119887 le min

119910119894=minus1

(minus1 + 120585119894minus 119908119879119909119894) (A10)

Hence 119887 is also bounded The proof is complete

B Proof of Lemma 3

Lemma 3 (see Section 3) shows that Algorithm 2 is welldefinedWe first introduce the following proposition to showthat the matrix 119878 defined by (21) is always positive definitewhich ensures Algorithm 2 to be well defined

Proposition B1 Let the inequality (26) hold then the matrix119878 defined by (21) is positive definite

Proof By an elementary deduction we have for any V(1) isin

119877119898 V(2) isin 119877

119899 V(3) isin 119877119898 V(4) isin 119877 with (V(1) V(2) V(3) V(4)) = 0

(V(1)119879 V(2)119879 V(3)119879 V(4)119879) 119878(

V(1)

V(2)

V(3)

V(4))

= V(1)11987911987811V(1) + V(1)119879119878

12V(2) + V(1)119879119878

13V(3)

+ V(1)11987911987814V(4) + V(2)119879119878

21V(1) + V(2)119879119878

22V(2)

+ V(2)11987911987823V(3) + V(2)119879119878

24V(4)

+ V(3)11987911987831V(1) + V(3)119879119878

32V(2)

+ V(3)11987911987833V(3) + V(3)119879119878

34V(4)

+ V(4)11987911987841V(1) + V(4)119879119878

42V(2)

+ V(4)11987911987843V(3) + V(4)119879119878

44V(4)

= V(1)119879 (119880 + 119881 + 119883119879119884119879119875minus11198634119884119883) V(1)

+ V(1)119879 (119883119879119884119879119875minus11198634) V(2)

+ V(1)119879 (minus2 (119880 minus 119881)119879) V(3)

+ V(1)119879 (119883119879119884119879119875minus11198634119910) V(4)

+ V(2)119879119875minus11198634119884119883V(1) + V(2)119879 (119875minus1119863

4+ Ξminus11198635) V(2)

+ V(2)119879119875minus11198634119910V(4)

minus 2V(3)119879 (119880 minus 119881)119879V(1)

+ V(3)119879 (4119879 (119880 + 119881)119879 + 119879minus11198633minus 2 (119863

1+ 1198632)) V(3)

+ V(4)119879119910119879119875minus11198634119884119883V(1)

+ V(4)119879119910119879119875minus11198634V(2) + V(4)119879119910119879119875minus1119863

4119910V(4)

= V(1)119879119880V(1) minus 4V(3)119879119880119879V(1) + 4V(3)119879119879119880119879V(3)

+ V(1)119879119881V(1) + 4V(3)119879119881119879V(1) + 4V(3)119879119879119881119879V(3)

+ V(2)119879Ξminus11198635V(2) + V(3)119879 (119879minus1119863

3minus 21198631minus 21198632) V(3)

+ (119884119883V(1) + V(2) + V(4)119910)119879

119875minus11198634

times (119884119883V(1) + V(2) + V(4)119910)

= 119890119879

119898119880(119863V

1minus 2119879119863V

3)2

119890119898+ 119890119879

119898119881(119863V

1+ 2119879119863V

3)2

119890119898

+ V(2)119879Ξminus11198635V(2)

+ V(3)119879 (119879minus11198633minus 21198631minus 21198632) V(3)

+ (119884119883V(1) + V(2) + V(4)119910)119879

119875minus11198634(119884119883V(1) + V(2) + V(4)119910)

gt 0

(B1)

where 119863V1= diag(V(1)) 119863V

2= diag(V(2)) 119863V

3= diag(V(3))

and119863V4= diag(V(4))

14 Journal of Applied Mathematics

B1 Proof of Lemma 3

Proof It is easy to see that if Δ119911119896= 0 which implies that (23)

holds for all 120572119896ge 0

Suppose Δ119911119896

= 0 We have

nabla119908Φ120583(119911119896)119879

Δ119908119896+ nabla119905Φ120583(119911119896)119879

Δ119905119896

+ nabla120585Φ120583(119911119896)119879

Δ120585119896+ nabla119887Φ120583(119911119896)119879

Δ119887119896

= minus (Δ119908119896119879

Δ120585119896119879

Δ119905119896119879

Δ119887119896) 119878(

Δ119908119896

Δ120585119896

Δ119905119896

Δ119887119896

)

(B2)

Since matrix 119878 is positive definite and Δ119911119896

= 0 the lastequation implies

nabla119908Φ120583(119911119896)119879

Δ119908119896+ nabla119905Φ120583(119911119896)119879

Δ119905119896

+ nabla120585Φ120583(119911119896)119879

Δ120585119896+ nabla119887Φ120583(119911119896)119879

Δ119887119896lt 0

(B3)

Consequently there exists a 119896isin (0

119896] such that the fourth

inequality in (23) is satisfied for all 120572119896isin (0

119896]

On the other hand since (119909119896 119905119896) is strictly feasible the

point (119909119896 119905119896) + 120572(Δ119909 119905119896 Δ119905119896) will be feasible for all 120572 gt 0

sufficiently small The proof is complete

C Convergence Analysis

This appendix is devoted to the global convergence of theinterior point method We first show the convergence ofAlgorithm 2 when a fixed 120583 is applied to the barrier subprob-lem (14)

Lemma C1 Let (119911119896 120582119896) be generated by Algorithm 2 Then119911119896 are strictly feasible for problem (10) and the Lagrangian

multipliers 120582119896 are bounded from above

Proof For the sake of convenience we use 119892119894(119911) 119894 =

1 2 3119898 + 2119899 to denote the constraint functions of theconstrained problem (10)

We first show that 119911119896 are strictly feasible Suppose onthe contrary that there exists an infinite index subset K andan index 119894 isin 1 2 3119898 + 2119899 such that 119892

119894(119911119896)K darr 0

By the definition of Φ120583(119911) and 119891(119911) = sum

119898

119895=1119905119894+ 119862sum

119899

119894=1120585119894

being bounded from below in the feasible set it must holdthat Φ

120583(119911119896)K rarr infin

However the line search rule implies that the sequenceΦ120583(119911119896) is decreasing So we get a contradiction Conse-

quently for any 119894 isin 1 2 3119898+2119899 119892119894(119909119896 119905119896) is bounded

away from zero The boundedness of 120582119896 then follows from(24)ndash(27)

Lemma C2 Let (119911119896 120582119896) and (Δ119911119896 119896+1

) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenthe sequence (Δ119911

119896 119896+1

)K is bounded

Proof Again we use 119892119894(119911) 119894 = 1 2 3119898+2119899 to denote the

constraint functions of the constrained problem (10)We suppose on the contrary that there exists an infinite

subsetK1015840 sube K such that the subsequence (Δ119911119896 119896+1

)K1015840

tends to infinity Let 119911119896K1015840 rarr 119911

lowast It follows fromLemma C1 that there is an infinite subset K sube K1015840 such that120582119896K rarr 120582

lowast and 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 Thus

we have

119878119896997888rarr 119878lowast (C1)

as 119896 rarr infin with 119896 isin K By Proposition B1 119878lowast is positivedefinite Since the right hand size of (18) is bounded andcontinuous the unboundedness of (Δ119911

119896 119896+1

)K impliesthat the limit of the coefficient matrices of (18) is singularwhich yields a contradiction The proof is complete

Lemma C3 Let (119911119896 120582119896) and (Δ119911119896 119896+1

) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenone has Δ119911

119896K rarr 0

Proof It follows from Lemma C1 that the sequence Δ119911119896K

is bounded Suppose on the contrary that there exists aninfinite subsetK1015840 sube K such that Δ119911

119896K1015840 rarr Δ119911

lowast= 0 Since

subsequences 119911119896K1015840 120582

119896K1015840 and

119896K1015840 are all bounded

there are points 119911lowast 120582lowast and

lowast as well as an infinite indexset K sube K1015840 such that 119911119896K rarr 119911

lowast 120582119896K rarr 120582lowast and

119896K rarr

lowast By Lemma C1 we have 119892119894(119911lowast) gt 0 forall119894 isin

1 2 3119898 + 2119899 Similar to the proof of Lemma 3 it is notdifficult to get

nabla119908Φ120583(119911lowast)119879

Δ119908lowast+ nabla119905Φ120583(119911lowast)119879

Δ119905lowast

+ nabla120585Φ120583(119911lowast)119879

Δ120585lowast+ nabla119887Φ120583(119911lowast)119879

Δ119887lowast

= minus (Δ119908lowast119879

Δ120585lowast119879

Δ119905lowast119879

Δ119887lowast) 119878(

Δ119908lowast

Δ120585lowast

Δ119905lowast

Δ119887lowast

) lt 0

(C2)

Since 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 there exists a

isin (0 1] such that for all 120572 isin (0 ]

119892119894(119911lowast+ 120572Δ119911

lowast) gt 0 forall119894 isin 1 2 3119899 (C3)

Taking into account 1205911isin (0 12) we claim that there exists

a isin (0 ] such that the following inequality holds for all120572 isin (0 ]

Φ120583(119911lowast+ 120572Δ119911

lowast) minus Φ120583(119911lowast) le 11120591

1120572nabla119911Φ120583(119911lowast)119879

Δ119911lowast (C4)

Let 119898lowast

= min119895 | 120573119895isin (0 ] 119895 = 0 1 and 120572

lowast= 120573119898lowast

It follows from (C3) that the following inequality is satisfiedfor all 119896 isin K sufficient large and any 119894 isin 1 2 3119898 + 2119899

119892119894(119911119896+ 120572lowastΔ119911119896) gt 0 (C5)

Journal of Applied Mathematics 15

Moreover we have

Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911119896)

= Φ120583(119911lowast+ 120572lowastΔ119911lowast) minus Φ120583(119911lowast)

+ Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911lowast+ 120572lowastΔ119911lowast)

+ Φ120583(119911lowast) minus Φ120583(119911119896)

le 1051205911120572lowastnabla119911Φ120583(119911lowast)119879

Δ119911lowast

le 1205911120572lowastnabla119911Φ120583(119911119896)119879

Δ119911119896

(C6)

By the backtracking line search rule the last inequalitytogether with (C5) yields the inequality120572

119896ge 120572lowastfor all 119896 isin K

large enough Consequently when 119896 isin K is sufficiently largewe have from (23) and (C3) that

Φ120583(119911119896+1

) le Φ120583(119911119896) + 1205911120572119896nabla119911Φ120583(119911119896)119879

Δ119911119896

le Φ120583(119911119896) + 1205911120572lowastnabla119911Φ120583(119911119896)119879

Δ119911119896

le Φ120583(119911119896) +

1

21205911120572lowastnabla119911Φ120583(119911lowast)119879

Δ119911lowast

(C7)

This shows that Φ120583(119911119896)K rarr minusinfin contradicting with the

fact that Φ120583(119911119896)K is bounded from below The proof is

complete

Then we will establish the convergence of Algorithms 2and 4 that is Theorems 5 and 6 in Section 3

C1 Proof of Theorem 5

Proof We let (119911lowast lowast) be a limit point of (119911119896 119896) and thesubsequence (119911119896 119896)K converges to (119911

lowast lowast)

Recall that (Δ119911119896 119896) are the solution of (18) with (119911 120582) =

(119911119896 120582119896) Taking limits in both sides of (18) with (119911 120582) =

(119911119896 120582119896) as 119896 rarr infinwith 119896 isin K by the use of Lemma C1 we

obtain

lowast(1)

minus lowast(2)

minus 119883119879119884119879lowast(4)

= 0

119862 lowast 119890119899minus lowast(4)

minus lowast(5)

= 0

119890119898minus 2119879lowastlowast(1)

minus 2119879lowastlowast(2)

minus lowast(3)

= 0

119910119879lowast(4)

= 0

(119879lowast2

minus 119882lowast) lowast(1)

minus 120583119890119898

= 0

(119879lowast2

+ 119882lowast) lowast(2)

minus 120583119890119898

= 0

119879lowastlowast(3)

minus 120583119890119898

= 0

119875lowastlowast(4)

minus 120583119890119899= 0

Ξlowastlowast(5)

minus 120583119890119899= 0

(C8)

This shows that (119911lowast lowast) satisfies the first-order optimalityconditions (15)

C2 Proof of Theorem 6

Proof (i) Without loss of generality we suppose that thebounded subsequence (119911

119895 119895)J converges to some point

(119911lowast lowast) It is clear that 119911lowast is a feasible point of (10) Since

120583119895 120598120583119895 rarr 0 it follows from (29) and (30) that

Res (119911119895 119895 120583119895)

J997888rarr Res (119911lowast lowast 0) = 0

119895J

997888rarr lowastge 0

(C9)

Consequently the (119911lowast lowast) satisfies the KKT conditions (13)(ii) Let 120585

119895= max119895

infin 1 and

119895= 120585minus1

119895119895 Obviously

119895 is bounded Hence there exists an infinite subsetJ1015840 sube J

such that 1198951015840J rarr = 0 and 119895infin

= 1 for large 119895 isin JFrom (30) we know that ge 0 Dividing both sides of thefirst inequality of (18) with (119911 120582) = (119911

119896 120582119896) by 120585119895 then taking

limits as 119895 rarr infin with 119895 isin J we get

(1)

minus (2)

minus 119883119879119884119879(4)

= 0

(4)

+ (5)

= 0

2119879lowast(1)

+ 2119879lowast(2)

+ (3)

= 0

119910119879(4)

= 0

(119879lowast2

minus 119882lowast) (1)

= 0

(119879lowast2

+ 119882lowast) (2)

= 0

119879lowast(3)

= 0

119875lowast(4)

= 0

Ξlowast(5)

= 0

(C10)

Since 119911lowast is feasible the above equations has shown that 119911lowast is

a Fritz-John point of problem (10)

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

The authors would like to acknowledge support for thisproject from the National Science Foundation (NSF Grants11371154 11071087 11271069 and 61103202) of China

References

[1] Y M Yang and J O Pedersen ldquoA comparative study on featureselection in text categorizationrdquo in Proceedings of the 14th

16 Journal of Applied Mathematics

International Conference on Machine Learning (ICML rsquo97) vol97 pp 412ndash420 1997

[2] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquoThe Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003

[3] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002

[4] H H Zhang J Ahn X Lin and C Park ldquoGene selection usingsupport vector machines with non-convex penaltyrdquo Bioinfor-matics vol 22 no 1 pp 88ndash95 2006

[5] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007

[6] J Weston F Perez-Cruz O Bousquet O Chapelle A Elisseeffand B Scholkopf ldquoFeature selection and transduction forprediction of molecular bioactivity for drug designrdquo Bioinfor-matics vol 19 no 6 pp 764ndash771 2003

[7] Y Liu ldquoA comparative study on feature selection methods fordrug discoveryrdquo Journal of Chemical Information and ComputerSciences vol 44 no 5 pp 1823ndash1828 2004

[8] I Guyon J Weston S Barnhill and V Vapnik Eds FeatureExtraction Foundations and Applications vol 207 Springer2006

[9] A Rakotomamonjy ldquoVariable selection using SVM-based cri-teriardquo Journal of Machine Learning Research vol 3 no 7-8 pp1357ndash1370 2003

[10] J Weston S Mukherjee O Chapelle M Pontil T Poggio andV Vapnik ldquoFeature selection for SVMsrdquo in Proceedings of theConference onNeural Information Processing Systems vol 13 pp668ndash674 Denver Colo USA 2000

[11] D Peleg and R Meir ldquoA feature selection algorithm based onthe global minimization of a generalization error boundrdquo inProceedings of the Conference on Neural Information ProcessingSystems Vancouver Canada 2004

[12] P S Bradley and O L Mangasarian ldquoFeature selection via con-cave minimization and support vector machinesrdquo in Proceed-ings of the International Conference on Machine Learning pp82ndash90 San Francisco Calif USA 1998

[13] JWeston A Elisseeff B Scholkopf andM Tipping ldquoUse of thezero-norm with linear models and kernel methodsrdquo Journal ofMachine Learning Research vol 3 pp 1439ndash1461 2003

[14] A B Chan N Vasconcelos and G R G Lanckriet ldquoDirectconvex relaxations of sparse SVMrdquo in Proceedings of the 24thInternational Conference on Machine Learning (ICML rsquo07) pp145ndash153 Corvallis Ore USA June 2007

[15] GM Fung andO L Mangasarian ldquoA feature selection Newtonmethod for support vector machine classificationrdquo Computa-tional Optimization and Applications vol 28 no 2 pp 185ndash2022004

[16] J Zhu S Rosset T Hastie and R Tibshirani ldquo1-norm supportvector machinesrdquo Advances in Neural Information ProcessingSystems vol 16 no 1 pp 49ndash56 2004

[17] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society B Methodological vol 58no 1 pp 267ndash288 1996

[18] T Zhang ldquoAnalysis of multi-stage convex relaxation for sparseregularizationrdquo Journal of Machine Learning Research vol 11pp 1081ndash1107 2010

[19] R Chartrand ldquoExact reconstruction of sparse signals via non-convexminimizationrdquo IEEE Signal Processing Letters vol 14 no10 pp 707ndash710 2007

[20] R Chartrand ldquoNonconvex regularizaron for shape preserva-tionrdquo inProceedings of the 14th IEEE International Conference onImage Processing (ICIP rsquo07) vol 1 pp I293ndashI296 San AntonioTex USA September 2007

[21] Z Xu H Zhang Y Wang X Chang and Y Liang ldquo11987112

regularizationrdquo Science China Information Sciences vol 53 no6 pp 1159ndash1169 2010

[22] W J Chen and Y J Tian ldquoLp-norm proximal support vectormachine and its applicationsrdquo Procedia Computer Science vol1 no 1 pp 2417ndash2423 2010

[23] J Liu J Li W Xu and Y Shi ldquoA weighted L119902adaptive least

squares support vector machine classifiers-Robust and sparseapproximationrdquo Expert Systems with Applications vol 38 no 3pp 2253ndash2259 2011

[24] Y Liu H H Zhang C Park and J Ahn ldquoSupport vector mach-ines with adaptive 119871

119902penaltyrdquo Computational Statistics amp Data

Analysis vol 51 no 12 pp 6380ndash6394 2007[25] Z Liu S Lin andM Tan ldquoSparse support vectormachines with

Lp penalty for biomarker identificationrdquo IEEEACM Transac-tions on Computational Biology and Bioinformatics vol 7 no1 pp 100ndash107 2010

[26] A Rakotomamonjy R Flamary G Gasso and S Canu ldquol119901-l119902

penalty for sparse linear and sparse multiple kernel multitasklearningrdquo IEEE Transactions on Neural Networks vol 22 no 8pp 1307ndash1320 2011

[27] J Y Tan Z Zhang L Zhen C H Zhang and N Y DengldquoAdaptive feature selection via a new version of support vectormachinerdquo Neural Computing and Applications vol 23 no 3-4pp 937ndash945 2013

[28] Z B Xu X Y Chang F M Xu and H Zhang ldquo11987112

regu-larization an iterative half thresholding algorithmrdquo IEEETrans-actions on Neural Networks and Learning Systems vol 23 no 7pp 1013ndash1027 2012

[29] D Ge X Jiang and Y Ye ldquoA note on the complexity of 119871119901

minimizationrdquo Mathematical Programming vol 129 no 2 pp285ndash299 2011

[30] VN VapnikTheNature of Statistical LearningTheory SpringerNew York NY USA 1995

[31] J Fan and H Peng ldquoNonconcave penalized likelihood with adiverging number of parametersrdquo The Annals of Statistics vol32 no 3 pp 928ndash961 2004

[32] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoThe Annals of Statistics vol 28 no 5 pp 1356ndash1378 2000

[33] B S Tian and X Q Yang ldquoAn interior-point l 11987112-penalty-

method for nonlinear programmingrdquo Technical ReportDepartment of Applied Mathematics Hong Kong PolytechnicUniversity 2013

[34] L Armijo ldquoMinimization of functions having Lipschitz con-tinuous first partial derivativesrdquo Pacific Journal of Mathematicsvol 16 pp 1ndash3 1966

[35] F John ldquoExtremum problems with inequalities as side-con-ditionsrdquo in Studies and Essays Courant Anniversary Volume pp187ndash1204 1948

[36] D J Newman S Hettich C L Blake and C J Merz ldquoUCIrepository of machine learning databasesrdquo Technical Report9702 Department of Information and Computer Science Uni-verisity of California Irvine Calif USA 1998 httparchiveicsucieduml

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 7: Research Article An Interior Point Method for -SVM …downloads.hindawi.com/journals/jam/2014/942520.pdfResearch Article An Interior Point Method for 1/2-SVM and Application to Feature

Journal of Applied Mathematics 7

where 119901119894= 119910119894(119908119879119909119894+ 119887) + 120585

119894minus 1

(5)(119896+1)

119894=

min119897120583

120585119896

119894

if (5)(119896+1)119894

lt min119897120583

120585119896

119894

(5)(119896+1)

119894 if min119897

120583

120585119896

119894

le (5)(119896+1)

119894le

1205831205741

120585119896

119894

1205831205741

120585119896

119894

if (5)(119896+1)119894

gt1205831205741

120585119896

119894

(25)

where the parameters 119897 and 1205741satisfy 0 lt 119897 120574

1gt 1

Since positive definiteness of the matrix 119878 is demanded inthis method the Lagrangian multipliers

119896+1 should satisfythe following condition

120582(3)

minus 2 (120582(1)

+ 120582(2)

) 119905 ge 0 (26)

For the sake of simplicity the proof is given in Appendix BTherefore if 119896+1 satisfies (26) we let 120582119896+1 =

119896+1 Other-wise we would further update it by the following setting

120582(1)(119896+1)

= 1205742(1)(119896+1)

120582(2)(119896+1)

= 1205742(2)(119896+1)

120582(3)(119896+1)

= 1205743(3)(119896+1)

(27)

where constants 1205742isin (0 1) and 120574

3ge 1 satisfy

1205743

1205742

= max119894isin119864

2119905119896+1

119894((1)(119896+1)

119894+ (2)(119896+1)

119894)

(3)(119896+1)

119894

(28)

with 119864 = 1 2 119899 It is not difficult to see that the vector((1)(119896+1)

(2)(119896+1)

(3)(119896+1)

) determined by (27) satisfies (26)In practice the KKT conditions (15) are allowed to be

satisfied within a tolerance 120598120583 It turns to be that the iterative

process stops while the following inequalities meet

Res (119911 120582 120583) =

10038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817

120582(1)

minus 120582(2)

minus 119883119879119884119879120582(4)

119862 lowast 119890119899minus 120582(4)

minus 120582(5)

119890119898minus 2119879120582

(1)minus 2119879120582

(2)minus 120582(3)

119890119879

119899119884120582(4)

(1198792minus 119882)120582

(1)minus 120583119890119898

(1198792+ 119882)120582

(2)minus 120583119890119898

119879120582(3)

minus 120583119890119898

119875120582(4)

minus 120583119890119899

Ξ120582(5)

minus 120583119890119899

10038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817

lt 120598120583 (29)

120582 ge minus120598120583119890 (30)

where 120598120583is related to the current barrier parameter 120583 and

satisfies 120598120583darr 0 as 120583 rarr 0

Since functionΦ120583in (14) is convex we have the following

lemma which shows that Algorithm 2 is well defined (theproof is given in Appendix B)

Lemma 3 Let 119911119896 be strictly feasible for problem (10) If Δ119911119896=

0 then (23) is satisfied for all 120572119896

ge 0 If Δ119911119896

= 0 then thereexists a

119896isin (0 1] such that (23) holds for all 120572

119896isin (0

119896]

The proposed interior point method successively solvesthe barrier subproblem (14) with a decreasing sequence 120583

119896

We simply reduce both 120598120583and 120583 by a constant factor 120573 isin

(0 1) Finally we test optimality for problem (10) by meansof the residual norm Res(119911 0)

Here we present the whole algorithm to solve the 11987112

-SVM problem(10)

Algorithm 4 Algorithm for solving 11987112

-SVM problem is asfollows

Step 0 Set 1199080 isin 119877119898 1198870isin 119877 120585

0isin 119877119899 1205820 =

0gt 0

1199050

119894ge radic|119908

0

119894| + (12) 119894 isin 1 2 119898 Given constants

1205830gt 0 1205981205830

gt 0 120573 isin (0 1) and 120598 gt 0 let 119895 = 0

Step 1 Stop if Res(119911119895 119895 0) le 120598 and 119895ge 0

Step 2 Starting from (119911119895 120582119895) apply Algorithm 2 to

solve (14) with barrier parameter 120583119895and stopping

tolerance 120598120583119895 Set 119911119895+1 = 119911

119895119896 119895+1 = 119895119896 and 120582

119895+1=

120582119895119896

Step 3 Set 120583119895+1

= 120573120583119895and 120598120583119895+1

= 120573120598120583119895 Let 119895 = 119895 + 1

and go to Step 1

In Algorithm 4 the index 119895 denotes an outer iterationwhile 119896 denotes the last inner iteration of Algorithm 2

The convergence of the proposed interior point methodcan be proved We list the theorem here and give the proof inAppendix C

Theorem 5 Let (119911119896 120582119896) be generated by Algorithm 2 Thenany limit point of the sequence (119911

119896 119896) generated by

Algorithm 2 satisfies the first-order optimality conditions (15)

Theorem 6 Let (119911119895 119895) be generated by Algorithm 4 by

ignoring its termination condition Then the following state-ments are true

(i) The limit point of (119911119895 119895) satisfies the first order opti-mality condition (13)

(ii) The limit point 119911lowast of the convergent subsequence

119911119895J sube 119911

119895 with unbounded multipliers

119895J is a

Fritz-John point [35] of problem (10)

4 Experiments

In this section we tested the constrained optimization refor-mulation to the 119871

12-SVM and the proposed interior point

method We compared the performance of the 11987112

-SVMwith 119871

2-SVM [30] 119871

1-SVM [12] and 119871

0-SVM [12] on

artificial data and ten UCI data sets (httparchiveicsucieduml) These four problems were solved in primal refer-encing the machine learning toolbox Spider (httppeoplekybtuebingenmpgdespider) The 119871

2-SVM and 119871

1-SVM

were solved directly by quadratic programming and linearprogramming respectively To the NP-hard problem 119871

0-

SVM a commonly cited approximation method FeatureSelection Concave (FSV) [12] was applied and then the FSV

8 Journal of Applied Mathematics

problem was solved by a Successive Linear Approximation(SLA) algorithm All the experiments were run in thepersonal computer (16 GHz of CPU 4GB of RAM) withMATLAB R2010b on 64 bit Windows 7

In the proposed interior point method (Algorithms 2 and4) we set the parameters as 120573 = 06 120591

1= 05 119897 = 000001

1205741

= 100000 and 120573 = 05 The balance parameter 119862 wasselected by 5-fold cross-validation on training set over therange 2

minus10 2minus9 210 After training the weights that didnot satisfy the criteria |119908

119895|max

119894(|119908119894|) ge 10

4 [14] were set tozeroThen the cardinality of the hyperplane was computed asthe number of the nonzero weights

41 Artificial Data First we took an artificial binary linearclassification problem as an example The problem is similarto that in [13] The probability of 119910 = 1 or minus1 is equal Thefirst 6 features are relevant but redundant In 70 samplesthe first three features 119909

1 1199092 1199093were drawn as 119909

119894= 119910119873(119894 1)

and the second three features 1199094 1199095 1199096 as 119909

119894= 119873(0 1)

Otherwise the first three were drawn as 119909119894= 119873(0 1) and the

second three as 119909119894= 119910119873(119894 minus 3 1) The rest features are noise

119909119894= 119873(0 20) 119894 = 7 119898 Here119898 is the dimension of input

features The inputs were scaled to mean zero and standarddeviation In each trial 500 points were generated for testingand the average results were estimated over 30 trials

In the first experiment we consider the cases with thefixed feature size 119898 = 30 and different training sample sizes119899 = 10 20 100 The average results over the 30 trialsare shown in Table 1 and Figure 2 Figure 2 (left) plots theaverage cardinality of each classifier Since the artificial datasets have 2 relevant and nonredundant features the idealaverage cardinality is 2 Figure 2 (left) shows that the threesparse SVMs 119871

1-SVM 119871

12-SVM and 119871

0-SVM can achieve

sparse solution while the 1198712-SVM almost uses full features

in each data set Furthermore the solutions of 11987112

-SVMand 119871

0-SVM are much sparser than 119871

1-SVM As shown in

Table 1 the 1198711-SVM selects more than 6 features in all cases

which implies that some redundant or irrelevant featuresare selected The average cardinalities of 119871

12-SVM and 119871

0-

SVM are similar and close to 2 However when 119899 = 10

and 20 the 1198710-SVM has the average cardinalities of 142

and 187 respectively It means that the 1198710-SVM sometimes

selects only one feature in low sample data set and maybeignores some really relevant feature Consequently with thecardinalities between 205 and 29 119871

12-SVM has the more

reliable solution than 1198710-SVM In short as far as the number

of selected features is concerned the 11987112

-SVM behavesbetter than the other three methods

Figure 2 (right) plots the trend of the prediction accuracyversus the size of the training sample The classificationperformance of all methods is generally improved with theincreasing of the training sample size 119899 119871

1-SVM has the best

prediction performance in all cases and a slightly better than11987112

-SVM 11987112

-SVM shows more accuracy in classificationthan 119871

2-SVM and 119871

0-SVM especially in the case of 119899 =

10 50 As shown in Table 1 when there are only 10 train-ing samples the average accuracy of 119871

12-SVM is 8805

while the results of 1198712-SVM and 119871

0-SVM are 8465 and

7765 respectively Compared with 1198712-SVM and 119871

0-SVM

11987112

-SVM has the average accuracy increased by 34 and104 respectively as can be explained in what follows Tothe 1198712-SVM all features are selected without discrimination

and the prediction would bemisled by the irrelevant featuresTo the 119871

0-SVM few features are selected and some relevant

features are not included which would put negative impacton the prediction result As the tradeoff between119871

2-SVMand

1198710-SVM 119871

12-SVM has better performance than the two

The average results over ten artificial data sets in the firstexperiment are shown in the bottom of Table 1 On averagethe accuracy of 119871

12-SVM is 087 lower than the 119871

1-SVM

while the features selected by 11987112

-SVM are 7414 less than1198711-SVM It indicates that the 119871

12-SVM can achieve much

sparser solution than 1198711-SVM with little cost of accuracy

Moreover the average accuracy of 11987112

-SVMover 10 data setsis 222 higher than 119871

0-SVM with the similar cardinality

To sum up the 11987112

-SVM provides the best balance betweenaccuracy and sparsity among the three sparse SVMs

To further evaluate the feature selection performance of11987112

-SVM we investigate whether the features are correctlyselected For the119871

2-SVM is not designed for feature selection

it is not included in this comparison Since our artificial datasets have 2 best features (119909

3 1199096) the best result should have

the two features (1199093 1199096) ranking on the top according to their

absolute values of weights |119908119895| In the experiment we select

the top 2 features with the maximal |119908119895| for each method

and calculate the frequency that the top 2 features are 1199093

and 1199096in 30 runs The results are listed in Table 2 When the

training sample size is too small it is difficult to discriminatethe two most important features for all sparse SVMs Forexample when 119899 = 10 the selected frequencies of 119871

1-SVM

1198710-SVM and 119871

12-SVM are 7 3 and 9 respectively When 119899

increases all methods tend to make more correct selectionMoreover Table 2 shows that the 119871

12-SVM outperforms the

other two methods in all cases For example when 119899 = 100the selected frequencies of 119871

1-SVM and119871

0-SVM are 22 and

25 respectively and the result of 11987112

-SVM is 27 The 1198711-

SVMselects toomany redundant or irrelevant features whichmay influence the ranking in some extent Therefore 119871

1-

SVM is not so good as 11987112

-SVM at distinguishing the criticalfeatures The 119871

0-SVM has the lower hit frequency than 119871

12-

SVM which is probably due to the excessive small featuresubset it obtained Above all Tables 1 and 2 and Figure 2clearly show that the 119871

12-SVM is a promising sparsity driven

classification methodIn the second simulation we consider the cases with

various dimensions of feature space 119898 = 20 40 180 200

and the fixed training sample size 119899 = 100 The averageresults over 30 trials are shown in Figure 3 and Table 3 Sincethere are only 6 relevant features yet the larger 119898 meansthe more noisy features Figure 3 (left) shows that as thedimension increases from 20 to 200 the number of featuresselected by 119871

1-SVM increases from 826 to 231 However the

cardinalities of 11987112

-SVM and 1198710-SVM keep stable (from 22

to 295) It indicates that the 11987112

-SVM and 1198710-SVM aremore

suitable for feature selection than 1198711-SVM

Figure 3 (right) shows thatwith the increasing of the noisefeatures the accuracy of 119871

2-SVM drops significantly (from

9868 to 8747) On the contrary to the other three sparse

Journal of Applied Mathematics 9

0 20 40 60 80 1000

5

10

15

20

25

30

Number of training samples

Card

inal

ity

0 20 40 60 80 10075

80

85

90

95

100

Number of training samples

Accu

racy

L2-SVML1-SVM

L0-SVML12-SVM

L2-SVML1-SVM

L0-SVML12-SVM

Figure 2 Results comparison on 10 artificial data sets with various training samples

Table 1 Results comparison on 10 artificial data sets with various training sample sizes

1198991198712-SVM 119871

1-SVM 119871

0-SVM 119871

12-SVM

Card Acc Card Acc Card Acc Card Acc10 2997 8465 693 8855 143 7765 225 880520 3000 9221 883 9679 187 9317 235 955030 3000 9427 847 9823 203 9467 205 974140 2997 9599 900 9875 207 9539 210 977550 3000 9661 950 9891 213 9657 245 976860 2997 9725 1000 9894 220 9761 245 979870 3000 9741 1100 9898 220 9756 240 982380 3000 9768 1003 9909 223 9761 260 985090 2997 9789 920 9921 230 9741 290 9830100 3000 9813 1027 9909 250 9793 255 9836On average 2999 9521 932 9765 210 9456 241 9678ldquo119899rdquo is the number of training samples ldquoCardrdquo represents the number of selected features and ldquoAccrdquo is the classification accuracy

Table 2 Frequency of the most important features (1199093 1199096) ranked on top 2 in 30 runs

119899 10 20 30 40 50 60 70 80 90 1001198711-SVM 7 14 16 18 22 23 20 22 21 22

1198710-SVM 3 11 12 15 19 22 22 23 22 25

11987112-SVM 9 16 19 24 24 23 27 26 26 27

Table 3 Average results over 10 artificial data sets with varies dimension

1198712-SVM 119871

1-SVM 119871

0-SVM 119871

12-SVM

Cardinality 10995 1585 238 241Accuracy 9293 9907 9776 9826

10 Journal of Applied Mathematics

Table 4 Feature selection performance comparison on UCI data sets (Cardinality)

No UCI Data Set 119899 1198981198712-SVM 119871

1-SVM 119871

0-SVM 119871

12-SVM

Mean Std Mean Std Mean Std Mean Std1 Pima diabetes 768 8 800 000 740 089 620 110 760 1522 Breast Cancer 683 9 900 000 860 122 640 207 840 1673 Wine (3) 178 13 1300 000 973 069 373 028 407 0494 Image (7) 2100 19 1900 000 820 053 366 097 511 0625 SPECT 267 22 2200 000 1740 527 980 427 9 1176 WDBC 569 30 3000 000 980 152 900 303 560 2057 Ionosphere 351 34 3320 045 2500 335 2460 1021 2180 8448 SPECTF 267 44 4400 000 3880 1098 2400 1132 2560 7759 Sonar 208 60 6000 000 3140 1305 2700 718 1360 100610 Musk1 476 166 16600 000 8580 1885 4860 428 41 1416

Average cardinality 4050 4042 2421 1630 1418

Table 5 Classification performance comparison on UCI data sets (accuaracy)

No UCI Data Set 1198712-SVM 119871

1-SVM 119871

0-SVM 119871

12-SVM

Mean Std Mean Std Mean Std Mean Std1 Pima diabetes 7699 347 7673 085 7673 294 7712 1522 Breast cancer 9618 223 9706 096 9632 138 9647 1593 Wine (3) 9771 373 9829 128 9314 239 9714 2564 Image (7) 8894 149 9127 177 8667 458 9136 1945 SPECT 8340 215 7849 231 7811 338 8234 5106 WDBC 9611 211 9646 131 9558 115 9646 1317 Ionosphere 8343 544 8800 370 8457 538 8743 5948 SPECTF 7849 391 7434 450 7387 325 7849 1899 Sonar 7366 529 7561 582 7415 798 7610 43510 Musk1 8484 173 8211 242 7600 564 8232 338

Average accuracy 8597 8584 8351 8652

SVMs there is little change in the accuracy It reveals thatSVMs can benefit from the features reduction

Table 3 shows the average results over all data sets inthe second experiment On average the solution of 119871

12-

SVM yields much sparser than 1198711-SVM and a slightly better

accuracy than 1198710-SVM

42 UCI Data Sets We further tested the reformulation andthe proposed interior point methods to 119871

12-SVM on 10 UCI

data sets [36] There are 8 binary classification problems and2multiclass problems (wine image) Each feature of the inputdata was normalized to zero mean and unit variance and theinstances with missing value were deletedThen the data wasrandomly split into training set (80) and testing set (20)For the two multiclass problems a one-against-rest methodwas applied to construct a binary classifier for each class Werepeated the training and testing procedure 10 times and theaverage results were shown in Tables 4 and 5 and Figure 4

Tables 4 and 5 summarize the feature selection andclassification performance of the numerical experiments onUCI data sets respectively Here 119899 is the numbers of samplesand 119898 is the number of the input features For the two

multiclass data sets the numbers of the classes are markedbehind their names Sparsity is defined as card119898 and thesmall value of sparsity is preferredThe data sets are arrangedin descending order according to the dimension The lowestcardinality and the best accuracy rate for each problem arebolded

As shown in Tables 4 and 5 the three sparse SVMs canencourage sparsity in all data sets while remaining roughlyidentical accuracy with 119871

2-SVM Among the three sparse

methods the 11987112

-SVM has the lowest cardinality (1418)and the highest classification accuracy (8652) on averageWhile the 119871

1-SVM has the worst feature selection perform-

ance with the highest average cardinality (2421) and the1198710-SVM has the lowest average classification accuracy

(8351)Figure 4 plots the sparsity (left) and classification accu-

racy (right) of each classifier on UCI data sets In three datasets (6 9 10) the 119871

12-SVM has the best performance both

in feature selection and classification among the three sparseSVMs Compared with 119871

1-SVM 119871

12-SVM can achieve

sparser solution in nine data sets For example in the dataset ldquo8 SPECTFrdquo the features selected by 119871

12-SVM are 34

Journal of Applied Mathematics 11

50 100 150 2000

20

40

60

80

100

120

140

160

180

200

Dimension

Card

inal

ity

50 100 150 20080

82

84

86

88

90

92

94

96

98

100

Dimension

Accu

racy

L2-SVML1-SVM

L0-SVML12-SVM

L2-SVML1-SVM

L0-SVML12-SVM

Figure 3 Results comparison on artificial data set with various dimensions

less than 1198711-SVM at the same time the accuracy of 119871

12-

SVM is 41 higher than 1198711-SVM In most data sets (4

5 6 9 10) the cardinality of 11987112

-SVM drops significantly(at least 37) with the equal or a slightly better result inaccuracy Only in the data set ldquo3 Winerdquo the accuracy of11987112

-SVM is decreased by 11 but the sparsity provided by11987112

-SVM leads to 582 improvement over 1198711-SVM In the

rest three data sets (1 2 7) the two methods have similarresults in feature selection and classification As seen abovethe 119871

12-SVM can provide lower dimension representation

than 1198711-SVM with the competitive prediction perform-

anceFigure 4 (right) shows that compared with 119871

0-SVM

11987112

-SVM has the classification accuracy improved in alldata sets For instance in four data sets (3 4 8 10) theclassification accuracy of 119871

12-SVM is at least 40 higher

than 1198710-SVM Especially in the data set ldquo10Muskrdquo 119871

12-SVM

gives a 63 rise in accuracy over 1198710-SVM Meanwhile it

can be observed from Figure 4 (left) that 11987112

-SVM selectsfewer feature than 119871

0-SVM in five data sets (5 6 7 9 10) For

example in the data sets 6 and 9 the cardinalities of 11987112

-SVM are 378 and 496 less than 119871

0-SVM respectively

In summary 11987112

-SVM presents better classification perfor-mance than 119871

0-SVM while it is effective in choosing relevant

features

5 Conclusions

In this paper we proposed a 11987112

regularization techniquefor simultaneous feature selection and classification in theSVMWe have reformulated the 119871

12-SVM into an equivalent

smooth constrained optimization problem The problem

possesses a very simple structure and is relatively easy todevelop numerical methods By the use of this interestingreformulation we proposed an interior point method andestablished its convergence Our numerical results supportedthe reformulation and the proposed method The 119871

12-

SVM can get more sparsity solution than 1198711-SVM with

the comparable classification accuracy Furthermore the11987112

-SVM can achieve more accuracy classification resultsthan 119871

0-SVM (FSV)

Inspired by the good performance of the smooth opti-mization reformulation of 119871

12-SVM there are some inter-

esting topics deserving further research For examples todevelop more efficient algorithms for solving the reformu-lation to study nonlinear 119871

12-SVM and to explore varies

applications of the 11987112

-SVM and further validate its effectiveare interesting research topics Some of them are under ourcurrent investigation

Appendix

A Properties of the Reformulation to11987112

-SVM

Let 119863 be set of all feasible points of the problem (10) Forany 119911 isin 119863 we let 119865119863(119911119863) and 119871119865119863(119911119863) be the set of allfeasible directions and linearized feasible directions of 119863 at119911 Since the constraint functions of (10) are all convex weimmediately have the following lemma

Lemma A1 For any feasible point 119911 of (10) one has119865119863 (119911119863) = 119871119865119863 (119911119863) (A1)

12 Journal of Applied Mathematics

1 2 3 4 5 6 7 8 9 1010

20

30

40

50

60

70

80

90

100

Data set number Data set number

Spar

sity

1 2 3 4 5 6 7 8 9 1070

75

80

85

90

95

100

Accu

racy

L2-SVML1-SVM

L0-SVML12-SVM

L2-SVML1-SVM

L0-SVML12-SVM

Figure 4 Results comparison on UCI data sets

Based on the above Lemma we can easily derive a firstorder necessary condition for (10) The Lagrangian functionof (10) is

119871 (119908 119905 120585 119887 120582 120583 ] 119901 119902)

= 119862

119899

sum

119894=1

120585119894+

119898

sum

119895=1

119905119895minus

119898

sum

119895=1

120582119895(1199052

119895minus 119908119895) minus

119898

sum

119895=1

120583119895(1199052

119895+ 119908119895)

minus

119898

sum

119895=1

]119895119905119895minus

119899

sum

119894=1

119901119894(119910119894119908119879119909119894+ 119910119894119887 + 120585119894minus 1) minus

119899

sum

119894=1

119902119894120585119894

(A2)

where 120582 120583 ] 119901 119902 are the Lagrangian multipliers By the useof Lemma A1 we immediately have the following theoremabout the first order necessary condition

Theorem A2 Let (119908 119905 120585 119887) be a local solution of (10) Thenthere are Lagrangian multipliers (120582 120583 ] 119901 119902) isin 119877

119898+119898+119898+119899+119899

such that the following KKT conditions hold

120597119871

120597119908= 120582119895minus 120583119895minus 119901119894

119899

sum

119894=1

119910119894119909119894= 0 119894 = 1 2 119899

120597119871

120597119905= 1 minus 2120582

119895119905119895minus 2120583119895119905119895minus ]119895= 0 119895 = 1 2 119898

120597119871

120597120585= 119862 minus 119901

119894minus 119902119894= 0 119894 = 1 2 119899

120597119871

120597119887=

119899

sum

119894=1

119901119894119910119894= 0 119894 = 1 2 119899

120582119895ge 0 119905

2

119895minus 119908119895ge 0 120582

119895(1199052

119895minus 119908119895) = 0

119895 = 1 2 119898

120583119895ge 0 119905

2

119895+ 119908119895ge 0 120583

119895(1199052

119895+ 119908119895) = 0

119895 = 1 2 119898

]119895ge 0 119905

119895ge 0 ]

119895119905119895= 0 119895 = 1 2 119898

119901119894ge 0 119910

119894(119908119879119909119894+ 119887) + 120585

119894minus 1 ge 0

119901119894(119910119894(119908119879119909119894+ 119887) + 120585

119894minus 1) = 0 119894 = 1 2 119899

119902119894ge 0 120585

119894ge 0 119902

119894120585119894= 0 119894 = 1 2 119899

(A3)

The following theorem shows that the level set will bebounded

Theorem A3 For any given constant 119888 gt 0 the level set

119882119888= 119911 = (119908 119905 120585) | 119891 (119911) le 119888 cap 119863 (A4)

is bounded

Proof For any (119908 119905 120585) isin Ω119888 we have

119898

sum

119894

119905119894+ 119862

119899

sum

119895

120585119895le 119888 (A5)

Journal of Applied Mathematics 13

Combining with 119905119894ge 0 119894 = 1 119898 and 120585

119895ge 0 119895 = 1 119899

we have

0 le 119905119894le 119888 119894 = 1 119898

0 le 120585119895le

119888

119862 119895 = 1 119899

(A6)

Moreover for any (119908 119905 120585) isin 119863

10038161003816100381610038161199081198941003816100381610038161003816 le 1199052

119894 119894 = 1 119899 (A7)

Consequently 119908 119905 120585 are bounded What is more from thecondition

119910119894(119908119879119909119894+ 119887) ge 1 minus 120585

119894 119894 = 1 119898 (A8)

we have

119908119879119909119894+ 119887 ge 1 minus 120585

119894 if 119910

119894= 1 119894 = 1 119898

119908119879119909119894+ 119887 le minus1 + 120585

119894 if 119910

119894= minus1 119894 = 1 119898

(A9)

Thus if the feasible region119863 is not empty then we have

max119910119894=1

(1 minus 120585119894minus 119908119879119909119894) le 119887 le min

119910119894=minus1

(minus1 + 120585119894minus 119908119879119909119894) (A10)

Hence 119887 is also bounded The proof is complete

B Proof of Lemma 3

Lemma 3 (see Section 3) shows that Algorithm 2 is welldefinedWe first introduce the following proposition to showthat the matrix 119878 defined by (21) is always positive definitewhich ensures Algorithm 2 to be well defined

Proposition B1 Let the inequality (26) hold then the matrix119878 defined by (21) is positive definite

Proof By an elementary deduction we have for any V(1) isin

119877119898 V(2) isin 119877

119899 V(3) isin 119877119898 V(4) isin 119877 with (V(1) V(2) V(3) V(4)) = 0

(V(1)119879 V(2)119879 V(3)119879 V(4)119879) 119878(

V(1)

V(2)

V(3)

V(4))

= V(1)11987911987811V(1) + V(1)119879119878

12V(2) + V(1)119879119878

13V(3)

+ V(1)11987911987814V(4) + V(2)119879119878

21V(1) + V(2)119879119878

22V(2)

+ V(2)11987911987823V(3) + V(2)119879119878

24V(4)

+ V(3)11987911987831V(1) + V(3)119879119878

32V(2)

+ V(3)11987911987833V(3) + V(3)119879119878

34V(4)

+ V(4)11987911987841V(1) + V(4)119879119878

42V(2)

+ V(4)11987911987843V(3) + V(4)119879119878

44V(4)

= V(1)119879 (119880 + 119881 + 119883119879119884119879119875minus11198634119884119883) V(1)

+ V(1)119879 (119883119879119884119879119875minus11198634) V(2)

+ V(1)119879 (minus2 (119880 minus 119881)119879) V(3)

+ V(1)119879 (119883119879119884119879119875minus11198634119910) V(4)

+ V(2)119879119875minus11198634119884119883V(1) + V(2)119879 (119875minus1119863

4+ Ξminus11198635) V(2)

+ V(2)119879119875minus11198634119910V(4)

minus 2V(3)119879 (119880 minus 119881)119879V(1)

+ V(3)119879 (4119879 (119880 + 119881)119879 + 119879minus11198633minus 2 (119863

1+ 1198632)) V(3)

+ V(4)119879119910119879119875minus11198634119884119883V(1)

+ V(4)119879119910119879119875minus11198634V(2) + V(4)119879119910119879119875minus1119863

4119910V(4)

= V(1)119879119880V(1) minus 4V(3)119879119880119879V(1) + 4V(3)119879119879119880119879V(3)

+ V(1)119879119881V(1) + 4V(3)119879119881119879V(1) + 4V(3)119879119879119881119879V(3)

+ V(2)119879Ξminus11198635V(2) + V(3)119879 (119879minus1119863

3minus 21198631minus 21198632) V(3)

+ (119884119883V(1) + V(2) + V(4)119910)119879

119875minus11198634

times (119884119883V(1) + V(2) + V(4)119910)

= 119890119879

119898119880(119863V

1minus 2119879119863V

3)2

119890119898+ 119890119879

119898119881(119863V

1+ 2119879119863V

3)2

119890119898

+ V(2)119879Ξminus11198635V(2)

+ V(3)119879 (119879minus11198633minus 21198631minus 21198632) V(3)

+ (119884119883V(1) + V(2) + V(4)119910)119879

119875minus11198634(119884119883V(1) + V(2) + V(4)119910)

gt 0

(B1)

where 119863V1= diag(V(1)) 119863V

2= diag(V(2)) 119863V

3= diag(V(3))

and119863V4= diag(V(4))

14 Journal of Applied Mathematics

B1 Proof of Lemma 3

Proof It is easy to see that if Δ119911119896= 0 which implies that (23)

holds for all 120572119896ge 0

Suppose Δ119911119896

= 0 We have

nabla119908Φ120583(119911119896)119879

Δ119908119896+ nabla119905Φ120583(119911119896)119879

Δ119905119896

+ nabla120585Φ120583(119911119896)119879

Δ120585119896+ nabla119887Φ120583(119911119896)119879

Δ119887119896

= minus (Δ119908119896119879

Δ120585119896119879

Δ119905119896119879

Δ119887119896) 119878(

Δ119908119896

Δ120585119896

Δ119905119896

Δ119887119896

)

(B2)

Since matrix 119878 is positive definite and Δ119911119896

= 0 the lastequation implies

nabla119908Φ120583(119911119896)119879

Δ119908119896+ nabla119905Φ120583(119911119896)119879

Δ119905119896

+ nabla120585Φ120583(119911119896)119879

Δ120585119896+ nabla119887Φ120583(119911119896)119879

Δ119887119896lt 0

(B3)

Consequently there exists a 119896isin (0

119896] such that the fourth

inequality in (23) is satisfied for all 120572119896isin (0

119896]

On the other hand since (119909119896 119905119896) is strictly feasible the

point (119909119896 119905119896) + 120572(Δ119909 119905119896 Δ119905119896) will be feasible for all 120572 gt 0

sufficiently small The proof is complete

C Convergence Analysis

This appendix is devoted to the global convergence of theinterior point method We first show the convergence ofAlgorithm 2 when a fixed 120583 is applied to the barrier subprob-lem (14)

Lemma C1 Let (119911119896 120582119896) be generated by Algorithm 2 Then119911119896 are strictly feasible for problem (10) and the Lagrangian

multipliers 120582119896 are bounded from above

Proof For the sake of convenience we use 119892119894(119911) 119894 =

1 2 3119898 + 2119899 to denote the constraint functions of theconstrained problem (10)

We first show that 119911119896 are strictly feasible Suppose onthe contrary that there exists an infinite index subset K andan index 119894 isin 1 2 3119898 + 2119899 such that 119892

119894(119911119896)K darr 0

By the definition of Φ120583(119911) and 119891(119911) = sum

119898

119895=1119905119894+ 119862sum

119899

119894=1120585119894

being bounded from below in the feasible set it must holdthat Φ

120583(119911119896)K rarr infin

However the line search rule implies that the sequenceΦ120583(119911119896) is decreasing So we get a contradiction Conse-

quently for any 119894 isin 1 2 3119898+2119899 119892119894(119909119896 119905119896) is bounded

away from zero The boundedness of 120582119896 then follows from(24)ndash(27)

Lemma C2 Let (119911119896 120582119896) and (Δ119911119896 119896+1

) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenthe sequence (Δ119911

119896 119896+1

)K is bounded

Proof Again we use 119892119894(119911) 119894 = 1 2 3119898+2119899 to denote the

constraint functions of the constrained problem (10)We suppose on the contrary that there exists an infinite

subsetK1015840 sube K such that the subsequence (Δ119911119896 119896+1

)K1015840

tends to infinity Let 119911119896K1015840 rarr 119911

lowast It follows fromLemma C1 that there is an infinite subset K sube K1015840 such that120582119896K rarr 120582

lowast and 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 Thus

we have

119878119896997888rarr 119878lowast (C1)

as 119896 rarr infin with 119896 isin K By Proposition B1 119878lowast is positivedefinite Since the right hand size of (18) is bounded andcontinuous the unboundedness of (Δ119911

119896 119896+1

)K impliesthat the limit of the coefficient matrices of (18) is singularwhich yields a contradiction The proof is complete

Lemma C3 Let (119911119896 120582119896) and (Δ119911119896 119896+1

) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenone has Δ119911

119896K rarr 0

Proof It follows from Lemma C1 that the sequence Δ119911119896K

is bounded Suppose on the contrary that there exists aninfinite subsetK1015840 sube K such that Δ119911

119896K1015840 rarr Δ119911

lowast= 0 Since

subsequences 119911119896K1015840 120582

119896K1015840 and

119896K1015840 are all bounded

there are points 119911lowast 120582lowast and

lowast as well as an infinite indexset K sube K1015840 such that 119911119896K rarr 119911

lowast 120582119896K rarr 120582lowast and

119896K rarr

lowast By Lemma C1 we have 119892119894(119911lowast) gt 0 forall119894 isin

1 2 3119898 + 2119899 Similar to the proof of Lemma 3 it is notdifficult to get

nabla119908Φ120583(119911lowast)119879

Δ119908lowast+ nabla119905Φ120583(119911lowast)119879

Δ119905lowast

+ nabla120585Φ120583(119911lowast)119879

Δ120585lowast+ nabla119887Φ120583(119911lowast)119879

Δ119887lowast

= minus (Δ119908lowast119879

Δ120585lowast119879

Δ119905lowast119879

Δ119887lowast) 119878(

Δ119908lowast

Δ120585lowast

Δ119905lowast

Δ119887lowast

) lt 0

(C2)

Since 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 there exists a

isin (0 1] such that for all 120572 isin (0 ]

119892119894(119911lowast+ 120572Δ119911

lowast) gt 0 forall119894 isin 1 2 3119899 (C3)

Taking into account 1205911isin (0 12) we claim that there exists

a isin (0 ] such that the following inequality holds for all120572 isin (0 ]

Φ120583(119911lowast+ 120572Δ119911

lowast) minus Φ120583(119911lowast) le 11120591

1120572nabla119911Φ120583(119911lowast)119879

Δ119911lowast (C4)

Let 119898lowast

= min119895 | 120573119895isin (0 ] 119895 = 0 1 and 120572

lowast= 120573119898lowast

It follows from (C3) that the following inequality is satisfiedfor all 119896 isin K sufficient large and any 119894 isin 1 2 3119898 + 2119899

119892119894(119911119896+ 120572lowastΔ119911119896) gt 0 (C5)

Journal of Applied Mathematics 15

Moreover we have

Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911119896)

= Φ120583(119911lowast+ 120572lowastΔ119911lowast) minus Φ120583(119911lowast)

+ Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911lowast+ 120572lowastΔ119911lowast)

+ Φ120583(119911lowast) minus Φ120583(119911119896)

le 1051205911120572lowastnabla119911Φ120583(119911lowast)119879

Δ119911lowast

le 1205911120572lowastnabla119911Φ120583(119911119896)119879

Δ119911119896

(C6)

By the backtracking line search rule the last inequalitytogether with (C5) yields the inequality120572

119896ge 120572lowastfor all 119896 isin K

large enough Consequently when 119896 isin K is sufficiently largewe have from (23) and (C3) that

Φ120583(119911119896+1

) le Φ120583(119911119896) + 1205911120572119896nabla119911Φ120583(119911119896)119879

Δ119911119896

le Φ120583(119911119896) + 1205911120572lowastnabla119911Φ120583(119911119896)119879

Δ119911119896

le Φ120583(119911119896) +

1

21205911120572lowastnabla119911Φ120583(119911lowast)119879

Δ119911lowast

(C7)

This shows that Φ120583(119911119896)K rarr minusinfin contradicting with the

fact that Φ120583(119911119896)K is bounded from below The proof is

complete

Then we will establish the convergence of Algorithms 2and 4 that is Theorems 5 and 6 in Section 3

C1 Proof of Theorem 5

Proof We let (119911lowast lowast) be a limit point of (119911119896 119896) and thesubsequence (119911119896 119896)K converges to (119911

lowast lowast)

Recall that (Δ119911119896 119896) are the solution of (18) with (119911 120582) =

(119911119896 120582119896) Taking limits in both sides of (18) with (119911 120582) =

(119911119896 120582119896) as 119896 rarr infinwith 119896 isin K by the use of Lemma C1 we

obtain

lowast(1)

minus lowast(2)

minus 119883119879119884119879lowast(4)

= 0

119862 lowast 119890119899minus lowast(4)

minus lowast(5)

= 0

119890119898minus 2119879lowastlowast(1)

minus 2119879lowastlowast(2)

minus lowast(3)

= 0

119910119879lowast(4)

= 0

(119879lowast2

minus 119882lowast) lowast(1)

minus 120583119890119898

= 0

(119879lowast2

+ 119882lowast) lowast(2)

minus 120583119890119898

= 0

119879lowastlowast(3)

minus 120583119890119898

= 0

119875lowastlowast(4)

minus 120583119890119899= 0

Ξlowastlowast(5)

minus 120583119890119899= 0

(C8)

This shows that (119911lowast lowast) satisfies the first-order optimalityconditions (15)

C2 Proof of Theorem 6

Proof (i) Without loss of generality we suppose that thebounded subsequence (119911

119895 119895)J converges to some point

(119911lowast lowast) It is clear that 119911lowast is a feasible point of (10) Since

120583119895 120598120583119895 rarr 0 it follows from (29) and (30) that

Res (119911119895 119895 120583119895)

J997888rarr Res (119911lowast lowast 0) = 0

119895J

997888rarr lowastge 0

(C9)

Consequently the (119911lowast lowast) satisfies the KKT conditions (13)(ii) Let 120585

119895= max119895

infin 1 and

119895= 120585minus1

119895119895 Obviously

119895 is bounded Hence there exists an infinite subsetJ1015840 sube J

such that 1198951015840J rarr = 0 and 119895infin

= 1 for large 119895 isin JFrom (30) we know that ge 0 Dividing both sides of thefirst inequality of (18) with (119911 120582) = (119911

119896 120582119896) by 120585119895 then taking

limits as 119895 rarr infin with 119895 isin J we get

(1)

minus (2)

minus 119883119879119884119879(4)

= 0

(4)

+ (5)

= 0

2119879lowast(1)

+ 2119879lowast(2)

+ (3)

= 0

119910119879(4)

= 0

(119879lowast2

minus 119882lowast) (1)

= 0

(119879lowast2

+ 119882lowast) (2)

= 0

119879lowast(3)

= 0

119875lowast(4)

= 0

Ξlowast(5)

= 0

(C10)

Since 119911lowast is feasible the above equations has shown that 119911lowast is

a Fritz-John point of problem (10)

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

The authors would like to acknowledge support for thisproject from the National Science Foundation (NSF Grants11371154 11071087 11271069 and 61103202) of China

References

[1] Y M Yang and J O Pedersen ldquoA comparative study on featureselection in text categorizationrdquo in Proceedings of the 14th

16 Journal of Applied Mathematics

International Conference on Machine Learning (ICML rsquo97) vol97 pp 412ndash420 1997

[2] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquoThe Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003

[3] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002

[4] H H Zhang J Ahn X Lin and C Park ldquoGene selection usingsupport vector machines with non-convex penaltyrdquo Bioinfor-matics vol 22 no 1 pp 88ndash95 2006

[5] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007

[6] J Weston F Perez-Cruz O Bousquet O Chapelle A Elisseeffand B Scholkopf ldquoFeature selection and transduction forprediction of molecular bioactivity for drug designrdquo Bioinfor-matics vol 19 no 6 pp 764ndash771 2003

[7] Y Liu ldquoA comparative study on feature selection methods fordrug discoveryrdquo Journal of Chemical Information and ComputerSciences vol 44 no 5 pp 1823ndash1828 2004

[8] I Guyon J Weston S Barnhill and V Vapnik Eds FeatureExtraction Foundations and Applications vol 207 Springer2006

[9] A Rakotomamonjy ldquoVariable selection using SVM-based cri-teriardquo Journal of Machine Learning Research vol 3 no 7-8 pp1357ndash1370 2003

[10] J Weston S Mukherjee O Chapelle M Pontil T Poggio andV Vapnik ldquoFeature selection for SVMsrdquo in Proceedings of theConference onNeural Information Processing Systems vol 13 pp668ndash674 Denver Colo USA 2000

[11] D Peleg and R Meir ldquoA feature selection algorithm based onthe global minimization of a generalization error boundrdquo inProceedings of the Conference on Neural Information ProcessingSystems Vancouver Canada 2004

[12] P S Bradley and O L Mangasarian ldquoFeature selection via con-cave minimization and support vector machinesrdquo in Proceed-ings of the International Conference on Machine Learning pp82ndash90 San Francisco Calif USA 1998

[13] JWeston A Elisseeff B Scholkopf andM Tipping ldquoUse of thezero-norm with linear models and kernel methodsrdquo Journal ofMachine Learning Research vol 3 pp 1439ndash1461 2003

[14] A B Chan N Vasconcelos and G R G Lanckriet ldquoDirectconvex relaxations of sparse SVMrdquo in Proceedings of the 24thInternational Conference on Machine Learning (ICML rsquo07) pp145ndash153 Corvallis Ore USA June 2007

[15] GM Fung andO L Mangasarian ldquoA feature selection Newtonmethod for support vector machine classificationrdquo Computa-tional Optimization and Applications vol 28 no 2 pp 185ndash2022004

[16] J Zhu S Rosset T Hastie and R Tibshirani ldquo1-norm supportvector machinesrdquo Advances in Neural Information ProcessingSystems vol 16 no 1 pp 49ndash56 2004

[17] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society B Methodological vol 58no 1 pp 267ndash288 1996

[18] T Zhang ldquoAnalysis of multi-stage convex relaxation for sparseregularizationrdquo Journal of Machine Learning Research vol 11pp 1081ndash1107 2010

[19] R Chartrand ldquoExact reconstruction of sparse signals via non-convexminimizationrdquo IEEE Signal Processing Letters vol 14 no10 pp 707ndash710 2007

[20] R Chartrand ldquoNonconvex regularizaron for shape preserva-tionrdquo inProceedings of the 14th IEEE International Conference onImage Processing (ICIP rsquo07) vol 1 pp I293ndashI296 San AntonioTex USA September 2007

[21] Z Xu H Zhang Y Wang X Chang and Y Liang ldquo11987112

regularizationrdquo Science China Information Sciences vol 53 no6 pp 1159ndash1169 2010

[22] W J Chen and Y J Tian ldquoLp-norm proximal support vectormachine and its applicationsrdquo Procedia Computer Science vol1 no 1 pp 2417ndash2423 2010

[23] J Liu J Li W Xu and Y Shi ldquoA weighted L119902adaptive least

squares support vector machine classifiers-Robust and sparseapproximationrdquo Expert Systems with Applications vol 38 no 3pp 2253ndash2259 2011

[24] Y Liu H H Zhang C Park and J Ahn ldquoSupport vector mach-ines with adaptive 119871

119902penaltyrdquo Computational Statistics amp Data

Analysis vol 51 no 12 pp 6380ndash6394 2007[25] Z Liu S Lin andM Tan ldquoSparse support vectormachines with

Lp penalty for biomarker identificationrdquo IEEEACM Transac-tions on Computational Biology and Bioinformatics vol 7 no1 pp 100ndash107 2010

[26] A Rakotomamonjy R Flamary G Gasso and S Canu ldquol119901-l119902

penalty for sparse linear and sparse multiple kernel multitasklearningrdquo IEEE Transactions on Neural Networks vol 22 no 8pp 1307ndash1320 2011

[27] J Y Tan Z Zhang L Zhen C H Zhang and N Y DengldquoAdaptive feature selection via a new version of support vectormachinerdquo Neural Computing and Applications vol 23 no 3-4pp 937ndash945 2013

[28] Z B Xu X Y Chang F M Xu and H Zhang ldquo11987112

regu-larization an iterative half thresholding algorithmrdquo IEEETrans-actions on Neural Networks and Learning Systems vol 23 no 7pp 1013ndash1027 2012

[29] D Ge X Jiang and Y Ye ldquoA note on the complexity of 119871119901

minimizationrdquo Mathematical Programming vol 129 no 2 pp285ndash299 2011

[30] VN VapnikTheNature of Statistical LearningTheory SpringerNew York NY USA 1995

[31] J Fan and H Peng ldquoNonconcave penalized likelihood with adiverging number of parametersrdquo The Annals of Statistics vol32 no 3 pp 928ndash961 2004

[32] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoThe Annals of Statistics vol 28 no 5 pp 1356ndash1378 2000

[33] B S Tian and X Q Yang ldquoAn interior-point l 11987112-penalty-

method for nonlinear programmingrdquo Technical ReportDepartment of Applied Mathematics Hong Kong PolytechnicUniversity 2013

[34] L Armijo ldquoMinimization of functions having Lipschitz con-tinuous first partial derivativesrdquo Pacific Journal of Mathematicsvol 16 pp 1ndash3 1966

[35] F John ldquoExtremum problems with inequalities as side-con-ditionsrdquo in Studies and Essays Courant Anniversary Volume pp187ndash1204 1948

[36] D J Newman S Hettich C L Blake and C J Merz ldquoUCIrepository of machine learning databasesrdquo Technical Report9702 Department of Information and Computer Science Uni-verisity of California Irvine Calif USA 1998 httparchiveicsucieduml

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 8: Research Article An Interior Point Method for -SVM …downloads.hindawi.com/journals/jam/2014/942520.pdfResearch Article An Interior Point Method for 1/2-SVM and Application to Feature

8 Journal of Applied Mathematics

problem was solved by a Successive Linear Approximation(SLA) algorithm All the experiments were run in thepersonal computer (16 GHz of CPU 4GB of RAM) withMATLAB R2010b on 64 bit Windows 7

In the proposed interior point method (Algorithms 2 and4) we set the parameters as 120573 = 06 120591

1= 05 119897 = 000001

1205741

= 100000 and 120573 = 05 The balance parameter 119862 wasselected by 5-fold cross-validation on training set over therange 2

minus10 2minus9 210 After training the weights that didnot satisfy the criteria |119908

119895|max

119894(|119908119894|) ge 10

4 [14] were set tozeroThen the cardinality of the hyperplane was computed asthe number of the nonzero weights

41 Artificial Data First we took an artificial binary linearclassification problem as an example The problem is similarto that in [13] The probability of 119910 = 1 or minus1 is equal Thefirst 6 features are relevant but redundant In 70 samplesthe first three features 119909

1 1199092 1199093were drawn as 119909

119894= 119910119873(119894 1)

and the second three features 1199094 1199095 1199096 as 119909

119894= 119873(0 1)

Otherwise the first three were drawn as 119909119894= 119873(0 1) and the

second three as 119909119894= 119910119873(119894 minus 3 1) The rest features are noise

119909119894= 119873(0 20) 119894 = 7 119898 Here119898 is the dimension of input

features The inputs were scaled to mean zero and standarddeviation In each trial 500 points were generated for testingand the average results were estimated over 30 trials

In the first experiment we consider the cases with thefixed feature size 119898 = 30 and different training sample sizes119899 = 10 20 100 The average results over the 30 trialsare shown in Table 1 and Figure 2 Figure 2 (left) plots theaverage cardinality of each classifier Since the artificial datasets have 2 relevant and nonredundant features the idealaverage cardinality is 2 Figure 2 (left) shows that the threesparse SVMs 119871

1-SVM 119871

12-SVM and 119871

0-SVM can achieve

sparse solution while the 1198712-SVM almost uses full features

in each data set Furthermore the solutions of 11987112

-SVMand 119871

0-SVM are much sparser than 119871

1-SVM As shown in

Table 1 the 1198711-SVM selects more than 6 features in all cases

which implies that some redundant or irrelevant featuresare selected The average cardinalities of 119871

12-SVM and 119871

0-

SVM are similar and close to 2 However when 119899 = 10

and 20 the 1198710-SVM has the average cardinalities of 142

and 187 respectively It means that the 1198710-SVM sometimes

selects only one feature in low sample data set and maybeignores some really relevant feature Consequently with thecardinalities between 205 and 29 119871

12-SVM has the more

reliable solution than 1198710-SVM In short as far as the number

of selected features is concerned the 11987112

-SVM behavesbetter than the other three methods

Figure 2 (right) plots the trend of the prediction accuracyversus the size of the training sample The classificationperformance of all methods is generally improved with theincreasing of the training sample size 119899 119871

1-SVM has the best

prediction performance in all cases and a slightly better than11987112

-SVM 11987112

-SVM shows more accuracy in classificationthan 119871

2-SVM and 119871

0-SVM especially in the case of 119899 =

10 50 As shown in Table 1 when there are only 10 train-ing samples the average accuracy of 119871

12-SVM is 8805

while the results of 1198712-SVM and 119871

0-SVM are 8465 and

7765 respectively Compared with 1198712-SVM and 119871

0-SVM

11987112

-SVM has the average accuracy increased by 34 and104 respectively as can be explained in what follows Tothe 1198712-SVM all features are selected without discrimination

and the prediction would bemisled by the irrelevant featuresTo the 119871

0-SVM few features are selected and some relevant

features are not included which would put negative impacton the prediction result As the tradeoff between119871

2-SVMand

1198710-SVM 119871

12-SVM has better performance than the two

The average results over ten artificial data sets in the firstexperiment are shown in the bottom of Table 1 On averagethe accuracy of 119871

12-SVM is 087 lower than the 119871

1-SVM

while the features selected by 11987112

-SVM are 7414 less than1198711-SVM It indicates that the 119871

12-SVM can achieve much

sparser solution than 1198711-SVM with little cost of accuracy

Moreover the average accuracy of 11987112

-SVMover 10 data setsis 222 higher than 119871

0-SVM with the similar cardinality

To sum up the 11987112

-SVM provides the best balance betweenaccuracy and sparsity among the three sparse SVMs

To further evaluate the feature selection performance of11987112

-SVM we investigate whether the features are correctlyselected For the119871

2-SVM is not designed for feature selection

it is not included in this comparison Since our artificial datasets have 2 best features (119909

3 1199096) the best result should have

the two features (1199093 1199096) ranking on the top according to their

absolute values of weights |119908119895| In the experiment we select

the top 2 features with the maximal |119908119895| for each method

and calculate the frequency that the top 2 features are 1199093

and 1199096in 30 runs The results are listed in Table 2 When the

training sample size is too small it is difficult to discriminatethe two most important features for all sparse SVMs Forexample when 119899 = 10 the selected frequencies of 119871

1-SVM

1198710-SVM and 119871

12-SVM are 7 3 and 9 respectively When 119899

increases all methods tend to make more correct selectionMoreover Table 2 shows that the 119871

12-SVM outperforms the

other two methods in all cases For example when 119899 = 100the selected frequencies of 119871

1-SVM and119871

0-SVM are 22 and

25 respectively and the result of 11987112

-SVM is 27 The 1198711-

SVMselects toomany redundant or irrelevant features whichmay influence the ranking in some extent Therefore 119871

1-

SVM is not so good as 11987112

-SVM at distinguishing the criticalfeatures The 119871

0-SVM has the lower hit frequency than 119871

12-

SVM which is probably due to the excessive small featuresubset it obtained Above all Tables 1 and 2 and Figure 2clearly show that the 119871

12-SVM is a promising sparsity driven

classification methodIn the second simulation we consider the cases with

various dimensions of feature space 119898 = 20 40 180 200

and the fixed training sample size 119899 = 100 The averageresults over 30 trials are shown in Figure 3 and Table 3 Sincethere are only 6 relevant features yet the larger 119898 meansthe more noisy features Figure 3 (left) shows that as thedimension increases from 20 to 200 the number of featuresselected by 119871

1-SVM increases from 826 to 231 However the

cardinalities of 11987112

-SVM and 1198710-SVM keep stable (from 22

to 295) It indicates that the 11987112

-SVM and 1198710-SVM aremore

suitable for feature selection than 1198711-SVM

Figure 3 (right) shows thatwith the increasing of the noisefeatures the accuracy of 119871

2-SVM drops significantly (from

9868 to 8747) On the contrary to the other three sparse

Journal of Applied Mathematics 9

0 20 40 60 80 1000

5

10

15

20

25

30

Number of training samples

Card

inal

ity

0 20 40 60 80 10075

80

85

90

95

100

Number of training samples

Accu

racy

L2-SVML1-SVM

L0-SVML12-SVM

L2-SVML1-SVM

L0-SVML12-SVM

Figure 2 Results comparison on 10 artificial data sets with various training samples

Table 1 Results comparison on 10 artificial data sets with various training sample sizes

1198991198712-SVM 119871

1-SVM 119871

0-SVM 119871

12-SVM

Card Acc Card Acc Card Acc Card Acc10 2997 8465 693 8855 143 7765 225 880520 3000 9221 883 9679 187 9317 235 955030 3000 9427 847 9823 203 9467 205 974140 2997 9599 900 9875 207 9539 210 977550 3000 9661 950 9891 213 9657 245 976860 2997 9725 1000 9894 220 9761 245 979870 3000 9741 1100 9898 220 9756 240 982380 3000 9768 1003 9909 223 9761 260 985090 2997 9789 920 9921 230 9741 290 9830100 3000 9813 1027 9909 250 9793 255 9836On average 2999 9521 932 9765 210 9456 241 9678ldquo119899rdquo is the number of training samples ldquoCardrdquo represents the number of selected features and ldquoAccrdquo is the classification accuracy

Table 2 Frequency of the most important features (1199093 1199096) ranked on top 2 in 30 runs

119899 10 20 30 40 50 60 70 80 90 1001198711-SVM 7 14 16 18 22 23 20 22 21 22

1198710-SVM 3 11 12 15 19 22 22 23 22 25

11987112-SVM 9 16 19 24 24 23 27 26 26 27

Table 3 Average results over 10 artificial data sets with varies dimension

1198712-SVM 119871

1-SVM 119871

0-SVM 119871

12-SVM

Cardinality 10995 1585 238 241Accuracy 9293 9907 9776 9826

10 Journal of Applied Mathematics

Table 4 Feature selection performance comparison on UCI data sets (Cardinality)

No UCI Data Set 119899 1198981198712-SVM 119871

1-SVM 119871

0-SVM 119871

12-SVM

Mean Std Mean Std Mean Std Mean Std1 Pima diabetes 768 8 800 000 740 089 620 110 760 1522 Breast Cancer 683 9 900 000 860 122 640 207 840 1673 Wine (3) 178 13 1300 000 973 069 373 028 407 0494 Image (7) 2100 19 1900 000 820 053 366 097 511 0625 SPECT 267 22 2200 000 1740 527 980 427 9 1176 WDBC 569 30 3000 000 980 152 900 303 560 2057 Ionosphere 351 34 3320 045 2500 335 2460 1021 2180 8448 SPECTF 267 44 4400 000 3880 1098 2400 1132 2560 7759 Sonar 208 60 6000 000 3140 1305 2700 718 1360 100610 Musk1 476 166 16600 000 8580 1885 4860 428 41 1416

Average cardinality 4050 4042 2421 1630 1418

Table 5 Classification performance comparison on UCI data sets (accuaracy)

No UCI Data Set 1198712-SVM 119871

1-SVM 119871

0-SVM 119871

12-SVM

Mean Std Mean Std Mean Std Mean Std1 Pima diabetes 7699 347 7673 085 7673 294 7712 1522 Breast cancer 9618 223 9706 096 9632 138 9647 1593 Wine (3) 9771 373 9829 128 9314 239 9714 2564 Image (7) 8894 149 9127 177 8667 458 9136 1945 SPECT 8340 215 7849 231 7811 338 8234 5106 WDBC 9611 211 9646 131 9558 115 9646 1317 Ionosphere 8343 544 8800 370 8457 538 8743 5948 SPECTF 7849 391 7434 450 7387 325 7849 1899 Sonar 7366 529 7561 582 7415 798 7610 43510 Musk1 8484 173 8211 242 7600 564 8232 338

Average accuracy 8597 8584 8351 8652

SVMs there is little change in the accuracy It reveals thatSVMs can benefit from the features reduction

Table 3 shows the average results over all data sets inthe second experiment On average the solution of 119871

12-

SVM yields much sparser than 1198711-SVM and a slightly better

accuracy than 1198710-SVM

42 UCI Data Sets We further tested the reformulation andthe proposed interior point methods to 119871

12-SVM on 10 UCI

data sets [36] There are 8 binary classification problems and2multiclass problems (wine image) Each feature of the inputdata was normalized to zero mean and unit variance and theinstances with missing value were deletedThen the data wasrandomly split into training set (80) and testing set (20)For the two multiclass problems a one-against-rest methodwas applied to construct a binary classifier for each class Werepeated the training and testing procedure 10 times and theaverage results were shown in Tables 4 and 5 and Figure 4

Tables 4 and 5 summarize the feature selection andclassification performance of the numerical experiments onUCI data sets respectively Here 119899 is the numbers of samplesand 119898 is the number of the input features For the two

multiclass data sets the numbers of the classes are markedbehind their names Sparsity is defined as card119898 and thesmall value of sparsity is preferredThe data sets are arrangedin descending order according to the dimension The lowestcardinality and the best accuracy rate for each problem arebolded

As shown in Tables 4 and 5 the three sparse SVMs canencourage sparsity in all data sets while remaining roughlyidentical accuracy with 119871

2-SVM Among the three sparse

methods the 11987112

-SVM has the lowest cardinality (1418)and the highest classification accuracy (8652) on averageWhile the 119871

1-SVM has the worst feature selection perform-

ance with the highest average cardinality (2421) and the1198710-SVM has the lowest average classification accuracy

(8351)Figure 4 plots the sparsity (left) and classification accu-

racy (right) of each classifier on UCI data sets In three datasets (6 9 10) the 119871

12-SVM has the best performance both

in feature selection and classification among the three sparseSVMs Compared with 119871

1-SVM 119871

12-SVM can achieve

sparser solution in nine data sets For example in the dataset ldquo8 SPECTFrdquo the features selected by 119871

12-SVM are 34

Journal of Applied Mathematics 11

50 100 150 2000

20

40

60

80

100

120

140

160

180

200

Dimension

Card

inal

ity

50 100 150 20080

82

84

86

88

90

92

94

96

98

100

Dimension

Accu

racy

L2-SVML1-SVM

L0-SVML12-SVM

L2-SVML1-SVM

L0-SVML12-SVM

Figure 3 Results comparison on artificial data set with various dimensions

less than 1198711-SVM at the same time the accuracy of 119871

12-

SVM is 41 higher than 1198711-SVM In most data sets (4

5 6 9 10) the cardinality of 11987112

-SVM drops significantly(at least 37) with the equal or a slightly better result inaccuracy Only in the data set ldquo3 Winerdquo the accuracy of11987112

-SVM is decreased by 11 but the sparsity provided by11987112

-SVM leads to 582 improvement over 1198711-SVM In the

rest three data sets (1 2 7) the two methods have similarresults in feature selection and classification As seen abovethe 119871

12-SVM can provide lower dimension representation

than 1198711-SVM with the competitive prediction perform-

anceFigure 4 (right) shows that compared with 119871

0-SVM

11987112

-SVM has the classification accuracy improved in alldata sets For instance in four data sets (3 4 8 10) theclassification accuracy of 119871

12-SVM is at least 40 higher

than 1198710-SVM Especially in the data set ldquo10Muskrdquo 119871

12-SVM

gives a 63 rise in accuracy over 1198710-SVM Meanwhile it

can be observed from Figure 4 (left) that 11987112

-SVM selectsfewer feature than 119871

0-SVM in five data sets (5 6 7 9 10) For

example in the data sets 6 and 9 the cardinalities of 11987112

-SVM are 378 and 496 less than 119871

0-SVM respectively

In summary 11987112

-SVM presents better classification perfor-mance than 119871

0-SVM while it is effective in choosing relevant

features

5 Conclusions

In this paper we proposed a 11987112

regularization techniquefor simultaneous feature selection and classification in theSVMWe have reformulated the 119871

12-SVM into an equivalent

smooth constrained optimization problem The problem

possesses a very simple structure and is relatively easy todevelop numerical methods By the use of this interestingreformulation we proposed an interior point method andestablished its convergence Our numerical results supportedthe reformulation and the proposed method The 119871

12-

SVM can get more sparsity solution than 1198711-SVM with

the comparable classification accuracy Furthermore the11987112

-SVM can achieve more accuracy classification resultsthan 119871

0-SVM (FSV)

Inspired by the good performance of the smooth opti-mization reformulation of 119871

12-SVM there are some inter-

esting topics deserving further research For examples todevelop more efficient algorithms for solving the reformu-lation to study nonlinear 119871

12-SVM and to explore varies

applications of the 11987112

-SVM and further validate its effectiveare interesting research topics Some of them are under ourcurrent investigation

Appendix

A Properties of the Reformulation to11987112

-SVM

Let 119863 be set of all feasible points of the problem (10) Forany 119911 isin 119863 we let 119865119863(119911119863) and 119871119865119863(119911119863) be the set of allfeasible directions and linearized feasible directions of 119863 at119911 Since the constraint functions of (10) are all convex weimmediately have the following lemma

Lemma A1 For any feasible point 119911 of (10) one has119865119863 (119911119863) = 119871119865119863 (119911119863) (A1)

12 Journal of Applied Mathematics

1 2 3 4 5 6 7 8 9 1010

20

30

40

50

60

70

80

90

100

Data set number Data set number

Spar

sity

1 2 3 4 5 6 7 8 9 1070

75

80

85

90

95

100

Accu

racy

L2-SVML1-SVM

L0-SVML12-SVM

L2-SVML1-SVM

L0-SVML12-SVM

Figure 4 Results comparison on UCI data sets

Based on the above Lemma we can easily derive a firstorder necessary condition for (10) The Lagrangian functionof (10) is

119871 (119908 119905 120585 119887 120582 120583 ] 119901 119902)

= 119862

119899

sum

119894=1

120585119894+

119898

sum

119895=1

119905119895minus

119898

sum

119895=1

120582119895(1199052

119895minus 119908119895) minus

119898

sum

119895=1

120583119895(1199052

119895+ 119908119895)

minus

119898

sum

119895=1

]119895119905119895minus

119899

sum

119894=1

119901119894(119910119894119908119879119909119894+ 119910119894119887 + 120585119894minus 1) minus

119899

sum

119894=1

119902119894120585119894

(A2)

where 120582 120583 ] 119901 119902 are the Lagrangian multipliers By the useof Lemma A1 we immediately have the following theoremabout the first order necessary condition

Theorem A2 Let (119908 119905 120585 119887) be a local solution of (10) Thenthere are Lagrangian multipliers (120582 120583 ] 119901 119902) isin 119877

119898+119898+119898+119899+119899

such that the following KKT conditions hold

120597119871

120597119908= 120582119895minus 120583119895minus 119901119894

119899

sum

119894=1

119910119894119909119894= 0 119894 = 1 2 119899

120597119871

120597119905= 1 minus 2120582

119895119905119895minus 2120583119895119905119895minus ]119895= 0 119895 = 1 2 119898

120597119871

120597120585= 119862 minus 119901

119894minus 119902119894= 0 119894 = 1 2 119899

120597119871

120597119887=

119899

sum

119894=1

119901119894119910119894= 0 119894 = 1 2 119899

120582119895ge 0 119905

2

119895minus 119908119895ge 0 120582

119895(1199052

119895minus 119908119895) = 0

119895 = 1 2 119898

120583119895ge 0 119905

2

119895+ 119908119895ge 0 120583

119895(1199052

119895+ 119908119895) = 0

119895 = 1 2 119898

]119895ge 0 119905

119895ge 0 ]

119895119905119895= 0 119895 = 1 2 119898

119901119894ge 0 119910

119894(119908119879119909119894+ 119887) + 120585

119894minus 1 ge 0

119901119894(119910119894(119908119879119909119894+ 119887) + 120585

119894minus 1) = 0 119894 = 1 2 119899

119902119894ge 0 120585

119894ge 0 119902

119894120585119894= 0 119894 = 1 2 119899

(A3)

The following theorem shows that the level set will bebounded

Theorem A3 For any given constant 119888 gt 0 the level set

119882119888= 119911 = (119908 119905 120585) | 119891 (119911) le 119888 cap 119863 (A4)

is bounded

Proof For any (119908 119905 120585) isin Ω119888 we have

119898

sum

119894

119905119894+ 119862

119899

sum

119895

120585119895le 119888 (A5)

Journal of Applied Mathematics 13

Combining with 119905119894ge 0 119894 = 1 119898 and 120585

119895ge 0 119895 = 1 119899

we have

0 le 119905119894le 119888 119894 = 1 119898

0 le 120585119895le

119888

119862 119895 = 1 119899

(A6)

Moreover for any (119908 119905 120585) isin 119863

10038161003816100381610038161199081198941003816100381610038161003816 le 1199052

119894 119894 = 1 119899 (A7)

Consequently 119908 119905 120585 are bounded What is more from thecondition

119910119894(119908119879119909119894+ 119887) ge 1 minus 120585

119894 119894 = 1 119898 (A8)

we have

119908119879119909119894+ 119887 ge 1 minus 120585

119894 if 119910

119894= 1 119894 = 1 119898

119908119879119909119894+ 119887 le minus1 + 120585

119894 if 119910

119894= minus1 119894 = 1 119898

(A9)

Thus if the feasible region119863 is not empty then we have

max119910119894=1

(1 minus 120585119894minus 119908119879119909119894) le 119887 le min

119910119894=minus1

(minus1 + 120585119894minus 119908119879119909119894) (A10)

Hence 119887 is also bounded The proof is complete

B Proof of Lemma 3

Lemma 3 (see Section 3) shows that Algorithm 2 is welldefinedWe first introduce the following proposition to showthat the matrix 119878 defined by (21) is always positive definitewhich ensures Algorithm 2 to be well defined

Proposition B1 Let the inequality (26) hold then the matrix119878 defined by (21) is positive definite

Proof By an elementary deduction we have for any V(1) isin

119877119898 V(2) isin 119877

119899 V(3) isin 119877119898 V(4) isin 119877 with (V(1) V(2) V(3) V(4)) = 0

(V(1)119879 V(2)119879 V(3)119879 V(4)119879) 119878(

V(1)

V(2)

V(3)

V(4))

= V(1)11987911987811V(1) + V(1)119879119878

12V(2) + V(1)119879119878

13V(3)

+ V(1)11987911987814V(4) + V(2)119879119878

21V(1) + V(2)119879119878

22V(2)

+ V(2)11987911987823V(3) + V(2)119879119878

24V(4)

+ V(3)11987911987831V(1) + V(3)119879119878

32V(2)

+ V(3)11987911987833V(3) + V(3)119879119878

34V(4)

+ V(4)11987911987841V(1) + V(4)119879119878

42V(2)

+ V(4)11987911987843V(3) + V(4)119879119878

44V(4)

= V(1)119879 (119880 + 119881 + 119883119879119884119879119875minus11198634119884119883) V(1)

+ V(1)119879 (119883119879119884119879119875minus11198634) V(2)

+ V(1)119879 (minus2 (119880 minus 119881)119879) V(3)

+ V(1)119879 (119883119879119884119879119875minus11198634119910) V(4)

+ V(2)119879119875minus11198634119884119883V(1) + V(2)119879 (119875minus1119863

4+ Ξminus11198635) V(2)

+ V(2)119879119875minus11198634119910V(4)

minus 2V(3)119879 (119880 minus 119881)119879V(1)

+ V(3)119879 (4119879 (119880 + 119881)119879 + 119879minus11198633minus 2 (119863

1+ 1198632)) V(3)

+ V(4)119879119910119879119875minus11198634119884119883V(1)

+ V(4)119879119910119879119875minus11198634V(2) + V(4)119879119910119879119875minus1119863

4119910V(4)

= V(1)119879119880V(1) minus 4V(3)119879119880119879V(1) + 4V(3)119879119879119880119879V(3)

+ V(1)119879119881V(1) + 4V(3)119879119881119879V(1) + 4V(3)119879119879119881119879V(3)

+ V(2)119879Ξminus11198635V(2) + V(3)119879 (119879minus1119863

3minus 21198631minus 21198632) V(3)

+ (119884119883V(1) + V(2) + V(4)119910)119879

119875minus11198634

times (119884119883V(1) + V(2) + V(4)119910)

= 119890119879

119898119880(119863V

1minus 2119879119863V

3)2

119890119898+ 119890119879

119898119881(119863V

1+ 2119879119863V

3)2

119890119898

+ V(2)119879Ξminus11198635V(2)

+ V(3)119879 (119879minus11198633minus 21198631minus 21198632) V(3)

+ (119884119883V(1) + V(2) + V(4)119910)119879

119875minus11198634(119884119883V(1) + V(2) + V(4)119910)

gt 0

(B1)

where 119863V1= diag(V(1)) 119863V

2= diag(V(2)) 119863V

3= diag(V(3))

and119863V4= diag(V(4))

14 Journal of Applied Mathematics

B1 Proof of Lemma 3

Proof It is easy to see that if Δ119911119896= 0 which implies that (23)

holds for all 120572119896ge 0

Suppose Δ119911119896

= 0 We have

nabla119908Φ120583(119911119896)119879

Δ119908119896+ nabla119905Φ120583(119911119896)119879

Δ119905119896

+ nabla120585Φ120583(119911119896)119879

Δ120585119896+ nabla119887Φ120583(119911119896)119879

Δ119887119896

= minus (Δ119908119896119879

Δ120585119896119879

Δ119905119896119879

Δ119887119896) 119878(

Δ119908119896

Δ120585119896

Δ119905119896

Δ119887119896

)

(B2)

Since matrix 119878 is positive definite and Δ119911119896

= 0 the lastequation implies

nabla119908Φ120583(119911119896)119879

Δ119908119896+ nabla119905Φ120583(119911119896)119879

Δ119905119896

+ nabla120585Φ120583(119911119896)119879

Δ120585119896+ nabla119887Φ120583(119911119896)119879

Δ119887119896lt 0

(B3)

Consequently there exists a 119896isin (0

119896] such that the fourth

inequality in (23) is satisfied for all 120572119896isin (0

119896]

On the other hand since (119909119896 119905119896) is strictly feasible the

point (119909119896 119905119896) + 120572(Δ119909 119905119896 Δ119905119896) will be feasible for all 120572 gt 0

sufficiently small The proof is complete

C Convergence Analysis

This appendix is devoted to the global convergence of theinterior point method We first show the convergence ofAlgorithm 2 when a fixed 120583 is applied to the barrier subprob-lem (14)

Lemma C1 Let (119911119896 120582119896) be generated by Algorithm 2 Then119911119896 are strictly feasible for problem (10) and the Lagrangian

multipliers 120582119896 are bounded from above

Proof For the sake of convenience we use 119892119894(119911) 119894 =

1 2 3119898 + 2119899 to denote the constraint functions of theconstrained problem (10)

We first show that 119911119896 are strictly feasible Suppose onthe contrary that there exists an infinite index subset K andan index 119894 isin 1 2 3119898 + 2119899 such that 119892

119894(119911119896)K darr 0

By the definition of Φ120583(119911) and 119891(119911) = sum

119898

119895=1119905119894+ 119862sum

119899

119894=1120585119894

being bounded from below in the feasible set it must holdthat Φ

120583(119911119896)K rarr infin

However the line search rule implies that the sequenceΦ120583(119911119896) is decreasing So we get a contradiction Conse-

quently for any 119894 isin 1 2 3119898+2119899 119892119894(119909119896 119905119896) is bounded

away from zero The boundedness of 120582119896 then follows from(24)ndash(27)

Lemma C2 Let (119911119896 120582119896) and (Δ119911119896 119896+1

) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenthe sequence (Δ119911

119896 119896+1

)K is bounded

Proof Again we use 119892119894(119911) 119894 = 1 2 3119898+2119899 to denote the

constraint functions of the constrained problem (10)We suppose on the contrary that there exists an infinite

subsetK1015840 sube K such that the subsequence (Δ119911119896 119896+1

)K1015840

tends to infinity Let 119911119896K1015840 rarr 119911

lowast It follows fromLemma C1 that there is an infinite subset K sube K1015840 such that120582119896K rarr 120582

lowast and 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 Thus

we have

119878119896997888rarr 119878lowast (C1)

as 119896 rarr infin with 119896 isin K By Proposition B1 119878lowast is positivedefinite Since the right hand size of (18) is bounded andcontinuous the unboundedness of (Δ119911

119896 119896+1

)K impliesthat the limit of the coefficient matrices of (18) is singularwhich yields a contradiction The proof is complete

Lemma C3 Let (119911119896 120582119896) and (Δ119911119896 119896+1

) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenone has Δ119911

119896K rarr 0

Proof It follows from Lemma C1 that the sequence Δ119911119896K

is bounded Suppose on the contrary that there exists aninfinite subsetK1015840 sube K such that Δ119911

119896K1015840 rarr Δ119911

lowast= 0 Since

subsequences 119911119896K1015840 120582

119896K1015840 and

119896K1015840 are all bounded

there are points 119911lowast 120582lowast and

lowast as well as an infinite indexset K sube K1015840 such that 119911119896K rarr 119911

lowast 120582119896K rarr 120582lowast and

119896K rarr

lowast By Lemma C1 we have 119892119894(119911lowast) gt 0 forall119894 isin

1 2 3119898 + 2119899 Similar to the proof of Lemma 3 it is notdifficult to get

nabla119908Φ120583(119911lowast)119879

Δ119908lowast+ nabla119905Φ120583(119911lowast)119879

Δ119905lowast

+ nabla120585Φ120583(119911lowast)119879

Δ120585lowast+ nabla119887Φ120583(119911lowast)119879

Δ119887lowast

= minus (Δ119908lowast119879

Δ120585lowast119879

Δ119905lowast119879

Δ119887lowast) 119878(

Δ119908lowast

Δ120585lowast

Δ119905lowast

Δ119887lowast

) lt 0

(C2)

Since 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 there exists a

isin (0 1] such that for all 120572 isin (0 ]

119892119894(119911lowast+ 120572Δ119911

lowast) gt 0 forall119894 isin 1 2 3119899 (C3)

Taking into account 1205911isin (0 12) we claim that there exists

a isin (0 ] such that the following inequality holds for all120572 isin (0 ]

Φ120583(119911lowast+ 120572Δ119911

lowast) minus Φ120583(119911lowast) le 11120591

1120572nabla119911Φ120583(119911lowast)119879

Δ119911lowast (C4)

Let 119898lowast

= min119895 | 120573119895isin (0 ] 119895 = 0 1 and 120572

lowast= 120573119898lowast

It follows from (C3) that the following inequality is satisfiedfor all 119896 isin K sufficient large and any 119894 isin 1 2 3119898 + 2119899

119892119894(119911119896+ 120572lowastΔ119911119896) gt 0 (C5)

Journal of Applied Mathematics 15

Moreover we have

Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911119896)

= Φ120583(119911lowast+ 120572lowastΔ119911lowast) minus Φ120583(119911lowast)

+ Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911lowast+ 120572lowastΔ119911lowast)

+ Φ120583(119911lowast) minus Φ120583(119911119896)

le 1051205911120572lowastnabla119911Φ120583(119911lowast)119879

Δ119911lowast

le 1205911120572lowastnabla119911Φ120583(119911119896)119879

Δ119911119896

(C6)

By the backtracking line search rule the last inequalitytogether with (C5) yields the inequality120572

119896ge 120572lowastfor all 119896 isin K

large enough Consequently when 119896 isin K is sufficiently largewe have from (23) and (C3) that

Φ120583(119911119896+1

) le Φ120583(119911119896) + 1205911120572119896nabla119911Φ120583(119911119896)119879

Δ119911119896

le Φ120583(119911119896) + 1205911120572lowastnabla119911Φ120583(119911119896)119879

Δ119911119896

le Φ120583(119911119896) +

1

21205911120572lowastnabla119911Φ120583(119911lowast)119879

Δ119911lowast

(C7)

This shows that Φ120583(119911119896)K rarr minusinfin contradicting with the

fact that Φ120583(119911119896)K is bounded from below The proof is

complete

Then we will establish the convergence of Algorithms 2and 4 that is Theorems 5 and 6 in Section 3

C1 Proof of Theorem 5

Proof We let (119911lowast lowast) be a limit point of (119911119896 119896) and thesubsequence (119911119896 119896)K converges to (119911

lowast lowast)

Recall that (Δ119911119896 119896) are the solution of (18) with (119911 120582) =

(119911119896 120582119896) Taking limits in both sides of (18) with (119911 120582) =

(119911119896 120582119896) as 119896 rarr infinwith 119896 isin K by the use of Lemma C1 we

obtain

lowast(1)

minus lowast(2)

minus 119883119879119884119879lowast(4)

= 0

119862 lowast 119890119899minus lowast(4)

minus lowast(5)

= 0

119890119898minus 2119879lowastlowast(1)

minus 2119879lowastlowast(2)

minus lowast(3)

= 0

119910119879lowast(4)

= 0

(119879lowast2

minus 119882lowast) lowast(1)

minus 120583119890119898

= 0

(119879lowast2

+ 119882lowast) lowast(2)

minus 120583119890119898

= 0

119879lowastlowast(3)

minus 120583119890119898

= 0

119875lowastlowast(4)

minus 120583119890119899= 0

Ξlowastlowast(5)

minus 120583119890119899= 0

(C8)

This shows that (119911lowast lowast) satisfies the first-order optimalityconditions (15)

C2 Proof of Theorem 6

Proof (i) Without loss of generality we suppose that thebounded subsequence (119911

119895 119895)J converges to some point

(119911lowast lowast) It is clear that 119911lowast is a feasible point of (10) Since

120583119895 120598120583119895 rarr 0 it follows from (29) and (30) that

Res (119911119895 119895 120583119895)

J997888rarr Res (119911lowast lowast 0) = 0

119895J

997888rarr lowastge 0

(C9)

Consequently the (119911lowast lowast) satisfies the KKT conditions (13)(ii) Let 120585

119895= max119895

infin 1 and

119895= 120585minus1

119895119895 Obviously

119895 is bounded Hence there exists an infinite subsetJ1015840 sube J

such that 1198951015840J rarr = 0 and 119895infin

= 1 for large 119895 isin JFrom (30) we know that ge 0 Dividing both sides of thefirst inequality of (18) with (119911 120582) = (119911

119896 120582119896) by 120585119895 then taking

limits as 119895 rarr infin with 119895 isin J we get

(1)

minus (2)

minus 119883119879119884119879(4)

= 0

(4)

+ (5)

= 0

2119879lowast(1)

+ 2119879lowast(2)

+ (3)

= 0

119910119879(4)

= 0

(119879lowast2

minus 119882lowast) (1)

= 0

(119879lowast2

+ 119882lowast) (2)

= 0

119879lowast(3)

= 0

119875lowast(4)

= 0

Ξlowast(5)

= 0

(C10)

Since 119911lowast is feasible the above equations has shown that 119911lowast is

a Fritz-John point of problem (10)

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

The authors would like to acknowledge support for thisproject from the National Science Foundation (NSF Grants11371154 11071087 11271069 and 61103202) of China

References

[1] Y M Yang and J O Pedersen ldquoA comparative study on featureselection in text categorizationrdquo in Proceedings of the 14th

16 Journal of Applied Mathematics

International Conference on Machine Learning (ICML rsquo97) vol97 pp 412ndash420 1997

[2] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquoThe Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003

[3] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002

[4] H H Zhang J Ahn X Lin and C Park ldquoGene selection usingsupport vector machines with non-convex penaltyrdquo Bioinfor-matics vol 22 no 1 pp 88ndash95 2006

[5] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007

[6] J Weston F Perez-Cruz O Bousquet O Chapelle A Elisseeffand B Scholkopf ldquoFeature selection and transduction forprediction of molecular bioactivity for drug designrdquo Bioinfor-matics vol 19 no 6 pp 764ndash771 2003

[7] Y Liu ldquoA comparative study on feature selection methods fordrug discoveryrdquo Journal of Chemical Information and ComputerSciences vol 44 no 5 pp 1823ndash1828 2004

[8] I Guyon J Weston S Barnhill and V Vapnik Eds FeatureExtraction Foundations and Applications vol 207 Springer2006

[9] A Rakotomamonjy ldquoVariable selection using SVM-based cri-teriardquo Journal of Machine Learning Research vol 3 no 7-8 pp1357ndash1370 2003

[10] J Weston S Mukherjee O Chapelle M Pontil T Poggio andV Vapnik ldquoFeature selection for SVMsrdquo in Proceedings of theConference onNeural Information Processing Systems vol 13 pp668ndash674 Denver Colo USA 2000

[11] D Peleg and R Meir ldquoA feature selection algorithm based onthe global minimization of a generalization error boundrdquo inProceedings of the Conference on Neural Information ProcessingSystems Vancouver Canada 2004

[12] P S Bradley and O L Mangasarian ldquoFeature selection via con-cave minimization and support vector machinesrdquo in Proceed-ings of the International Conference on Machine Learning pp82ndash90 San Francisco Calif USA 1998

[13] JWeston A Elisseeff B Scholkopf andM Tipping ldquoUse of thezero-norm with linear models and kernel methodsrdquo Journal ofMachine Learning Research vol 3 pp 1439ndash1461 2003

[14] A B Chan N Vasconcelos and G R G Lanckriet ldquoDirectconvex relaxations of sparse SVMrdquo in Proceedings of the 24thInternational Conference on Machine Learning (ICML rsquo07) pp145ndash153 Corvallis Ore USA June 2007

[15] GM Fung andO L Mangasarian ldquoA feature selection Newtonmethod for support vector machine classificationrdquo Computa-tional Optimization and Applications vol 28 no 2 pp 185ndash2022004

[16] J Zhu S Rosset T Hastie and R Tibshirani ldquo1-norm supportvector machinesrdquo Advances in Neural Information ProcessingSystems vol 16 no 1 pp 49ndash56 2004

[17] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society B Methodological vol 58no 1 pp 267ndash288 1996

[18] T Zhang ldquoAnalysis of multi-stage convex relaxation for sparseregularizationrdquo Journal of Machine Learning Research vol 11pp 1081ndash1107 2010

[19] R Chartrand ldquoExact reconstruction of sparse signals via non-convexminimizationrdquo IEEE Signal Processing Letters vol 14 no10 pp 707ndash710 2007

[20] R Chartrand ldquoNonconvex regularizaron for shape preserva-tionrdquo inProceedings of the 14th IEEE International Conference onImage Processing (ICIP rsquo07) vol 1 pp I293ndashI296 San AntonioTex USA September 2007

[21] Z Xu H Zhang Y Wang X Chang and Y Liang ldquo11987112

regularizationrdquo Science China Information Sciences vol 53 no6 pp 1159ndash1169 2010

[22] W J Chen and Y J Tian ldquoLp-norm proximal support vectormachine and its applicationsrdquo Procedia Computer Science vol1 no 1 pp 2417ndash2423 2010

[23] J Liu J Li W Xu and Y Shi ldquoA weighted L119902adaptive least

squares support vector machine classifiers-Robust and sparseapproximationrdquo Expert Systems with Applications vol 38 no 3pp 2253ndash2259 2011

[24] Y Liu H H Zhang C Park and J Ahn ldquoSupport vector mach-ines with adaptive 119871

119902penaltyrdquo Computational Statistics amp Data

Analysis vol 51 no 12 pp 6380ndash6394 2007[25] Z Liu S Lin andM Tan ldquoSparse support vectormachines with

Lp penalty for biomarker identificationrdquo IEEEACM Transac-tions on Computational Biology and Bioinformatics vol 7 no1 pp 100ndash107 2010

[26] A Rakotomamonjy R Flamary G Gasso and S Canu ldquol119901-l119902

penalty for sparse linear and sparse multiple kernel multitasklearningrdquo IEEE Transactions on Neural Networks vol 22 no 8pp 1307ndash1320 2011

[27] J Y Tan Z Zhang L Zhen C H Zhang and N Y DengldquoAdaptive feature selection via a new version of support vectormachinerdquo Neural Computing and Applications vol 23 no 3-4pp 937ndash945 2013

[28] Z B Xu X Y Chang F M Xu and H Zhang ldquo11987112

regu-larization an iterative half thresholding algorithmrdquo IEEETrans-actions on Neural Networks and Learning Systems vol 23 no 7pp 1013ndash1027 2012

[29] D Ge X Jiang and Y Ye ldquoA note on the complexity of 119871119901

minimizationrdquo Mathematical Programming vol 129 no 2 pp285ndash299 2011

[30] VN VapnikTheNature of Statistical LearningTheory SpringerNew York NY USA 1995

[31] J Fan and H Peng ldquoNonconcave penalized likelihood with adiverging number of parametersrdquo The Annals of Statistics vol32 no 3 pp 928ndash961 2004

[32] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoThe Annals of Statistics vol 28 no 5 pp 1356ndash1378 2000

[33] B S Tian and X Q Yang ldquoAn interior-point l 11987112-penalty-

method for nonlinear programmingrdquo Technical ReportDepartment of Applied Mathematics Hong Kong PolytechnicUniversity 2013

[34] L Armijo ldquoMinimization of functions having Lipschitz con-tinuous first partial derivativesrdquo Pacific Journal of Mathematicsvol 16 pp 1ndash3 1966

[35] F John ldquoExtremum problems with inequalities as side-con-ditionsrdquo in Studies and Essays Courant Anniversary Volume pp187ndash1204 1948

[36] D J Newman S Hettich C L Blake and C J Merz ldquoUCIrepository of machine learning databasesrdquo Technical Report9702 Department of Information and Computer Science Uni-verisity of California Irvine Calif USA 1998 httparchiveicsucieduml

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 9: Research Article An Interior Point Method for -SVM …downloads.hindawi.com/journals/jam/2014/942520.pdfResearch Article An Interior Point Method for 1/2-SVM and Application to Feature

Journal of Applied Mathematics 9

0 20 40 60 80 1000

5

10

15

20

25

30

Number of training samples

Card

inal

ity

0 20 40 60 80 10075

80

85

90

95

100

Number of training samples

Accu

racy

L2-SVML1-SVM

L0-SVML12-SVM

L2-SVML1-SVM

L0-SVML12-SVM

Figure 2 Results comparison on 10 artificial data sets with various training samples

Table 1 Results comparison on 10 artificial data sets with various training sample sizes

1198991198712-SVM 119871

1-SVM 119871

0-SVM 119871

12-SVM

Card Acc Card Acc Card Acc Card Acc10 2997 8465 693 8855 143 7765 225 880520 3000 9221 883 9679 187 9317 235 955030 3000 9427 847 9823 203 9467 205 974140 2997 9599 900 9875 207 9539 210 977550 3000 9661 950 9891 213 9657 245 976860 2997 9725 1000 9894 220 9761 245 979870 3000 9741 1100 9898 220 9756 240 982380 3000 9768 1003 9909 223 9761 260 985090 2997 9789 920 9921 230 9741 290 9830100 3000 9813 1027 9909 250 9793 255 9836On average 2999 9521 932 9765 210 9456 241 9678ldquo119899rdquo is the number of training samples ldquoCardrdquo represents the number of selected features and ldquoAccrdquo is the classification accuracy

Table 2 Frequency of the most important features (1199093 1199096) ranked on top 2 in 30 runs

119899 10 20 30 40 50 60 70 80 90 1001198711-SVM 7 14 16 18 22 23 20 22 21 22

1198710-SVM 3 11 12 15 19 22 22 23 22 25

11987112-SVM 9 16 19 24 24 23 27 26 26 27

Table 3 Average results over 10 artificial data sets with varies dimension

1198712-SVM 119871

1-SVM 119871

0-SVM 119871

12-SVM

Cardinality 10995 1585 238 241Accuracy 9293 9907 9776 9826

10 Journal of Applied Mathematics

Table 4 Feature selection performance comparison on UCI data sets (Cardinality)

No UCI Data Set 119899 1198981198712-SVM 119871

1-SVM 119871

0-SVM 119871

12-SVM

Mean Std Mean Std Mean Std Mean Std1 Pima diabetes 768 8 800 000 740 089 620 110 760 1522 Breast Cancer 683 9 900 000 860 122 640 207 840 1673 Wine (3) 178 13 1300 000 973 069 373 028 407 0494 Image (7) 2100 19 1900 000 820 053 366 097 511 0625 SPECT 267 22 2200 000 1740 527 980 427 9 1176 WDBC 569 30 3000 000 980 152 900 303 560 2057 Ionosphere 351 34 3320 045 2500 335 2460 1021 2180 8448 SPECTF 267 44 4400 000 3880 1098 2400 1132 2560 7759 Sonar 208 60 6000 000 3140 1305 2700 718 1360 100610 Musk1 476 166 16600 000 8580 1885 4860 428 41 1416

Average cardinality 4050 4042 2421 1630 1418

Table 5 Classification performance comparison on UCI data sets (accuaracy)

No UCI Data Set 1198712-SVM 119871

1-SVM 119871

0-SVM 119871

12-SVM

Mean Std Mean Std Mean Std Mean Std1 Pima diabetes 7699 347 7673 085 7673 294 7712 1522 Breast cancer 9618 223 9706 096 9632 138 9647 1593 Wine (3) 9771 373 9829 128 9314 239 9714 2564 Image (7) 8894 149 9127 177 8667 458 9136 1945 SPECT 8340 215 7849 231 7811 338 8234 5106 WDBC 9611 211 9646 131 9558 115 9646 1317 Ionosphere 8343 544 8800 370 8457 538 8743 5948 SPECTF 7849 391 7434 450 7387 325 7849 1899 Sonar 7366 529 7561 582 7415 798 7610 43510 Musk1 8484 173 8211 242 7600 564 8232 338

Average accuracy 8597 8584 8351 8652

SVMs there is little change in the accuracy It reveals thatSVMs can benefit from the features reduction

Table 3 shows the average results over all data sets inthe second experiment On average the solution of 119871

12-

SVM yields much sparser than 1198711-SVM and a slightly better

accuracy than 1198710-SVM

42 UCI Data Sets We further tested the reformulation andthe proposed interior point methods to 119871

12-SVM on 10 UCI

data sets [36] There are 8 binary classification problems and2multiclass problems (wine image) Each feature of the inputdata was normalized to zero mean and unit variance and theinstances with missing value were deletedThen the data wasrandomly split into training set (80) and testing set (20)For the two multiclass problems a one-against-rest methodwas applied to construct a binary classifier for each class Werepeated the training and testing procedure 10 times and theaverage results were shown in Tables 4 and 5 and Figure 4

Tables 4 and 5 summarize the feature selection andclassification performance of the numerical experiments onUCI data sets respectively Here 119899 is the numbers of samplesand 119898 is the number of the input features For the two

multiclass data sets the numbers of the classes are markedbehind their names Sparsity is defined as card119898 and thesmall value of sparsity is preferredThe data sets are arrangedin descending order according to the dimension The lowestcardinality and the best accuracy rate for each problem arebolded

As shown in Tables 4 and 5 the three sparse SVMs canencourage sparsity in all data sets while remaining roughlyidentical accuracy with 119871

2-SVM Among the three sparse

methods the 11987112

-SVM has the lowest cardinality (1418)and the highest classification accuracy (8652) on averageWhile the 119871

1-SVM has the worst feature selection perform-

ance with the highest average cardinality (2421) and the1198710-SVM has the lowest average classification accuracy

(8351)Figure 4 plots the sparsity (left) and classification accu-

racy (right) of each classifier on UCI data sets In three datasets (6 9 10) the 119871

12-SVM has the best performance both

in feature selection and classification among the three sparseSVMs Compared with 119871

1-SVM 119871

12-SVM can achieve

sparser solution in nine data sets For example in the dataset ldquo8 SPECTFrdquo the features selected by 119871

12-SVM are 34

Journal of Applied Mathematics 11

50 100 150 2000

20

40

60

80

100

120

140

160

180

200

Dimension

Card

inal

ity

50 100 150 20080

82

84

86

88

90

92

94

96

98

100

Dimension

Accu

racy

L2-SVML1-SVM

L0-SVML12-SVM

L2-SVML1-SVM

L0-SVML12-SVM

Figure 3 Results comparison on artificial data set with various dimensions

less than 1198711-SVM at the same time the accuracy of 119871

12-

SVM is 41 higher than 1198711-SVM In most data sets (4

5 6 9 10) the cardinality of 11987112

-SVM drops significantly(at least 37) with the equal or a slightly better result inaccuracy Only in the data set ldquo3 Winerdquo the accuracy of11987112

-SVM is decreased by 11 but the sparsity provided by11987112

-SVM leads to 582 improvement over 1198711-SVM In the

rest three data sets (1 2 7) the two methods have similarresults in feature selection and classification As seen abovethe 119871

12-SVM can provide lower dimension representation

than 1198711-SVM with the competitive prediction perform-

anceFigure 4 (right) shows that compared with 119871

0-SVM

11987112

-SVM has the classification accuracy improved in alldata sets For instance in four data sets (3 4 8 10) theclassification accuracy of 119871

12-SVM is at least 40 higher

than 1198710-SVM Especially in the data set ldquo10Muskrdquo 119871

12-SVM

gives a 63 rise in accuracy over 1198710-SVM Meanwhile it

can be observed from Figure 4 (left) that 11987112

-SVM selectsfewer feature than 119871

0-SVM in five data sets (5 6 7 9 10) For

example in the data sets 6 and 9 the cardinalities of 11987112

-SVM are 378 and 496 less than 119871

0-SVM respectively

In summary 11987112

-SVM presents better classification perfor-mance than 119871

0-SVM while it is effective in choosing relevant

features

5 Conclusions

In this paper we proposed a 11987112

regularization techniquefor simultaneous feature selection and classification in theSVMWe have reformulated the 119871

12-SVM into an equivalent

smooth constrained optimization problem The problem

possesses a very simple structure and is relatively easy todevelop numerical methods By the use of this interestingreformulation we proposed an interior point method andestablished its convergence Our numerical results supportedthe reformulation and the proposed method The 119871

12-

SVM can get more sparsity solution than 1198711-SVM with

the comparable classification accuracy Furthermore the11987112

-SVM can achieve more accuracy classification resultsthan 119871

0-SVM (FSV)

Inspired by the good performance of the smooth opti-mization reformulation of 119871

12-SVM there are some inter-

esting topics deserving further research For examples todevelop more efficient algorithms for solving the reformu-lation to study nonlinear 119871

12-SVM and to explore varies

applications of the 11987112

-SVM and further validate its effectiveare interesting research topics Some of them are under ourcurrent investigation

Appendix

A Properties of the Reformulation to11987112

-SVM

Let 119863 be set of all feasible points of the problem (10) Forany 119911 isin 119863 we let 119865119863(119911119863) and 119871119865119863(119911119863) be the set of allfeasible directions and linearized feasible directions of 119863 at119911 Since the constraint functions of (10) are all convex weimmediately have the following lemma

Lemma A1 For any feasible point 119911 of (10) one has119865119863 (119911119863) = 119871119865119863 (119911119863) (A1)

12 Journal of Applied Mathematics

1 2 3 4 5 6 7 8 9 1010

20

30

40

50

60

70

80

90

100

Data set number Data set number

Spar

sity

1 2 3 4 5 6 7 8 9 1070

75

80

85

90

95

100

Accu

racy

L2-SVML1-SVM

L0-SVML12-SVM

L2-SVML1-SVM

L0-SVML12-SVM

Figure 4 Results comparison on UCI data sets

Based on the above Lemma we can easily derive a firstorder necessary condition for (10) The Lagrangian functionof (10) is

119871 (119908 119905 120585 119887 120582 120583 ] 119901 119902)

= 119862

119899

sum

119894=1

120585119894+

119898

sum

119895=1

119905119895minus

119898

sum

119895=1

120582119895(1199052

119895minus 119908119895) minus

119898

sum

119895=1

120583119895(1199052

119895+ 119908119895)

minus

119898

sum

119895=1

]119895119905119895minus

119899

sum

119894=1

119901119894(119910119894119908119879119909119894+ 119910119894119887 + 120585119894minus 1) minus

119899

sum

119894=1

119902119894120585119894

(A2)

where 120582 120583 ] 119901 119902 are the Lagrangian multipliers By the useof Lemma A1 we immediately have the following theoremabout the first order necessary condition

Theorem A2 Let (119908 119905 120585 119887) be a local solution of (10) Thenthere are Lagrangian multipliers (120582 120583 ] 119901 119902) isin 119877

119898+119898+119898+119899+119899

such that the following KKT conditions hold

120597119871

120597119908= 120582119895minus 120583119895minus 119901119894

119899

sum

119894=1

119910119894119909119894= 0 119894 = 1 2 119899

120597119871

120597119905= 1 minus 2120582

119895119905119895minus 2120583119895119905119895minus ]119895= 0 119895 = 1 2 119898

120597119871

120597120585= 119862 minus 119901

119894minus 119902119894= 0 119894 = 1 2 119899

120597119871

120597119887=

119899

sum

119894=1

119901119894119910119894= 0 119894 = 1 2 119899

120582119895ge 0 119905

2

119895minus 119908119895ge 0 120582

119895(1199052

119895minus 119908119895) = 0

119895 = 1 2 119898

120583119895ge 0 119905

2

119895+ 119908119895ge 0 120583

119895(1199052

119895+ 119908119895) = 0

119895 = 1 2 119898

]119895ge 0 119905

119895ge 0 ]

119895119905119895= 0 119895 = 1 2 119898

119901119894ge 0 119910

119894(119908119879119909119894+ 119887) + 120585

119894minus 1 ge 0

119901119894(119910119894(119908119879119909119894+ 119887) + 120585

119894minus 1) = 0 119894 = 1 2 119899

119902119894ge 0 120585

119894ge 0 119902

119894120585119894= 0 119894 = 1 2 119899

(A3)

The following theorem shows that the level set will bebounded

Theorem A3 For any given constant 119888 gt 0 the level set

119882119888= 119911 = (119908 119905 120585) | 119891 (119911) le 119888 cap 119863 (A4)

is bounded

Proof For any (119908 119905 120585) isin Ω119888 we have

119898

sum

119894

119905119894+ 119862

119899

sum

119895

120585119895le 119888 (A5)

Journal of Applied Mathematics 13

Combining with 119905119894ge 0 119894 = 1 119898 and 120585

119895ge 0 119895 = 1 119899

we have

0 le 119905119894le 119888 119894 = 1 119898

0 le 120585119895le

119888

119862 119895 = 1 119899

(A6)

Moreover for any (119908 119905 120585) isin 119863

10038161003816100381610038161199081198941003816100381610038161003816 le 1199052

119894 119894 = 1 119899 (A7)

Consequently 119908 119905 120585 are bounded What is more from thecondition

119910119894(119908119879119909119894+ 119887) ge 1 minus 120585

119894 119894 = 1 119898 (A8)

we have

119908119879119909119894+ 119887 ge 1 minus 120585

119894 if 119910

119894= 1 119894 = 1 119898

119908119879119909119894+ 119887 le minus1 + 120585

119894 if 119910

119894= minus1 119894 = 1 119898

(A9)

Thus if the feasible region119863 is not empty then we have

max119910119894=1

(1 minus 120585119894minus 119908119879119909119894) le 119887 le min

119910119894=minus1

(minus1 + 120585119894minus 119908119879119909119894) (A10)

Hence 119887 is also bounded The proof is complete

B Proof of Lemma 3

Lemma 3 (see Section 3) shows that Algorithm 2 is welldefinedWe first introduce the following proposition to showthat the matrix 119878 defined by (21) is always positive definitewhich ensures Algorithm 2 to be well defined

Proposition B1 Let the inequality (26) hold then the matrix119878 defined by (21) is positive definite

Proof By an elementary deduction we have for any V(1) isin

119877119898 V(2) isin 119877

119899 V(3) isin 119877119898 V(4) isin 119877 with (V(1) V(2) V(3) V(4)) = 0

(V(1)119879 V(2)119879 V(3)119879 V(4)119879) 119878(

V(1)

V(2)

V(3)

V(4))

= V(1)11987911987811V(1) + V(1)119879119878

12V(2) + V(1)119879119878

13V(3)

+ V(1)11987911987814V(4) + V(2)119879119878

21V(1) + V(2)119879119878

22V(2)

+ V(2)11987911987823V(3) + V(2)119879119878

24V(4)

+ V(3)11987911987831V(1) + V(3)119879119878

32V(2)

+ V(3)11987911987833V(3) + V(3)119879119878

34V(4)

+ V(4)11987911987841V(1) + V(4)119879119878

42V(2)

+ V(4)11987911987843V(3) + V(4)119879119878

44V(4)

= V(1)119879 (119880 + 119881 + 119883119879119884119879119875minus11198634119884119883) V(1)

+ V(1)119879 (119883119879119884119879119875minus11198634) V(2)

+ V(1)119879 (minus2 (119880 minus 119881)119879) V(3)

+ V(1)119879 (119883119879119884119879119875minus11198634119910) V(4)

+ V(2)119879119875minus11198634119884119883V(1) + V(2)119879 (119875minus1119863

4+ Ξminus11198635) V(2)

+ V(2)119879119875minus11198634119910V(4)

minus 2V(3)119879 (119880 minus 119881)119879V(1)

+ V(3)119879 (4119879 (119880 + 119881)119879 + 119879minus11198633minus 2 (119863

1+ 1198632)) V(3)

+ V(4)119879119910119879119875minus11198634119884119883V(1)

+ V(4)119879119910119879119875minus11198634V(2) + V(4)119879119910119879119875minus1119863

4119910V(4)

= V(1)119879119880V(1) minus 4V(3)119879119880119879V(1) + 4V(3)119879119879119880119879V(3)

+ V(1)119879119881V(1) + 4V(3)119879119881119879V(1) + 4V(3)119879119879119881119879V(3)

+ V(2)119879Ξminus11198635V(2) + V(3)119879 (119879minus1119863

3minus 21198631minus 21198632) V(3)

+ (119884119883V(1) + V(2) + V(4)119910)119879

119875minus11198634

times (119884119883V(1) + V(2) + V(4)119910)

= 119890119879

119898119880(119863V

1minus 2119879119863V

3)2

119890119898+ 119890119879

119898119881(119863V

1+ 2119879119863V

3)2

119890119898

+ V(2)119879Ξminus11198635V(2)

+ V(3)119879 (119879minus11198633minus 21198631minus 21198632) V(3)

+ (119884119883V(1) + V(2) + V(4)119910)119879

119875minus11198634(119884119883V(1) + V(2) + V(4)119910)

gt 0

(B1)

where 119863V1= diag(V(1)) 119863V

2= diag(V(2)) 119863V

3= diag(V(3))

and119863V4= diag(V(4))

14 Journal of Applied Mathematics

B1 Proof of Lemma 3

Proof It is easy to see that if Δ119911119896= 0 which implies that (23)

holds for all 120572119896ge 0

Suppose Δ119911119896

= 0 We have

nabla119908Φ120583(119911119896)119879

Δ119908119896+ nabla119905Φ120583(119911119896)119879

Δ119905119896

+ nabla120585Φ120583(119911119896)119879

Δ120585119896+ nabla119887Φ120583(119911119896)119879

Δ119887119896

= minus (Δ119908119896119879

Δ120585119896119879

Δ119905119896119879

Δ119887119896) 119878(

Δ119908119896

Δ120585119896

Δ119905119896

Δ119887119896

)

(B2)

Since matrix 119878 is positive definite and Δ119911119896

= 0 the lastequation implies

nabla119908Φ120583(119911119896)119879

Δ119908119896+ nabla119905Φ120583(119911119896)119879

Δ119905119896

+ nabla120585Φ120583(119911119896)119879

Δ120585119896+ nabla119887Φ120583(119911119896)119879

Δ119887119896lt 0

(B3)

Consequently there exists a 119896isin (0

119896] such that the fourth

inequality in (23) is satisfied for all 120572119896isin (0

119896]

On the other hand since (119909119896 119905119896) is strictly feasible the

point (119909119896 119905119896) + 120572(Δ119909 119905119896 Δ119905119896) will be feasible for all 120572 gt 0

sufficiently small The proof is complete

C Convergence Analysis

This appendix is devoted to the global convergence of theinterior point method We first show the convergence ofAlgorithm 2 when a fixed 120583 is applied to the barrier subprob-lem (14)

Lemma C1 Let (119911119896 120582119896) be generated by Algorithm 2 Then119911119896 are strictly feasible for problem (10) and the Lagrangian

multipliers 120582119896 are bounded from above

Proof For the sake of convenience we use 119892119894(119911) 119894 =

1 2 3119898 + 2119899 to denote the constraint functions of theconstrained problem (10)

We first show that 119911119896 are strictly feasible Suppose onthe contrary that there exists an infinite index subset K andan index 119894 isin 1 2 3119898 + 2119899 such that 119892

119894(119911119896)K darr 0

By the definition of Φ120583(119911) and 119891(119911) = sum

119898

119895=1119905119894+ 119862sum

119899

119894=1120585119894

being bounded from below in the feasible set it must holdthat Φ

120583(119911119896)K rarr infin

However the line search rule implies that the sequenceΦ120583(119911119896) is decreasing So we get a contradiction Conse-

quently for any 119894 isin 1 2 3119898+2119899 119892119894(119909119896 119905119896) is bounded

away from zero The boundedness of 120582119896 then follows from(24)ndash(27)

Lemma C2 Let (119911119896 120582119896) and (Δ119911119896 119896+1

) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenthe sequence (Δ119911

119896 119896+1

)K is bounded

Proof Again we use 119892119894(119911) 119894 = 1 2 3119898+2119899 to denote the

constraint functions of the constrained problem (10)We suppose on the contrary that there exists an infinite

subsetK1015840 sube K such that the subsequence (Δ119911119896 119896+1

)K1015840

tends to infinity Let 119911119896K1015840 rarr 119911

lowast It follows fromLemma C1 that there is an infinite subset K sube K1015840 such that120582119896K rarr 120582

lowast and 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 Thus

we have

119878119896997888rarr 119878lowast (C1)

as 119896 rarr infin with 119896 isin K By Proposition B1 119878lowast is positivedefinite Since the right hand size of (18) is bounded andcontinuous the unboundedness of (Δ119911

119896 119896+1

)K impliesthat the limit of the coefficient matrices of (18) is singularwhich yields a contradiction The proof is complete

Lemma C3 Let (119911119896 120582119896) and (Δ119911119896 119896+1

) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenone has Δ119911

119896K rarr 0

Proof It follows from Lemma C1 that the sequence Δ119911119896K

is bounded Suppose on the contrary that there exists aninfinite subsetK1015840 sube K such that Δ119911

119896K1015840 rarr Δ119911

lowast= 0 Since

subsequences 119911119896K1015840 120582

119896K1015840 and

119896K1015840 are all bounded

there are points 119911lowast 120582lowast and

lowast as well as an infinite indexset K sube K1015840 such that 119911119896K rarr 119911

lowast 120582119896K rarr 120582lowast and

119896K rarr

lowast By Lemma C1 we have 119892119894(119911lowast) gt 0 forall119894 isin

1 2 3119898 + 2119899 Similar to the proof of Lemma 3 it is notdifficult to get

nabla119908Φ120583(119911lowast)119879

Δ119908lowast+ nabla119905Φ120583(119911lowast)119879

Δ119905lowast

+ nabla120585Φ120583(119911lowast)119879

Δ120585lowast+ nabla119887Φ120583(119911lowast)119879

Δ119887lowast

= minus (Δ119908lowast119879

Δ120585lowast119879

Δ119905lowast119879

Δ119887lowast) 119878(

Δ119908lowast

Δ120585lowast

Δ119905lowast

Δ119887lowast

) lt 0

(C2)

Since 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 there exists a

isin (0 1] such that for all 120572 isin (0 ]

119892119894(119911lowast+ 120572Δ119911

lowast) gt 0 forall119894 isin 1 2 3119899 (C3)

Taking into account 1205911isin (0 12) we claim that there exists

a isin (0 ] such that the following inequality holds for all120572 isin (0 ]

Φ120583(119911lowast+ 120572Δ119911

lowast) minus Φ120583(119911lowast) le 11120591

1120572nabla119911Φ120583(119911lowast)119879

Δ119911lowast (C4)

Let 119898lowast

= min119895 | 120573119895isin (0 ] 119895 = 0 1 and 120572

lowast= 120573119898lowast

It follows from (C3) that the following inequality is satisfiedfor all 119896 isin K sufficient large and any 119894 isin 1 2 3119898 + 2119899

119892119894(119911119896+ 120572lowastΔ119911119896) gt 0 (C5)

Journal of Applied Mathematics 15

Moreover we have

Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911119896)

= Φ120583(119911lowast+ 120572lowastΔ119911lowast) minus Φ120583(119911lowast)

+ Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911lowast+ 120572lowastΔ119911lowast)

+ Φ120583(119911lowast) minus Φ120583(119911119896)

le 1051205911120572lowastnabla119911Φ120583(119911lowast)119879

Δ119911lowast

le 1205911120572lowastnabla119911Φ120583(119911119896)119879

Δ119911119896

(C6)

By the backtracking line search rule the last inequalitytogether with (C5) yields the inequality120572

119896ge 120572lowastfor all 119896 isin K

large enough Consequently when 119896 isin K is sufficiently largewe have from (23) and (C3) that

Φ120583(119911119896+1

) le Φ120583(119911119896) + 1205911120572119896nabla119911Φ120583(119911119896)119879

Δ119911119896

le Φ120583(119911119896) + 1205911120572lowastnabla119911Φ120583(119911119896)119879

Δ119911119896

le Φ120583(119911119896) +

1

21205911120572lowastnabla119911Φ120583(119911lowast)119879

Δ119911lowast

(C7)

This shows that Φ120583(119911119896)K rarr minusinfin contradicting with the

fact that Φ120583(119911119896)K is bounded from below The proof is

complete

Then we will establish the convergence of Algorithms 2and 4 that is Theorems 5 and 6 in Section 3

C1 Proof of Theorem 5

Proof We let (119911lowast lowast) be a limit point of (119911119896 119896) and thesubsequence (119911119896 119896)K converges to (119911

lowast lowast)

Recall that (Δ119911119896 119896) are the solution of (18) with (119911 120582) =

(119911119896 120582119896) Taking limits in both sides of (18) with (119911 120582) =

(119911119896 120582119896) as 119896 rarr infinwith 119896 isin K by the use of Lemma C1 we

obtain

lowast(1)

minus lowast(2)

minus 119883119879119884119879lowast(4)

= 0

119862 lowast 119890119899minus lowast(4)

minus lowast(5)

= 0

119890119898minus 2119879lowastlowast(1)

minus 2119879lowastlowast(2)

minus lowast(3)

= 0

119910119879lowast(4)

= 0

(119879lowast2

minus 119882lowast) lowast(1)

minus 120583119890119898

= 0

(119879lowast2

+ 119882lowast) lowast(2)

minus 120583119890119898

= 0

119879lowastlowast(3)

minus 120583119890119898

= 0

119875lowastlowast(4)

minus 120583119890119899= 0

Ξlowastlowast(5)

minus 120583119890119899= 0

(C8)

This shows that (119911lowast lowast) satisfies the first-order optimalityconditions (15)

C2 Proof of Theorem 6

Proof (i) Without loss of generality we suppose that thebounded subsequence (119911

119895 119895)J converges to some point

(119911lowast lowast) It is clear that 119911lowast is a feasible point of (10) Since

120583119895 120598120583119895 rarr 0 it follows from (29) and (30) that

Res (119911119895 119895 120583119895)

J997888rarr Res (119911lowast lowast 0) = 0

119895J

997888rarr lowastge 0

(C9)

Consequently the (119911lowast lowast) satisfies the KKT conditions (13)(ii) Let 120585

119895= max119895

infin 1 and

119895= 120585minus1

119895119895 Obviously

119895 is bounded Hence there exists an infinite subsetJ1015840 sube J

such that 1198951015840J rarr = 0 and 119895infin

= 1 for large 119895 isin JFrom (30) we know that ge 0 Dividing both sides of thefirst inequality of (18) with (119911 120582) = (119911

119896 120582119896) by 120585119895 then taking

limits as 119895 rarr infin with 119895 isin J we get

(1)

minus (2)

minus 119883119879119884119879(4)

= 0

(4)

+ (5)

= 0

2119879lowast(1)

+ 2119879lowast(2)

+ (3)

= 0

119910119879(4)

= 0

(119879lowast2

minus 119882lowast) (1)

= 0

(119879lowast2

+ 119882lowast) (2)

= 0

119879lowast(3)

= 0

119875lowast(4)

= 0

Ξlowast(5)

= 0

(C10)

Since 119911lowast is feasible the above equations has shown that 119911lowast is

a Fritz-John point of problem (10)

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

The authors would like to acknowledge support for thisproject from the National Science Foundation (NSF Grants11371154 11071087 11271069 and 61103202) of China

References

[1] Y M Yang and J O Pedersen ldquoA comparative study on featureselection in text categorizationrdquo in Proceedings of the 14th

16 Journal of Applied Mathematics

International Conference on Machine Learning (ICML rsquo97) vol97 pp 412ndash420 1997

[2] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquoThe Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003

[3] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002

[4] H H Zhang J Ahn X Lin and C Park ldquoGene selection usingsupport vector machines with non-convex penaltyrdquo Bioinfor-matics vol 22 no 1 pp 88ndash95 2006

[5] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007

[6] J Weston F Perez-Cruz O Bousquet O Chapelle A Elisseeffand B Scholkopf ldquoFeature selection and transduction forprediction of molecular bioactivity for drug designrdquo Bioinfor-matics vol 19 no 6 pp 764ndash771 2003

[7] Y Liu ldquoA comparative study on feature selection methods fordrug discoveryrdquo Journal of Chemical Information and ComputerSciences vol 44 no 5 pp 1823ndash1828 2004

[8] I Guyon J Weston S Barnhill and V Vapnik Eds FeatureExtraction Foundations and Applications vol 207 Springer2006

[9] A Rakotomamonjy ldquoVariable selection using SVM-based cri-teriardquo Journal of Machine Learning Research vol 3 no 7-8 pp1357ndash1370 2003

[10] J Weston S Mukherjee O Chapelle M Pontil T Poggio andV Vapnik ldquoFeature selection for SVMsrdquo in Proceedings of theConference onNeural Information Processing Systems vol 13 pp668ndash674 Denver Colo USA 2000

[11] D Peleg and R Meir ldquoA feature selection algorithm based onthe global minimization of a generalization error boundrdquo inProceedings of the Conference on Neural Information ProcessingSystems Vancouver Canada 2004

[12] P S Bradley and O L Mangasarian ldquoFeature selection via con-cave minimization and support vector machinesrdquo in Proceed-ings of the International Conference on Machine Learning pp82ndash90 San Francisco Calif USA 1998

[13] JWeston A Elisseeff B Scholkopf andM Tipping ldquoUse of thezero-norm with linear models and kernel methodsrdquo Journal ofMachine Learning Research vol 3 pp 1439ndash1461 2003

[14] A B Chan N Vasconcelos and G R G Lanckriet ldquoDirectconvex relaxations of sparse SVMrdquo in Proceedings of the 24thInternational Conference on Machine Learning (ICML rsquo07) pp145ndash153 Corvallis Ore USA June 2007

[15] GM Fung andO L Mangasarian ldquoA feature selection Newtonmethod for support vector machine classificationrdquo Computa-tional Optimization and Applications vol 28 no 2 pp 185ndash2022004

[16] J Zhu S Rosset T Hastie and R Tibshirani ldquo1-norm supportvector machinesrdquo Advances in Neural Information ProcessingSystems vol 16 no 1 pp 49ndash56 2004

[17] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society B Methodological vol 58no 1 pp 267ndash288 1996

[18] T Zhang ldquoAnalysis of multi-stage convex relaxation for sparseregularizationrdquo Journal of Machine Learning Research vol 11pp 1081ndash1107 2010

[19] R Chartrand ldquoExact reconstruction of sparse signals via non-convexminimizationrdquo IEEE Signal Processing Letters vol 14 no10 pp 707ndash710 2007

[20] R Chartrand ldquoNonconvex regularizaron for shape preserva-tionrdquo inProceedings of the 14th IEEE International Conference onImage Processing (ICIP rsquo07) vol 1 pp I293ndashI296 San AntonioTex USA September 2007

[21] Z Xu H Zhang Y Wang X Chang and Y Liang ldquo11987112

regularizationrdquo Science China Information Sciences vol 53 no6 pp 1159ndash1169 2010

[22] W J Chen and Y J Tian ldquoLp-norm proximal support vectormachine and its applicationsrdquo Procedia Computer Science vol1 no 1 pp 2417ndash2423 2010

[23] J Liu J Li W Xu and Y Shi ldquoA weighted L119902adaptive least

squares support vector machine classifiers-Robust and sparseapproximationrdquo Expert Systems with Applications vol 38 no 3pp 2253ndash2259 2011

[24] Y Liu H H Zhang C Park and J Ahn ldquoSupport vector mach-ines with adaptive 119871

119902penaltyrdquo Computational Statistics amp Data

Analysis vol 51 no 12 pp 6380ndash6394 2007[25] Z Liu S Lin andM Tan ldquoSparse support vectormachines with

Lp penalty for biomarker identificationrdquo IEEEACM Transac-tions on Computational Biology and Bioinformatics vol 7 no1 pp 100ndash107 2010

[26] A Rakotomamonjy R Flamary G Gasso and S Canu ldquol119901-l119902

penalty for sparse linear and sparse multiple kernel multitasklearningrdquo IEEE Transactions on Neural Networks vol 22 no 8pp 1307ndash1320 2011

[27] J Y Tan Z Zhang L Zhen C H Zhang and N Y DengldquoAdaptive feature selection via a new version of support vectormachinerdquo Neural Computing and Applications vol 23 no 3-4pp 937ndash945 2013

[28] Z B Xu X Y Chang F M Xu and H Zhang ldquo11987112

regu-larization an iterative half thresholding algorithmrdquo IEEETrans-actions on Neural Networks and Learning Systems vol 23 no 7pp 1013ndash1027 2012

[29] D Ge X Jiang and Y Ye ldquoA note on the complexity of 119871119901

minimizationrdquo Mathematical Programming vol 129 no 2 pp285ndash299 2011

[30] VN VapnikTheNature of Statistical LearningTheory SpringerNew York NY USA 1995

[31] J Fan and H Peng ldquoNonconcave penalized likelihood with adiverging number of parametersrdquo The Annals of Statistics vol32 no 3 pp 928ndash961 2004

[32] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoThe Annals of Statistics vol 28 no 5 pp 1356ndash1378 2000

[33] B S Tian and X Q Yang ldquoAn interior-point l 11987112-penalty-

method for nonlinear programmingrdquo Technical ReportDepartment of Applied Mathematics Hong Kong PolytechnicUniversity 2013

[34] L Armijo ldquoMinimization of functions having Lipschitz con-tinuous first partial derivativesrdquo Pacific Journal of Mathematicsvol 16 pp 1ndash3 1966

[35] F John ldquoExtremum problems with inequalities as side-con-ditionsrdquo in Studies and Essays Courant Anniversary Volume pp187ndash1204 1948

[36] D J Newman S Hettich C L Blake and C J Merz ldquoUCIrepository of machine learning databasesrdquo Technical Report9702 Department of Information and Computer Science Uni-verisity of California Irvine Calif USA 1998 httparchiveicsucieduml

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 10: Research Article An Interior Point Method for -SVM …downloads.hindawi.com/journals/jam/2014/942520.pdfResearch Article An Interior Point Method for 1/2-SVM and Application to Feature

10 Journal of Applied Mathematics

Table 4 Feature selection performance comparison on UCI data sets (Cardinality)

No UCI Data Set 119899 1198981198712-SVM 119871

1-SVM 119871

0-SVM 119871

12-SVM

Mean Std Mean Std Mean Std Mean Std1 Pima diabetes 768 8 800 000 740 089 620 110 760 1522 Breast Cancer 683 9 900 000 860 122 640 207 840 1673 Wine (3) 178 13 1300 000 973 069 373 028 407 0494 Image (7) 2100 19 1900 000 820 053 366 097 511 0625 SPECT 267 22 2200 000 1740 527 980 427 9 1176 WDBC 569 30 3000 000 980 152 900 303 560 2057 Ionosphere 351 34 3320 045 2500 335 2460 1021 2180 8448 SPECTF 267 44 4400 000 3880 1098 2400 1132 2560 7759 Sonar 208 60 6000 000 3140 1305 2700 718 1360 100610 Musk1 476 166 16600 000 8580 1885 4860 428 41 1416

Average cardinality 4050 4042 2421 1630 1418

Table 5 Classification performance comparison on UCI data sets (accuaracy)

No UCI Data Set 1198712-SVM 119871

1-SVM 119871

0-SVM 119871

12-SVM

Mean Std Mean Std Mean Std Mean Std1 Pima diabetes 7699 347 7673 085 7673 294 7712 1522 Breast cancer 9618 223 9706 096 9632 138 9647 1593 Wine (3) 9771 373 9829 128 9314 239 9714 2564 Image (7) 8894 149 9127 177 8667 458 9136 1945 SPECT 8340 215 7849 231 7811 338 8234 5106 WDBC 9611 211 9646 131 9558 115 9646 1317 Ionosphere 8343 544 8800 370 8457 538 8743 5948 SPECTF 7849 391 7434 450 7387 325 7849 1899 Sonar 7366 529 7561 582 7415 798 7610 43510 Musk1 8484 173 8211 242 7600 564 8232 338

Average accuracy 8597 8584 8351 8652

SVMs there is little change in the accuracy It reveals thatSVMs can benefit from the features reduction

Table 3 shows the average results over all data sets inthe second experiment On average the solution of 119871

12-

SVM yields much sparser than 1198711-SVM and a slightly better

accuracy than 1198710-SVM

42 UCI Data Sets We further tested the reformulation andthe proposed interior point methods to 119871

12-SVM on 10 UCI

data sets [36] There are 8 binary classification problems and2multiclass problems (wine image) Each feature of the inputdata was normalized to zero mean and unit variance and theinstances with missing value were deletedThen the data wasrandomly split into training set (80) and testing set (20)For the two multiclass problems a one-against-rest methodwas applied to construct a binary classifier for each class Werepeated the training and testing procedure 10 times and theaverage results were shown in Tables 4 and 5 and Figure 4

Tables 4 and 5 summarize the feature selection andclassification performance of the numerical experiments onUCI data sets respectively Here 119899 is the numbers of samplesand 119898 is the number of the input features For the two

multiclass data sets the numbers of the classes are markedbehind their names Sparsity is defined as card119898 and thesmall value of sparsity is preferredThe data sets are arrangedin descending order according to the dimension The lowestcardinality and the best accuracy rate for each problem arebolded

As shown in Tables 4 and 5 the three sparse SVMs canencourage sparsity in all data sets while remaining roughlyidentical accuracy with 119871

2-SVM Among the three sparse

methods the 11987112

-SVM has the lowest cardinality (1418)and the highest classification accuracy (8652) on averageWhile the 119871

1-SVM has the worst feature selection perform-

ance with the highest average cardinality (2421) and the1198710-SVM has the lowest average classification accuracy

(8351)Figure 4 plots the sparsity (left) and classification accu-

racy (right) of each classifier on UCI data sets In three datasets (6 9 10) the 119871

12-SVM has the best performance both

in feature selection and classification among the three sparseSVMs Compared with 119871

1-SVM 119871

12-SVM can achieve

sparser solution in nine data sets For example in the dataset ldquo8 SPECTFrdquo the features selected by 119871

12-SVM are 34

Journal of Applied Mathematics 11

50 100 150 2000

20

40

60

80

100

120

140

160

180

200

Dimension

Card

inal

ity

50 100 150 20080

82

84

86

88

90

92

94

96

98

100

Dimension

Accu

racy

L2-SVML1-SVM

L0-SVML12-SVM

L2-SVML1-SVM

L0-SVML12-SVM

Figure 3 Results comparison on artificial data set with various dimensions

less than 1198711-SVM at the same time the accuracy of 119871

12-

SVM is 41 higher than 1198711-SVM In most data sets (4

5 6 9 10) the cardinality of 11987112

-SVM drops significantly(at least 37) with the equal or a slightly better result inaccuracy Only in the data set ldquo3 Winerdquo the accuracy of11987112

-SVM is decreased by 11 but the sparsity provided by11987112

-SVM leads to 582 improvement over 1198711-SVM In the

rest three data sets (1 2 7) the two methods have similarresults in feature selection and classification As seen abovethe 119871

12-SVM can provide lower dimension representation

than 1198711-SVM with the competitive prediction perform-

anceFigure 4 (right) shows that compared with 119871

0-SVM

11987112

-SVM has the classification accuracy improved in alldata sets For instance in four data sets (3 4 8 10) theclassification accuracy of 119871

12-SVM is at least 40 higher

than 1198710-SVM Especially in the data set ldquo10Muskrdquo 119871

12-SVM

gives a 63 rise in accuracy over 1198710-SVM Meanwhile it

can be observed from Figure 4 (left) that 11987112

-SVM selectsfewer feature than 119871

0-SVM in five data sets (5 6 7 9 10) For

example in the data sets 6 and 9 the cardinalities of 11987112

-SVM are 378 and 496 less than 119871

0-SVM respectively

In summary 11987112

-SVM presents better classification perfor-mance than 119871

0-SVM while it is effective in choosing relevant

features

5 Conclusions

In this paper we proposed a 11987112

regularization techniquefor simultaneous feature selection and classification in theSVMWe have reformulated the 119871

12-SVM into an equivalent

smooth constrained optimization problem The problem

possesses a very simple structure and is relatively easy todevelop numerical methods By the use of this interestingreformulation we proposed an interior point method andestablished its convergence Our numerical results supportedthe reformulation and the proposed method The 119871

12-

SVM can get more sparsity solution than 1198711-SVM with

the comparable classification accuracy Furthermore the11987112

-SVM can achieve more accuracy classification resultsthan 119871

0-SVM (FSV)

Inspired by the good performance of the smooth opti-mization reformulation of 119871

12-SVM there are some inter-

esting topics deserving further research For examples todevelop more efficient algorithms for solving the reformu-lation to study nonlinear 119871

12-SVM and to explore varies

applications of the 11987112

-SVM and further validate its effectiveare interesting research topics Some of them are under ourcurrent investigation

Appendix

A Properties of the Reformulation to11987112

-SVM

Let 119863 be set of all feasible points of the problem (10) Forany 119911 isin 119863 we let 119865119863(119911119863) and 119871119865119863(119911119863) be the set of allfeasible directions and linearized feasible directions of 119863 at119911 Since the constraint functions of (10) are all convex weimmediately have the following lemma

Lemma A1 For any feasible point 119911 of (10) one has119865119863 (119911119863) = 119871119865119863 (119911119863) (A1)

12 Journal of Applied Mathematics

1 2 3 4 5 6 7 8 9 1010

20

30

40

50

60

70

80

90

100

Data set number Data set number

Spar

sity

1 2 3 4 5 6 7 8 9 1070

75

80

85

90

95

100

Accu

racy

L2-SVML1-SVM

L0-SVML12-SVM

L2-SVML1-SVM

L0-SVML12-SVM

Figure 4 Results comparison on UCI data sets

Based on the above Lemma we can easily derive a firstorder necessary condition for (10) The Lagrangian functionof (10) is

119871 (119908 119905 120585 119887 120582 120583 ] 119901 119902)

= 119862

119899

sum

119894=1

120585119894+

119898

sum

119895=1

119905119895minus

119898

sum

119895=1

120582119895(1199052

119895minus 119908119895) minus

119898

sum

119895=1

120583119895(1199052

119895+ 119908119895)

minus

119898

sum

119895=1

]119895119905119895minus

119899

sum

119894=1

119901119894(119910119894119908119879119909119894+ 119910119894119887 + 120585119894minus 1) minus

119899

sum

119894=1

119902119894120585119894

(A2)

where 120582 120583 ] 119901 119902 are the Lagrangian multipliers By the useof Lemma A1 we immediately have the following theoremabout the first order necessary condition

Theorem A2 Let (119908 119905 120585 119887) be a local solution of (10) Thenthere are Lagrangian multipliers (120582 120583 ] 119901 119902) isin 119877

119898+119898+119898+119899+119899

such that the following KKT conditions hold

120597119871

120597119908= 120582119895minus 120583119895minus 119901119894

119899

sum

119894=1

119910119894119909119894= 0 119894 = 1 2 119899

120597119871

120597119905= 1 minus 2120582

119895119905119895minus 2120583119895119905119895minus ]119895= 0 119895 = 1 2 119898

120597119871

120597120585= 119862 minus 119901

119894minus 119902119894= 0 119894 = 1 2 119899

120597119871

120597119887=

119899

sum

119894=1

119901119894119910119894= 0 119894 = 1 2 119899

120582119895ge 0 119905

2

119895minus 119908119895ge 0 120582

119895(1199052

119895minus 119908119895) = 0

119895 = 1 2 119898

120583119895ge 0 119905

2

119895+ 119908119895ge 0 120583

119895(1199052

119895+ 119908119895) = 0

119895 = 1 2 119898

]119895ge 0 119905

119895ge 0 ]

119895119905119895= 0 119895 = 1 2 119898

119901119894ge 0 119910

119894(119908119879119909119894+ 119887) + 120585

119894minus 1 ge 0

119901119894(119910119894(119908119879119909119894+ 119887) + 120585

119894minus 1) = 0 119894 = 1 2 119899

119902119894ge 0 120585

119894ge 0 119902

119894120585119894= 0 119894 = 1 2 119899

(A3)

The following theorem shows that the level set will bebounded

Theorem A3 For any given constant 119888 gt 0 the level set

119882119888= 119911 = (119908 119905 120585) | 119891 (119911) le 119888 cap 119863 (A4)

is bounded

Proof For any (119908 119905 120585) isin Ω119888 we have

119898

sum

119894

119905119894+ 119862

119899

sum

119895

120585119895le 119888 (A5)

Journal of Applied Mathematics 13

Combining with 119905119894ge 0 119894 = 1 119898 and 120585

119895ge 0 119895 = 1 119899

we have

0 le 119905119894le 119888 119894 = 1 119898

0 le 120585119895le

119888

119862 119895 = 1 119899

(A6)

Moreover for any (119908 119905 120585) isin 119863

10038161003816100381610038161199081198941003816100381610038161003816 le 1199052

119894 119894 = 1 119899 (A7)

Consequently 119908 119905 120585 are bounded What is more from thecondition

119910119894(119908119879119909119894+ 119887) ge 1 minus 120585

119894 119894 = 1 119898 (A8)

we have

119908119879119909119894+ 119887 ge 1 minus 120585

119894 if 119910

119894= 1 119894 = 1 119898

119908119879119909119894+ 119887 le minus1 + 120585

119894 if 119910

119894= minus1 119894 = 1 119898

(A9)

Thus if the feasible region119863 is not empty then we have

max119910119894=1

(1 minus 120585119894minus 119908119879119909119894) le 119887 le min

119910119894=minus1

(minus1 + 120585119894minus 119908119879119909119894) (A10)

Hence 119887 is also bounded The proof is complete

B Proof of Lemma 3

Lemma 3 (see Section 3) shows that Algorithm 2 is welldefinedWe first introduce the following proposition to showthat the matrix 119878 defined by (21) is always positive definitewhich ensures Algorithm 2 to be well defined

Proposition B1 Let the inequality (26) hold then the matrix119878 defined by (21) is positive definite

Proof By an elementary deduction we have for any V(1) isin

119877119898 V(2) isin 119877

119899 V(3) isin 119877119898 V(4) isin 119877 with (V(1) V(2) V(3) V(4)) = 0

(V(1)119879 V(2)119879 V(3)119879 V(4)119879) 119878(

V(1)

V(2)

V(3)

V(4))

= V(1)11987911987811V(1) + V(1)119879119878

12V(2) + V(1)119879119878

13V(3)

+ V(1)11987911987814V(4) + V(2)119879119878

21V(1) + V(2)119879119878

22V(2)

+ V(2)11987911987823V(3) + V(2)119879119878

24V(4)

+ V(3)11987911987831V(1) + V(3)119879119878

32V(2)

+ V(3)11987911987833V(3) + V(3)119879119878

34V(4)

+ V(4)11987911987841V(1) + V(4)119879119878

42V(2)

+ V(4)11987911987843V(3) + V(4)119879119878

44V(4)

= V(1)119879 (119880 + 119881 + 119883119879119884119879119875minus11198634119884119883) V(1)

+ V(1)119879 (119883119879119884119879119875minus11198634) V(2)

+ V(1)119879 (minus2 (119880 minus 119881)119879) V(3)

+ V(1)119879 (119883119879119884119879119875minus11198634119910) V(4)

+ V(2)119879119875minus11198634119884119883V(1) + V(2)119879 (119875minus1119863

4+ Ξminus11198635) V(2)

+ V(2)119879119875minus11198634119910V(4)

minus 2V(3)119879 (119880 minus 119881)119879V(1)

+ V(3)119879 (4119879 (119880 + 119881)119879 + 119879minus11198633minus 2 (119863

1+ 1198632)) V(3)

+ V(4)119879119910119879119875minus11198634119884119883V(1)

+ V(4)119879119910119879119875minus11198634V(2) + V(4)119879119910119879119875minus1119863

4119910V(4)

= V(1)119879119880V(1) minus 4V(3)119879119880119879V(1) + 4V(3)119879119879119880119879V(3)

+ V(1)119879119881V(1) + 4V(3)119879119881119879V(1) + 4V(3)119879119879119881119879V(3)

+ V(2)119879Ξminus11198635V(2) + V(3)119879 (119879minus1119863

3minus 21198631minus 21198632) V(3)

+ (119884119883V(1) + V(2) + V(4)119910)119879

119875minus11198634

times (119884119883V(1) + V(2) + V(4)119910)

= 119890119879

119898119880(119863V

1minus 2119879119863V

3)2

119890119898+ 119890119879

119898119881(119863V

1+ 2119879119863V

3)2

119890119898

+ V(2)119879Ξminus11198635V(2)

+ V(3)119879 (119879minus11198633minus 21198631minus 21198632) V(3)

+ (119884119883V(1) + V(2) + V(4)119910)119879

119875minus11198634(119884119883V(1) + V(2) + V(4)119910)

gt 0

(B1)

where 119863V1= diag(V(1)) 119863V

2= diag(V(2)) 119863V

3= diag(V(3))

and119863V4= diag(V(4))

14 Journal of Applied Mathematics

B1 Proof of Lemma 3

Proof It is easy to see that if Δ119911119896= 0 which implies that (23)

holds for all 120572119896ge 0

Suppose Δ119911119896

= 0 We have

nabla119908Φ120583(119911119896)119879

Δ119908119896+ nabla119905Φ120583(119911119896)119879

Δ119905119896

+ nabla120585Φ120583(119911119896)119879

Δ120585119896+ nabla119887Φ120583(119911119896)119879

Δ119887119896

= minus (Δ119908119896119879

Δ120585119896119879

Δ119905119896119879

Δ119887119896) 119878(

Δ119908119896

Δ120585119896

Δ119905119896

Δ119887119896

)

(B2)

Since matrix 119878 is positive definite and Δ119911119896

= 0 the lastequation implies

nabla119908Φ120583(119911119896)119879

Δ119908119896+ nabla119905Φ120583(119911119896)119879

Δ119905119896

+ nabla120585Φ120583(119911119896)119879

Δ120585119896+ nabla119887Φ120583(119911119896)119879

Δ119887119896lt 0

(B3)

Consequently there exists a 119896isin (0

119896] such that the fourth

inequality in (23) is satisfied for all 120572119896isin (0

119896]

On the other hand since (119909119896 119905119896) is strictly feasible the

point (119909119896 119905119896) + 120572(Δ119909 119905119896 Δ119905119896) will be feasible for all 120572 gt 0

sufficiently small The proof is complete

C Convergence Analysis

This appendix is devoted to the global convergence of theinterior point method We first show the convergence ofAlgorithm 2 when a fixed 120583 is applied to the barrier subprob-lem (14)

Lemma C1 Let (119911119896 120582119896) be generated by Algorithm 2 Then119911119896 are strictly feasible for problem (10) and the Lagrangian

multipliers 120582119896 are bounded from above

Proof For the sake of convenience we use 119892119894(119911) 119894 =

1 2 3119898 + 2119899 to denote the constraint functions of theconstrained problem (10)

We first show that 119911119896 are strictly feasible Suppose onthe contrary that there exists an infinite index subset K andan index 119894 isin 1 2 3119898 + 2119899 such that 119892

119894(119911119896)K darr 0

By the definition of Φ120583(119911) and 119891(119911) = sum

119898

119895=1119905119894+ 119862sum

119899

119894=1120585119894

being bounded from below in the feasible set it must holdthat Φ

120583(119911119896)K rarr infin

However the line search rule implies that the sequenceΦ120583(119911119896) is decreasing So we get a contradiction Conse-

quently for any 119894 isin 1 2 3119898+2119899 119892119894(119909119896 119905119896) is bounded

away from zero The boundedness of 120582119896 then follows from(24)ndash(27)

Lemma C2 Let (119911119896 120582119896) and (Δ119911119896 119896+1

) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenthe sequence (Δ119911

119896 119896+1

)K is bounded

Proof Again we use 119892119894(119911) 119894 = 1 2 3119898+2119899 to denote the

constraint functions of the constrained problem (10)We suppose on the contrary that there exists an infinite

subsetK1015840 sube K such that the subsequence (Δ119911119896 119896+1

)K1015840

tends to infinity Let 119911119896K1015840 rarr 119911

lowast It follows fromLemma C1 that there is an infinite subset K sube K1015840 such that120582119896K rarr 120582

lowast and 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 Thus

we have

119878119896997888rarr 119878lowast (C1)

as 119896 rarr infin with 119896 isin K By Proposition B1 119878lowast is positivedefinite Since the right hand size of (18) is bounded andcontinuous the unboundedness of (Δ119911

119896 119896+1

)K impliesthat the limit of the coefficient matrices of (18) is singularwhich yields a contradiction The proof is complete

Lemma C3 Let (119911119896 120582119896) and (Δ119911119896 119896+1

) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenone has Δ119911

119896K rarr 0

Proof It follows from Lemma C1 that the sequence Δ119911119896K

is bounded Suppose on the contrary that there exists aninfinite subsetK1015840 sube K such that Δ119911

119896K1015840 rarr Δ119911

lowast= 0 Since

subsequences 119911119896K1015840 120582

119896K1015840 and

119896K1015840 are all bounded

there are points 119911lowast 120582lowast and

lowast as well as an infinite indexset K sube K1015840 such that 119911119896K rarr 119911

lowast 120582119896K rarr 120582lowast and

119896K rarr

lowast By Lemma C1 we have 119892119894(119911lowast) gt 0 forall119894 isin

1 2 3119898 + 2119899 Similar to the proof of Lemma 3 it is notdifficult to get

nabla119908Φ120583(119911lowast)119879

Δ119908lowast+ nabla119905Φ120583(119911lowast)119879

Δ119905lowast

+ nabla120585Φ120583(119911lowast)119879

Δ120585lowast+ nabla119887Φ120583(119911lowast)119879

Δ119887lowast

= minus (Δ119908lowast119879

Δ120585lowast119879

Δ119905lowast119879

Δ119887lowast) 119878(

Δ119908lowast

Δ120585lowast

Δ119905lowast

Δ119887lowast

) lt 0

(C2)

Since 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 there exists a

isin (0 1] such that for all 120572 isin (0 ]

119892119894(119911lowast+ 120572Δ119911

lowast) gt 0 forall119894 isin 1 2 3119899 (C3)

Taking into account 1205911isin (0 12) we claim that there exists

a isin (0 ] such that the following inequality holds for all120572 isin (0 ]

Φ120583(119911lowast+ 120572Δ119911

lowast) minus Φ120583(119911lowast) le 11120591

1120572nabla119911Φ120583(119911lowast)119879

Δ119911lowast (C4)

Let 119898lowast

= min119895 | 120573119895isin (0 ] 119895 = 0 1 and 120572

lowast= 120573119898lowast

It follows from (C3) that the following inequality is satisfiedfor all 119896 isin K sufficient large and any 119894 isin 1 2 3119898 + 2119899

119892119894(119911119896+ 120572lowastΔ119911119896) gt 0 (C5)

Journal of Applied Mathematics 15

Moreover we have

Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911119896)

= Φ120583(119911lowast+ 120572lowastΔ119911lowast) minus Φ120583(119911lowast)

+ Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911lowast+ 120572lowastΔ119911lowast)

+ Φ120583(119911lowast) minus Φ120583(119911119896)

le 1051205911120572lowastnabla119911Φ120583(119911lowast)119879

Δ119911lowast

le 1205911120572lowastnabla119911Φ120583(119911119896)119879

Δ119911119896

(C6)

By the backtracking line search rule the last inequalitytogether with (C5) yields the inequality120572

119896ge 120572lowastfor all 119896 isin K

large enough Consequently when 119896 isin K is sufficiently largewe have from (23) and (C3) that

Φ120583(119911119896+1

) le Φ120583(119911119896) + 1205911120572119896nabla119911Φ120583(119911119896)119879

Δ119911119896

le Φ120583(119911119896) + 1205911120572lowastnabla119911Φ120583(119911119896)119879

Δ119911119896

le Φ120583(119911119896) +

1

21205911120572lowastnabla119911Φ120583(119911lowast)119879

Δ119911lowast

(C7)

This shows that Φ120583(119911119896)K rarr minusinfin contradicting with the

fact that Φ120583(119911119896)K is bounded from below The proof is

complete

Then we will establish the convergence of Algorithms 2and 4 that is Theorems 5 and 6 in Section 3

C1 Proof of Theorem 5

Proof We let (119911lowast lowast) be a limit point of (119911119896 119896) and thesubsequence (119911119896 119896)K converges to (119911

lowast lowast)

Recall that (Δ119911119896 119896) are the solution of (18) with (119911 120582) =

(119911119896 120582119896) Taking limits in both sides of (18) with (119911 120582) =

(119911119896 120582119896) as 119896 rarr infinwith 119896 isin K by the use of Lemma C1 we

obtain

lowast(1)

minus lowast(2)

minus 119883119879119884119879lowast(4)

= 0

119862 lowast 119890119899minus lowast(4)

minus lowast(5)

= 0

119890119898minus 2119879lowastlowast(1)

minus 2119879lowastlowast(2)

minus lowast(3)

= 0

119910119879lowast(4)

= 0

(119879lowast2

minus 119882lowast) lowast(1)

minus 120583119890119898

= 0

(119879lowast2

+ 119882lowast) lowast(2)

minus 120583119890119898

= 0

119879lowastlowast(3)

minus 120583119890119898

= 0

119875lowastlowast(4)

minus 120583119890119899= 0

Ξlowastlowast(5)

minus 120583119890119899= 0

(C8)

This shows that (119911lowast lowast) satisfies the first-order optimalityconditions (15)

C2 Proof of Theorem 6

Proof (i) Without loss of generality we suppose that thebounded subsequence (119911

119895 119895)J converges to some point

(119911lowast lowast) It is clear that 119911lowast is a feasible point of (10) Since

120583119895 120598120583119895 rarr 0 it follows from (29) and (30) that

Res (119911119895 119895 120583119895)

J997888rarr Res (119911lowast lowast 0) = 0

119895J

997888rarr lowastge 0

(C9)

Consequently the (119911lowast lowast) satisfies the KKT conditions (13)(ii) Let 120585

119895= max119895

infin 1 and

119895= 120585minus1

119895119895 Obviously

119895 is bounded Hence there exists an infinite subsetJ1015840 sube J

such that 1198951015840J rarr = 0 and 119895infin

= 1 for large 119895 isin JFrom (30) we know that ge 0 Dividing both sides of thefirst inequality of (18) with (119911 120582) = (119911

119896 120582119896) by 120585119895 then taking

limits as 119895 rarr infin with 119895 isin J we get

(1)

minus (2)

minus 119883119879119884119879(4)

= 0

(4)

+ (5)

= 0

2119879lowast(1)

+ 2119879lowast(2)

+ (3)

= 0

119910119879(4)

= 0

(119879lowast2

minus 119882lowast) (1)

= 0

(119879lowast2

+ 119882lowast) (2)

= 0

119879lowast(3)

= 0

119875lowast(4)

= 0

Ξlowast(5)

= 0

(C10)

Since 119911lowast is feasible the above equations has shown that 119911lowast is

a Fritz-John point of problem (10)

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

The authors would like to acknowledge support for thisproject from the National Science Foundation (NSF Grants11371154 11071087 11271069 and 61103202) of China

References

[1] Y M Yang and J O Pedersen ldquoA comparative study on featureselection in text categorizationrdquo in Proceedings of the 14th

16 Journal of Applied Mathematics

International Conference on Machine Learning (ICML rsquo97) vol97 pp 412ndash420 1997

[2] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquoThe Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003

[3] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002

[4] H H Zhang J Ahn X Lin and C Park ldquoGene selection usingsupport vector machines with non-convex penaltyrdquo Bioinfor-matics vol 22 no 1 pp 88ndash95 2006

[5] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007

[6] J Weston F Perez-Cruz O Bousquet O Chapelle A Elisseeffand B Scholkopf ldquoFeature selection and transduction forprediction of molecular bioactivity for drug designrdquo Bioinfor-matics vol 19 no 6 pp 764ndash771 2003

[7] Y Liu ldquoA comparative study on feature selection methods fordrug discoveryrdquo Journal of Chemical Information and ComputerSciences vol 44 no 5 pp 1823ndash1828 2004

[8] I Guyon J Weston S Barnhill and V Vapnik Eds FeatureExtraction Foundations and Applications vol 207 Springer2006

[9] A Rakotomamonjy ldquoVariable selection using SVM-based cri-teriardquo Journal of Machine Learning Research vol 3 no 7-8 pp1357ndash1370 2003

[10] J Weston S Mukherjee O Chapelle M Pontil T Poggio andV Vapnik ldquoFeature selection for SVMsrdquo in Proceedings of theConference onNeural Information Processing Systems vol 13 pp668ndash674 Denver Colo USA 2000

[11] D Peleg and R Meir ldquoA feature selection algorithm based onthe global minimization of a generalization error boundrdquo inProceedings of the Conference on Neural Information ProcessingSystems Vancouver Canada 2004

[12] P S Bradley and O L Mangasarian ldquoFeature selection via con-cave minimization and support vector machinesrdquo in Proceed-ings of the International Conference on Machine Learning pp82ndash90 San Francisco Calif USA 1998

[13] JWeston A Elisseeff B Scholkopf andM Tipping ldquoUse of thezero-norm with linear models and kernel methodsrdquo Journal ofMachine Learning Research vol 3 pp 1439ndash1461 2003

[14] A B Chan N Vasconcelos and G R G Lanckriet ldquoDirectconvex relaxations of sparse SVMrdquo in Proceedings of the 24thInternational Conference on Machine Learning (ICML rsquo07) pp145ndash153 Corvallis Ore USA June 2007

[15] GM Fung andO L Mangasarian ldquoA feature selection Newtonmethod for support vector machine classificationrdquo Computa-tional Optimization and Applications vol 28 no 2 pp 185ndash2022004

[16] J Zhu S Rosset T Hastie and R Tibshirani ldquo1-norm supportvector machinesrdquo Advances in Neural Information ProcessingSystems vol 16 no 1 pp 49ndash56 2004

[17] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society B Methodological vol 58no 1 pp 267ndash288 1996

[18] T Zhang ldquoAnalysis of multi-stage convex relaxation for sparseregularizationrdquo Journal of Machine Learning Research vol 11pp 1081ndash1107 2010

[19] R Chartrand ldquoExact reconstruction of sparse signals via non-convexminimizationrdquo IEEE Signal Processing Letters vol 14 no10 pp 707ndash710 2007

[20] R Chartrand ldquoNonconvex regularizaron for shape preserva-tionrdquo inProceedings of the 14th IEEE International Conference onImage Processing (ICIP rsquo07) vol 1 pp I293ndashI296 San AntonioTex USA September 2007

[21] Z Xu H Zhang Y Wang X Chang and Y Liang ldquo11987112

regularizationrdquo Science China Information Sciences vol 53 no6 pp 1159ndash1169 2010

[22] W J Chen and Y J Tian ldquoLp-norm proximal support vectormachine and its applicationsrdquo Procedia Computer Science vol1 no 1 pp 2417ndash2423 2010

[23] J Liu J Li W Xu and Y Shi ldquoA weighted L119902adaptive least

squares support vector machine classifiers-Robust and sparseapproximationrdquo Expert Systems with Applications vol 38 no 3pp 2253ndash2259 2011

[24] Y Liu H H Zhang C Park and J Ahn ldquoSupport vector mach-ines with adaptive 119871

119902penaltyrdquo Computational Statistics amp Data

Analysis vol 51 no 12 pp 6380ndash6394 2007[25] Z Liu S Lin andM Tan ldquoSparse support vectormachines with

Lp penalty for biomarker identificationrdquo IEEEACM Transac-tions on Computational Biology and Bioinformatics vol 7 no1 pp 100ndash107 2010

[26] A Rakotomamonjy R Flamary G Gasso and S Canu ldquol119901-l119902

penalty for sparse linear and sparse multiple kernel multitasklearningrdquo IEEE Transactions on Neural Networks vol 22 no 8pp 1307ndash1320 2011

[27] J Y Tan Z Zhang L Zhen C H Zhang and N Y DengldquoAdaptive feature selection via a new version of support vectormachinerdquo Neural Computing and Applications vol 23 no 3-4pp 937ndash945 2013

[28] Z B Xu X Y Chang F M Xu and H Zhang ldquo11987112

regu-larization an iterative half thresholding algorithmrdquo IEEETrans-actions on Neural Networks and Learning Systems vol 23 no 7pp 1013ndash1027 2012

[29] D Ge X Jiang and Y Ye ldquoA note on the complexity of 119871119901

minimizationrdquo Mathematical Programming vol 129 no 2 pp285ndash299 2011

[30] VN VapnikTheNature of Statistical LearningTheory SpringerNew York NY USA 1995

[31] J Fan and H Peng ldquoNonconcave penalized likelihood with adiverging number of parametersrdquo The Annals of Statistics vol32 no 3 pp 928ndash961 2004

[32] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoThe Annals of Statistics vol 28 no 5 pp 1356ndash1378 2000

[33] B S Tian and X Q Yang ldquoAn interior-point l 11987112-penalty-

method for nonlinear programmingrdquo Technical ReportDepartment of Applied Mathematics Hong Kong PolytechnicUniversity 2013

[34] L Armijo ldquoMinimization of functions having Lipschitz con-tinuous first partial derivativesrdquo Pacific Journal of Mathematicsvol 16 pp 1ndash3 1966

[35] F John ldquoExtremum problems with inequalities as side-con-ditionsrdquo in Studies and Essays Courant Anniversary Volume pp187ndash1204 1948

[36] D J Newman S Hettich C L Blake and C J Merz ldquoUCIrepository of machine learning databasesrdquo Technical Report9702 Department of Information and Computer Science Uni-verisity of California Irvine Calif USA 1998 httparchiveicsucieduml

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 11: Research Article An Interior Point Method for -SVM …downloads.hindawi.com/journals/jam/2014/942520.pdfResearch Article An Interior Point Method for 1/2-SVM and Application to Feature

Journal of Applied Mathematics 11

50 100 150 2000

20

40

60

80

100

120

140

160

180

200

Dimension

Card

inal

ity

50 100 150 20080

82

84

86

88

90

92

94

96

98

100

Dimension

Accu

racy

L2-SVML1-SVM

L0-SVML12-SVM

L2-SVML1-SVM

L0-SVML12-SVM

Figure 3 Results comparison on artificial data set with various dimensions

less than 1198711-SVM at the same time the accuracy of 119871

12-

SVM is 41 higher than 1198711-SVM In most data sets (4

5 6 9 10) the cardinality of 11987112

-SVM drops significantly(at least 37) with the equal or a slightly better result inaccuracy Only in the data set ldquo3 Winerdquo the accuracy of11987112

-SVM is decreased by 11 but the sparsity provided by11987112

-SVM leads to 582 improvement over 1198711-SVM In the

rest three data sets (1 2 7) the two methods have similarresults in feature selection and classification As seen abovethe 119871

12-SVM can provide lower dimension representation

than 1198711-SVM with the competitive prediction perform-

anceFigure 4 (right) shows that compared with 119871

0-SVM

11987112

-SVM has the classification accuracy improved in alldata sets For instance in four data sets (3 4 8 10) theclassification accuracy of 119871

12-SVM is at least 40 higher

than 1198710-SVM Especially in the data set ldquo10Muskrdquo 119871

12-SVM

gives a 63 rise in accuracy over 1198710-SVM Meanwhile it

can be observed from Figure 4 (left) that 11987112

-SVM selectsfewer feature than 119871

0-SVM in five data sets (5 6 7 9 10) For

example in the data sets 6 and 9 the cardinalities of 11987112

-SVM are 378 and 496 less than 119871

0-SVM respectively

In summary 11987112

-SVM presents better classification perfor-mance than 119871

0-SVM while it is effective in choosing relevant

features

5 Conclusions

In this paper we proposed a 11987112

regularization techniquefor simultaneous feature selection and classification in theSVMWe have reformulated the 119871

12-SVM into an equivalent

smooth constrained optimization problem The problem

possesses a very simple structure and is relatively easy todevelop numerical methods By the use of this interestingreformulation we proposed an interior point method andestablished its convergence Our numerical results supportedthe reformulation and the proposed method The 119871

12-

SVM can get more sparsity solution than 1198711-SVM with

the comparable classification accuracy Furthermore the11987112

-SVM can achieve more accuracy classification resultsthan 119871

0-SVM (FSV)

Inspired by the good performance of the smooth opti-mization reformulation of 119871

12-SVM there are some inter-

esting topics deserving further research For examples todevelop more efficient algorithms for solving the reformu-lation to study nonlinear 119871

12-SVM and to explore varies

applications of the 11987112

-SVM and further validate its effectiveare interesting research topics Some of them are under ourcurrent investigation

Appendix

A Properties of the Reformulation to11987112

-SVM

Let 119863 be set of all feasible points of the problem (10) Forany 119911 isin 119863 we let 119865119863(119911119863) and 119871119865119863(119911119863) be the set of allfeasible directions and linearized feasible directions of 119863 at119911 Since the constraint functions of (10) are all convex weimmediately have the following lemma

Lemma A1 For any feasible point 119911 of (10) one has119865119863 (119911119863) = 119871119865119863 (119911119863) (A1)

12 Journal of Applied Mathematics

1 2 3 4 5 6 7 8 9 1010

20

30

40

50

60

70

80

90

100

Data set number Data set number

Spar

sity

1 2 3 4 5 6 7 8 9 1070

75

80

85

90

95

100

Accu

racy

L2-SVML1-SVM

L0-SVML12-SVM

L2-SVML1-SVM

L0-SVML12-SVM

Figure 4 Results comparison on UCI data sets

Based on the above Lemma we can easily derive a firstorder necessary condition for (10) The Lagrangian functionof (10) is

119871 (119908 119905 120585 119887 120582 120583 ] 119901 119902)

= 119862

119899

sum

119894=1

120585119894+

119898

sum

119895=1

119905119895minus

119898

sum

119895=1

120582119895(1199052

119895minus 119908119895) minus

119898

sum

119895=1

120583119895(1199052

119895+ 119908119895)

minus

119898

sum

119895=1

]119895119905119895minus

119899

sum

119894=1

119901119894(119910119894119908119879119909119894+ 119910119894119887 + 120585119894minus 1) minus

119899

sum

119894=1

119902119894120585119894

(A2)

where 120582 120583 ] 119901 119902 are the Lagrangian multipliers By the useof Lemma A1 we immediately have the following theoremabout the first order necessary condition

Theorem A2 Let (119908 119905 120585 119887) be a local solution of (10) Thenthere are Lagrangian multipliers (120582 120583 ] 119901 119902) isin 119877

119898+119898+119898+119899+119899

such that the following KKT conditions hold

120597119871

120597119908= 120582119895minus 120583119895minus 119901119894

119899

sum

119894=1

119910119894119909119894= 0 119894 = 1 2 119899

120597119871

120597119905= 1 minus 2120582

119895119905119895minus 2120583119895119905119895minus ]119895= 0 119895 = 1 2 119898

120597119871

120597120585= 119862 minus 119901

119894minus 119902119894= 0 119894 = 1 2 119899

120597119871

120597119887=

119899

sum

119894=1

119901119894119910119894= 0 119894 = 1 2 119899

120582119895ge 0 119905

2

119895minus 119908119895ge 0 120582

119895(1199052

119895minus 119908119895) = 0

119895 = 1 2 119898

120583119895ge 0 119905

2

119895+ 119908119895ge 0 120583

119895(1199052

119895+ 119908119895) = 0

119895 = 1 2 119898

]119895ge 0 119905

119895ge 0 ]

119895119905119895= 0 119895 = 1 2 119898

119901119894ge 0 119910

119894(119908119879119909119894+ 119887) + 120585

119894minus 1 ge 0

119901119894(119910119894(119908119879119909119894+ 119887) + 120585

119894minus 1) = 0 119894 = 1 2 119899

119902119894ge 0 120585

119894ge 0 119902

119894120585119894= 0 119894 = 1 2 119899

(A3)

The following theorem shows that the level set will bebounded

Theorem A3 For any given constant 119888 gt 0 the level set

119882119888= 119911 = (119908 119905 120585) | 119891 (119911) le 119888 cap 119863 (A4)

is bounded

Proof For any (119908 119905 120585) isin Ω119888 we have

119898

sum

119894

119905119894+ 119862

119899

sum

119895

120585119895le 119888 (A5)

Journal of Applied Mathematics 13

Combining with 119905119894ge 0 119894 = 1 119898 and 120585

119895ge 0 119895 = 1 119899

we have

0 le 119905119894le 119888 119894 = 1 119898

0 le 120585119895le

119888

119862 119895 = 1 119899

(A6)

Moreover for any (119908 119905 120585) isin 119863

10038161003816100381610038161199081198941003816100381610038161003816 le 1199052

119894 119894 = 1 119899 (A7)

Consequently 119908 119905 120585 are bounded What is more from thecondition

119910119894(119908119879119909119894+ 119887) ge 1 minus 120585

119894 119894 = 1 119898 (A8)

we have

119908119879119909119894+ 119887 ge 1 minus 120585

119894 if 119910

119894= 1 119894 = 1 119898

119908119879119909119894+ 119887 le minus1 + 120585

119894 if 119910

119894= minus1 119894 = 1 119898

(A9)

Thus if the feasible region119863 is not empty then we have

max119910119894=1

(1 minus 120585119894minus 119908119879119909119894) le 119887 le min

119910119894=minus1

(minus1 + 120585119894minus 119908119879119909119894) (A10)

Hence 119887 is also bounded The proof is complete

B Proof of Lemma 3

Lemma 3 (see Section 3) shows that Algorithm 2 is welldefinedWe first introduce the following proposition to showthat the matrix 119878 defined by (21) is always positive definitewhich ensures Algorithm 2 to be well defined

Proposition B1 Let the inequality (26) hold then the matrix119878 defined by (21) is positive definite

Proof By an elementary deduction we have for any V(1) isin

119877119898 V(2) isin 119877

119899 V(3) isin 119877119898 V(4) isin 119877 with (V(1) V(2) V(3) V(4)) = 0

(V(1)119879 V(2)119879 V(3)119879 V(4)119879) 119878(

V(1)

V(2)

V(3)

V(4))

= V(1)11987911987811V(1) + V(1)119879119878

12V(2) + V(1)119879119878

13V(3)

+ V(1)11987911987814V(4) + V(2)119879119878

21V(1) + V(2)119879119878

22V(2)

+ V(2)11987911987823V(3) + V(2)119879119878

24V(4)

+ V(3)11987911987831V(1) + V(3)119879119878

32V(2)

+ V(3)11987911987833V(3) + V(3)119879119878

34V(4)

+ V(4)11987911987841V(1) + V(4)119879119878

42V(2)

+ V(4)11987911987843V(3) + V(4)119879119878

44V(4)

= V(1)119879 (119880 + 119881 + 119883119879119884119879119875minus11198634119884119883) V(1)

+ V(1)119879 (119883119879119884119879119875minus11198634) V(2)

+ V(1)119879 (minus2 (119880 minus 119881)119879) V(3)

+ V(1)119879 (119883119879119884119879119875minus11198634119910) V(4)

+ V(2)119879119875minus11198634119884119883V(1) + V(2)119879 (119875minus1119863

4+ Ξminus11198635) V(2)

+ V(2)119879119875minus11198634119910V(4)

minus 2V(3)119879 (119880 minus 119881)119879V(1)

+ V(3)119879 (4119879 (119880 + 119881)119879 + 119879minus11198633minus 2 (119863

1+ 1198632)) V(3)

+ V(4)119879119910119879119875minus11198634119884119883V(1)

+ V(4)119879119910119879119875minus11198634V(2) + V(4)119879119910119879119875minus1119863

4119910V(4)

= V(1)119879119880V(1) minus 4V(3)119879119880119879V(1) + 4V(3)119879119879119880119879V(3)

+ V(1)119879119881V(1) + 4V(3)119879119881119879V(1) + 4V(3)119879119879119881119879V(3)

+ V(2)119879Ξminus11198635V(2) + V(3)119879 (119879minus1119863

3minus 21198631minus 21198632) V(3)

+ (119884119883V(1) + V(2) + V(4)119910)119879

119875minus11198634

times (119884119883V(1) + V(2) + V(4)119910)

= 119890119879

119898119880(119863V

1minus 2119879119863V

3)2

119890119898+ 119890119879

119898119881(119863V

1+ 2119879119863V

3)2

119890119898

+ V(2)119879Ξminus11198635V(2)

+ V(3)119879 (119879minus11198633minus 21198631minus 21198632) V(3)

+ (119884119883V(1) + V(2) + V(4)119910)119879

119875minus11198634(119884119883V(1) + V(2) + V(4)119910)

gt 0

(B1)

where 119863V1= diag(V(1)) 119863V

2= diag(V(2)) 119863V

3= diag(V(3))

and119863V4= diag(V(4))

14 Journal of Applied Mathematics

B1 Proof of Lemma 3

Proof It is easy to see that if Δ119911119896= 0 which implies that (23)

holds for all 120572119896ge 0

Suppose Δ119911119896

= 0 We have

nabla119908Φ120583(119911119896)119879

Δ119908119896+ nabla119905Φ120583(119911119896)119879

Δ119905119896

+ nabla120585Φ120583(119911119896)119879

Δ120585119896+ nabla119887Φ120583(119911119896)119879

Δ119887119896

= minus (Δ119908119896119879

Δ120585119896119879

Δ119905119896119879

Δ119887119896) 119878(

Δ119908119896

Δ120585119896

Δ119905119896

Δ119887119896

)

(B2)

Since matrix 119878 is positive definite and Δ119911119896

= 0 the lastequation implies

nabla119908Φ120583(119911119896)119879

Δ119908119896+ nabla119905Φ120583(119911119896)119879

Δ119905119896

+ nabla120585Φ120583(119911119896)119879

Δ120585119896+ nabla119887Φ120583(119911119896)119879

Δ119887119896lt 0

(B3)

Consequently there exists a 119896isin (0

119896] such that the fourth

inequality in (23) is satisfied for all 120572119896isin (0

119896]

On the other hand since (119909119896 119905119896) is strictly feasible the

point (119909119896 119905119896) + 120572(Δ119909 119905119896 Δ119905119896) will be feasible for all 120572 gt 0

sufficiently small The proof is complete

C Convergence Analysis

This appendix is devoted to the global convergence of theinterior point method We first show the convergence ofAlgorithm 2 when a fixed 120583 is applied to the barrier subprob-lem (14)

Lemma C1 Let (119911119896 120582119896) be generated by Algorithm 2 Then119911119896 are strictly feasible for problem (10) and the Lagrangian

multipliers 120582119896 are bounded from above

Proof For the sake of convenience we use 119892119894(119911) 119894 =

1 2 3119898 + 2119899 to denote the constraint functions of theconstrained problem (10)

We first show that 119911119896 are strictly feasible Suppose onthe contrary that there exists an infinite index subset K andan index 119894 isin 1 2 3119898 + 2119899 such that 119892

119894(119911119896)K darr 0

By the definition of Φ120583(119911) and 119891(119911) = sum

119898

119895=1119905119894+ 119862sum

119899

119894=1120585119894

being bounded from below in the feasible set it must holdthat Φ

120583(119911119896)K rarr infin

However the line search rule implies that the sequenceΦ120583(119911119896) is decreasing So we get a contradiction Conse-

quently for any 119894 isin 1 2 3119898+2119899 119892119894(119909119896 119905119896) is bounded

away from zero The boundedness of 120582119896 then follows from(24)ndash(27)

Lemma C2 Let (119911119896 120582119896) and (Δ119911119896 119896+1

) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenthe sequence (Δ119911

119896 119896+1

)K is bounded

Proof Again we use 119892119894(119911) 119894 = 1 2 3119898+2119899 to denote the

constraint functions of the constrained problem (10)We suppose on the contrary that there exists an infinite

subsetK1015840 sube K such that the subsequence (Δ119911119896 119896+1

)K1015840

tends to infinity Let 119911119896K1015840 rarr 119911

lowast It follows fromLemma C1 that there is an infinite subset K sube K1015840 such that120582119896K rarr 120582

lowast and 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 Thus

we have

119878119896997888rarr 119878lowast (C1)

as 119896 rarr infin with 119896 isin K By Proposition B1 119878lowast is positivedefinite Since the right hand size of (18) is bounded andcontinuous the unboundedness of (Δ119911

119896 119896+1

)K impliesthat the limit of the coefficient matrices of (18) is singularwhich yields a contradiction The proof is complete

Lemma C3 Let (119911119896 120582119896) and (Δ119911119896 119896+1

) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenone has Δ119911

119896K rarr 0

Proof It follows from Lemma C1 that the sequence Δ119911119896K

is bounded Suppose on the contrary that there exists aninfinite subsetK1015840 sube K such that Δ119911

119896K1015840 rarr Δ119911

lowast= 0 Since

subsequences 119911119896K1015840 120582

119896K1015840 and

119896K1015840 are all bounded

there are points 119911lowast 120582lowast and

lowast as well as an infinite indexset K sube K1015840 such that 119911119896K rarr 119911

lowast 120582119896K rarr 120582lowast and

119896K rarr

lowast By Lemma C1 we have 119892119894(119911lowast) gt 0 forall119894 isin

1 2 3119898 + 2119899 Similar to the proof of Lemma 3 it is notdifficult to get

nabla119908Φ120583(119911lowast)119879

Δ119908lowast+ nabla119905Φ120583(119911lowast)119879

Δ119905lowast

+ nabla120585Φ120583(119911lowast)119879

Δ120585lowast+ nabla119887Φ120583(119911lowast)119879

Δ119887lowast

= minus (Δ119908lowast119879

Δ120585lowast119879

Δ119905lowast119879

Δ119887lowast) 119878(

Δ119908lowast

Δ120585lowast

Δ119905lowast

Δ119887lowast

) lt 0

(C2)

Since 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 there exists a

isin (0 1] such that for all 120572 isin (0 ]

119892119894(119911lowast+ 120572Δ119911

lowast) gt 0 forall119894 isin 1 2 3119899 (C3)

Taking into account 1205911isin (0 12) we claim that there exists

a isin (0 ] such that the following inequality holds for all120572 isin (0 ]

Φ120583(119911lowast+ 120572Δ119911

lowast) minus Φ120583(119911lowast) le 11120591

1120572nabla119911Φ120583(119911lowast)119879

Δ119911lowast (C4)

Let 119898lowast

= min119895 | 120573119895isin (0 ] 119895 = 0 1 and 120572

lowast= 120573119898lowast

It follows from (C3) that the following inequality is satisfiedfor all 119896 isin K sufficient large and any 119894 isin 1 2 3119898 + 2119899

119892119894(119911119896+ 120572lowastΔ119911119896) gt 0 (C5)

Journal of Applied Mathematics 15

Moreover we have

Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911119896)

= Φ120583(119911lowast+ 120572lowastΔ119911lowast) minus Φ120583(119911lowast)

+ Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911lowast+ 120572lowastΔ119911lowast)

+ Φ120583(119911lowast) minus Φ120583(119911119896)

le 1051205911120572lowastnabla119911Φ120583(119911lowast)119879

Δ119911lowast

le 1205911120572lowastnabla119911Φ120583(119911119896)119879

Δ119911119896

(C6)

By the backtracking line search rule the last inequalitytogether with (C5) yields the inequality120572

119896ge 120572lowastfor all 119896 isin K

large enough Consequently when 119896 isin K is sufficiently largewe have from (23) and (C3) that

Φ120583(119911119896+1

) le Φ120583(119911119896) + 1205911120572119896nabla119911Φ120583(119911119896)119879

Δ119911119896

le Φ120583(119911119896) + 1205911120572lowastnabla119911Φ120583(119911119896)119879

Δ119911119896

le Φ120583(119911119896) +

1

21205911120572lowastnabla119911Φ120583(119911lowast)119879

Δ119911lowast

(C7)

This shows that Φ120583(119911119896)K rarr minusinfin contradicting with the

fact that Φ120583(119911119896)K is bounded from below The proof is

complete

Then we will establish the convergence of Algorithms 2and 4 that is Theorems 5 and 6 in Section 3

C1 Proof of Theorem 5

Proof We let (119911lowast lowast) be a limit point of (119911119896 119896) and thesubsequence (119911119896 119896)K converges to (119911

lowast lowast)

Recall that (Δ119911119896 119896) are the solution of (18) with (119911 120582) =

(119911119896 120582119896) Taking limits in both sides of (18) with (119911 120582) =

(119911119896 120582119896) as 119896 rarr infinwith 119896 isin K by the use of Lemma C1 we

obtain

lowast(1)

minus lowast(2)

minus 119883119879119884119879lowast(4)

= 0

119862 lowast 119890119899minus lowast(4)

minus lowast(5)

= 0

119890119898minus 2119879lowastlowast(1)

minus 2119879lowastlowast(2)

minus lowast(3)

= 0

119910119879lowast(4)

= 0

(119879lowast2

minus 119882lowast) lowast(1)

minus 120583119890119898

= 0

(119879lowast2

+ 119882lowast) lowast(2)

minus 120583119890119898

= 0

119879lowastlowast(3)

minus 120583119890119898

= 0

119875lowastlowast(4)

minus 120583119890119899= 0

Ξlowastlowast(5)

minus 120583119890119899= 0

(C8)

This shows that (119911lowast lowast) satisfies the first-order optimalityconditions (15)

C2 Proof of Theorem 6

Proof (i) Without loss of generality we suppose that thebounded subsequence (119911

119895 119895)J converges to some point

(119911lowast lowast) It is clear that 119911lowast is a feasible point of (10) Since

120583119895 120598120583119895 rarr 0 it follows from (29) and (30) that

Res (119911119895 119895 120583119895)

J997888rarr Res (119911lowast lowast 0) = 0

119895J

997888rarr lowastge 0

(C9)

Consequently the (119911lowast lowast) satisfies the KKT conditions (13)(ii) Let 120585

119895= max119895

infin 1 and

119895= 120585minus1

119895119895 Obviously

119895 is bounded Hence there exists an infinite subsetJ1015840 sube J

such that 1198951015840J rarr = 0 and 119895infin

= 1 for large 119895 isin JFrom (30) we know that ge 0 Dividing both sides of thefirst inequality of (18) with (119911 120582) = (119911

119896 120582119896) by 120585119895 then taking

limits as 119895 rarr infin with 119895 isin J we get

(1)

minus (2)

minus 119883119879119884119879(4)

= 0

(4)

+ (5)

= 0

2119879lowast(1)

+ 2119879lowast(2)

+ (3)

= 0

119910119879(4)

= 0

(119879lowast2

minus 119882lowast) (1)

= 0

(119879lowast2

+ 119882lowast) (2)

= 0

119879lowast(3)

= 0

119875lowast(4)

= 0

Ξlowast(5)

= 0

(C10)

Since 119911lowast is feasible the above equations has shown that 119911lowast is

a Fritz-John point of problem (10)

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

The authors would like to acknowledge support for thisproject from the National Science Foundation (NSF Grants11371154 11071087 11271069 and 61103202) of China

References

[1] Y M Yang and J O Pedersen ldquoA comparative study on featureselection in text categorizationrdquo in Proceedings of the 14th

16 Journal of Applied Mathematics

International Conference on Machine Learning (ICML rsquo97) vol97 pp 412ndash420 1997

[2] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquoThe Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003

[3] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002

[4] H H Zhang J Ahn X Lin and C Park ldquoGene selection usingsupport vector machines with non-convex penaltyrdquo Bioinfor-matics vol 22 no 1 pp 88ndash95 2006

[5] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007

[6] J Weston F Perez-Cruz O Bousquet O Chapelle A Elisseeffand B Scholkopf ldquoFeature selection and transduction forprediction of molecular bioactivity for drug designrdquo Bioinfor-matics vol 19 no 6 pp 764ndash771 2003

[7] Y Liu ldquoA comparative study on feature selection methods fordrug discoveryrdquo Journal of Chemical Information and ComputerSciences vol 44 no 5 pp 1823ndash1828 2004

[8] I Guyon J Weston S Barnhill and V Vapnik Eds FeatureExtraction Foundations and Applications vol 207 Springer2006

[9] A Rakotomamonjy ldquoVariable selection using SVM-based cri-teriardquo Journal of Machine Learning Research vol 3 no 7-8 pp1357ndash1370 2003

[10] J Weston S Mukherjee O Chapelle M Pontil T Poggio andV Vapnik ldquoFeature selection for SVMsrdquo in Proceedings of theConference onNeural Information Processing Systems vol 13 pp668ndash674 Denver Colo USA 2000

[11] D Peleg and R Meir ldquoA feature selection algorithm based onthe global minimization of a generalization error boundrdquo inProceedings of the Conference on Neural Information ProcessingSystems Vancouver Canada 2004

[12] P S Bradley and O L Mangasarian ldquoFeature selection via con-cave minimization and support vector machinesrdquo in Proceed-ings of the International Conference on Machine Learning pp82ndash90 San Francisco Calif USA 1998

[13] JWeston A Elisseeff B Scholkopf andM Tipping ldquoUse of thezero-norm with linear models and kernel methodsrdquo Journal ofMachine Learning Research vol 3 pp 1439ndash1461 2003

[14] A B Chan N Vasconcelos and G R G Lanckriet ldquoDirectconvex relaxations of sparse SVMrdquo in Proceedings of the 24thInternational Conference on Machine Learning (ICML rsquo07) pp145ndash153 Corvallis Ore USA June 2007

[15] GM Fung andO L Mangasarian ldquoA feature selection Newtonmethod for support vector machine classificationrdquo Computa-tional Optimization and Applications vol 28 no 2 pp 185ndash2022004

[16] J Zhu S Rosset T Hastie and R Tibshirani ldquo1-norm supportvector machinesrdquo Advances in Neural Information ProcessingSystems vol 16 no 1 pp 49ndash56 2004

[17] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society B Methodological vol 58no 1 pp 267ndash288 1996

[18] T Zhang ldquoAnalysis of multi-stage convex relaxation for sparseregularizationrdquo Journal of Machine Learning Research vol 11pp 1081ndash1107 2010

[19] R Chartrand ldquoExact reconstruction of sparse signals via non-convexminimizationrdquo IEEE Signal Processing Letters vol 14 no10 pp 707ndash710 2007

[20] R Chartrand ldquoNonconvex regularizaron for shape preserva-tionrdquo inProceedings of the 14th IEEE International Conference onImage Processing (ICIP rsquo07) vol 1 pp I293ndashI296 San AntonioTex USA September 2007

[21] Z Xu H Zhang Y Wang X Chang and Y Liang ldquo11987112

regularizationrdquo Science China Information Sciences vol 53 no6 pp 1159ndash1169 2010

[22] W J Chen and Y J Tian ldquoLp-norm proximal support vectormachine and its applicationsrdquo Procedia Computer Science vol1 no 1 pp 2417ndash2423 2010

[23] J Liu J Li W Xu and Y Shi ldquoA weighted L119902adaptive least

squares support vector machine classifiers-Robust and sparseapproximationrdquo Expert Systems with Applications vol 38 no 3pp 2253ndash2259 2011

[24] Y Liu H H Zhang C Park and J Ahn ldquoSupport vector mach-ines with adaptive 119871

119902penaltyrdquo Computational Statistics amp Data

Analysis vol 51 no 12 pp 6380ndash6394 2007[25] Z Liu S Lin andM Tan ldquoSparse support vectormachines with

Lp penalty for biomarker identificationrdquo IEEEACM Transac-tions on Computational Biology and Bioinformatics vol 7 no1 pp 100ndash107 2010

[26] A Rakotomamonjy R Flamary G Gasso and S Canu ldquol119901-l119902

penalty for sparse linear and sparse multiple kernel multitasklearningrdquo IEEE Transactions on Neural Networks vol 22 no 8pp 1307ndash1320 2011

[27] J Y Tan Z Zhang L Zhen C H Zhang and N Y DengldquoAdaptive feature selection via a new version of support vectormachinerdquo Neural Computing and Applications vol 23 no 3-4pp 937ndash945 2013

[28] Z B Xu X Y Chang F M Xu and H Zhang ldquo11987112

regu-larization an iterative half thresholding algorithmrdquo IEEETrans-actions on Neural Networks and Learning Systems vol 23 no 7pp 1013ndash1027 2012

[29] D Ge X Jiang and Y Ye ldquoA note on the complexity of 119871119901

minimizationrdquo Mathematical Programming vol 129 no 2 pp285ndash299 2011

[30] VN VapnikTheNature of Statistical LearningTheory SpringerNew York NY USA 1995

[31] J Fan and H Peng ldquoNonconcave penalized likelihood with adiverging number of parametersrdquo The Annals of Statistics vol32 no 3 pp 928ndash961 2004

[32] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoThe Annals of Statistics vol 28 no 5 pp 1356ndash1378 2000

[33] B S Tian and X Q Yang ldquoAn interior-point l 11987112-penalty-

method for nonlinear programmingrdquo Technical ReportDepartment of Applied Mathematics Hong Kong PolytechnicUniversity 2013

[34] L Armijo ldquoMinimization of functions having Lipschitz con-tinuous first partial derivativesrdquo Pacific Journal of Mathematicsvol 16 pp 1ndash3 1966

[35] F John ldquoExtremum problems with inequalities as side-con-ditionsrdquo in Studies and Essays Courant Anniversary Volume pp187ndash1204 1948

[36] D J Newman S Hettich C L Blake and C J Merz ldquoUCIrepository of machine learning databasesrdquo Technical Report9702 Department of Information and Computer Science Uni-verisity of California Irvine Calif USA 1998 httparchiveicsucieduml

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 12: Research Article An Interior Point Method for -SVM …downloads.hindawi.com/journals/jam/2014/942520.pdfResearch Article An Interior Point Method for 1/2-SVM and Application to Feature

12 Journal of Applied Mathematics

1 2 3 4 5 6 7 8 9 1010

20

30

40

50

60

70

80

90

100

Data set number Data set number

Spar

sity

1 2 3 4 5 6 7 8 9 1070

75

80

85

90

95

100

Accu

racy

L2-SVML1-SVM

L0-SVML12-SVM

L2-SVML1-SVM

L0-SVML12-SVM

Figure 4 Results comparison on UCI data sets

Based on the above Lemma we can easily derive a firstorder necessary condition for (10) The Lagrangian functionof (10) is

119871 (119908 119905 120585 119887 120582 120583 ] 119901 119902)

= 119862

119899

sum

119894=1

120585119894+

119898

sum

119895=1

119905119895minus

119898

sum

119895=1

120582119895(1199052

119895minus 119908119895) minus

119898

sum

119895=1

120583119895(1199052

119895+ 119908119895)

minus

119898

sum

119895=1

]119895119905119895minus

119899

sum

119894=1

119901119894(119910119894119908119879119909119894+ 119910119894119887 + 120585119894minus 1) minus

119899

sum

119894=1

119902119894120585119894

(A2)

where 120582 120583 ] 119901 119902 are the Lagrangian multipliers By the useof Lemma A1 we immediately have the following theoremabout the first order necessary condition

Theorem A2 Let (119908 119905 120585 119887) be a local solution of (10) Thenthere are Lagrangian multipliers (120582 120583 ] 119901 119902) isin 119877

119898+119898+119898+119899+119899

such that the following KKT conditions hold

120597119871

120597119908= 120582119895minus 120583119895minus 119901119894

119899

sum

119894=1

119910119894119909119894= 0 119894 = 1 2 119899

120597119871

120597119905= 1 minus 2120582

119895119905119895minus 2120583119895119905119895minus ]119895= 0 119895 = 1 2 119898

120597119871

120597120585= 119862 minus 119901

119894minus 119902119894= 0 119894 = 1 2 119899

120597119871

120597119887=

119899

sum

119894=1

119901119894119910119894= 0 119894 = 1 2 119899

120582119895ge 0 119905

2

119895minus 119908119895ge 0 120582

119895(1199052

119895minus 119908119895) = 0

119895 = 1 2 119898

120583119895ge 0 119905

2

119895+ 119908119895ge 0 120583

119895(1199052

119895+ 119908119895) = 0

119895 = 1 2 119898

]119895ge 0 119905

119895ge 0 ]

119895119905119895= 0 119895 = 1 2 119898

119901119894ge 0 119910

119894(119908119879119909119894+ 119887) + 120585

119894minus 1 ge 0

119901119894(119910119894(119908119879119909119894+ 119887) + 120585

119894minus 1) = 0 119894 = 1 2 119899

119902119894ge 0 120585

119894ge 0 119902

119894120585119894= 0 119894 = 1 2 119899

(A3)

The following theorem shows that the level set will bebounded

Theorem A3 For any given constant 119888 gt 0 the level set

119882119888= 119911 = (119908 119905 120585) | 119891 (119911) le 119888 cap 119863 (A4)

is bounded

Proof For any (119908 119905 120585) isin Ω119888 we have

119898

sum

119894

119905119894+ 119862

119899

sum

119895

120585119895le 119888 (A5)

Journal of Applied Mathematics 13

Combining with 119905119894ge 0 119894 = 1 119898 and 120585

119895ge 0 119895 = 1 119899

we have

0 le 119905119894le 119888 119894 = 1 119898

0 le 120585119895le

119888

119862 119895 = 1 119899

(A6)

Moreover for any (119908 119905 120585) isin 119863

10038161003816100381610038161199081198941003816100381610038161003816 le 1199052

119894 119894 = 1 119899 (A7)

Consequently 119908 119905 120585 are bounded What is more from thecondition

119910119894(119908119879119909119894+ 119887) ge 1 minus 120585

119894 119894 = 1 119898 (A8)

we have

119908119879119909119894+ 119887 ge 1 minus 120585

119894 if 119910

119894= 1 119894 = 1 119898

119908119879119909119894+ 119887 le minus1 + 120585

119894 if 119910

119894= minus1 119894 = 1 119898

(A9)

Thus if the feasible region119863 is not empty then we have

max119910119894=1

(1 minus 120585119894minus 119908119879119909119894) le 119887 le min

119910119894=minus1

(minus1 + 120585119894minus 119908119879119909119894) (A10)

Hence 119887 is also bounded The proof is complete

B Proof of Lemma 3

Lemma 3 (see Section 3) shows that Algorithm 2 is welldefinedWe first introduce the following proposition to showthat the matrix 119878 defined by (21) is always positive definitewhich ensures Algorithm 2 to be well defined

Proposition B1 Let the inequality (26) hold then the matrix119878 defined by (21) is positive definite

Proof By an elementary deduction we have for any V(1) isin

119877119898 V(2) isin 119877

119899 V(3) isin 119877119898 V(4) isin 119877 with (V(1) V(2) V(3) V(4)) = 0

(V(1)119879 V(2)119879 V(3)119879 V(4)119879) 119878(

V(1)

V(2)

V(3)

V(4))

= V(1)11987911987811V(1) + V(1)119879119878

12V(2) + V(1)119879119878

13V(3)

+ V(1)11987911987814V(4) + V(2)119879119878

21V(1) + V(2)119879119878

22V(2)

+ V(2)11987911987823V(3) + V(2)119879119878

24V(4)

+ V(3)11987911987831V(1) + V(3)119879119878

32V(2)

+ V(3)11987911987833V(3) + V(3)119879119878

34V(4)

+ V(4)11987911987841V(1) + V(4)119879119878

42V(2)

+ V(4)11987911987843V(3) + V(4)119879119878

44V(4)

= V(1)119879 (119880 + 119881 + 119883119879119884119879119875minus11198634119884119883) V(1)

+ V(1)119879 (119883119879119884119879119875minus11198634) V(2)

+ V(1)119879 (minus2 (119880 minus 119881)119879) V(3)

+ V(1)119879 (119883119879119884119879119875minus11198634119910) V(4)

+ V(2)119879119875minus11198634119884119883V(1) + V(2)119879 (119875minus1119863

4+ Ξminus11198635) V(2)

+ V(2)119879119875minus11198634119910V(4)

minus 2V(3)119879 (119880 minus 119881)119879V(1)

+ V(3)119879 (4119879 (119880 + 119881)119879 + 119879minus11198633minus 2 (119863

1+ 1198632)) V(3)

+ V(4)119879119910119879119875minus11198634119884119883V(1)

+ V(4)119879119910119879119875minus11198634V(2) + V(4)119879119910119879119875minus1119863

4119910V(4)

= V(1)119879119880V(1) minus 4V(3)119879119880119879V(1) + 4V(3)119879119879119880119879V(3)

+ V(1)119879119881V(1) + 4V(3)119879119881119879V(1) + 4V(3)119879119879119881119879V(3)

+ V(2)119879Ξminus11198635V(2) + V(3)119879 (119879minus1119863

3minus 21198631minus 21198632) V(3)

+ (119884119883V(1) + V(2) + V(4)119910)119879

119875minus11198634

times (119884119883V(1) + V(2) + V(4)119910)

= 119890119879

119898119880(119863V

1minus 2119879119863V

3)2

119890119898+ 119890119879

119898119881(119863V

1+ 2119879119863V

3)2

119890119898

+ V(2)119879Ξminus11198635V(2)

+ V(3)119879 (119879minus11198633minus 21198631minus 21198632) V(3)

+ (119884119883V(1) + V(2) + V(4)119910)119879

119875minus11198634(119884119883V(1) + V(2) + V(4)119910)

gt 0

(B1)

where 119863V1= diag(V(1)) 119863V

2= diag(V(2)) 119863V

3= diag(V(3))

and119863V4= diag(V(4))

14 Journal of Applied Mathematics

B1 Proof of Lemma 3

Proof It is easy to see that if Δ119911119896= 0 which implies that (23)

holds for all 120572119896ge 0

Suppose Δ119911119896

= 0 We have

nabla119908Φ120583(119911119896)119879

Δ119908119896+ nabla119905Φ120583(119911119896)119879

Δ119905119896

+ nabla120585Φ120583(119911119896)119879

Δ120585119896+ nabla119887Φ120583(119911119896)119879

Δ119887119896

= minus (Δ119908119896119879

Δ120585119896119879

Δ119905119896119879

Δ119887119896) 119878(

Δ119908119896

Δ120585119896

Δ119905119896

Δ119887119896

)

(B2)

Since matrix 119878 is positive definite and Δ119911119896

= 0 the lastequation implies

nabla119908Φ120583(119911119896)119879

Δ119908119896+ nabla119905Φ120583(119911119896)119879

Δ119905119896

+ nabla120585Φ120583(119911119896)119879

Δ120585119896+ nabla119887Φ120583(119911119896)119879

Δ119887119896lt 0

(B3)

Consequently there exists a 119896isin (0

119896] such that the fourth

inequality in (23) is satisfied for all 120572119896isin (0

119896]

On the other hand since (119909119896 119905119896) is strictly feasible the

point (119909119896 119905119896) + 120572(Δ119909 119905119896 Δ119905119896) will be feasible for all 120572 gt 0

sufficiently small The proof is complete

C Convergence Analysis

This appendix is devoted to the global convergence of theinterior point method We first show the convergence ofAlgorithm 2 when a fixed 120583 is applied to the barrier subprob-lem (14)

Lemma C1 Let (119911119896 120582119896) be generated by Algorithm 2 Then119911119896 are strictly feasible for problem (10) and the Lagrangian

multipliers 120582119896 are bounded from above

Proof For the sake of convenience we use 119892119894(119911) 119894 =

1 2 3119898 + 2119899 to denote the constraint functions of theconstrained problem (10)

We first show that 119911119896 are strictly feasible Suppose onthe contrary that there exists an infinite index subset K andan index 119894 isin 1 2 3119898 + 2119899 such that 119892

119894(119911119896)K darr 0

By the definition of Φ120583(119911) and 119891(119911) = sum

119898

119895=1119905119894+ 119862sum

119899

119894=1120585119894

being bounded from below in the feasible set it must holdthat Φ

120583(119911119896)K rarr infin

However the line search rule implies that the sequenceΦ120583(119911119896) is decreasing So we get a contradiction Conse-

quently for any 119894 isin 1 2 3119898+2119899 119892119894(119909119896 119905119896) is bounded

away from zero The boundedness of 120582119896 then follows from(24)ndash(27)

Lemma C2 Let (119911119896 120582119896) and (Δ119911119896 119896+1

) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenthe sequence (Δ119911

119896 119896+1

)K is bounded

Proof Again we use 119892119894(119911) 119894 = 1 2 3119898+2119899 to denote the

constraint functions of the constrained problem (10)We suppose on the contrary that there exists an infinite

subsetK1015840 sube K such that the subsequence (Δ119911119896 119896+1

)K1015840

tends to infinity Let 119911119896K1015840 rarr 119911

lowast It follows fromLemma C1 that there is an infinite subset K sube K1015840 such that120582119896K rarr 120582

lowast and 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 Thus

we have

119878119896997888rarr 119878lowast (C1)

as 119896 rarr infin with 119896 isin K By Proposition B1 119878lowast is positivedefinite Since the right hand size of (18) is bounded andcontinuous the unboundedness of (Δ119911

119896 119896+1

)K impliesthat the limit of the coefficient matrices of (18) is singularwhich yields a contradiction The proof is complete

Lemma C3 Let (119911119896 120582119896) and (Δ119911119896 119896+1

) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenone has Δ119911

119896K rarr 0

Proof It follows from Lemma C1 that the sequence Δ119911119896K

is bounded Suppose on the contrary that there exists aninfinite subsetK1015840 sube K such that Δ119911

119896K1015840 rarr Δ119911

lowast= 0 Since

subsequences 119911119896K1015840 120582

119896K1015840 and

119896K1015840 are all bounded

there are points 119911lowast 120582lowast and

lowast as well as an infinite indexset K sube K1015840 such that 119911119896K rarr 119911

lowast 120582119896K rarr 120582lowast and

119896K rarr

lowast By Lemma C1 we have 119892119894(119911lowast) gt 0 forall119894 isin

1 2 3119898 + 2119899 Similar to the proof of Lemma 3 it is notdifficult to get

nabla119908Φ120583(119911lowast)119879

Δ119908lowast+ nabla119905Φ120583(119911lowast)119879

Δ119905lowast

+ nabla120585Φ120583(119911lowast)119879

Δ120585lowast+ nabla119887Φ120583(119911lowast)119879

Δ119887lowast

= minus (Δ119908lowast119879

Δ120585lowast119879

Δ119905lowast119879

Δ119887lowast) 119878(

Δ119908lowast

Δ120585lowast

Δ119905lowast

Δ119887lowast

) lt 0

(C2)

Since 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 there exists a

isin (0 1] such that for all 120572 isin (0 ]

119892119894(119911lowast+ 120572Δ119911

lowast) gt 0 forall119894 isin 1 2 3119899 (C3)

Taking into account 1205911isin (0 12) we claim that there exists

a isin (0 ] such that the following inequality holds for all120572 isin (0 ]

Φ120583(119911lowast+ 120572Δ119911

lowast) minus Φ120583(119911lowast) le 11120591

1120572nabla119911Φ120583(119911lowast)119879

Δ119911lowast (C4)

Let 119898lowast

= min119895 | 120573119895isin (0 ] 119895 = 0 1 and 120572

lowast= 120573119898lowast

It follows from (C3) that the following inequality is satisfiedfor all 119896 isin K sufficient large and any 119894 isin 1 2 3119898 + 2119899

119892119894(119911119896+ 120572lowastΔ119911119896) gt 0 (C5)

Journal of Applied Mathematics 15

Moreover we have

Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911119896)

= Φ120583(119911lowast+ 120572lowastΔ119911lowast) minus Φ120583(119911lowast)

+ Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911lowast+ 120572lowastΔ119911lowast)

+ Φ120583(119911lowast) minus Φ120583(119911119896)

le 1051205911120572lowastnabla119911Φ120583(119911lowast)119879

Δ119911lowast

le 1205911120572lowastnabla119911Φ120583(119911119896)119879

Δ119911119896

(C6)

By the backtracking line search rule the last inequalitytogether with (C5) yields the inequality120572

119896ge 120572lowastfor all 119896 isin K

large enough Consequently when 119896 isin K is sufficiently largewe have from (23) and (C3) that

Φ120583(119911119896+1

) le Φ120583(119911119896) + 1205911120572119896nabla119911Φ120583(119911119896)119879

Δ119911119896

le Φ120583(119911119896) + 1205911120572lowastnabla119911Φ120583(119911119896)119879

Δ119911119896

le Φ120583(119911119896) +

1

21205911120572lowastnabla119911Φ120583(119911lowast)119879

Δ119911lowast

(C7)

This shows that Φ120583(119911119896)K rarr minusinfin contradicting with the

fact that Φ120583(119911119896)K is bounded from below The proof is

complete

Then we will establish the convergence of Algorithms 2and 4 that is Theorems 5 and 6 in Section 3

C1 Proof of Theorem 5

Proof We let (119911lowast lowast) be a limit point of (119911119896 119896) and thesubsequence (119911119896 119896)K converges to (119911

lowast lowast)

Recall that (Δ119911119896 119896) are the solution of (18) with (119911 120582) =

(119911119896 120582119896) Taking limits in both sides of (18) with (119911 120582) =

(119911119896 120582119896) as 119896 rarr infinwith 119896 isin K by the use of Lemma C1 we

obtain

lowast(1)

minus lowast(2)

minus 119883119879119884119879lowast(4)

= 0

119862 lowast 119890119899minus lowast(4)

minus lowast(5)

= 0

119890119898minus 2119879lowastlowast(1)

minus 2119879lowastlowast(2)

minus lowast(3)

= 0

119910119879lowast(4)

= 0

(119879lowast2

minus 119882lowast) lowast(1)

minus 120583119890119898

= 0

(119879lowast2

+ 119882lowast) lowast(2)

minus 120583119890119898

= 0

119879lowastlowast(3)

minus 120583119890119898

= 0

119875lowastlowast(4)

minus 120583119890119899= 0

Ξlowastlowast(5)

minus 120583119890119899= 0

(C8)

This shows that (119911lowast lowast) satisfies the first-order optimalityconditions (15)

C2 Proof of Theorem 6

Proof (i) Without loss of generality we suppose that thebounded subsequence (119911

119895 119895)J converges to some point

(119911lowast lowast) It is clear that 119911lowast is a feasible point of (10) Since

120583119895 120598120583119895 rarr 0 it follows from (29) and (30) that

Res (119911119895 119895 120583119895)

J997888rarr Res (119911lowast lowast 0) = 0

119895J

997888rarr lowastge 0

(C9)

Consequently the (119911lowast lowast) satisfies the KKT conditions (13)(ii) Let 120585

119895= max119895

infin 1 and

119895= 120585minus1

119895119895 Obviously

119895 is bounded Hence there exists an infinite subsetJ1015840 sube J

such that 1198951015840J rarr = 0 and 119895infin

= 1 for large 119895 isin JFrom (30) we know that ge 0 Dividing both sides of thefirst inequality of (18) with (119911 120582) = (119911

119896 120582119896) by 120585119895 then taking

limits as 119895 rarr infin with 119895 isin J we get

(1)

minus (2)

minus 119883119879119884119879(4)

= 0

(4)

+ (5)

= 0

2119879lowast(1)

+ 2119879lowast(2)

+ (3)

= 0

119910119879(4)

= 0

(119879lowast2

minus 119882lowast) (1)

= 0

(119879lowast2

+ 119882lowast) (2)

= 0

119879lowast(3)

= 0

119875lowast(4)

= 0

Ξlowast(5)

= 0

(C10)

Since 119911lowast is feasible the above equations has shown that 119911lowast is

a Fritz-John point of problem (10)

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

The authors would like to acknowledge support for thisproject from the National Science Foundation (NSF Grants11371154 11071087 11271069 and 61103202) of China

References

[1] Y M Yang and J O Pedersen ldquoA comparative study on featureselection in text categorizationrdquo in Proceedings of the 14th

16 Journal of Applied Mathematics

International Conference on Machine Learning (ICML rsquo97) vol97 pp 412ndash420 1997

[2] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquoThe Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003

[3] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002

[4] H H Zhang J Ahn X Lin and C Park ldquoGene selection usingsupport vector machines with non-convex penaltyrdquo Bioinfor-matics vol 22 no 1 pp 88ndash95 2006

[5] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007

[6] J Weston F Perez-Cruz O Bousquet O Chapelle A Elisseeffand B Scholkopf ldquoFeature selection and transduction forprediction of molecular bioactivity for drug designrdquo Bioinfor-matics vol 19 no 6 pp 764ndash771 2003

[7] Y Liu ldquoA comparative study on feature selection methods fordrug discoveryrdquo Journal of Chemical Information and ComputerSciences vol 44 no 5 pp 1823ndash1828 2004

[8] I Guyon J Weston S Barnhill and V Vapnik Eds FeatureExtraction Foundations and Applications vol 207 Springer2006

[9] A Rakotomamonjy ldquoVariable selection using SVM-based cri-teriardquo Journal of Machine Learning Research vol 3 no 7-8 pp1357ndash1370 2003

[10] J Weston S Mukherjee O Chapelle M Pontil T Poggio andV Vapnik ldquoFeature selection for SVMsrdquo in Proceedings of theConference onNeural Information Processing Systems vol 13 pp668ndash674 Denver Colo USA 2000

[11] D Peleg and R Meir ldquoA feature selection algorithm based onthe global minimization of a generalization error boundrdquo inProceedings of the Conference on Neural Information ProcessingSystems Vancouver Canada 2004

[12] P S Bradley and O L Mangasarian ldquoFeature selection via con-cave minimization and support vector machinesrdquo in Proceed-ings of the International Conference on Machine Learning pp82ndash90 San Francisco Calif USA 1998

[13] JWeston A Elisseeff B Scholkopf andM Tipping ldquoUse of thezero-norm with linear models and kernel methodsrdquo Journal ofMachine Learning Research vol 3 pp 1439ndash1461 2003

[14] A B Chan N Vasconcelos and G R G Lanckriet ldquoDirectconvex relaxations of sparse SVMrdquo in Proceedings of the 24thInternational Conference on Machine Learning (ICML rsquo07) pp145ndash153 Corvallis Ore USA June 2007

[15] GM Fung andO L Mangasarian ldquoA feature selection Newtonmethod for support vector machine classificationrdquo Computa-tional Optimization and Applications vol 28 no 2 pp 185ndash2022004

[16] J Zhu S Rosset T Hastie and R Tibshirani ldquo1-norm supportvector machinesrdquo Advances in Neural Information ProcessingSystems vol 16 no 1 pp 49ndash56 2004

[17] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society B Methodological vol 58no 1 pp 267ndash288 1996

[18] T Zhang ldquoAnalysis of multi-stage convex relaxation for sparseregularizationrdquo Journal of Machine Learning Research vol 11pp 1081ndash1107 2010

[19] R Chartrand ldquoExact reconstruction of sparse signals via non-convexminimizationrdquo IEEE Signal Processing Letters vol 14 no10 pp 707ndash710 2007

[20] R Chartrand ldquoNonconvex regularizaron for shape preserva-tionrdquo inProceedings of the 14th IEEE International Conference onImage Processing (ICIP rsquo07) vol 1 pp I293ndashI296 San AntonioTex USA September 2007

[21] Z Xu H Zhang Y Wang X Chang and Y Liang ldquo11987112

regularizationrdquo Science China Information Sciences vol 53 no6 pp 1159ndash1169 2010

[22] W J Chen and Y J Tian ldquoLp-norm proximal support vectormachine and its applicationsrdquo Procedia Computer Science vol1 no 1 pp 2417ndash2423 2010

[23] J Liu J Li W Xu and Y Shi ldquoA weighted L119902adaptive least

squares support vector machine classifiers-Robust and sparseapproximationrdquo Expert Systems with Applications vol 38 no 3pp 2253ndash2259 2011

[24] Y Liu H H Zhang C Park and J Ahn ldquoSupport vector mach-ines with adaptive 119871

119902penaltyrdquo Computational Statistics amp Data

Analysis vol 51 no 12 pp 6380ndash6394 2007[25] Z Liu S Lin andM Tan ldquoSparse support vectormachines with

Lp penalty for biomarker identificationrdquo IEEEACM Transac-tions on Computational Biology and Bioinformatics vol 7 no1 pp 100ndash107 2010

[26] A Rakotomamonjy R Flamary G Gasso and S Canu ldquol119901-l119902

penalty for sparse linear and sparse multiple kernel multitasklearningrdquo IEEE Transactions on Neural Networks vol 22 no 8pp 1307ndash1320 2011

[27] J Y Tan Z Zhang L Zhen C H Zhang and N Y DengldquoAdaptive feature selection via a new version of support vectormachinerdquo Neural Computing and Applications vol 23 no 3-4pp 937ndash945 2013

[28] Z B Xu X Y Chang F M Xu and H Zhang ldquo11987112

regu-larization an iterative half thresholding algorithmrdquo IEEETrans-actions on Neural Networks and Learning Systems vol 23 no 7pp 1013ndash1027 2012

[29] D Ge X Jiang and Y Ye ldquoA note on the complexity of 119871119901

minimizationrdquo Mathematical Programming vol 129 no 2 pp285ndash299 2011

[30] VN VapnikTheNature of Statistical LearningTheory SpringerNew York NY USA 1995

[31] J Fan and H Peng ldquoNonconcave penalized likelihood with adiverging number of parametersrdquo The Annals of Statistics vol32 no 3 pp 928ndash961 2004

[32] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoThe Annals of Statistics vol 28 no 5 pp 1356ndash1378 2000

[33] B S Tian and X Q Yang ldquoAn interior-point l 11987112-penalty-

method for nonlinear programmingrdquo Technical ReportDepartment of Applied Mathematics Hong Kong PolytechnicUniversity 2013

[34] L Armijo ldquoMinimization of functions having Lipschitz con-tinuous first partial derivativesrdquo Pacific Journal of Mathematicsvol 16 pp 1ndash3 1966

[35] F John ldquoExtremum problems with inequalities as side-con-ditionsrdquo in Studies and Essays Courant Anniversary Volume pp187ndash1204 1948

[36] D J Newman S Hettich C L Blake and C J Merz ldquoUCIrepository of machine learning databasesrdquo Technical Report9702 Department of Information and Computer Science Uni-verisity of California Irvine Calif USA 1998 httparchiveicsucieduml

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 13: Research Article An Interior Point Method for -SVM …downloads.hindawi.com/journals/jam/2014/942520.pdfResearch Article An Interior Point Method for 1/2-SVM and Application to Feature

Journal of Applied Mathematics 13

Combining with 119905119894ge 0 119894 = 1 119898 and 120585

119895ge 0 119895 = 1 119899

we have

0 le 119905119894le 119888 119894 = 1 119898

0 le 120585119895le

119888

119862 119895 = 1 119899

(A6)

Moreover for any (119908 119905 120585) isin 119863

10038161003816100381610038161199081198941003816100381610038161003816 le 1199052

119894 119894 = 1 119899 (A7)

Consequently 119908 119905 120585 are bounded What is more from thecondition

119910119894(119908119879119909119894+ 119887) ge 1 minus 120585

119894 119894 = 1 119898 (A8)

we have

119908119879119909119894+ 119887 ge 1 minus 120585

119894 if 119910

119894= 1 119894 = 1 119898

119908119879119909119894+ 119887 le minus1 + 120585

119894 if 119910

119894= minus1 119894 = 1 119898

(A9)

Thus if the feasible region119863 is not empty then we have

max119910119894=1

(1 minus 120585119894minus 119908119879119909119894) le 119887 le min

119910119894=minus1

(minus1 + 120585119894minus 119908119879119909119894) (A10)

Hence 119887 is also bounded The proof is complete

B Proof of Lemma 3

Lemma 3 (see Section 3) shows that Algorithm 2 is welldefinedWe first introduce the following proposition to showthat the matrix 119878 defined by (21) is always positive definitewhich ensures Algorithm 2 to be well defined

Proposition B1 Let the inequality (26) hold then the matrix119878 defined by (21) is positive definite

Proof By an elementary deduction we have for any V(1) isin

119877119898 V(2) isin 119877

119899 V(3) isin 119877119898 V(4) isin 119877 with (V(1) V(2) V(3) V(4)) = 0

(V(1)119879 V(2)119879 V(3)119879 V(4)119879) 119878(

V(1)

V(2)

V(3)

V(4))

= V(1)11987911987811V(1) + V(1)119879119878

12V(2) + V(1)119879119878

13V(3)

+ V(1)11987911987814V(4) + V(2)119879119878

21V(1) + V(2)119879119878

22V(2)

+ V(2)11987911987823V(3) + V(2)119879119878

24V(4)

+ V(3)11987911987831V(1) + V(3)119879119878

32V(2)

+ V(3)11987911987833V(3) + V(3)119879119878

34V(4)

+ V(4)11987911987841V(1) + V(4)119879119878

42V(2)

+ V(4)11987911987843V(3) + V(4)119879119878

44V(4)

= V(1)119879 (119880 + 119881 + 119883119879119884119879119875minus11198634119884119883) V(1)

+ V(1)119879 (119883119879119884119879119875minus11198634) V(2)

+ V(1)119879 (minus2 (119880 minus 119881)119879) V(3)

+ V(1)119879 (119883119879119884119879119875minus11198634119910) V(4)

+ V(2)119879119875minus11198634119884119883V(1) + V(2)119879 (119875minus1119863

4+ Ξminus11198635) V(2)

+ V(2)119879119875minus11198634119910V(4)

minus 2V(3)119879 (119880 minus 119881)119879V(1)

+ V(3)119879 (4119879 (119880 + 119881)119879 + 119879minus11198633minus 2 (119863

1+ 1198632)) V(3)

+ V(4)119879119910119879119875minus11198634119884119883V(1)

+ V(4)119879119910119879119875minus11198634V(2) + V(4)119879119910119879119875minus1119863

4119910V(4)

= V(1)119879119880V(1) minus 4V(3)119879119880119879V(1) + 4V(3)119879119879119880119879V(3)

+ V(1)119879119881V(1) + 4V(3)119879119881119879V(1) + 4V(3)119879119879119881119879V(3)

+ V(2)119879Ξminus11198635V(2) + V(3)119879 (119879minus1119863

3minus 21198631minus 21198632) V(3)

+ (119884119883V(1) + V(2) + V(4)119910)119879

119875minus11198634

times (119884119883V(1) + V(2) + V(4)119910)

= 119890119879

119898119880(119863V

1minus 2119879119863V

3)2

119890119898+ 119890119879

119898119881(119863V

1+ 2119879119863V

3)2

119890119898

+ V(2)119879Ξminus11198635V(2)

+ V(3)119879 (119879minus11198633minus 21198631minus 21198632) V(3)

+ (119884119883V(1) + V(2) + V(4)119910)119879

119875minus11198634(119884119883V(1) + V(2) + V(4)119910)

gt 0

(B1)

where 119863V1= diag(V(1)) 119863V

2= diag(V(2)) 119863V

3= diag(V(3))

and119863V4= diag(V(4))

14 Journal of Applied Mathematics

B1 Proof of Lemma 3

Proof It is easy to see that if Δ119911119896= 0 which implies that (23)

holds for all 120572119896ge 0

Suppose Δ119911119896

= 0 We have

nabla119908Φ120583(119911119896)119879

Δ119908119896+ nabla119905Φ120583(119911119896)119879

Δ119905119896

+ nabla120585Φ120583(119911119896)119879

Δ120585119896+ nabla119887Φ120583(119911119896)119879

Δ119887119896

= minus (Δ119908119896119879

Δ120585119896119879

Δ119905119896119879

Δ119887119896) 119878(

Δ119908119896

Δ120585119896

Δ119905119896

Δ119887119896

)

(B2)

Since matrix 119878 is positive definite and Δ119911119896

= 0 the lastequation implies

nabla119908Φ120583(119911119896)119879

Δ119908119896+ nabla119905Φ120583(119911119896)119879

Δ119905119896

+ nabla120585Φ120583(119911119896)119879

Δ120585119896+ nabla119887Φ120583(119911119896)119879

Δ119887119896lt 0

(B3)

Consequently there exists a 119896isin (0

119896] such that the fourth

inequality in (23) is satisfied for all 120572119896isin (0

119896]

On the other hand since (119909119896 119905119896) is strictly feasible the

point (119909119896 119905119896) + 120572(Δ119909 119905119896 Δ119905119896) will be feasible for all 120572 gt 0

sufficiently small The proof is complete

C Convergence Analysis

This appendix is devoted to the global convergence of theinterior point method We first show the convergence ofAlgorithm 2 when a fixed 120583 is applied to the barrier subprob-lem (14)

Lemma C1 Let (119911119896 120582119896) be generated by Algorithm 2 Then119911119896 are strictly feasible for problem (10) and the Lagrangian

multipliers 120582119896 are bounded from above

Proof For the sake of convenience we use 119892119894(119911) 119894 =

1 2 3119898 + 2119899 to denote the constraint functions of theconstrained problem (10)

We first show that 119911119896 are strictly feasible Suppose onthe contrary that there exists an infinite index subset K andan index 119894 isin 1 2 3119898 + 2119899 such that 119892

119894(119911119896)K darr 0

By the definition of Φ120583(119911) and 119891(119911) = sum

119898

119895=1119905119894+ 119862sum

119899

119894=1120585119894

being bounded from below in the feasible set it must holdthat Φ

120583(119911119896)K rarr infin

However the line search rule implies that the sequenceΦ120583(119911119896) is decreasing So we get a contradiction Conse-

quently for any 119894 isin 1 2 3119898+2119899 119892119894(119909119896 119905119896) is bounded

away from zero The boundedness of 120582119896 then follows from(24)ndash(27)

Lemma C2 Let (119911119896 120582119896) and (Δ119911119896 119896+1

) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenthe sequence (Δ119911

119896 119896+1

)K is bounded

Proof Again we use 119892119894(119911) 119894 = 1 2 3119898+2119899 to denote the

constraint functions of the constrained problem (10)We suppose on the contrary that there exists an infinite

subsetK1015840 sube K such that the subsequence (Δ119911119896 119896+1

)K1015840

tends to infinity Let 119911119896K1015840 rarr 119911

lowast It follows fromLemma C1 that there is an infinite subset K sube K1015840 such that120582119896K rarr 120582

lowast and 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 Thus

we have

119878119896997888rarr 119878lowast (C1)

as 119896 rarr infin with 119896 isin K By Proposition B1 119878lowast is positivedefinite Since the right hand size of (18) is bounded andcontinuous the unboundedness of (Δ119911

119896 119896+1

)K impliesthat the limit of the coefficient matrices of (18) is singularwhich yields a contradiction The proof is complete

Lemma C3 Let (119911119896 120582119896) and (Δ119911119896 119896+1

) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenone has Δ119911

119896K rarr 0

Proof It follows from Lemma C1 that the sequence Δ119911119896K

is bounded Suppose on the contrary that there exists aninfinite subsetK1015840 sube K such that Δ119911

119896K1015840 rarr Δ119911

lowast= 0 Since

subsequences 119911119896K1015840 120582

119896K1015840 and

119896K1015840 are all bounded

there are points 119911lowast 120582lowast and

lowast as well as an infinite indexset K sube K1015840 such that 119911119896K rarr 119911

lowast 120582119896K rarr 120582lowast and

119896K rarr

lowast By Lemma C1 we have 119892119894(119911lowast) gt 0 forall119894 isin

1 2 3119898 + 2119899 Similar to the proof of Lemma 3 it is notdifficult to get

nabla119908Φ120583(119911lowast)119879

Δ119908lowast+ nabla119905Φ120583(119911lowast)119879

Δ119905lowast

+ nabla120585Φ120583(119911lowast)119879

Δ120585lowast+ nabla119887Φ120583(119911lowast)119879

Δ119887lowast

= minus (Δ119908lowast119879

Δ120585lowast119879

Δ119905lowast119879

Δ119887lowast) 119878(

Δ119908lowast

Δ120585lowast

Δ119905lowast

Δ119887lowast

) lt 0

(C2)

Since 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 there exists a

isin (0 1] such that for all 120572 isin (0 ]

119892119894(119911lowast+ 120572Δ119911

lowast) gt 0 forall119894 isin 1 2 3119899 (C3)

Taking into account 1205911isin (0 12) we claim that there exists

a isin (0 ] such that the following inequality holds for all120572 isin (0 ]

Φ120583(119911lowast+ 120572Δ119911

lowast) minus Φ120583(119911lowast) le 11120591

1120572nabla119911Φ120583(119911lowast)119879

Δ119911lowast (C4)

Let 119898lowast

= min119895 | 120573119895isin (0 ] 119895 = 0 1 and 120572

lowast= 120573119898lowast

It follows from (C3) that the following inequality is satisfiedfor all 119896 isin K sufficient large and any 119894 isin 1 2 3119898 + 2119899

119892119894(119911119896+ 120572lowastΔ119911119896) gt 0 (C5)

Journal of Applied Mathematics 15

Moreover we have

Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911119896)

= Φ120583(119911lowast+ 120572lowastΔ119911lowast) minus Φ120583(119911lowast)

+ Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911lowast+ 120572lowastΔ119911lowast)

+ Φ120583(119911lowast) minus Φ120583(119911119896)

le 1051205911120572lowastnabla119911Φ120583(119911lowast)119879

Δ119911lowast

le 1205911120572lowastnabla119911Φ120583(119911119896)119879

Δ119911119896

(C6)

By the backtracking line search rule the last inequalitytogether with (C5) yields the inequality120572

119896ge 120572lowastfor all 119896 isin K

large enough Consequently when 119896 isin K is sufficiently largewe have from (23) and (C3) that

Φ120583(119911119896+1

) le Φ120583(119911119896) + 1205911120572119896nabla119911Φ120583(119911119896)119879

Δ119911119896

le Φ120583(119911119896) + 1205911120572lowastnabla119911Φ120583(119911119896)119879

Δ119911119896

le Φ120583(119911119896) +

1

21205911120572lowastnabla119911Φ120583(119911lowast)119879

Δ119911lowast

(C7)

This shows that Φ120583(119911119896)K rarr minusinfin contradicting with the

fact that Φ120583(119911119896)K is bounded from below The proof is

complete

Then we will establish the convergence of Algorithms 2and 4 that is Theorems 5 and 6 in Section 3

C1 Proof of Theorem 5

Proof We let (119911lowast lowast) be a limit point of (119911119896 119896) and thesubsequence (119911119896 119896)K converges to (119911

lowast lowast)

Recall that (Δ119911119896 119896) are the solution of (18) with (119911 120582) =

(119911119896 120582119896) Taking limits in both sides of (18) with (119911 120582) =

(119911119896 120582119896) as 119896 rarr infinwith 119896 isin K by the use of Lemma C1 we

obtain

lowast(1)

minus lowast(2)

minus 119883119879119884119879lowast(4)

= 0

119862 lowast 119890119899minus lowast(4)

minus lowast(5)

= 0

119890119898minus 2119879lowastlowast(1)

minus 2119879lowastlowast(2)

minus lowast(3)

= 0

119910119879lowast(4)

= 0

(119879lowast2

minus 119882lowast) lowast(1)

minus 120583119890119898

= 0

(119879lowast2

+ 119882lowast) lowast(2)

minus 120583119890119898

= 0

119879lowastlowast(3)

minus 120583119890119898

= 0

119875lowastlowast(4)

minus 120583119890119899= 0

Ξlowastlowast(5)

minus 120583119890119899= 0

(C8)

This shows that (119911lowast lowast) satisfies the first-order optimalityconditions (15)

C2 Proof of Theorem 6

Proof (i) Without loss of generality we suppose that thebounded subsequence (119911

119895 119895)J converges to some point

(119911lowast lowast) It is clear that 119911lowast is a feasible point of (10) Since

120583119895 120598120583119895 rarr 0 it follows from (29) and (30) that

Res (119911119895 119895 120583119895)

J997888rarr Res (119911lowast lowast 0) = 0

119895J

997888rarr lowastge 0

(C9)

Consequently the (119911lowast lowast) satisfies the KKT conditions (13)(ii) Let 120585

119895= max119895

infin 1 and

119895= 120585minus1

119895119895 Obviously

119895 is bounded Hence there exists an infinite subsetJ1015840 sube J

such that 1198951015840J rarr = 0 and 119895infin

= 1 for large 119895 isin JFrom (30) we know that ge 0 Dividing both sides of thefirst inequality of (18) with (119911 120582) = (119911

119896 120582119896) by 120585119895 then taking

limits as 119895 rarr infin with 119895 isin J we get

(1)

minus (2)

minus 119883119879119884119879(4)

= 0

(4)

+ (5)

= 0

2119879lowast(1)

+ 2119879lowast(2)

+ (3)

= 0

119910119879(4)

= 0

(119879lowast2

minus 119882lowast) (1)

= 0

(119879lowast2

+ 119882lowast) (2)

= 0

119879lowast(3)

= 0

119875lowast(4)

= 0

Ξlowast(5)

= 0

(C10)

Since 119911lowast is feasible the above equations has shown that 119911lowast is

a Fritz-John point of problem (10)

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

The authors would like to acknowledge support for thisproject from the National Science Foundation (NSF Grants11371154 11071087 11271069 and 61103202) of China

References

[1] Y M Yang and J O Pedersen ldquoA comparative study on featureselection in text categorizationrdquo in Proceedings of the 14th

16 Journal of Applied Mathematics

International Conference on Machine Learning (ICML rsquo97) vol97 pp 412ndash420 1997

[2] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquoThe Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003

[3] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002

[4] H H Zhang J Ahn X Lin and C Park ldquoGene selection usingsupport vector machines with non-convex penaltyrdquo Bioinfor-matics vol 22 no 1 pp 88ndash95 2006

[5] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007

[6] J Weston F Perez-Cruz O Bousquet O Chapelle A Elisseeffand B Scholkopf ldquoFeature selection and transduction forprediction of molecular bioactivity for drug designrdquo Bioinfor-matics vol 19 no 6 pp 764ndash771 2003

[7] Y Liu ldquoA comparative study on feature selection methods fordrug discoveryrdquo Journal of Chemical Information and ComputerSciences vol 44 no 5 pp 1823ndash1828 2004

[8] I Guyon J Weston S Barnhill and V Vapnik Eds FeatureExtraction Foundations and Applications vol 207 Springer2006

[9] A Rakotomamonjy ldquoVariable selection using SVM-based cri-teriardquo Journal of Machine Learning Research vol 3 no 7-8 pp1357ndash1370 2003

[10] J Weston S Mukherjee O Chapelle M Pontil T Poggio andV Vapnik ldquoFeature selection for SVMsrdquo in Proceedings of theConference onNeural Information Processing Systems vol 13 pp668ndash674 Denver Colo USA 2000

[11] D Peleg and R Meir ldquoA feature selection algorithm based onthe global minimization of a generalization error boundrdquo inProceedings of the Conference on Neural Information ProcessingSystems Vancouver Canada 2004

[12] P S Bradley and O L Mangasarian ldquoFeature selection via con-cave minimization and support vector machinesrdquo in Proceed-ings of the International Conference on Machine Learning pp82ndash90 San Francisco Calif USA 1998

[13] JWeston A Elisseeff B Scholkopf andM Tipping ldquoUse of thezero-norm with linear models and kernel methodsrdquo Journal ofMachine Learning Research vol 3 pp 1439ndash1461 2003

[14] A B Chan N Vasconcelos and G R G Lanckriet ldquoDirectconvex relaxations of sparse SVMrdquo in Proceedings of the 24thInternational Conference on Machine Learning (ICML rsquo07) pp145ndash153 Corvallis Ore USA June 2007

[15] GM Fung andO L Mangasarian ldquoA feature selection Newtonmethod for support vector machine classificationrdquo Computa-tional Optimization and Applications vol 28 no 2 pp 185ndash2022004

[16] J Zhu S Rosset T Hastie and R Tibshirani ldquo1-norm supportvector machinesrdquo Advances in Neural Information ProcessingSystems vol 16 no 1 pp 49ndash56 2004

[17] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society B Methodological vol 58no 1 pp 267ndash288 1996

[18] T Zhang ldquoAnalysis of multi-stage convex relaxation for sparseregularizationrdquo Journal of Machine Learning Research vol 11pp 1081ndash1107 2010

[19] R Chartrand ldquoExact reconstruction of sparse signals via non-convexminimizationrdquo IEEE Signal Processing Letters vol 14 no10 pp 707ndash710 2007

[20] R Chartrand ldquoNonconvex regularizaron for shape preserva-tionrdquo inProceedings of the 14th IEEE International Conference onImage Processing (ICIP rsquo07) vol 1 pp I293ndashI296 San AntonioTex USA September 2007

[21] Z Xu H Zhang Y Wang X Chang and Y Liang ldquo11987112

regularizationrdquo Science China Information Sciences vol 53 no6 pp 1159ndash1169 2010

[22] W J Chen and Y J Tian ldquoLp-norm proximal support vectormachine and its applicationsrdquo Procedia Computer Science vol1 no 1 pp 2417ndash2423 2010

[23] J Liu J Li W Xu and Y Shi ldquoA weighted L119902adaptive least

squares support vector machine classifiers-Robust and sparseapproximationrdquo Expert Systems with Applications vol 38 no 3pp 2253ndash2259 2011

[24] Y Liu H H Zhang C Park and J Ahn ldquoSupport vector mach-ines with adaptive 119871

119902penaltyrdquo Computational Statistics amp Data

Analysis vol 51 no 12 pp 6380ndash6394 2007[25] Z Liu S Lin andM Tan ldquoSparse support vectormachines with

Lp penalty for biomarker identificationrdquo IEEEACM Transac-tions on Computational Biology and Bioinformatics vol 7 no1 pp 100ndash107 2010

[26] A Rakotomamonjy R Flamary G Gasso and S Canu ldquol119901-l119902

penalty for sparse linear and sparse multiple kernel multitasklearningrdquo IEEE Transactions on Neural Networks vol 22 no 8pp 1307ndash1320 2011

[27] J Y Tan Z Zhang L Zhen C H Zhang and N Y DengldquoAdaptive feature selection via a new version of support vectormachinerdquo Neural Computing and Applications vol 23 no 3-4pp 937ndash945 2013

[28] Z B Xu X Y Chang F M Xu and H Zhang ldquo11987112

regu-larization an iterative half thresholding algorithmrdquo IEEETrans-actions on Neural Networks and Learning Systems vol 23 no 7pp 1013ndash1027 2012

[29] D Ge X Jiang and Y Ye ldquoA note on the complexity of 119871119901

minimizationrdquo Mathematical Programming vol 129 no 2 pp285ndash299 2011

[30] VN VapnikTheNature of Statistical LearningTheory SpringerNew York NY USA 1995

[31] J Fan and H Peng ldquoNonconcave penalized likelihood with adiverging number of parametersrdquo The Annals of Statistics vol32 no 3 pp 928ndash961 2004

[32] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoThe Annals of Statistics vol 28 no 5 pp 1356ndash1378 2000

[33] B S Tian and X Q Yang ldquoAn interior-point l 11987112-penalty-

method for nonlinear programmingrdquo Technical ReportDepartment of Applied Mathematics Hong Kong PolytechnicUniversity 2013

[34] L Armijo ldquoMinimization of functions having Lipschitz con-tinuous first partial derivativesrdquo Pacific Journal of Mathematicsvol 16 pp 1ndash3 1966

[35] F John ldquoExtremum problems with inequalities as side-con-ditionsrdquo in Studies and Essays Courant Anniversary Volume pp187ndash1204 1948

[36] D J Newman S Hettich C L Blake and C J Merz ldquoUCIrepository of machine learning databasesrdquo Technical Report9702 Department of Information and Computer Science Uni-verisity of California Irvine Calif USA 1998 httparchiveicsucieduml

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 14: Research Article An Interior Point Method for -SVM …downloads.hindawi.com/journals/jam/2014/942520.pdfResearch Article An Interior Point Method for 1/2-SVM and Application to Feature

14 Journal of Applied Mathematics

B1 Proof of Lemma 3

Proof It is easy to see that if Δ119911119896= 0 which implies that (23)

holds for all 120572119896ge 0

Suppose Δ119911119896

= 0 We have

nabla119908Φ120583(119911119896)119879

Δ119908119896+ nabla119905Φ120583(119911119896)119879

Δ119905119896

+ nabla120585Φ120583(119911119896)119879

Δ120585119896+ nabla119887Φ120583(119911119896)119879

Δ119887119896

= minus (Δ119908119896119879

Δ120585119896119879

Δ119905119896119879

Δ119887119896) 119878(

Δ119908119896

Δ120585119896

Δ119905119896

Δ119887119896

)

(B2)

Since matrix 119878 is positive definite and Δ119911119896

= 0 the lastequation implies

nabla119908Φ120583(119911119896)119879

Δ119908119896+ nabla119905Φ120583(119911119896)119879

Δ119905119896

+ nabla120585Φ120583(119911119896)119879

Δ120585119896+ nabla119887Φ120583(119911119896)119879

Δ119887119896lt 0

(B3)

Consequently there exists a 119896isin (0

119896] such that the fourth

inequality in (23) is satisfied for all 120572119896isin (0

119896]

On the other hand since (119909119896 119905119896) is strictly feasible the

point (119909119896 119905119896) + 120572(Δ119909 119905119896 Δ119905119896) will be feasible for all 120572 gt 0

sufficiently small The proof is complete

C Convergence Analysis

This appendix is devoted to the global convergence of theinterior point method We first show the convergence ofAlgorithm 2 when a fixed 120583 is applied to the barrier subprob-lem (14)

Lemma C1 Let (119911119896 120582119896) be generated by Algorithm 2 Then119911119896 are strictly feasible for problem (10) and the Lagrangian

multipliers 120582119896 are bounded from above

Proof For the sake of convenience we use 119892119894(119911) 119894 =

1 2 3119898 + 2119899 to denote the constraint functions of theconstrained problem (10)

We first show that 119911119896 are strictly feasible Suppose onthe contrary that there exists an infinite index subset K andan index 119894 isin 1 2 3119898 + 2119899 such that 119892

119894(119911119896)K darr 0

By the definition of Φ120583(119911) and 119891(119911) = sum

119898

119895=1119905119894+ 119862sum

119899

119894=1120585119894

being bounded from below in the feasible set it must holdthat Φ

120583(119911119896)K rarr infin

However the line search rule implies that the sequenceΦ120583(119911119896) is decreasing So we get a contradiction Conse-

quently for any 119894 isin 1 2 3119898+2119899 119892119894(119909119896 119905119896) is bounded

away from zero The boundedness of 120582119896 then follows from(24)ndash(27)

Lemma C2 Let (119911119896 120582119896) and (Δ119911119896 119896+1

) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenthe sequence (Δ119911

119896 119896+1

)K is bounded

Proof Again we use 119892119894(119911) 119894 = 1 2 3119898+2119899 to denote the

constraint functions of the constrained problem (10)We suppose on the contrary that there exists an infinite

subsetK1015840 sube K such that the subsequence (Δ119911119896 119896+1

)K1015840

tends to infinity Let 119911119896K1015840 rarr 119911

lowast It follows fromLemma C1 that there is an infinite subset K sube K1015840 such that120582119896K rarr 120582

lowast and 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 Thus

we have

119878119896997888rarr 119878lowast (C1)

as 119896 rarr infin with 119896 isin K By Proposition B1 119878lowast is positivedefinite Since the right hand size of (18) is bounded andcontinuous the unboundedness of (Δ119911

119896 119896+1

)K impliesthat the limit of the coefficient matrices of (18) is singularwhich yields a contradiction The proof is complete

Lemma C3 Let (119911119896 120582119896) and (Δ119911119896 119896+1

) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenone has Δ119911

119896K rarr 0

Proof It follows from Lemma C1 that the sequence Δ119911119896K

is bounded Suppose on the contrary that there exists aninfinite subsetK1015840 sube K such that Δ119911

119896K1015840 rarr Δ119911

lowast= 0 Since

subsequences 119911119896K1015840 120582

119896K1015840 and

119896K1015840 are all bounded

there are points 119911lowast 120582lowast and

lowast as well as an infinite indexset K sube K1015840 such that 119911119896K rarr 119911

lowast 120582119896K rarr 120582lowast and

119896K rarr

lowast By Lemma C1 we have 119892119894(119911lowast) gt 0 forall119894 isin

1 2 3119898 + 2119899 Similar to the proof of Lemma 3 it is notdifficult to get

nabla119908Φ120583(119911lowast)119879

Δ119908lowast+ nabla119905Φ120583(119911lowast)119879

Δ119905lowast

+ nabla120585Φ120583(119911lowast)119879

Δ120585lowast+ nabla119887Φ120583(119911lowast)119879

Δ119887lowast

= minus (Δ119908lowast119879

Δ120585lowast119879

Δ119905lowast119879

Δ119887lowast) 119878(

Δ119908lowast

Δ120585lowast

Δ119905lowast

Δ119887lowast

) lt 0

(C2)

Since 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 there exists a

isin (0 1] such that for all 120572 isin (0 ]

119892119894(119911lowast+ 120572Δ119911

lowast) gt 0 forall119894 isin 1 2 3119899 (C3)

Taking into account 1205911isin (0 12) we claim that there exists

a isin (0 ] such that the following inequality holds for all120572 isin (0 ]

Φ120583(119911lowast+ 120572Δ119911

lowast) minus Φ120583(119911lowast) le 11120591

1120572nabla119911Φ120583(119911lowast)119879

Δ119911lowast (C4)

Let 119898lowast

= min119895 | 120573119895isin (0 ] 119895 = 0 1 and 120572

lowast= 120573119898lowast

It follows from (C3) that the following inequality is satisfiedfor all 119896 isin K sufficient large and any 119894 isin 1 2 3119898 + 2119899

119892119894(119911119896+ 120572lowastΔ119911119896) gt 0 (C5)

Journal of Applied Mathematics 15

Moreover we have

Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911119896)

= Φ120583(119911lowast+ 120572lowastΔ119911lowast) minus Φ120583(119911lowast)

+ Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911lowast+ 120572lowastΔ119911lowast)

+ Φ120583(119911lowast) minus Φ120583(119911119896)

le 1051205911120572lowastnabla119911Φ120583(119911lowast)119879

Δ119911lowast

le 1205911120572lowastnabla119911Φ120583(119911119896)119879

Δ119911119896

(C6)

By the backtracking line search rule the last inequalitytogether with (C5) yields the inequality120572

119896ge 120572lowastfor all 119896 isin K

large enough Consequently when 119896 isin K is sufficiently largewe have from (23) and (C3) that

Φ120583(119911119896+1

) le Φ120583(119911119896) + 1205911120572119896nabla119911Φ120583(119911119896)119879

Δ119911119896

le Φ120583(119911119896) + 1205911120572lowastnabla119911Φ120583(119911119896)119879

Δ119911119896

le Φ120583(119911119896) +

1

21205911120572lowastnabla119911Φ120583(119911lowast)119879

Δ119911lowast

(C7)

This shows that Φ120583(119911119896)K rarr minusinfin contradicting with the

fact that Φ120583(119911119896)K is bounded from below The proof is

complete

Then we will establish the convergence of Algorithms 2and 4 that is Theorems 5 and 6 in Section 3

C1 Proof of Theorem 5

Proof We let (119911lowast lowast) be a limit point of (119911119896 119896) and thesubsequence (119911119896 119896)K converges to (119911

lowast lowast)

Recall that (Δ119911119896 119896) are the solution of (18) with (119911 120582) =

(119911119896 120582119896) Taking limits in both sides of (18) with (119911 120582) =

(119911119896 120582119896) as 119896 rarr infinwith 119896 isin K by the use of Lemma C1 we

obtain

lowast(1)

minus lowast(2)

minus 119883119879119884119879lowast(4)

= 0

119862 lowast 119890119899minus lowast(4)

minus lowast(5)

= 0

119890119898minus 2119879lowastlowast(1)

minus 2119879lowastlowast(2)

minus lowast(3)

= 0

119910119879lowast(4)

= 0

(119879lowast2

minus 119882lowast) lowast(1)

minus 120583119890119898

= 0

(119879lowast2

+ 119882lowast) lowast(2)

minus 120583119890119898

= 0

119879lowastlowast(3)

minus 120583119890119898

= 0

119875lowastlowast(4)

minus 120583119890119899= 0

Ξlowastlowast(5)

minus 120583119890119899= 0

(C8)

This shows that (119911lowast lowast) satisfies the first-order optimalityconditions (15)

C2 Proof of Theorem 6

Proof (i) Without loss of generality we suppose that thebounded subsequence (119911

119895 119895)J converges to some point

(119911lowast lowast) It is clear that 119911lowast is a feasible point of (10) Since

120583119895 120598120583119895 rarr 0 it follows from (29) and (30) that

Res (119911119895 119895 120583119895)

J997888rarr Res (119911lowast lowast 0) = 0

119895J

997888rarr lowastge 0

(C9)

Consequently the (119911lowast lowast) satisfies the KKT conditions (13)(ii) Let 120585

119895= max119895

infin 1 and

119895= 120585minus1

119895119895 Obviously

119895 is bounded Hence there exists an infinite subsetJ1015840 sube J

such that 1198951015840J rarr = 0 and 119895infin

= 1 for large 119895 isin JFrom (30) we know that ge 0 Dividing both sides of thefirst inequality of (18) with (119911 120582) = (119911

119896 120582119896) by 120585119895 then taking

limits as 119895 rarr infin with 119895 isin J we get

(1)

minus (2)

minus 119883119879119884119879(4)

= 0

(4)

+ (5)

= 0

2119879lowast(1)

+ 2119879lowast(2)

+ (3)

= 0

119910119879(4)

= 0

(119879lowast2

minus 119882lowast) (1)

= 0

(119879lowast2

+ 119882lowast) (2)

= 0

119879lowast(3)

= 0

119875lowast(4)

= 0

Ξlowast(5)

= 0

(C10)

Since 119911lowast is feasible the above equations has shown that 119911lowast is

a Fritz-John point of problem (10)

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

The authors would like to acknowledge support for thisproject from the National Science Foundation (NSF Grants11371154 11071087 11271069 and 61103202) of China

References

[1] Y M Yang and J O Pedersen ldquoA comparative study on featureselection in text categorizationrdquo in Proceedings of the 14th

16 Journal of Applied Mathematics

International Conference on Machine Learning (ICML rsquo97) vol97 pp 412ndash420 1997

[2] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquoThe Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003

[3] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002

[4] H H Zhang J Ahn X Lin and C Park ldquoGene selection usingsupport vector machines with non-convex penaltyrdquo Bioinfor-matics vol 22 no 1 pp 88ndash95 2006

[5] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007

[6] J Weston F Perez-Cruz O Bousquet O Chapelle A Elisseeffand B Scholkopf ldquoFeature selection and transduction forprediction of molecular bioactivity for drug designrdquo Bioinfor-matics vol 19 no 6 pp 764ndash771 2003

[7] Y Liu ldquoA comparative study on feature selection methods fordrug discoveryrdquo Journal of Chemical Information and ComputerSciences vol 44 no 5 pp 1823ndash1828 2004

[8] I Guyon J Weston S Barnhill and V Vapnik Eds FeatureExtraction Foundations and Applications vol 207 Springer2006

[9] A Rakotomamonjy ldquoVariable selection using SVM-based cri-teriardquo Journal of Machine Learning Research vol 3 no 7-8 pp1357ndash1370 2003

[10] J Weston S Mukherjee O Chapelle M Pontil T Poggio andV Vapnik ldquoFeature selection for SVMsrdquo in Proceedings of theConference onNeural Information Processing Systems vol 13 pp668ndash674 Denver Colo USA 2000

[11] D Peleg and R Meir ldquoA feature selection algorithm based onthe global minimization of a generalization error boundrdquo inProceedings of the Conference on Neural Information ProcessingSystems Vancouver Canada 2004

[12] P S Bradley and O L Mangasarian ldquoFeature selection via con-cave minimization and support vector machinesrdquo in Proceed-ings of the International Conference on Machine Learning pp82ndash90 San Francisco Calif USA 1998

[13] JWeston A Elisseeff B Scholkopf andM Tipping ldquoUse of thezero-norm with linear models and kernel methodsrdquo Journal ofMachine Learning Research vol 3 pp 1439ndash1461 2003

[14] A B Chan N Vasconcelos and G R G Lanckriet ldquoDirectconvex relaxations of sparse SVMrdquo in Proceedings of the 24thInternational Conference on Machine Learning (ICML rsquo07) pp145ndash153 Corvallis Ore USA June 2007

[15] GM Fung andO L Mangasarian ldquoA feature selection Newtonmethod for support vector machine classificationrdquo Computa-tional Optimization and Applications vol 28 no 2 pp 185ndash2022004

[16] J Zhu S Rosset T Hastie and R Tibshirani ldquo1-norm supportvector machinesrdquo Advances in Neural Information ProcessingSystems vol 16 no 1 pp 49ndash56 2004

[17] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society B Methodological vol 58no 1 pp 267ndash288 1996

[18] T Zhang ldquoAnalysis of multi-stage convex relaxation for sparseregularizationrdquo Journal of Machine Learning Research vol 11pp 1081ndash1107 2010

[19] R Chartrand ldquoExact reconstruction of sparse signals via non-convexminimizationrdquo IEEE Signal Processing Letters vol 14 no10 pp 707ndash710 2007

[20] R Chartrand ldquoNonconvex regularizaron for shape preserva-tionrdquo inProceedings of the 14th IEEE International Conference onImage Processing (ICIP rsquo07) vol 1 pp I293ndashI296 San AntonioTex USA September 2007

[21] Z Xu H Zhang Y Wang X Chang and Y Liang ldquo11987112

regularizationrdquo Science China Information Sciences vol 53 no6 pp 1159ndash1169 2010

[22] W J Chen and Y J Tian ldquoLp-norm proximal support vectormachine and its applicationsrdquo Procedia Computer Science vol1 no 1 pp 2417ndash2423 2010

[23] J Liu J Li W Xu and Y Shi ldquoA weighted L119902adaptive least

squares support vector machine classifiers-Robust and sparseapproximationrdquo Expert Systems with Applications vol 38 no 3pp 2253ndash2259 2011

[24] Y Liu H H Zhang C Park and J Ahn ldquoSupport vector mach-ines with adaptive 119871

119902penaltyrdquo Computational Statistics amp Data

Analysis vol 51 no 12 pp 6380ndash6394 2007[25] Z Liu S Lin andM Tan ldquoSparse support vectormachines with

Lp penalty for biomarker identificationrdquo IEEEACM Transac-tions on Computational Biology and Bioinformatics vol 7 no1 pp 100ndash107 2010

[26] A Rakotomamonjy R Flamary G Gasso and S Canu ldquol119901-l119902

penalty for sparse linear and sparse multiple kernel multitasklearningrdquo IEEE Transactions on Neural Networks vol 22 no 8pp 1307ndash1320 2011

[27] J Y Tan Z Zhang L Zhen C H Zhang and N Y DengldquoAdaptive feature selection via a new version of support vectormachinerdquo Neural Computing and Applications vol 23 no 3-4pp 937ndash945 2013

[28] Z B Xu X Y Chang F M Xu and H Zhang ldquo11987112

regu-larization an iterative half thresholding algorithmrdquo IEEETrans-actions on Neural Networks and Learning Systems vol 23 no 7pp 1013ndash1027 2012

[29] D Ge X Jiang and Y Ye ldquoA note on the complexity of 119871119901

minimizationrdquo Mathematical Programming vol 129 no 2 pp285ndash299 2011

[30] VN VapnikTheNature of Statistical LearningTheory SpringerNew York NY USA 1995

[31] J Fan and H Peng ldquoNonconcave penalized likelihood with adiverging number of parametersrdquo The Annals of Statistics vol32 no 3 pp 928ndash961 2004

[32] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoThe Annals of Statistics vol 28 no 5 pp 1356ndash1378 2000

[33] B S Tian and X Q Yang ldquoAn interior-point l 11987112-penalty-

method for nonlinear programmingrdquo Technical ReportDepartment of Applied Mathematics Hong Kong PolytechnicUniversity 2013

[34] L Armijo ldquoMinimization of functions having Lipschitz con-tinuous first partial derivativesrdquo Pacific Journal of Mathematicsvol 16 pp 1ndash3 1966

[35] F John ldquoExtremum problems with inequalities as side-con-ditionsrdquo in Studies and Essays Courant Anniversary Volume pp187ndash1204 1948

[36] D J Newman S Hettich C L Blake and C J Merz ldquoUCIrepository of machine learning databasesrdquo Technical Report9702 Department of Information and Computer Science Uni-verisity of California Irvine Calif USA 1998 httparchiveicsucieduml

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 15: Research Article An Interior Point Method for -SVM …downloads.hindawi.com/journals/jam/2014/942520.pdfResearch Article An Interior Point Method for 1/2-SVM and Application to Feature

Journal of Applied Mathematics 15

Moreover we have

Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911119896)

= Φ120583(119911lowast+ 120572lowastΔ119911lowast) minus Φ120583(119911lowast)

+ Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911lowast+ 120572lowastΔ119911lowast)

+ Φ120583(119911lowast) minus Φ120583(119911119896)

le 1051205911120572lowastnabla119911Φ120583(119911lowast)119879

Δ119911lowast

le 1205911120572lowastnabla119911Φ120583(119911119896)119879

Δ119911119896

(C6)

By the backtracking line search rule the last inequalitytogether with (C5) yields the inequality120572

119896ge 120572lowastfor all 119896 isin K

large enough Consequently when 119896 isin K is sufficiently largewe have from (23) and (C3) that

Φ120583(119911119896+1

) le Φ120583(119911119896) + 1205911120572119896nabla119911Φ120583(119911119896)119879

Δ119911119896

le Φ120583(119911119896) + 1205911120572lowastnabla119911Φ120583(119911119896)119879

Δ119911119896

le Φ120583(119911119896) +

1

21205911120572lowastnabla119911Φ120583(119911lowast)119879

Δ119911lowast

(C7)

This shows that Φ120583(119911119896)K rarr minusinfin contradicting with the

fact that Φ120583(119911119896)K is bounded from below The proof is

complete

Then we will establish the convergence of Algorithms 2and 4 that is Theorems 5 and 6 in Section 3

C1 Proof of Theorem 5

Proof We let (119911lowast lowast) be a limit point of (119911119896 119896) and thesubsequence (119911119896 119896)K converges to (119911

lowast lowast)

Recall that (Δ119911119896 119896) are the solution of (18) with (119911 120582) =

(119911119896 120582119896) Taking limits in both sides of (18) with (119911 120582) =

(119911119896 120582119896) as 119896 rarr infinwith 119896 isin K by the use of Lemma C1 we

obtain

lowast(1)

minus lowast(2)

minus 119883119879119884119879lowast(4)

= 0

119862 lowast 119890119899minus lowast(4)

minus lowast(5)

= 0

119890119898minus 2119879lowastlowast(1)

minus 2119879lowastlowast(2)

minus lowast(3)

= 0

119910119879lowast(4)

= 0

(119879lowast2

minus 119882lowast) lowast(1)

minus 120583119890119898

= 0

(119879lowast2

+ 119882lowast) lowast(2)

minus 120583119890119898

= 0

119879lowastlowast(3)

minus 120583119890119898

= 0

119875lowastlowast(4)

minus 120583119890119899= 0

Ξlowastlowast(5)

minus 120583119890119899= 0

(C8)

This shows that (119911lowast lowast) satisfies the first-order optimalityconditions (15)

C2 Proof of Theorem 6

Proof (i) Without loss of generality we suppose that thebounded subsequence (119911

119895 119895)J converges to some point

(119911lowast lowast) It is clear that 119911lowast is a feasible point of (10) Since

120583119895 120598120583119895 rarr 0 it follows from (29) and (30) that

Res (119911119895 119895 120583119895)

J997888rarr Res (119911lowast lowast 0) = 0

119895J

997888rarr lowastge 0

(C9)

Consequently the (119911lowast lowast) satisfies the KKT conditions (13)(ii) Let 120585

119895= max119895

infin 1 and

119895= 120585minus1

119895119895 Obviously

119895 is bounded Hence there exists an infinite subsetJ1015840 sube J

such that 1198951015840J rarr = 0 and 119895infin

= 1 for large 119895 isin JFrom (30) we know that ge 0 Dividing both sides of thefirst inequality of (18) with (119911 120582) = (119911

119896 120582119896) by 120585119895 then taking

limits as 119895 rarr infin with 119895 isin J we get

(1)

minus (2)

minus 119883119879119884119879(4)

= 0

(4)

+ (5)

= 0

2119879lowast(1)

+ 2119879lowast(2)

+ (3)

= 0

119910119879(4)

= 0

(119879lowast2

minus 119882lowast) (1)

= 0

(119879lowast2

+ 119882lowast) (2)

= 0

119879lowast(3)

= 0

119875lowast(4)

= 0

Ξlowast(5)

= 0

(C10)

Since 119911lowast is feasible the above equations has shown that 119911lowast is

a Fritz-John point of problem (10)

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

The authors would like to acknowledge support for thisproject from the National Science Foundation (NSF Grants11371154 11071087 11271069 and 61103202) of China

References

[1] Y M Yang and J O Pedersen ldquoA comparative study on featureselection in text categorizationrdquo in Proceedings of the 14th

16 Journal of Applied Mathematics

International Conference on Machine Learning (ICML rsquo97) vol97 pp 412ndash420 1997

[2] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquoThe Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003

[3] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002

[4] H H Zhang J Ahn X Lin and C Park ldquoGene selection usingsupport vector machines with non-convex penaltyrdquo Bioinfor-matics vol 22 no 1 pp 88ndash95 2006

[5] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007

[6] J Weston F Perez-Cruz O Bousquet O Chapelle A Elisseeffand B Scholkopf ldquoFeature selection and transduction forprediction of molecular bioactivity for drug designrdquo Bioinfor-matics vol 19 no 6 pp 764ndash771 2003

[7] Y Liu ldquoA comparative study on feature selection methods fordrug discoveryrdquo Journal of Chemical Information and ComputerSciences vol 44 no 5 pp 1823ndash1828 2004

[8] I Guyon J Weston S Barnhill and V Vapnik Eds FeatureExtraction Foundations and Applications vol 207 Springer2006

[9] A Rakotomamonjy ldquoVariable selection using SVM-based cri-teriardquo Journal of Machine Learning Research vol 3 no 7-8 pp1357ndash1370 2003

[10] J Weston S Mukherjee O Chapelle M Pontil T Poggio andV Vapnik ldquoFeature selection for SVMsrdquo in Proceedings of theConference onNeural Information Processing Systems vol 13 pp668ndash674 Denver Colo USA 2000

[11] D Peleg and R Meir ldquoA feature selection algorithm based onthe global minimization of a generalization error boundrdquo inProceedings of the Conference on Neural Information ProcessingSystems Vancouver Canada 2004

[12] P S Bradley and O L Mangasarian ldquoFeature selection via con-cave minimization and support vector machinesrdquo in Proceed-ings of the International Conference on Machine Learning pp82ndash90 San Francisco Calif USA 1998

[13] JWeston A Elisseeff B Scholkopf andM Tipping ldquoUse of thezero-norm with linear models and kernel methodsrdquo Journal ofMachine Learning Research vol 3 pp 1439ndash1461 2003

[14] A B Chan N Vasconcelos and G R G Lanckriet ldquoDirectconvex relaxations of sparse SVMrdquo in Proceedings of the 24thInternational Conference on Machine Learning (ICML rsquo07) pp145ndash153 Corvallis Ore USA June 2007

[15] GM Fung andO L Mangasarian ldquoA feature selection Newtonmethod for support vector machine classificationrdquo Computa-tional Optimization and Applications vol 28 no 2 pp 185ndash2022004

[16] J Zhu S Rosset T Hastie and R Tibshirani ldquo1-norm supportvector machinesrdquo Advances in Neural Information ProcessingSystems vol 16 no 1 pp 49ndash56 2004

[17] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society B Methodological vol 58no 1 pp 267ndash288 1996

[18] T Zhang ldquoAnalysis of multi-stage convex relaxation for sparseregularizationrdquo Journal of Machine Learning Research vol 11pp 1081ndash1107 2010

[19] R Chartrand ldquoExact reconstruction of sparse signals via non-convexminimizationrdquo IEEE Signal Processing Letters vol 14 no10 pp 707ndash710 2007

[20] R Chartrand ldquoNonconvex regularizaron for shape preserva-tionrdquo inProceedings of the 14th IEEE International Conference onImage Processing (ICIP rsquo07) vol 1 pp I293ndashI296 San AntonioTex USA September 2007

[21] Z Xu H Zhang Y Wang X Chang and Y Liang ldquo11987112

regularizationrdquo Science China Information Sciences vol 53 no6 pp 1159ndash1169 2010

[22] W J Chen and Y J Tian ldquoLp-norm proximal support vectormachine and its applicationsrdquo Procedia Computer Science vol1 no 1 pp 2417ndash2423 2010

[23] J Liu J Li W Xu and Y Shi ldquoA weighted L119902adaptive least

squares support vector machine classifiers-Robust and sparseapproximationrdquo Expert Systems with Applications vol 38 no 3pp 2253ndash2259 2011

[24] Y Liu H H Zhang C Park and J Ahn ldquoSupport vector mach-ines with adaptive 119871

119902penaltyrdquo Computational Statistics amp Data

Analysis vol 51 no 12 pp 6380ndash6394 2007[25] Z Liu S Lin andM Tan ldquoSparse support vectormachines with

Lp penalty for biomarker identificationrdquo IEEEACM Transac-tions on Computational Biology and Bioinformatics vol 7 no1 pp 100ndash107 2010

[26] A Rakotomamonjy R Flamary G Gasso and S Canu ldquol119901-l119902

penalty for sparse linear and sparse multiple kernel multitasklearningrdquo IEEE Transactions on Neural Networks vol 22 no 8pp 1307ndash1320 2011

[27] J Y Tan Z Zhang L Zhen C H Zhang and N Y DengldquoAdaptive feature selection via a new version of support vectormachinerdquo Neural Computing and Applications vol 23 no 3-4pp 937ndash945 2013

[28] Z B Xu X Y Chang F M Xu and H Zhang ldquo11987112

regu-larization an iterative half thresholding algorithmrdquo IEEETrans-actions on Neural Networks and Learning Systems vol 23 no 7pp 1013ndash1027 2012

[29] D Ge X Jiang and Y Ye ldquoA note on the complexity of 119871119901

minimizationrdquo Mathematical Programming vol 129 no 2 pp285ndash299 2011

[30] VN VapnikTheNature of Statistical LearningTheory SpringerNew York NY USA 1995

[31] J Fan and H Peng ldquoNonconcave penalized likelihood with adiverging number of parametersrdquo The Annals of Statistics vol32 no 3 pp 928ndash961 2004

[32] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoThe Annals of Statistics vol 28 no 5 pp 1356ndash1378 2000

[33] B S Tian and X Q Yang ldquoAn interior-point l 11987112-penalty-

method for nonlinear programmingrdquo Technical ReportDepartment of Applied Mathematics Hong Kong PolytechnicUniversity 2013

[34] L Armijo ldquoMinimization of functions having Lipschitz con-tinuous first partial derivativesrdquo Pacific Journal of Mathematicsvol 16 pp 1ndash3 1966

[35] F John ldquoExtremum problems with inequalities as side-con-ditionsrdquo in Studies and Essays Courant Anniversary Volume pp187ndash1204 1948

[36] D J Newman S Hettich C L Blake and C J Merz ldquoUCIrepository of machine learning databasesrdquo Technical Report9702 Department of Information and Computer Science Uni-verisity of California Irvine Calif USA 1998 httparchiveicsucieduml

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 16: Research Article An Interior Point Method for -SVM …downloads.hindawi.com/journals/jam/2014/942520.pdfResearch Article An Interior Point Method for 1/2-SVM and Application to Feature

16 Journal of Applied Mathematics

International Conference on Machine Learning (ICML rsquo97) vol97 pp 412ndash420 1997

[2] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquoThe Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003

[3] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002

[4] H H Zhang J Ahn X Lin and C Park ldquoGene selection usingsupport vector machines with non-convex penaltyrdquo Bioinfor-matics vol 22 no 1 pp 88ndash95 2006

[5] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007

[6] J Weston F Perez-Cruz O Bousquet O Chapelle A Elisseeffand B Scholkopf ldquoFeature selection and transduction forprediction of molecular bioactivity for drug designrdquo Bioinfor-matics vol 19 no 6 pp 764ndash771 2003

[7] Y Liu ldquoA comparative study on feature selection methods fordrug discoveryrdquo Journal of Chemical Information and ComputerSciences vol 44 no 5 pp 1823ndash1828 2004

[8] I Guyon J Weston S Barnhill and V Vapnik Eds FeatureExtraction Foundations and Applications vol 207 Springer2006

[9] A Rakotomamonjy ldquoVariable selection using SVM-based cri-teriardquo Journal of Machine Learning Research vol 3 no 7-8 pp1357ndash1370 2003

[10] J Weston S Mukherjee O Chapelle M Pontil T Poggio andV Vapnik ldquoFeature selection for SVMsrdquo in Proceedings of theConference onNeural Information Processing Systems vol 13 pp668ndash674 Denver Colo USA 2000

[11] D Peleg and R Meir ldquoA feature selection algorithm based onthe global minimization of a generalization error boundrdquo inProceedings of the Conference on Neural Information ProcessingSystems Vancouver Canada 2004

[12] P S Bradley and O L Mangasarian ldquoFeature selection via con-cave minimization and support vector machinesrdquo in Proceed-ings of the International Conference on Machine Learning pp82ndash90 San Francisco Calif USA 1998

[13] JWeston A Elisseeff B Scholkopf andM Tipping ldquoUse of thezero-norm with linear models and kernel methodsrdquo Journal ofMachine Learning Research vol 3 pp 1439ndash1461 2003

[14] A B Chan N Vasconcelos and G R G Lanckriet ldquoDirectconvex relaxations of sparse SVMrdquo in Proceedings of the 24thInternational Conference on Machine Learning (ICML rsquo07) pp145ndash153 Corvallis Ore USA June 2007

[15] GM Fung andO L Mangasarian ldquoA feature selection Newtonmethod for support vector machine classificationrdquo Computa-tional Optimization and Applications vol 28 no 2 pp 185ndash2022004

[16] J Zhu S Rosset T Hastie and R Tibshirani ldquo1-norm supportvector machinesrdquo Advances in Neural Information ProcessingSystems vol 16 no 1 pp 49ndash56 2004

[17] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society B Methodological vol 58no 1 pp 267ndash288 1996

[18] T Zhang ldquoAnalysis of multi-stage convex relaxation for sparseregularizationrdquo Journal of Machine Learning Research vol 11pp 1081ndash1107 2010

[19] R Chartrand ldquoExact reconstruction of sparse signals via non-convexminimizationrdquo IEEE Signal Processing Letters vol 14 no10 pp 707ndash710 2007

[20] R Chartrand ldquoNonconvex regularizaron for shape preserva-tionrdquo inProceedings of the 14th IEEE International Conference onImage Processing (ICIP rsquo07) vol 1 pp I293ndashI296 San AntonioTex USA September 2007

[21] Z Xu H Zhang Y Wang X Chang and Y Liang ldquo11987112

regularizationrdquo Science China Information Sciences vol 53 no6 pp 1159ndash1169 2010

[22] W J Chen and Y J Tian ldquoLp-norm proximal support vectormachine and its applicationsrdquo Procedia Computer Science vol1 no 1 pp 2417ndash2423 2010

[23] J Liu J Li W Xu and Y Shi ldquoA weighted L119902adaptive least

squares support vector machine classifiers-Robust and sparseapproximationrdquo Expert Systems with Applications vol 38 no 3pp 2253ndash2259 2011

[24] Y Liu H H Zhang C Park and J Ahn ldquoSupport vector mach-ines with adaptive 119871

119902penaltyrdquo Computational Statistics amp Data

Analysis vol 51 no 12 pp 6380ndash6394 2007[25] Z Liu S Lin andM Tan ldquoSparse support vectormachines with

Lp penalty for biomarker identificationrdquo IEEEACM Transac-tions on Computational Biology and Bioinformatics vol 7 no1 pp 100ndash107 2010

[26] A Rakotomamonjy R Flamary G Gasso and S Canu ldquol119901-l119902

penalty for sparse linear and sparse multiple kernel multitasklearningrdquo IEEE Transactions on Neural Networks vol 22 no 8pp 1307ndash1320 2011

[27] J Y Tan Z Zhang L Zhen C H Zhang and N Y DengldquoAdaptive feature selection via a new version of support vectormachinerdquo Neural Computing and Applications vol 23 no 3-4pp 937ndash945 2013

[28] Z B Xu X Y Chang F M Xu and H Zhang ldquo11987112

regu-larization an iterative half thresholding algorithmrdquo IEEETrans-actions on Neural Networks and Learning Systems vol 23 no 7pp 1013ndash1027 2012

[29] D Ge X Jiang and Y Ye ldquoA note on the complexity of 119871119901

minimizationrdquo Mathematical Programming vol 129 no 2 pp285ndash299 2011

[30] VN VapnikTheNature of Statistical LearningTheory SpringerNew York NY USA 1995

[31] J Fan and H Peng ldquoNonconcave penalized likelihood with adiverging number of parametersrdquo The Annals of Statistics vol32 no 3 pp 928ndash961 2004

[32] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoThe Annals of Statistics vol 28 no 5 pp 1356ndash1378 2000

[33] B S Tian and X Q Yang ldquoAn interior-point l 11987112-penalty-

method for nonlinear programmingrdquo Technical ReportDepartment of Applied Mathematics Hong Kong PolytechnicUniversity 2013

[34] L Armijo ldquoMinimization of functions having Lipschitz con-tinuous first partial derivativesrdquo Pacific Journal of Mathematicsvol 16 pp 1ndash3 1966

[35] F John ldquoExtremum problems with inequalities as side-con-ditionsrdquo in Studies and Essays Courant Anniversary Volume pp187ndash1204 1948

[36] D J Newman S Hettich C L Blake and C J Merz ldquoUCIrepository of machine learning databasesrdquo Technical Report9702 Department of Information and Computer Science Uni-verisity of California Irvine Calif USA 1998 httparchiveicsucieduml

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 17: Research Article An Interior Point Method for -SVM …downloads.hindawi.com/journals/jam/2014/942520.pdfResearch Article An Interior Point Method for 1/2-SVM and Application to Feature

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of