The hidden neurons selection of the wavelet networks using support vector machines and ridge...

9
Neurocomputing 72 (2008) 471–479 The hidden neurons selection of the wavelet networks using support vector machines and ridge regression Min Han , Jia Yin School of Electronic and Information Engineering, Dalian University of Technology, 2 Linggong Lu, Ganjingzi Qu, Dalian 116023, China Received 23 March 2007; received in revised form 3 December 2007; accepted 6 December 2007 Communicated by J. Tin-Yau Kwok Available online 31 December 2007 Abstract A 1-norm support vector machine stepwise (SVMS) algorithm is proposed for the hidden neurons selection of wavelet networks (WNs). In this new algorithm, the linear programming support vector machine (LPSVM) is employed to pre-select the hidden neurons, and then a stepwise selection algorithm based on ridge regression is introduced to select hidden neurons from the pre-selection. The main advantages of the new algorithm are that it can get rid of the influence of the ill conditioning of the matrix and deal with the problems that involve a great number of candidate neurons or a large size of samples. Four examples are provided to illustrate the efficiency of the new algorithm. r 2007 Elsevier B.V. All rights reserved. Keywords: Wavelet network; Support vector machine; Hidden neurons selection; Ridge regression 1. Introduction Dynamical system modeling and control [13,17,19] using artificial neural networks (ANNs) has been studied widely. Wavelet theory has also been extensively studied in recent years and has been widely applied in various areas in science and engineering. The idea of combining wavelets with neural networks has led to the development of the wavelet networks (WNs) [18,22,6], where wavelets are introduced as activation functions of the hidden neurons in traditional feed forward neural networks with a linear output neuron. The wavelet analysis procedure is implemented with dilated and translated versions of a mother wavelet. In theory, the dilation of a wavelet function can be any positive real value and the translation can be an arbitrary real number. This is referred to as the continuous wavelet transform. In fact, in order to improve computation efficiency, the values of the two parameters are often limited to some discrete lattices. However, in nonlinear and high-dimensional dynamical modeling, wavelet transforms often contain much redundant information. Therefore, it is necessary to use an efficient method to select the hidden neurons of the WNs. Several methods have been developed for selecting hidden neurons. Battiti [2] used the mutual information to select the hidden neurons, Gomm and Yu [10] proposed the piecewise linearization based on Taylor decomposition, and Alonge et al. [1] applied genetic algorithm for selecting the wavelet functions. Mallat and Zhifeng [14] developed the residual-based selection (RBS) algorithm. But these methods are complicated themselves. Some even make the burden of the calculation heavier. In recent years, Billings et al. [7,16,3] applied the forward orthogonal least-squares (OLS) algorithm to select the hidden neurons. It is known that ill conditioning problems can effectively be solved by using the forward OLS algorithm [20,4]. However, this method may be time consuming especially for high-dimensional problems. Xu and Ho [21] proposed an orthogonalized residual-based selection (ORBS) algorithm, which reduced the calculation of the OLS algorithm in the cost of the precision. As discussed in Ref. [21], the computational burden of the selection algorithms is decided by the number of samples ARTICLE IN PRESS www.elsevier.com/locate/neucom 0925-2312/$ - see front matter r 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2007.12.009 Corresponding author. Tel./fax: +86 411 84707847. E-mail address: [email protected] (M. Han).

Transcript of The hidden neurons selection of the wavelet networks using support vector machines and ridge...

Page 1: The hidden neurons selection of the wavelet networks using support vector machines and ridge regression

ARTICLE IN PRESS

0925-2312/$ - se

doi:10.1016/j.ne

�CorrespondE-mail addr

Neurocomputing 72 (2008) 471–479

www.elsevier.com/locate/neucom

The hidden neurons selection of the wavelet networks using supportvector machines and ridge regression

Min Han�, Jia Yin

School of Electronic and Information Engineering, Dalian University of Technology, 2 Linggong Lu, Ganjingzi Qu, Dalian 116023, China

Received 23 March 2007; received in revised form 3 December 2007; accepted 6 December 2007

Communicated by J. Tin-Yau Kwok

Available online 31 December 2007

Abstract

A 1-norm support vector machine stepwise (SVMS) algorithm is proposed for the hidden neurons selection of wavelet networks

(WNs). In this new algorithm, the linear programming support vector machine (LPSVM) is employed to pre-select the hidden neurons,

and then a stepwise selection algorithm based on ridge regression is introduced to select hidden neurons from the pre-selection. The main

advantages of the new algorithm are that it can get rid of the influence of the ill conditioning of the matrix and deal with the problems

that involve a great number of candidate neurons or a large size of samples. Four examples are provided to illustrate the efficiency of the

new algorithm.

r 2007 Elsevier B.V. All rights reserved.

Keywords: Wavelet network; Support vector machine; Hidden neurons selection; Ridge regression

1. Introduction

Dynamical system modeling and control [13,17,19] usingartificial neural networks (ANNs) has been studied widely.Wavelet theory has also been extensively studied in recentyears and has been widely applied in various areas inscience and engineering. The idea of combining waveletswith neural networks has led to the development of thewavelet networks (WNs) [18,22,6], where wavelets areintroduced as activation functions of the hidden neurons intraditional feed forward neural networks with a linearoutput neuron.

The wavelet analysis procedure is implemented withdilated and translated versions of a mother wavelet. Intheory, the dilation of a wavelet function can be anypositive real value and the translation can be an arbitraryreal number. This is referred to as the continuous wavelettransform. In fact, in order to improve computationefficiency, the values of the two parameters are oftenlimited to some discrete lattices. However, in nonlinear and

e front matter r 2007 Elsevier B.V. All rights reserved.

ucom.2007.12.009

ing author. Tel./fax: +86 411 84707847.

ess: [email protected] (M. Han).

high-dimensional dynamical modeling, wavelet transformsoften contain much redundant information. Therefore, it isnecessary to use an efficient method to select the hiddenneurons of the WNs. Several methods have been developedfor selecting hidden neurons. Battiti [2] used the mutualinformation to select the hidden neurons, Gomm and Yu[10] proposed the piecewise linearization based on Taylordecomposition, and Alonge et al. [1] applied geneticalgorithm for selecting the wavelet functions. Mallat andZhifeng [14] developed the residual-based selection (RBS)algorithm. But these methods are complicated themselves.Some even make the burden of the calculation heavier. Inrecent years, Billings et al. [7,16,3] applied the forwardorthogonal least-squares (OLS) algorithm to select thehidden neurons. It is known that ill conditioning problemscan effectively be solved by using the forward OLSalgorithm [20,4]. However, this method may be timeconsuming especially for high-dimensional problems. Xuand Ho [21] proposed an orthogonalized residual-basedselection (ORBS) algorithm, which reduced the calculationof the OLS algorithm in the cost of the precision. Asdiscussed in Ref. [21], the computational burden of theselection algorithms is decided by the number of samples

Page 2: The hidden neurons selection of the wavelet networks using support vector machines and ridge regression

ARTICLE IN PRESSM. Han, J. Yin / Neurocomputing 72 (2008) 471–479472

and the size of the basis library. And this work is inspiredby Fung and Mangasarian [9], which proposed a fastNewton method for a linear programming formulation ofsupport vector classifiers. This 1-norm support vectormachine (1-norm SVM) can handle classification problemsin very high-dimensional spaces and generates a classifierthat depends on very few input features. So a similar1-norm SVM is employed in the new algorithm to pre-select the neurons from the basis library. But note thatsome research [11] showed that support vector machines(SVM) are not always able to construct parsimoniousstructures in the system identification.

In this paper, a 1-norm support vector machine stepwise(SVMS) algorithm is proposed for selecting the hiddenneurons of WNs, where 1-norm SVM is initially used topre-select significant neurons and a stepwise selectionalgorithm based on ridge regression is then applied toselect the most important neurons. The paper is organizedas follows. In Section 2, a briefly review on WNs and SVMis given. The SVMS selection algorithm is described inSection 3. In Section 4, four examples are simulated toillustrate the performance of the new algorithm. Finally,the conclusions are given in Section 5.

2. A brief review on wavelet networks and 1-norm SVM

2.1. Wavelet networks

Let cðxÞ be a mother wavelet, and assume that thereexists a denumerable family derived from cðxÞ

X ¼ cðat;btÞ: cðat;btÞ

ðxÞ ¼1ffiffiffiffiatp c

x� bt

at

� �;

�at 2 R

þ; bt 2 R�,

where at is the scale factor, bt is the translation factor andthe factor 1=

ffiffiffiffiatp

is for energy normalization acrossdifferent scales. An arbitrary signal can be reconstructedby the wavelet basis functions [3]

f ðxÞ ¼Xt2C

ctc at;btð ÞðxÞ ¼Xt2C

ct

1ffiffiffiffiatp c

x� bt

at

� �, (1)

where ct stands for the wavelet transform coefficients, C isan index set which might be finite or infinite. Then Eq. (1)is called the wavelet frame decomposition. In practice, thediscrete wavelet decomposition is often chosen to improvethe computational efficiency. The most popular approachis to restrict the dilation and translation parameters todyadic lattices as at ¼ 2�j and bt ¼ k2�j with j; k 2 Z (Z isthe set of all integers). For this case, Eq. (1) becomes

f ðxÞ ¼X

j

Xk

cj;kcj;kðxÞ, (2)

where cj,k stands for the wavelet transform coefficient, andcj;kðxÞ ¼ 2�j=2cð2jx� kÞ with j; k 2 Z.

In practical applications for WNs, it is unnecessary andimpossible to represent a signal using infinite wavelets.

Therefore, Eq. (2) can be expressed as

f ðxÞ ¼Xjmax

j0

Xk2Kj

cj;kcj;kðxÞ, (3)

where j0 is the coarsest resolution, jmax is the finestresolution, Kj depending on the dilation parameter is thesubset of Z. Then the family X becomes

X1 ¼ fcj;k : j ¼ j0; j0 þ 1; . . . ; jmax; k 2 Kjg. (4)

Assume that the number of elements in X1 is M. ThenEq. (3) can be expressed as

f ðxÞ ¼XMm¼1

cm

1ffiffiffiffiffijm

p cjm ;km2jm x� km

� �. (5)

According to Eq. (5), the structure of WN used in thispaper is similar with the radial basis function (RBF)networks, while the activation functions are wavelet basiscmð�Þ not RBF. The new algorithm for the node selectionwould be introduced in the next section. The WN can betrained using least-squares methods.The result for the one-dimensional (1-D) case described

previously can be extended to high dimensions [6,3].Firstly, n-dimensional wavelet function can be expressed as

c½n�j;kðxÞ ¼ c½n�j;kðx1; x2; . . . ; xnÞ ¼Yn

i¼1

c½n�j;kðxiÞ,

x ¼ ½x1; x2; . . . ; xn�; i ¼ 1; 2; . . . ; n, (6)

where x is the n-dimension input and the superscript [n]denotes the dimension of the wavelet function. Then usingthe analysis of variance (ANOVA) decomposition [3]simplifies the n-dimensional wavelet function. The mainidea of ANOVA in Eq. (7) is to decompose the high-dimensional function into lower dimensional ones:

f ðxÞ ¼ f ðx1;x2; . . . xnÞ ¼ f 0 þXn

i¼1

f iðxiÞ þX1ijn

f ijðxi;xjÞ

þX1ijkn

f ijkðxi;xj ;xkÞ þ � � �

þ f 12���nðx1; x2; . . . ; xnÞ þ e, (7)

where e is the error of the ANOVA decomposition,f ið�Þ; f ijð�Þ; f ijkð�Þ and f 12���nð�Þ are 1-D function, two-dimen-sional (2-D) function, three-dimensional (3-D) functionand n-dimensional function, respectively.

2.2. 1-norm support vector machines and its Newton

algorithm

Given a set of data {(x1, y1), (x2, y2),y, (xN, yN)}, wherexi 2 Rm; yi 2 R, the SVM is initially used to map the inputdata {x1, x2,y, xN} into a high-dimensional feature spaceF by using a function column vector F( � ), and a linearregression is then performed in this feature space so that

f ðXÞ ¼ wT � FðXÞ þ b, (8)

Page 3: The hidden neurons selection of the wavelet networks using support vector machines and ridge regression

ARTICLE IN PRESSM. Han, J. Yin / Neurocomputing 72 (2008) 471–479 473

where w 2 F, b is the threshold, wT � FðXÞ denotes the dotproduct of the wT and F(X) and the superscript ‘T’ denotestranspose. In this paper, X ¼ FðXÞ is applied for reducingthe calculation of the SVM. So Eq. (8) can be rewrite as

y ¼ wTXþ b. (9)

To solve the two unknowns variables (w, b) in Eq. (9),the following function is minimized:

min jjwjj1 þ C eTmðxiÞ þ eTmðx�i Þ

;

½s:t:� yi � f ðxiÞxi;

f ðxiÞ � yix�i ;

xi; x�i X0;

(10)

where jj � jj1 denotes the Euclidean 1-norm, C is theregularized factor that can be learned by cross validation,xi; x

�i are training errors and mð�Þ is the loss function.

A linear loss function will be used in the present study:

mðxÞ ¼ x, (11)

where x stands for the training errors xi or x�i .However, the optimization problem is difficult to solve in

the form of Eq. (10). Given �̄40; 8� 2 ð0; �̄Þ, the problemcan be reformulated as

minu2R2m

f ðuÞ ¼ � yT �yTh i� �

þ1

2I 0

u� Ce� �

þ

2�

þ 0 I

u� Ce� �

þ

2þ �X X

u� e

� �þ

2þ X �X

u� e

� �þ

2þ �eT e

u

2 þ �uð Þþ

2�, (12)

where u are the dual variables and e is the column vector ofones of arbitrary dimension. Then (w, b) can be calculatedas follows:

w ¼ p� q ¼1

��XT XT

u� e� �

þ

�1

�XT �XT

u� e� �

þ; b ¼

1

��eT eT

u.

See the Appendix for more details of 1-norm linearprogramming SVM.

3. The 1-norm SVM stepwise (SVMS) selection algorithm

How to select the hidden neurons has been extensivelystudied in recent years. Several OLS algorithms have beenwell developed, such as classical Gram–Schmidt (CGS)algorithm, modified Gram–Schmidt (MGS) algorithm [8]and householder algorithm. However, these algorithms arenot only complicated, but also sensitive to the illconditioning of the matrices and the round-off error ofthe computers. ORBS algorithm was proposed based onOLS and RBS algorithm, which reduced the calculation of

the OLS algorithm. But the ORBS [21] algorithm mayselect more hidden neurons than the OLS algorithm. Thenew algorithm selects the hidden neurons in two steps.Firstly, the pre-selection is performed based on 1-normSVM and then a new stepwise selection algorithm isapplied to select the optimal neurons.

3.1. The pre-selection based on 1-norm SVM

By minimizing an exterior penalty function of the dual ofa linear programming formulation of a 1-norm SVM, for afinite value of the penalty parameter, an exact least 2-normsolution to the SVM regression is obtained. Our approachis based on a 1-norm SVM formulation (10) to generatevery sparse solutions, which correspond to the candidatehidden neurons. Assume that the size of the hidden neuronlibrary is M and the number of training samples is N. Thenevery candidate neuron can form a N-dimensional vectorpi ¼ [pi1, pi2,y, piN]

T (i ¼ 1, 2,y, M). Therefore, a matrixP consisted of pi is formed. According to Eq. (5), theoutput of the WN can be expressed as

y ¼XMi¼1

pi � wi ¼ wTP; y ¼ ½y1; y2; . . . ; yN �T, (13)

where y is the desired output of the WN and wi (i ¼ 1, 2,y, M) is the weight of the ith neuron. Let wT b

Tin

Eq. (13) equal to w in Eq. (9), then Eq. (13) is similar withEq. (9). So P can be used as the input data of the SVM andy is the desired output of the SVM. Therefore, thetechniques introduced in Section 2 can be applied to selectsupport vectors from pi (i ¼ 1, 2, y, M):

minu2R2m

f ðuÞ ¼ � yT �yTh i� �

þ1

2I 0

u� Ce� �

þ

2�

þ 0 I

u� Ce� �

þ

2þ �P P

u� e

� �þ

2þ P �P

u� e

� �þ

2þ �eT e

u

2 þ �uð Þþ

2�. (14)

A Newton–Armijo algorithm [9,12] has been applied tosolve the problem formulated by Eq. (14). Then thegradient of the function f(u) is given by

rf ðuÞ ¼ �y

�y

" #þ ðu� CeÞþ þ

�X

X

" #�XT XT

u� e� �

þ

þX

�X

" #�XT XT

u� e� �

þ

þ�e

e

" #�eT e

u� ð�uÞþ (15)

Page 4: The hidden neurons selection of the wavelet networks using support vector machines and ridge regression

ARTICLE IN PRESSM. Han, J. Yin / Neurocomputing 72 (2008) 471–479474

and its generalized Hessian is defined as

q2f ðuÞ ¼X

�X

" #diag X �X

�� ��u� �� e

� ��

� �XT �XT

þ�e

e

" #�eT eT

þ diagððu� CeÞ�

þ ð�uÞ�Þ. (16)

Set the parameter values C, e, d, the tolerance sse andimax (typically: � ¼ 4� 10�4 for 1-norm SVM [15], while C

and d are set by the problem we studied). Start with anyu0 2 Rm. For i ¼ 0; 1; . . ..

Step 1: uiþ1 ¼ ui � liðq2f ðuiÞ þ dIÞ�1rf ðuiÞ ¼ ui þ lid

i,where the Armijo stepsize

li ¼ max 1; 12; 14; � � �

� �is such that

f ðuiÞ � f ðui þ lidiÞ �

li

4rf ðuiÞ

0di.

And di is the modified Newton direction:

di ¼ �ðq2f ðuiÞ þ dIÞ�1rf ðuiÞ.

Step 2: Stop if ui � uiþ1 sse or i ¼ imax. Otherwise, set

i ¼ i þ 1 and go to Step1.Step 3: Define the 2-norm solution of the linear

programming SVM (8) by Eq. (12) with u ¼ ui. Then w

and b can be calculated by the solutions.According to Eq. (9), the SVM can be rewritten as

y ¼ Pwþ b. (17)

Then the output variance can be expressed as

yTy ¼ ðPnwþ bÞTðPnwþ bÞ

¼ wTPTPwþ bTPwþ wTPTbþ bTb. (18)

According to Eq. (17), bTPw ¼ wTPTb. SubstitutingEq. (17) into Eq. (18), we get

yTy ¼ wTPTPwþ 2bTy� bTb

¼XMi¼1

wipTi ðy� bÞ þ 2bTy� bTb. (19)

Note that the output variance consists of two parts, the

polynomialPM

i¼1wipTi ðy� bÞ and the constant ð2bTy� bTbÞ.

Thus, wipTi ðy� bÞ is the increment to the desired output

variance brought by pi. And the ith ERRi criterion,introduced by pi, can be defined as

ERRi ¼wip

Ti ðy� bÞ

yTy� 2bTy� bTb� � / wip

Ti ðy� bÞ

yTy. (20)

According to Eq. (20), the ith candidate hidden neuronshas no contribution to the desired output variance if wi ¼ 0.But the ERRi criterion cannot represent exactly thecontribution of pi to the output variance, so only thecandidate hidden neurons with wia0 are selected. And it is

unwise to select the hidden neurons directly by the parameterERRi because the sum of all nonzero ERRi is not equal to 1.

3.2. Stepwise selection algorithm based on ridge regression

Assume that M0 (with the use of the 1-norm SVM,M0pM) vectors p�i (i ¼ 1, 2,y, M0) have been selected assupport vectors. However, the pre-selected neurons are stilllarge in number and cannot be directly used in conductingthe WN. Therefore, a simply but efficient method isintroduced in this paper to minimize M0. Assume that y

(y ¼ [y1, y2,y, yN]T) belongs to a N-dimensional hyper-

space. And p�i selected by SVM also belongs to theN-dimension hyper-space. Through the pre-selecting ofthe 1-norm SVM, Eq. (13) can be rewritten as

y ¼XM0

i¼1

p�i � wi, (21)

where wi is the weight corresponding to the hidden neuron.Then the output y can be viewed as the linear sum of p�i .The goal of the WN is to use several new vectors p�i � wi toapproach y in the N-dimension hyper-space. And the newalgorithm selects the vectors or the neurons p�i � wi which isclosest to the destination D. At the first time, D0 is equal tothe desired output y and then Djþ1 ¼ Dj � pnj � wj . Definethat the set M0 consists of the pre-selected hidden neurons,the procedure of the stepwise selection algorithm shows asfollows.

Initialization: Initialize the variables: the regularizationfactor l, the penalty factor ln of l (0olno1), the goalD0 ¼ y, the desired learning accuracy sse and j ¼ 0;

Step 1: Calculate the criterion dj̄.

dj̄ ¼ min di di ¼ Dj � p�i � wi

2���nþl2

wik k2; i 2M0

�, (22)

where l is a regulation coefficient, which balances theaccuracies of the learning and generalization. And di and wi

can be calculated by making the derivative of Eq. (22)equal to zero.

Step 2: If Dj̄

Dj̄ � p�j̄� wj̄

; Djþ1 ¼ Dj̄ � p�j̄� wj̄ and

the index j̄ is removed from M0 and stored in M1;otherwise, l ¼ ln � l, and go to Step 1.

Step 3: If Djþ1

osse and M0 ¼+, stop; otherwise,j ¼ j þ 1,where sse is a pre-determined threshold to limitthe train error of the WN, M1 is the set of the final hiddenneurons. And the value of sse is set based on the problemsstudied by the WN. In this algorithm, the norm of goalvector D should be smaller and smaller when a new neuronis selected from M0, or the limitation to the weight will berelaxed by reducing l. The biger l is, the biger the effect oflearning accuracy caused by the weight wi. When the normof D is smaller than sse, the procedure stops and theselection of the hidden neurons completes.

Page 5: The hidden neurons selection of the wavelet networks using support vector machines and ridge regression

ARTICLE IN PRESS

Fig. 1. Prediction with SVMS for Mackey–Glass series.

M. Han, J. Yin / Neurocomputing 72 (2008) 471–479 475

In fact, the stepwise selection algorithm cannot select thebest subset from the M0. This is because in Eq. (22) onlyone vector has been considered. It is possible that there is anew vector formed by two or more original vectors, whichcan give a smaller dj̄ than only one vector. However, thisalgorithm can get rid of the infection from the illconditioning of the matrix with a modest accuracy.

4. Simulations

Four examples are provided to illustrate the perfor-mance of the new algorithm. It should be noticed that theoriginal data set used for the WN is initially normalized, sothe WN is performed using normalized variables. Thenormalization is applied based on the operating domain ofthe wavelet activation functions used in the WN ratherthan physical insight. All the simulations are carried out inmatlab 7 environment running in a Pentium 4, 1.6GHzCPU, and 512M Memory.

The performance of the WN can be evaluated by theroot mean square error (ERMSE)

ERMSE ¼1

N � 1

XN

t¼1

½yðtÞ � dðtÞ�2

!1=2

, (23)

where y(t) is the predicted value, d(t) is the desired valueand N is the number of the test samples. ERMSE reflects theabsolute deviation between the predicted value and thedesired value. In ideal situation, if there is no error inprediction, ERMSE will be equal to 0.

4.1. The Mackey–Glass delay-differential equation

This data set is generated by the Mackey–Glass delay-differential equation

dxðtÞ

dt¼ �0:1xðtÞ þ

0:2xðt� tÞ1þ x10ðt� tÞ

, (24)

where the time delay t is chosen to be 30 in this example.Setting the initial condition x(t) ¼ 0.9 for 0ptpt, aRunge–Kutta integral algorithm is applied to solveEq. (24) with an integral step Dt ¼ 0:01 and 1000 equi-spaced samples, x(t) (t ¼ 1, 2,y, 1000) is extracted with asampling interval of T ¼ 0.06 time unit. And the data set isnormalized into the interval [0, 1] using the information0:2pxðtÞp1:4.

The data set is divided into two parts: 500 data points areused to train the WN, and others are prepared for testingthe network. Following [3], the model order is set to 6. Themodel can be expressed as

yðmÞ ¼X

1pl1p6

f½1�l1

xl1

� �þ

X1pl1pl2p6

f½2�l1l2

xl1 ;xl2

� �þ

X1pl1pl2pl3p6

f½3�l1l2l3

xl1 ;xl2 ;xl3

� �,

where xli¼ yðt� liÞ and i ¼ 1, 2, 3, 1pl1pl2pl3p6, the

1-D, 2-D and 3-D compactly supported Mexican hat

wavelets are used in this example to approximate theuni-variate functions f

½1�l1ð�Þ, the bi-variate functions f

½2�l1l2ð�Þ,

and the tri-variate function f½3�l1l2l3ð�Þ, respectively, with

the coarsest resolutions j1 ¼ j2 ¼ j3 ¼ 0 and the finestresolutions J1 ¼ 3, J2 ¼ 1 and J3 ¼ 0. The 1-D andn-dimensional mother wavelets used in this example areEqs. (26) and (27).To facilitate the comparisons, an evaluation criterion,

the relative error [6,3], is introduced to measure theperformance of the identified WN. This criterion isdefined as

Erel ¼jyðtÞ � dðtÞj

jdðtÞj, (25)

where y(t) and d(t) are associated one-step-ahead predic-tions and the desired data generated by Eq. (24),respectively.The original series is normalized to [0, 1], therefore the

regularization factor l, the penalty factor ln and the desiredlearning accuracy sse are initialized to 0.1, 1 and 0.001,respectively. With the use of 1-norm sum, only 316candidate neurons from about 5000 ones are selected bythe SVM; this greatly reduces the calculation of the WN.The results of one-step-ahead predictions of the WN werecompared with the desired data and the results of theSVMS algorithm show in Fig. 1. The relative error Erels oftwo (SVMS and OLS) algorithms are shown in Fig. 2. Thedetails of the two algorithms’ performances are listed inTable 1. These results obviously show that the newalgorithm is more effective than the classical OLSalgorithm.

4.2. Approximation of ‘SinC’ function with noise

In this example, three algorithms (OLS, ORBS andSVMS) are used to approximate the ‘SinC’ function, apopular choice to illustrate neural network for regression

Page 6: The hidden neurons selection of the wavelet networks using support vector machines and ridge regression

ARTICLE IN PRESS

Fig. 2. Relative errors of the OLS and SVMS algorithm.

Table 1

Comparison of classic OLS and SVMS algorithm

Hidden neurons ERMSE

OLS 14 3.5� 10�3

SVMS 14 3.0� 10�3

Table 2

Performance comparison for learning noise free function: SinC

Algorithm ERMSE Time (s) Nodes

SVMS 0.0065 4.6503 18

OLS 0.0077 5.0938 16

ORBS 0.0071 1.2034 39

Fig. 3. Outputs of the SVMS algorithm.

Fig. 4. Prediction sunspot series with SVMS.

M. Han, J. Yin / Neurocomputing 72 (2008) 471–479476

in the literature

yðxÞ ¼sinðxÞ=x; xa0;

1; x ¼ 0:

(

A training set ðxi; yiÞ and testing set ðxi; yiÞ with 5000data, respectively, are created where x are uniformlyrandomly distributed on the interval ½�10; 10�. In orderto make the regression problem ‘real’, large uniform noisedistributed in ½�0:2; 0:2� has been added to all the trainingsamples while testing data remain noise-free.

There are 16 hidden neurons selected for about 80candidate ones by the SVMS algorithm. Twenty trials havebeen conducted for this algorithm and the average resultsare shown in Table 2. Figs. 3 and 4 show the true and theapproximated function of the SVMS algorithm. It can beseen from Table 1 that the SVMS algorithm spent 4.6503 sCPU time obtaining the testing error (ERMSE) 0.0065,however, it takes 5.0938 s CPU time for OLS algorithm toreach a much higher testing error 0.0077. Although ORBS

algorithm is much faster than SVMS and OLS algorithm, itgets more nodes and a lower precision.

4.3. High-dimensional regression problem—ailerons

This data set containing 13,750 samples addresses acontrol problem, namely flying a F16 aircraft. Theattributes describe the status of the aeroplane, while thegoal is to predict the control action on the ailerons ofthe aircraft. The data set contains 40 continuous attributes,and according to [20], only 20 attributes have beenselected to conduct the high-dimensional Gauss wavelet.The data set has been normalized into the interval ½�1; 1�.The 1-D and n-dimensional mother wavelets used in thisexample are

c½1�ðxÞ ¼ ð1� x2Þe�ð1=2Þx2

, (26)

Page 7: The hidden neurons selection of the wavelet networks using support vector machines and ridge regression

ARTICLE IN PRESS

Table 3

Performance comparison for real-world regression

Algorithm ERMSE Time (s) Nodes

SVMS 0.0900 174 30

OLS 0.0932 1486 24

ORBS 0.0954 353 53

Table 4

Comparison of classic OLS and SVMS Selection algorithm

ERMSE Time (s) Nodes

OLS 16.4899 7.2344 24

ORBS 18.5115 1.5630 26

SVMS 14.5075 9.9219 25

M. Han, J. Yin / Neurocomputing 72 (2008) 471–479 477

c½n�ðxÞ ¼ ðn� jjxjjÞe�ð1=2Þjjxjj2

, (27)

where x is the 1-D input data, x is the n-dimensional inputdata, || � || is the Euclidean norm and 2-norm is used in thissimulation. Then the model order is 40 and the model canbe expressed as

yðmÞ ¼X

1pl1p40

f½1�l1

xl1

� �þ

X1pl1pl2p20

f½2�l1l2

xl1 ;xl2

� �,

where the 1-D and 2-D Gauss wavelets are used in thisexample to approximate the uni-variate functions f

½1�l1ð�Þ

and the bi-variate functions f½2�l1l2ð�Þ, respectively, with the

coarsest resolutions j1 ¼ j2 ¼ 0 and the finest resolutionsJ1 ¼ 3 and J2 ¼ 0.

To reduce the calculation of the whole WN, 500 samplesare used to select the hidden neurons, 5000 samples areused to get the weights of the WN, and the rest areprepared for testing the WN. Table 2 shows the results ofthree (SVMS, OLS, ORBS) algorithms.

From Tables 3 and 4, it can be seen that the SVMSalgorithm is the fastest algorithm because in this examplethe calculation of this algorithm depends on, 500, thenumber of samples used to select nodes. And thecalculation of OLS and ORBS algorithm depends on notonly the number of training samples but also the size of thecandidate hidden neurons which is about 9000.

4.4. The sunspot time series

The sunspot time series considered in this exampleconsists of 300 annually recorded Wolf sunspots of theperiod from 1700 to 1999. The objective here is to build aWN model to product one-step-ahead prediction for thesunspot series. In this example, the Gaussian wavelet isapplied as the activation function with the operatingdomain in the interval [�0.6, 0.6]. Therefore, the originaldata set is normalized into the interval [�0.6, 0.6]:

c½1�ðxÞ ¼ xe�x2=2, (28)

c½n�ðxÞ ¼ c½n�ðx1;x2; . . . ;xnÞ ¼ x1; x2; . . . ; xn; e� xk k2=2,

(29)

where c½1�ðxÞ and c½n�ðxÞ are, respectively, 1-D Gaussianmother wavelet and n-dimensional Gaussian motherwavelet. The data set was separated into two parts: thetraining set consisted of 270 data points corresponding tothe period 1700–1969, and the rest data points are used totest the WN.Following [3], the model order was chosen to be n ¼ 9,

so the input data x ¼ [x1, x2,y, x9] ¼ [y(t�1), y(t�2),y,y(t�9)]. And the most significant variables are y(t�1),y(t�2) and y(t�9). Therefore, the original model of theWN can be expressed as

yðtÞ ¼ f ðxÞ ¼ f ðyðt� 1Þ; yðt� 2Þ; . . . ; yðt� 9ÞÞ

¼X9l1¼1

f½1�l1

xl1

� �þX2l1¼1

X3l2¼2

f½2�l1l2

zl1 ; zl2

� �þ f½3�129ðx1; x2;x9Þ, (30)

where z1 ¼ y(t�1), z2 ¼ y(t�2) and z3 ¼ y(t�9). The 1-D,2-D and 3-D Gaussian wavelet functions are used in thisexample to approximate the uni-variate function f

½1�l1ð�Þ,

the bi-variate function f½2�l1l2ð�Þ, ant the tri-variate function

f½3�129ð�Þ, respectively, with the coarsest resolutions j1 ¼ j2 ¼

j3 ¼ 0 and the finest resolutions J1 ¼ 2, J2 ¼ J3 ¼ 0.The original series is normalized to [�0.6, 0.6], so theregularization factor l, the penalty factor ln and the desiredlearning accuracy sse are initialized to 0.1, 0.1 and 0.001,respectively. With the use of 1-norm sum, only 26 hiddenneurons from 500 ones are selected in this example; thisgreatly reduces the calculation required by the originalstepwise algorithm.Form Table 2, the SVMS selection algorithm can give a

better performance than the OLS and ORBS algorithmalthough it would spend more time than them. From thisexample, the new algorithm can catch the characteristics ofthe real-world chaotic system although the stepwiseselection of this algorithm is not optimal.

5. Conclusions

A new algorithm has been introduced for the hiddenneurons selection of the WNs. The SVMS algorithm cancope with the problems contained large numbers ofcandidate hidden neurons by using the 1-norm SVM.And it can get rid of the influence from the ill conditioningof the matrix. In this paper, the problem of hidden neuronselection is converted to a problem of feature selection byusing 1-norm SVM, which can generate very sparsesolutions [15]. A new criterion ERRi is introduced to rankthe contribution of the neurons to the desired output andaccording to Eq. (20), it can be seen that the nodes withwi ¼ 0 has no contribution to the output of the network.And it is unwise to use directly the coefficient ERRi toselect the hidden neurons because the sum of ERRi is notequal to 1. Thus, it is hard to control the number of the

Page 8: The hidden neurons selection of the wavelet networks using support vector machines and ridge regression

ARTICLE IN PRESSM. Han, J. Yin / Neurocomputing 72 (2008) 471–479478

neurons by using parameter C because the L1-SVMgives very sparse solutions. So a kind of stepwisemethod has been employed to find the finally nodes. Infact, the learning time of standard SVM partly depends onthe number of training samples. But the 1-norm SVM usedin the SVMS algorithm applied a Newton–Armijo algo-rithm to solve the problem, which makes the learning timedepend on the smaller one between the number of trainingsamples and the size of the neuron library. The variableimax in Section 3 stands for the max iteration loop ofNewton–Armijo algorithm, which is set to 25 for reducingthe learning time in this paper. The results obtained fromthe control problem of the F16 aircraft with a large size ofthe neuron library and a few training samples demonstratethese advantages of the new algorithm.

Acknowledgments

This research is supported by the project (60674073) ofthe National Nature Science Foundation of China, theproject (2006BAB14B05) of the National Key TechnologyR&D Program of China and the project (2006CB403405)of the National Basic Research Program of China (973Program). All of these supports are appreciated.

Appendix. The linear programming support

vector regression

In a linear programming SVM formulation [9,5] of thestandard SVM the term 1

2wTw is replaced by wk k1 in

Eq. (10). Empirical evidence [5] indicates that the 1-normformulation has the advantage of generating very sparsesolutions, which implies that many input space features donot play a role in determining the linear regressor.This makes this approach suitable for hidden neuronsselection in the SVMS algorithm. And Eq. (10) can beexpressed as

minx;x;p;q

CðeTxþ eTx̂Þ þ eTðpþ qÞ;

s:t: Xðp� qÞ þ b� yx;

y� Xðp� qÞ � bXx̂;

xX0; x̂X0;

(31)

where the following substitution for w has been made:

w ¼ p� q; p0; qX0.

Convert Eq. (31) to standard linear programming form:

minx;x̂;p;q

CeT CeT eT eT

xT x̂T

pT qTh iT

;

s:t:I 0 �X X

0 I X �X

� �xT x̂

TpT qT

h i

þ�e

e

� �bX

�y

y

" #;

xX0; x̂X0; pX0; qX0:

(32)

And the dual of the linear program (32) is the following:

maxu2R2m

�yT yTh i

u;

s:t: upCe;

�X X

upe;

X �X

upe;

�eT eT

up0;

uX0:

(33)

The asymptotic exterior penalty problem for this linearprogram is (12) mentioned in Section 3, which introducesan approximation to the solution. And the Newton–Armijo algorithm in this paper is applied to get thesolution u. Then (w, b) can be yield

w ¼p� q ¼1

��XT XT

u� e� �

þ�

1

�XT �XT

u� e� �

þ

b ¼1

��eT eT

u,

where the plus function xþ is defined as ðxþÞi ¼ maxf0;xig;i ¼ 1; . . . ; n.

References

[1] F. Alonge, F. D’Ippolito, F.M. Raimondi, System identification via

optimised wavelet-based neural networks, IEE Proc. Control Theory

Appl. 150 (2003) 147–154.

[2] R. Battiti, Using mutual information for selecting features in

supervised neural net learning, IEEE Trans. Neural Networks 5

(1994) 537–550.

[3] S.A. Billings, H.L. Wei, A new class of wavelet networks for

nonlinear system identification, IEEE Trans. Neural Networks 16

(2005) 862–874.

[4] S.A. Billings, H.L. Wei, The wavelet-NARMAX representation: a

hybrid model structure combining polynomial models with multi-

resolution wavelet decompositions, Int. J. Syst. Sci. 36 (2005)

137–152.

[5] P.S. Bradley, O.L. Mangasarian, Feature selection via concave

minimization and support vector machines, in: Machine Learning

Proceedings of the 15th International, San Francisco, CA, 1998.

pp. 82–90.

[6] L.Y. Cao, Y.G. Hong, H.P. Fang, G.W. He, Predicting chaotic time-

series with wavelet networks, Physica D 85 (1995) 225–238.

[7] S. Chen, S.A. Billings, W. Luo, Orthogonal least squares methods

and their application to non-linear system identification, Int. J.

Control 50 (1989) 1873–1896.

[8] A. Dax, A modified Gram–Schmidt algorithm with iterative

orthogonalization and column pivoting, Linear Algebra Appl. 310

(2000) 25–42.

[9] G.M. Fung, O.L. Mangasarian, A feature selection Newton method

for support vector machine classification, Computat. Optim. Appl. 28

(2004) 185–202.

[10] J.B. Gomm, D.L. Yu, Order and delay selection for neural network

modelling by identification of linearized models, Int. J. Syst. Sci. 31

(2000) 1273–1283.

[11] K.L. Lee, S.A. Billings, Time series prediction using support vector

machines, the orthogonal and the regularized orthogonal least-

squares algorithms, Int. J. Syst. Sci. 33 (2002) 811–821.

[12] Y.J. Lee, W.F. Hsieh, C.M. Huang, epsilon-SSVR: a smooth support

vector machine for epsilon-insensitive regression, IEEE Trans.

Knowl. Data Eng. 17 (2005) 678–685.

Page 9: The hidden neurons selection of the wavelet networks using support vector machines and ridge regression

ARTICLE IN PRESSM. Han, J. Yin / Neurocomputing 72 (2008) 471–479 479

[13] T.N. Lin, B.G. Horne, P. Tino, C.L. Giles, Learning long-term

dependencies in NARX recurrent neural network, IEEE Trans.

Neural Networks 7 (1996) 1329–1338.

[14] S.G. Mallat, Z. Zhifeng, Matching pursuits with time-frequency

dictionaries, IEEE Trans. on Signal Process [see also IEEE Trans. on

Acoust. Speech Signal Process.] 41 (1993) 3397–3415.

[15] O.L. Mangasarian, Exact 1-norm support vector machines via

unconstrained convex differentiable minimization, J. Mach. Learn.

Res. 7 (2006) 1517–1530.

[16] K.Z. Mao, S.A. Billings, Algorithms for minimal model structure

detection in nonlinear dynamic system identification, Int. J. Control

68 (1997) 311–330.

[17] I. Rivals, L. Personnaz, Neural-network construction and selection in

nonlinear modeling, IEEE Trans. Neural Networks 14 (2003)

804–819.

[18] H.H. Szu, B. Telfer, S. Kadambe, Neural network adaptive wavelets

for signal representation and classification, Opt. Eng. 31 (1992)

1907–1916.

[19] R.J. Wai, H.H. Chang, Backstepping wavelet neural network control

for indirect field-oriented induction motor drive, IEEE Trans. Neural

Networks 15 (2004) 367–382.

[20] H.L. Wei, S.A. Billings, J. Liu, Term and variable selection for non-

linear system identification, Int. J. Control 77 (2004) 86–110.

[21] J.H. Xu, D.W.C. Ho, A basis selection algorithm for wavelet neural

networks, Neurocomputing 48 (2002) 681–689.

[22] Q. Zhang, A. Benveniste, Wavelet networks, IEEE Trans. Neural

Networks 3 (1992) 889–898.

Min Han received her B.S. and M.S. degrees from

the Department of Electrical Engineering, Dalian

University of Technology, LiaoNing, China, in 1982

and 1993, respectively, and the M.S. and Ph.D.

degrees from Kyushu University, Fukuoka, Japan,

in 1996 and 1999, respectively. She is a Professor at

School of Electronic and Information Engineering,

Dalian University of Technology. Her current

research interests are neural network and chaos

and their applications to control and identification.

Jia Yin received his B.S. degree at Shandong

University at Weihai, China. Now he is studying

for his M.S. at School of Electronics and

Information Engineering, Dalian University of

Technology, Liaoning, China. His major interest

is Pattern recognition and Intelligent Systems.