1-s2.0-S092523121401087X-main

12
Multiple optimal learning factors for the multi-layer perceptron Sanjeev S. Malalur, Michael T. Manry n , Praveen Jesudhas University of Texas at Arlington, Department of Electrical Engineering, Nedderman Hall, Room 517, Arlington, TX 76019, USA article info Article history: Received 10 April 2010 Received in revised form 14 June 2014 Accepted 21 August 2014 Communicated by: G. Thimm Available online 11 September 2014 Keywords: Multilayer perceptron Newton's method Hessian Orthogonal least squares Multiple optimal learning factor Whitening transform abstract A batch training algorithm is developed for a fully connected multi-layer perceptron, with a single hidden layer, which uses two-stages per iteration. In the rst stage, Newton's method is used to nd a vector of optimal learning factors (OLFs), one for each hidden unit, which is used to update the input weights. Linear equations are solved for output weights in the second stage. Elements of the new method's Hessian matrix are shown to be weighted sums of elements from the Hessian of the whole network. The effects of linearly dependent inputs and hidden units on training are analyzed and an improved version of the batch training algorithm is developed. In several examples, the improved method performs better than rst order training methods like backpropagation and scaled conjugate gradient, with minimal computational overhead and performs almost as well as LevenbergMarquardt, a second order training method, with several orders of magnitude fewer multiplications. & 2014 Elsevier B.V. All rights reserved. 1. Introduction Multi-layer perceptron (MLP) neural networks are widely used for regression and classication applications in the areas of parameter estimation [1,2], document analysis and recognition [3], nance and manufacturing [4] and data mining [5]. Due to its layered, parallel architecture, the MLP has several favorable properties such as universal approximation [6] and the ability to mimic Bayes discriminant [7] and maximum a-posteriori (MAP) estimates [8]. However, actual MLP performance is adversely affected by the limitations of the available training algorithms. Common batch training algorithms for the MLP include rst order methods such as backpropagation (BP) [9] and conjugate gradient [10] and second order learning methods related to New- ton's method. First order methods generally use fewer operations per iteration but require more iterations than second order methods. Newton's method for training the MLP often has non-positive denite [11,12] or even singular Hessians. Hence LevenbergMar- quardt (LM) [13,14] and other quasi-Newton methods are used instead. Unfortunately, second order methods do not scale well. Although rst order methods scale better, they are sensitive to input means and gains [15], since they lack afne invariance. Layer-by- layer approaches [16] also exist for optimizing the MLP. Such approaches (i) divide the network weights into disjoint subsets, and (ii) train the subsets separately. Some network weight sub- sets have nonsingular Hessians, allowing them to be trained via Newton's method. For example, solving linear equations for output weights [15,1719] is actually Newton's algorithm for the output weights. The optimal learning factor (OLF) [20] for BP training is found using a one by one Hessian. If the learning factor and momentum term gain in BP are found using a two by two Hessian [21], the resulting algorithm is very similar to conjugate gradient. The primary purpose of this paper is to present a family of learning algorithms targeted towards training a xed architecture, fully connected multi-layer perceptron with a single hidden layer, capable of learning from regression/approximation type application and data. In [22] a two-stage training method was introduced, which uses Newton's method to obtain a vector of optimal learning factors, one for each MLP hidden unit. These learning factors are optimal hence it was named multiple optimal learning factors (MOLF). A variation to MOLF called the variable optimal learning factors (VOLF) was presented in [23]. As a learning method, VOLF looks promising. However the original MOLF was still better in terms of training performance. In this paper, we explain in detail the motiva- tion behind the original MOLF algorithm using the concept of equivalent networks. We analyze the structure of the MOLF Hessian matrix and demonstrate how linear dependency among inputs and hidden units can impact the structure, ultimately affecting training performance. This leads to the development of an improved version of the MOLF whose performance is not impacted by the presence of linear dependencies and has an overall performance better than the Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/neucom Neurocomputing http://dx.doi.org/10.1016/j.neucom.2014.08.043 0925-2312/& 2014 Elsevier B.V. All rights reserved. n Corresponding author. Present address: Department of Electrical Engineering, The University of Texas at Arlington, Arlington, TX 76019, USA. Tel.: þ1 817 272 3483. E-mail addresses: [email protected] (S.S. Malalur), [email protected] (M.T. Manry). Neurocomputing 149 (2015) 14901501

Transcript of 1-s2.0-S092523121401087X-main

Multiple optimal learning factors for the multi-layer perceptron

Sanjeev S. Malalur, Michael T. Manry n, Praveen JesudhasUniversity of Texas at Arlington, Department of Electrical Engineering, Nedderman Hall, Room 517, Arlington, TX 76019, USA

a r t i c l e i n f o

Article history:Received 10 April 2010Received in revised form14 June 2014Accepted 21 August 2014Communicated by: G. ThimmAvailable online 11 September 2014

Keywords:Multilayer perceptronNewton's methodHessianOrthogonal least squaresMultiple optimal learning factorWhitening transform

a b s t r a c t

A batch training algorithm is developed for a fully connected multi-layer perceptron, with a singlehidden layer, which uses two-stages per iteration. In the first stage, Newton's method is used to find avector of optimal learning factors (OLFs), one for each hidden unit, which is used to update the inputweights. Linear equations are solved for output weights in the second stage. Elements of the newmethod's Hessian matrix are shown to be weighted sums of elements from the Hessian of the wholenetwork. The effects of linearly dependent inputs and hidden units on training are analyzed and animproved version of the batch training algorithm is developed. In several examples, the improvedmethod performs better than first order training methods like backpropagation and scaled conjugategradient, with minimal computational overhead and performs almost as well as Levenberg–Marquardt, asecond order training method, with several orders of magnitude fewer multiplications.

& 2014 Elsevier B.V. All rights reserved.

1. Introduction

Multi-layer perceptron (MLP) neural networks are widely usedfor regression and classification applications in the areas ofparameter estimation [1,2], document analysis and recognition[3], finance and manufacturing [4] and data mining [5]. Due to itslayered, parallel architecture, the MLP has several favorableproperties such as universal approximation [6] and the ability tomimic Bayes discriminant [7] and maximum a-posteriori (MAP)estimates [8]. However, actual MLP performance is adverselyaffected by the limitations of the available training algorithms.

Common batch training algorithms for the MLP include firstorder methods such as backpropagation (BP) [9] and conjugategradient [10] and second order learning methods related to New-ton's method. First order methods generally use fewer operationsper iteration but require more iterations than second order methods.Newton's method for training the MLP often has non-positivedefinite [11,12] or even singular Hessians. Hence Levenberg–Mar-quardt (LM) [13,14] and other quasi-Newton methods are usedinstead. Unfortunately, second order methods do not scale well.Although first order methods scale better, they are sensitive to inputmeans and gains [15], since they lack affine invariance. Layer-by-layer approaches [16] also exist for optimizing the MLP. Such

approaches (i) divide the network weights into disjoint subsets,and (ii) train the subsets separately. Some network weight sub-sets have nonsingular Hessians, allowing them to be trained viaNewton's method. For example, solving linear equations for outputweights [15,17–19] is actually Newton's algorithm for the outputweights. The optimal learning factor (OLF) [20] for BP training isfound using a one by one Hessian. If the learning factor andmomentum term gain in BP are found using a two by two Hessian[21], the resulting algorithm is very similar to conjugate gradient.

The primary purpose of this paper is to present a family oflearning algorithms targeted towards training a fixed architecture,fully connected multi-layer perceptron with a single hidden layer,capable of learning from regression/approximation type applicationand data. In [22] a two-stage training method was introduced, whichuses Newton's method to obtain a vector of optimal learning factors,one for each MLP hidden unit. These learning factors are optimalhence it was named multiple optimal learning factors (MOLF).A variation to MOLF called the variable optimal learning factors(VOLF) was presented in [23]. As a learning method, VOLF lookspromising. However the original MOLF was still better in terms oftraining performance. In this paper, we explain in detail the motiva-tion behind the original MOLF algorithm using the concept ofequivalent networks. We analyze the structure of the MOLF Hessianmatrix and demonstrate how linear dependency among inputs andhidden units can impact the structure, ultimately affecting trainingperformance. This leads to the development of an improved versionof the MOLF whose performance is not impacted by the presence oflinear dependencies and has an overall performance better than the

Contents lists available at ScienceDirect

journal homepage: www.elsevier.com/locate/neucom

Neurocomputing

http://dx.doi.org/10.1016/j.neucom.2014.08.0430925-2312/& 2014 Elsevier B.V. All rights reserved.

n Corresponding author. Present address: Department of Electrical Engineering, TheUniversity of Texas at Arlington, Arlington, TX 76019, USA. Tel.: þ1 817 272 3483.

E-mail addresses: [email protected] (S.S. Malalur),[email protected] (M.T. Manry).

Neurocomputing 149 (2015) 1490–1501

original MOLF. The rest of the paper is organized as follows. Section 2reviews MLP notation, a simple first order training method, and anexpression for the OLF. The MOLF method is presented in Section 3motivated by a discussion of equivalent networks. Section 4analyzes the effect of linearly dependent inputs and hidden unitson MOLF learning. An improvement to the MOLF algorithm ispresented in Section 5 that is immune to the presence of linearlydependent inputs and hidden units. In Section 6, our methods arecompared to various existing first and second order algorithms.Section 7 summarizes our conclusions.

2. Review of multi-layer perceptron

First, we introduce the notation used in the rest of the paperand review a simple two-stage algorithm that makes limited use ofNewton's method. Observations from the two-stage method areused to motivate the MOLF approach.

2.1. MLP notation

A fully connected feed-forward MLP network with a singlehidden layer is shown in Fig. 1. It consists of a layer for inputs, asingle layer of hidden units and a layer of outputs. The inputweights w(k, n) connect the nth input to the kth hidden unit.Output weights woh(m, k) connect the kth hidden unit's activationop(k) to the mth output yp(m), which has a linear activation.The bypass weight woi(m, n) connects the nth input to the mthoutput. The training data, described by the set {xp, tp} consists ofN-dimensional input vectors xp and M-dimensional desired outputvectors, tp. The pattern number p varies from 1 to Nv, where Nv

denotes the number of training vectors present in the data set.In order to handle thresholds in the hidden and output layers, the

input vectors are augmented by an extra element xp(Nþ1) where,xp(Nþ1)¼1, so xp¼[xp(1), xp(2),…, xp(Nþ1)]T. Let Nh denote thenumber of hidden units. The vector of hidden layer net functions, np

and the output vector of the network, yp can be written as

np ¼W Uxp;

yp ¼Wio U xpþ Woh Uop ð1Þ

where the kth element of the hidden unit activation vector op iscalculated as op(k)¼ f(np(k)) and f(∙) denotes the hidden layeractivation function such as the sigmoid activation [9], which is theone used throughout this paper. In this paper, the cost function used

in MLP training is the classic mean squared error between thedesired and the actual network outputs, defined as

E¼ 1Nv

∑Nv

p ¼ 1∑M

m ¼ 1½tp mð Þ�yp mð Þ�2 ð2Þ

2.2. Training using output weight optimization-backpropagation

Here, we introduce output weight optimization-backpropagation[17] (OWO-BP), a first order algorithm that groups the weights in thenetwork into two subsets and trains them separately. In a giveniteration of OWO-BP, the training proceeds from (i) finding theoutput weight matrices, Woh and Woi connected to the networkoutputs to (ii) separately training the input weights W using BP.

Output weight optimization (OWO) is a technique for calculatingthe network's output weights [18]. From (1) it is clear that ourMLP's output layer activations are linear. Therefore, finding theweights connected to the outputs is equivalent to solving a systemof linear equations. The expression for the actual outputs given in(1) can be re-written as

yp ¼Wo Uxp ð3Þ

where xp¼[xTp, oTp]T is the augmented input column vector or basisvector of size Nu where Nu¼NþNhþ1. The M by Nu output weightmatrix Wo, formed as [Woi: Woh], contains all the output weights.The output weights can be found by setting ∂E/∂Wo¼0 whichleads to a set of linear equations given by

Cx ¼ Rx UWTo ð4Þ

where

Cx ¼1Nv

∑Nv

p ¼ 1xp UtTp

Rx ¼1Nv

∑Nv

p ¼ 1xp Ux

Tp

Eq. (4) comprises ofM sets of Nu linear equations in Nu unknowns andis most easily solved using orthogonal least squares (OLS) [24]. In thesecond half of an OWO-BP iteration, the input weight matrix W isupdated as

W’WþzUG ð5Þwhere G is the generic representation of a direction matrix thatcontains information about the direction of learning and the learningfactor, z, contains information about the step length to be taken in thedirection G. For backpropagation, the direction matrix is nothing butthe Nh by (Nþ1) negative input weight Jacobian matrix computed as

G¼ ∂E∂W

G¼ 1Nv

∑Nv

p ¼ 1δp UxTp ð6Þ

Here δp¼[δp(1), δp(2),…, δp(Nh)]T is the Nh by 1 column vector ofhidden unit delta functions [9]. A brief outline of training usingOWO-BP is given below. For every training iteration,

1. In the first stage, solve the system of linear equations in (4)using OLS and update the output weights, Wo;

2. In the second stage, begin by finding the negative Jacobianmatrix G described in Eq. (6);

3. Then, update the input weights, W, using Eq. (5).

This method is attractive for a couple of reasons. First, thetraining is faster than BP, since training weights connected to theFig. 1. A fully connected multi-layer perceptron.

S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–1501 1491

outputs is equivalent to solving linear equations [16] and second, ithelps us avoid some local minima.

2.3. Optimal learning factor

The choice of learning factor z in Eq. (5) has a direct effect onthe convergence rate of OWO-BP. Early steepest descent methodsused a fixed constant, with slow convergence, while later methodsused a heuristic scaling approach to modify the learning factorbetween iterations and thus speed up the rate of convergence.However, using Taylor's series for the error E, in Eq. (2), a non-heuristic optimal learning factor (OLF) for OWO-BP can be derived[20] as,

z¼ �∂E=∂z∂2E=∂z2

�����z ¼ 0

ð7Þ

where the numerator and denominator derivatives are evaluatedat z¼0. The expression for the second derivative of the error withrespect to the OLF is found using (2) as,

∂2E∂z2

¼ ∑Nh

k ¼ 1∑Nh

j ¼ 1∑

Nþ1

n ¼ 1gðk;nÞU ∑

Nþ1

i ¼ 1

∂2E∂w k;nð Þ∂w j; ið Þg j; ið Þ ¼ ∑

Nh

k ¼ 1∑Nh

j ¼ 1gTkH

k;jR gj

ð8Þwhere column vector gk contains elements g(k, n) of G, for allvalues of n. HR is the input weight Hessian with Niw rows andcolumns, where Niw¼(Nþ1) Nh is the number of input weights.Clearly, HR is reduced in size compared to H, the Hessian for allweights in the entire network. Hk;j

R contains elements of HR for allinput weights connected to the jth and kth hidden units and hassize (Nþ1) by (Nþ1). When Gauss–Newton [12] updates are used,elements of HR are computed as

∂2E∂wðj; iÞ∂w k;nð Þ ¼

2Nvu j; kð Þ ∑

Nv

p ¼ 1xp ið Þxp nð Þop' jð Þop' kð Þ

u j; kð Þ ¼ ∑M

m ¼ 1woh m; jð Þwohðm; kÞ ð9Þ

o0p(k) denotes the first partial derivative of op(k) with respect to itsnet function. Because (9) represents the Gauss–Newton approx-imation to the Hessian, it is positive semi-definite. Eq. (8) showsthat (i) the OLF can be obtained from elements of the Hessian HR,(ii) HR contains useful information even when it is singular, and(iii) a smaller non-singular Hessian ∂2E/∂z2 can be constructedusing HR.

2.4. Motivation for further work

Even though the OWO stage of OWO-BP is equivalent toNewton's algorithm for the output weights, the overall algorithmis only marginally better than standard BP, and so it is rarely used.This occurs because the BP stage is first order. The one by oneHessian used in the OLF calculations is the only one associatedwith the input weight matrix W. Therefore, improving the perfor-mance of OWO-BP may require that we develop a larger Hessianmatrix associated with W.

3. Multiple optimal learning factor algorithm

In this section, we attempt to improve input weight trainingby deriving a vector of optimal learning factors, z, which containsone element for each hidden unit. The resulting improvement toOWO-BP is called the multiple optimal learning factors (MOLF)training method. We begin by first introducing the concept ofequivalent networks and identify the conditions under which twonetworks can generate the same outputs.

3.1. Discussion of equivalent networks

Let MLP 1 have a net function vector np and output vector yp asdiscussed previously in Section 2.1. In a second network calledMLP 2, which is trained similarly, the net function vector and theoutput vector are respectively denoted by np' and yp'. The bypassand output weight matrices along with the hidden unit activationfunctions are considered to be equal for both the MLP 1 and MLP 2.Based on this information the output vectors for MLP 1 and MLP2 are respectively,

yp ið Þ ¼ ∑Nþ1

n ¼ 1woi i;nð Þxp nð Þþ ∑

Nh

k ¼ 1woh i; kð Þf ðnp kð ÞÞ ð10Þ

yp' ið Þ ¼ ∑Nþ1

n ¼ 1woi i;nð Þxp nð Þþ ∑

Nh

k ¼ 1woh i; kð Þf np' kð Þ� � ð11Þ

In order to make MLP 2 strongly equivalent to MLP 1 theirrespective output vectors yp and yp' have to be equal. Based onEqs. (10) and (11) this can be achieved only when their respectivenet function vectors are equal [25]. MLP 2 can be made stronglyequivalent to MLP 1 by linearly transforming its net functionvector np' before passing it as an argument to the activationfunction f(). This process is equivalent to directly using the netfunction np in MLP 2, denoted as,

np ¼ C Unp' ð12Þ

The linear transformation matrix C used to transform np' inEq. (12) is of size Nh-by-Nh. This process of transforming the netfunction vector of MLP 2 is represented in Fig. 2.

On applying the linear transformation to the net functionvector n0

p in Eq. (12), MLP 2 is made strongly equivalent toMLP1. We then get,

yp' ið Þ ¼ yp ið Þ ¼ ∑Nþ1

n ¼ 1woi i;nð Þxp nð Þþ ∑

Nh

k ¼ 1woh i; kð Þf ∑

Nh

m ¼ 1c k;mð Þnp' mð Þ

!

ð13ÞAfter making MLP 2 strongly equivalent to MLP 1, the net functionvector np can be related to its input weights W and the netfunction vector np' as,

np kð Þ ¼ ∑Nþ1

n ¼ 1w k;nð Þxp nð Þ ¼ ∑

Nh

m ¼ 1c k;mð Þnp' ðmÞ ð14Þ

Similarly the MLP 2 net function vector np' can be related to itsweights W0 as

np' mð Þ ¼ ∑Nþ1

n ¼ 1w' m;nð Þxp nð Þ ð15Þ

Fig. 2. Graphical representation of net function vectors.

S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–15011492

On substituting (15) into (14) we get,

npðkÞ ¼ ∑Nþ1

n ¼ 1wðk;nÞxpðnÞ ¼ ∑

Nþ1

n ¼ 1∑Nh

m ¼ 1cðk;mÞUw'ðm;nÞxpðnÞ ð16Þ

In (16) we see that the input weight matrices of MLPs 1 and 2 arelinearly related as

W ¼ C UW' ð17ÞIf the elements of the C matrix are found through an optimalitycriterion, then optimal input weights W0 can be computed as

W ' ¼ C�1 UW ð18Þ

3.2. Transformation of net function vectors

In Section 3.1 we showed a method for generating a network,MLP 2, which is strongly equivalent to the original network, MLP 1.In this section, we show that equivalent networks train differently.The output of the transformed network is given in Eq. (14).The elements of the negative Jacobian matrix G' of MLP 2 foundfrom Eqs. (2) and (14) are defined as,

g' u; vð Þ ¼ � ∂E∂w' u; vð Þ

¼ 2Nv

∑Nv

p ¼ 1∑M

i ¼ 1tp ið Þ�yp ið Þh i

U∂yp ið Þ

∂w' u; vð Þ∂yp ið Þ

∂w' u; vð Þ¼ ∑

Nh

k ¼ 1woh i; kð Þop' kð Þc k;uð ÞxpðvÞ ð19Þ

o0p(k) denotes the first partial derivative of op(k). Rearrangingterms in (19) results in,

g'ðu; vÞ ¼ ∑Nh

k ¼ 1cðk;uÞ 2

Nv∑Nv

p ¼ 1∑M

i ¼ 1½tpðiÞ�ypðiÞ�U wohði; kÞo'pðkÞxpðvÞ

¼ ∑Nh

k ¼ 1cðk;uÞ �∂E

∂wðk; vÞ ð20Þ

This is abbreviated as

G' ¼ CT UG ð21ÞThe input weight update equation for MLP 2 based on its Jacobianvector G0 is given by,

W ' ¼W 'þzUG'

On pre-multiplying this equation by C and using Eqs. (18) and (21)we obtain

W ¼WþzUG'' ð22Þwhere,

G'' ¼ RUG

R¼ C UCT ð23Þ

Lemma 1. For a given R matrix in Eq. (23), the number of C matricesis infinite.

Continuing, the transformed Jacobian G'' of MLP 1 is given interms of the original Jacobian G as,

g''ðu; vÞ ¼ ∑Nh

k ¼ 1rðu; kÞgðk; vÞ

gðk; vÞ ¼ �∂E∂wðk; vÞ ¼

2Nv

∑Nv

p ¼ 1∑M

i ¼ 1½tpðiÞ�ypðiÞ�Uwohði; kÞo0pðkÞxpðvÞ ð24Þ

where o'p(k) denotes the derivative of op(k) with respect to its netfunction. Eqs. (22) and (23) show that valid input weight updatematrices G'' can be generated by simply transforming the weightupdate matrix G from MLP 1, using an autocorrelation matrix R.

Since uncountably many matrices R exist, it is natural to wonderwhich one is optimal.

3.3. Multiple optimal learning factors for input weights

In this subsection, we explore the idea of finding an optimaltransform matrix R for the negative Jacobian G. However, finding adense R matrix would result in an algorithm with almost thecomputational complexity of Levenberg–Marquardt (LM). Weavoid this by letting the R matrix in Eq. (23) be diagonal. Underthis case Eq. (22) becomes,

w k;nð Þ ¼w k;nð ÞþzUr kð Þg k;nð Þ ð25ÞOn comparing Eqs. (22) and (25), the expression for the differentoptimal learning factors zk could be given as,

zk ¼ zUr kð Þ ð26ÞEq. (26) shows that using the MOLF approach for training a MLP isequivalent to transforming the net function vector using a diag-onal transform matrix. Next, we use Newton's method develop ourapproach for finding z.

Assume that an MLP is being trained using OWO-BP. Alsoassume that a separate OLF zk is being used to update each hiddenunit's input weights, w(k, n), where 1rnr(Nþ1). The errorfunction to be minimized is given by (2). The predicted outputyp(m) is given by,

ypðmÞ ¼ ∑Nþ1

n ¼ 1woiðm;nÞxpðnÞ

þ ∑Nh

k ¼ 1wohðm; kÞf ∑

Nþ1

n ¼ 1ðwðk;nÞþzk � gðk;nÞÞxpðnÞ

� �

where, g(k, n) is an element of the negative Jacobian matrix G andf() denotes the hidden layer activation function. The negative firstpartial of E with respect to zj is

gmolf jð Þ ¼ � ∂E∂zj

¼ 2Nv

∑Nv

p ¼ 1∑M

m ¼ 1tp mð Þ� ∑

Nh

k ¼ 1woh m; kð Þop zkð Þ

" #Uwoh m; jð Þop' jð ÞΔnpðjÞ

ð27Þwhere,

tp mð Þ ¼ tp mð Þ� ∑Nþ1

n ¼ 1woi m;nð Þxp nð Þ;

Δnp jð Þ ¼ ∑Nþ1

n ¼ 1xp nð ÞUg j;nð Þ

op zkð Þ ¼ f ∑Nþ1

n ¼ 1ðw k;nð Þþzk Ug k;nð ÞÞxp nð Þ

� �

The elements of the Gauss–Newton Hessian Hmolf are

hmolf l; jð Þ � ∂2E∂zl∂zj

¼ 2Nv

∑M

m ¼ 1woh m; lð Þwoh m; jð Þ ∑

Nv

p ¼ 1op' lð Þop' jð ÞΔnp lð ÞΔnp jð Þ

¼ ∑Nþ1

i ¼ 1∑

Nþ1

n ¼ 1

2Nv

u l; jð Þ ∑Nv

p ¼ 1xp ið Þxp nð Þop' lð Þop' jð Þ

" #g l; ið ÞUg j;nð Þ ð28Þ

where u(l, j) is defined in (9). The Gauss–Newton update guaran-tees that Hmolf is non-negative definite. Given Hmolf and thenegative gradient vector gmolf, defined as

gmolf ¼ ½�∂E=∂z1; �∂E=∂z2;…; �∂E=∂zNh�T

We minimize E with respect to the vector z using Newton'smethod. If Hmolf and gmolf are the Hessian and negative gradient,respectively, of the error with respect to z, then the multipleoptimal learning factors are computed by solving

Hmolf Uz¼ gmolf ð29Þ

S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–1501 1493

A brief outline of training using the MOLF algorithm is givenbelow. In each iteration of the training algorithm:

1. Solve linear equations for all output weights using (4)2. Calculate the negative input weight Jacobian G using BP as

given by (6)3. Calculate z using Newton's method as shown in (29) and

update the input weights as

w k;nð Þ’w k;nð Þþzk Ug k;nð Þ ð30ÞHere, the MOLF procedure has been inserted into the OWO-BPalgorithm, resulting in an algorithm we denote as MOLF-BP.The MOLF procedure can be inserted into other algorithms as well.

3.4. Analysis of the MOLF Hessian

This subsection presents an analysis of the structure of theMOLF Hessian and its relation to the reduced Hessian HR of theinput weights. Comparing (28) to (9), we see that the term withinthe square brackets is an element of HR. Hence,

hmolf l; jð Þ ¼ ∂2E∂zl∂zj

¼ ∑Nþ1

i ¼ 1∑

Nþ1

n ¼ 1

∂2E∂w l; ið Þ∂w j;nð Þ g l; ið Þg j;nð Þ ð31Þ

For fixed (l, j), hmolf(l, j) can be expressed as,

∂2E∂zl∂zj

¼ ∑Nþ1

i ¼ 1gl ið Þ ∑

Nþ1

n ¼ 1hl;jR i;nð ÞUgj nð Þ ð32Þ

In matrix notation,

Hmolf ¼ gTl H

l;jR gj ð33Þ

where column vector gl contains G elements g(l, n) for all values ofn, where the (Nþ1) by (Nþ1) matrix HR

l, j contains elements of HR

for weights connected to the lth and jth hidden units. The reducedsize Hessian HR

l, j in (33) uses four indices (l, i, j, n) and can beviewed as a 4-dimensional array, represented by H4

R whoseelements satisfy

hl;jR ði;nÞARðNþ1ÞxNhxNhxðNþ1Þ

Now, (33) can be rewritten as

H4molf ¼ GH4

RGT ð34Þ

From (32), we see that hmolf(l, j)¼h4molf(l, j, l, j), i.e., the4-dimensional matrix H4

molf is transformed into the 2-dimensionalmatrix Hmolf, by setting m¼ l and u¼ j. Each element of the MOLFHessian combines the information from (Nþ1) rows and columns ofthe reduced Hessian, HR

l, j. This makes MOLF less sensitive to inputconditions. Note the similarities between (8) and (33).

4. Effect of dependence on the MOLF algorithm

In this section, we analyze the performance of MOLF applied toOWO-BP, in the presence of linear dependence.

4.1. Dependence in the input layer

A linearly dependent input can be modeled as a linear combi-nation of any or all inputs as

xp Nþ2ð Þ ¼ ∑Nþ1

n ¼ 1b nð Þxp nð Þ ð35Þ

During OWO, the weights from the dependent input, feeding theoutputs will be set to zero and the output weight adaptationwill not be affected. During the input weight adaptation, the

expression for gradient given by (27) can be re-written as,

gmolf jð Þ ¼ � ∂E∂zj

¼ 2Nv

∑Nv

p ¼ 1∑M

m ¼ 1tp mð Þ� ∑

Nh

k ¼ 1woh m; kð Þop zkð Þ

" #

Uwoh m; jð Þop' jð Þ Δnp jð Þþg k;Nþ2ð Þ ∑Nþ1

n ¼ 1b nð Þxp nð Þ

� �ð36Þ

and the expression for an element of the Hessian in (28) can bere-written as

∂2E∂zl∂zj

¼ 2Nv

∑M

m ¼ 1woh m; lð Þwoh m; jð Þ ∑

Nv

p ¼ 1op' lð Þop' jð Þ

��Δnp lð ÞΔnp jð Þþ Δnp lð Þg j; Nþ2ð Þ

� ∑Nþ1

n ¼ 1b nð Þxp nð ÞþΔnp jð Þg l;Nþ2ð Þ ∑

Nþ1

i ¼ 1b ið Þxp ið Þ

þg j;Nþ2ð Þg l;Nþ2ð Þ ∑Nþ1

n ¼ 1∑

Nþ1

i ¼ 1b ið Þxp ið Þb nð Þxp nð Þ

#ð37Þ

Let H0molf be the Hessian, when the extra dependent input xp(Nþ2)

is included. Then,

hmolf' l; jð Þ ¼ hmolf l; jð Þþ 2Nv

u l; jð Þ ∑Nv

p ¼ 1xp Nþ2ð Þop' lð Þop' jð Þ

U g l;Nþ2ð Þ ∑Nþ1

n ¼ 1xp nð Þg j;nð Þþg j;Nþ2ð Þ ∑

Nþ1

i ¼ 1xp ið Þg l; ið Þ

"

þxp Nþ2ð Þg l;Nþ2ð Þg j;Nþ2ð Þ#

ð38Þ

Comparing (27) with (36) and (28) with (37) and (38), we seesome additional terms that appear within the square brackets inthe expressions for gradient and Hessian in the presence oflinearly dependent input. Clearly, these parasitic cross-terms willcause the training using MOLF to be different for the case oflinearly dependent inputs. This leads to the following lemma.

Lemma 2. Linearly dependent inputs, when added to the network,do not force H'molf to be singular.

As seen in (37) and (38), each element h0molf (m, j) simply gainssome first and second degree terms in the variables b(n).

4.2. Dependence in the hidden layer

Here we look at how a dependent hidden unit affects theperformance of MOLF. Assume that some hidden unit activationsare linearly dependent upon others, as

opðNhþ1Þ ¼ ∑Nh

k ¼ 1c kð Þop kð Þ ð39Þ

Further assume that OLS is used to solve for output weights.The weights in the network are updated during every trainingiteration and it is quite possible that this could cause some hiddenunits to be linearly dependent upon each other. The dependencecould manifest in hidden units being identical or a linear combi-nation of other hidden unit outputs.

Consider one of the hidden units to be dependent. The autocorre-lation matrix, Rx, will be singular and since we are using OLS to solvefor output weights, all the weights connecting the dependent hiddenunit to the outputs will be forced to zero. This will ensure that thedependent hidden unit does not contribute to the output. To see if thishas any impact on learning in MOLF we can look at the expression forthe gradient and Hessian, given by (27) and (28) respectively. Bothequations have a sum of product terms containing output weights.Since OWO sets the output weights for the dependent hidden unit tobe zero, this will also set the corresponding gradient and Hessian

S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–15011494

elements to be zero. In general, any dependence in the hidden layerwill cause the corresponding learning factor to be zero and will notaffect the performance of MOLF.

Lemma 3. For each dependent hidden unit, the corresponding rowand column of Hmolf is zero-valued.

This follows from (28). The zero-valued rows and columns inHmolf cause no difficulties if (16) is solved using OLS [26].

5. Improving the MOLF algorithm

Section 4.1 showed that the presence of linearly dependent inputsaffect MOLF training, primarily the input weight adaptation. Depen-dent inputs manifest themselves as additional rows and columns inthe input weight Jacobian matrix G. If we can find a way to set theseadditional rows and columns to zero, then we can ensure that theweights connected to the dependent inputs will not get updated andthe training will be unaffected by the presence of dependent inputs.In this section, we present a way to mitigate the effect of linearlydependent inputs on training input weights.

Without loss of generality, let a linearly dependent inputxp(Nþ2) be modeled as in (35), where the first (Nþ1) inputs arelinearly independent. Using the singular value decomposition(SVD), the input autocorrelation matrix R of dimension (Nþ2) by(Nþ2) is written as UΣUT. This can be rewritten as R¼ATA, wherethe transformation matrix A is defined as

A¼ Σ�1=2UT

Both the last row and the last column of A are zero. Now, when theinput vector with the dependent inputs is transformed as,

x0p ¼ A � xp ð40Þ

the last element, x0p(Nþ2) will be zero. This type of transformationis termed as whitening transformation [27] in linear algebra andstatistics literature. Now, let us investigate the effect of dependentinputs on adjusting input weights using MOLF, when whiteningtransform is incorporated into training. We need to verify that thetransformed input vector x0p, with zero-values for the dependentinputs, causes no problems. The Nh by (Nþ2) gradient matrix G0

for the transformed data is

G' ¼ G � AT ð41Þwhere G is the negative Jacobian matrix before the inputs aretransformed. Although the last column of G is dependent upon theother columns, the whitening transform causes the last column ofG0 to be zero. The fact that x'p(Nþ2) and g(k, Nþ2) are zero causes(36) to revert to (27) and H0

molf in (38) to equal Hmolf. When OLS isused to find z and then the output weights, xp(Nþ2) has no effect,and weights connected to it are not trained.

Eqs. (40) and (41) now provide us with a way to remove line-arly dependent inputs during training, so their presence has noeffect. This leads us to develop an improved version of MOLF.The transformation applied to the Jacobian matrix is a whiteningtransformation (WT). Whitening transformation has two desirableproperties: (i) decorrelation, which causes the transformed matrix tohave a diagonal covariance matrix and (ii) whitening, which forcesthe transformed matrix to be a unit covariance (identity) matrix. Wecall the improved algorithm MOLF-WT. The steps to training an MLPwith MOLF-WT are listed below:

1. Compute the input autocorrelation matrix R as given by (4)2. Use the SVD to compute the whitening transformation matrix A

as described above3. In each iteration of training

a. Calculate the negative input weight Jacobian G using BPas given by (6)

b. Calculate the whitening transformed Jacobian G0 as describedin (41) to eliminate any linear dependency in the inputs

c. Calculate z using Newton's method as before, using (29) andupdate the input weights as described in (30), except thatG is replaced by G0

d. Solve linear equations for all output weights as beforeusing (4)

6. Performance comparison and results

This section compares the performance of MOLF-WT, withseveral popular existing feed-forward batch learning algorithms.We begin by listing the various algorithms being compared.We then identify the metrics used for comparing the performancesof the various algorithms, provide details of the experimentalprocedure along with a description of the data files used forobtaining the numerical results.

6.1. List of competing algorithms

The performance of MOLF-WT is compared with four otherexisting methods, namely:

1. Output weight optimization-backpropagation (OWO-BP)2. Multiple Learning Factor applied to OWO-BP (MOLF-BP)3. Levenberg–Marquardt (LM)4. Scaled conjugate gradient (SCG) [28]

All algorithms employ the fully connected feed-forward net-work architecture described in Fig. 1. The hidden units havesigmoid activation functions, while the output units all have linearactivations. SCG and LM employ one-stage learning where allweights in the network are updated simultaneously in eachiteration. OWO-BP, MOLF-BP, and MOLF-WT employ two-stagelearning where we first update the input weights and subse-quently solve linear equations for the output weights. A single OLFwas used for training in the OWO-BP and LM algorithms, whilemultiple OLFs were used for MOLF-BP and MOLF-WT.

6.2. Experimental procedure

Primarily, we make use of the k-fold methodology for our experi-ments (k¼10 for our simulations). Each data set is split into k non-overlapping parts of equal size. Of these, k-2 parts (roughly 80%) arecombined and used for training, while each of the remaining two partsis used for validation and testing respectively (roughly 10% each).Validation is performed per training iteration (to prevent over train-ing) and the network with the minimum validation error is saved.After the training is completed, the network with the minimumvalidation error is used with the testing data to obtain the final testingerror, which measures the networks ability to generalize. The max-imum number of iteration per training session is fixed at 2500 and isthe same for all algorithms. There is also an early stopping criterion.Training is repeated till all k combinations have been exhausted (i.e., 10times).

It is known that the initial network plays a role in the networksability to learn. In order to make our performance comparisonsindependent of the initial network, we carry out the k-foldmethodology 5 times, each time with a different initial network.This leads to a total of 50 different training sessions per algorithm,with 50 different initial networks being used. We call this the5k-fold method. For a given data set, even though the initial weight

S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–1501 1495

is different in each training session, we use the same 50 initialweight sets across all algorithms. This is done to allow for a faircomparison between the various training methods.

We use net control [19] to adjust the weights before trainingbegins. In net control, the randomly initialized input weights andhidden unit thresholds are scaled and biased so that each hiddenunit's net function has the same mean and standard deviation.Specifically, the net function means are equal to 0.5 and the corre-sponding standard deviations are equal to 1. In addition, OWO is usedto find the initial output weights. This means that all the algorithmslisted above have the same starting point (a different one for eachsession) for all data files. This will be evident in the plots presentedlater in this section, where we can see that the MSE for the firstiteration for all algorithms is the same.

At the end of the 5k-fold procedure, we compute (i) the averagetraining error, (ii) the average minimum validation error, (iii) theaverage testing error, and (iv) the average number of cumulativemultiplies required for training, for each algorithm and for eachdata file. The average is computed over 50 sessions. Thesequantities form the metrics for comparison and are subsequentlyused to generate the plots and tables to compare the performanceof different learning algorithms.

6.3. Computational cost

The proposed MOLF algorithm involves inversion a Hessianmatrix, which has a general complexity of O(N3), for a squarematrix of size N. However, compared to Newton's method or LM,the size of the Hessian is much smaller. Updating weights usingNewton's method or LM, requires a Hessian with Nw rows andcolumns, where Nw¼M �Nuþ(Nþ1)Nh and Nu¼NþNhþ1. In com-parison, the Hessian used in the proposed MOLF has only Nh rowsand columns. The number of multiplies required to solve foroutput weights using OLS is given by

Mols ¼Nu Nuþ1ð Þ Mþ16Nu 2Nuþ1ð Þþ3

2

� �ð42Þ

The numbers of multiplies per training iteration for OWO-BP, LMand MOLF are given below

Mowo�bp ¼Nv½ 2Nh Nþ2ð ÞþMðNuþ1Þ þðNu Nuþ1ð ÞÞ=2þM Nþ6Nhþ4ð Þ�

þMolsþNh Nþ1ð Þ ð43Þ

Mlm ¼Nv½MNuþ2Nh Nþ1ð ÞþM Nþ6Nhþ4ð ÞþMNu Nuþ3Nh Nþ1ð Þð Þ

þ4N2h Nþ1ð Þ2�þN3

wþN2w ð44Þ

Mscg ¼ 4Nv Nh Nþ1ð ÞþMNu½ �þ10 Nh Nþ1ð ÞþMNu½ � ð45Þ

Mmolf ¼Mowo�bpþNv Nh Nþ4ð Þ�M Nþ6Nhþ4ð Þ½ �þN3h ð46Þ

Note that Mmolf consists of Mowo-bp plus the required multiplies forcalculating optimal learning factors. Similarly, Mlm consists of Mbp

plus the required multiplies for calculating and inverting theHessian matrix. We will see later in the plots that despite theadditional complexity of Hessian inversion, the number of cumu-lative multiplies to train a network using MOLF is almost the sameas OWO-BP or SCG algorithms, which do not involve any matrixinversion operation. We will also see that LM on the other handhas a far greater overhead due the inversion of a much largerHessian matrix.

6.4. Training data sets

Table 1 lists seven data sets used for simulations and perfor-mance comparisons.

All the data sets are publicly available. In all data sets, theinputs have been normalized to be zero-mean and unit variance.This way, it becomes clear that MOLF is not equivalent to a merenormalization of the data.

The number of hidden units for each data set is selected byfirst training a multi-layer perceptron with a large number ofhidden units followed by a step-wise pruning with validation [29].The least useful hidden unit is removed at each step till there is onlyone hidden unit left. The number of hidden units corresponding to thesmallest validation error is used for training on that data set.

6.5. Results

Numerical results obtained from our training, validation andtesting methodology for the proposed MOLF-WT algorithm arecompared with those of OWO-BP, MOLF-BP, LM and SCG, using thedata files listed in Table 1. Two plots are generated for each datafile: (i) the average mean square error (MSE) for training from10-fold training is plotted versus the number of iterations (shownon a log10 scale) and (ii) the average training MSE from 10-foldtraining is plotted versus the cumulative number of multiplies(also shown on a log10 scale) for each algorithm. Together, the twoplots will not only show the final average MSE achieved bytraining but also the computations it consumed for getting there.

6.5.1. Prognostics data setThis data file is available at the Image Processing and Neural

Networks Lab repository [30]. It consists of parameters that areavailable in the Bell Helicopter health and usage monitoringsystem (HUMS), which performs flight load synthesis, which is aform of prognostics [31]. The data set containing 17 input featuresand 9 output parameters.

We trained an MLP having 13 hidden units. In Fig. 3-a, theaverage mean square error (MSE) for 10-fold training is plottedversus the number of iterations for each algorithm. In Fig. 3-b, theaverage 10-fold training MSE is plotted versus the required numberof multiplies (shown on a log10 scale). From Fig. 3-a, LM and MOLFbased algorithms have almost identical performance (LM is margin-ally better than MOLF-WT which in turn is marginally better thanMOLF-BP). However, the performance of LM comes with a signifi-cantly higher computational demand, as shown in Fig. 3-b. Also,MOLF based algorithms are better than OWO-BP and SCG.

6.5.2. Federal reserve economic data setThis file contains some economic data for the USA from 01/04/

1980 to 02/04/2000 on a weekly basis. From the given features, thegoal is to predict the 1-Month CD Rate [32]. For this data file, calledTR on its webpage [33], we trained an MLP having 34 hidden units.From Fig. 4-a, LM has the best overall performance, followedclosely by MOLF-WT. The MOLF-WT algorithm is better thanMOLF-BP. Fig. 4-b shows the computational cost of achieving this

Table 1Data set description.

Data set name No. ofinputs

No. ofoutputs

No. ofpatterns

No. of HiddenUnits

Prognostics 17 9 4745 13Federal reserve 15 1 1049 34Housing 16 1 22,784 30Concrete 8 1 1030 13Remote sensing 16 3 5992 25White winequality

11 1 4898 24

Parkinson's 16 2 5875 17

S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–15011496

performance. The proposed MOLF algorithms consume slightlymore computation than OWO-BP and SCG. However, all the firstorder algorithms utilized about two orders of magnitude fewercomputations than LM.

6.5.3. Housing data setThis data file is available on the DELVE data set repository [34].

The data was collected as part of the 1990 US census. For thepurpose of this data set a level State-Place was used. Data from allstates were obtained. Most of the counts were changed intoappropriate proportions. These are all concerned with predictingthe median price of houses in a region based on demographiccomposition and the state of the housing market in the region.House-16H data was used in our simulation, with ‘H’ standing forhigh difficulty. Tasks with high difficulty have had their attributeschosen to make the modeling more difficult due to higher varianceor lower correlation of the inputs to the target. We trained an MLPhaving 30 hidden units. From Figs. 5-a and -b, we see that the LMhad the best performance, followed closely by SCG. The proposedMOLF-WT algorithm has a slight edge over MOLF-BP.

6.5.4. Concrete compressive strength data setThis data file is available on the UCI Machine Learning Repo-

sitory [35]. It contains the actual concrete compressive strength(MPa) for a given mixture under a specific age (days) determinedfrom laboratory. The concrete compressive strength is a highlynonlinear function of age and ingredients. We trained an MLPhaving 13 hidden units. For this data set, Fig. 6 shows that LM hasthe best overall average training error, followed very closely by theimproved MOLF-WT, which is marginally better than MOLF-BP.SCG attains a performance almost identical to MOLF-WT.

6.5.5. Remote sensing data setThis data file is available on the Image Processing and Neural

Networks Lab repository [30]. It consists of 16 inputs and 3 outputsand represents the training set for inversion of surface permittiv-ity, the normalized surface rms roughness, and the surfacecorrelation length found in back scattering models from randomlyrough dielectric surfaces [36]. We trained an MLP having 25hidden units. From Fig. 7-a, the MOLF-WT has the best overallaverage training MSE.

Fig. 4. Federal reserve data: average error vs. (a) iterations and (b) multiplies per iteration.

Fig. 3. Prognostics data set: average error vs. (a) iterations and (b) multiplies per iteration.

S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–1501 1497

6.5.6. White wine quality data setThe Wine Quality-White data set [37] was created using white

wine samples. The inputs include objective tests (e.g. PH values)and the output is based on sensory data (median of at least3 evaluations made by wine experts). Each expert graded the winequality between 0 (very bad) and 10 (very excellent). This data setis available on the UCI data repository [35]. For this data file, thenumber of hidden units was chosen to be 24. For this data set,Fig. 8 shows that LM and the improved MOLF-WT have almostidentical performance and are better than the rest.

6.5.7. Parkinson's disease data setThe Parkinson's data set [38] is available on the UCI data repository

[35]. It is composed of a range of biomedical voice measurementsfrom 42 people with early-stage Parkinson's disease recruited to asix-month trial of a telemonitoring device for remote symptomprogression monitoring. The main aim of the data is to predict themotor and total UPDRS scores (‘motor_UPDRS’ and ‘total_UPDRS’)from the 16 voice measures. For this data set the number of hidden

units was chosen to be 17. From Fig. 9, LM attains the best overalltraining error, followed by the improved MOLF-WT. We can see againthat the improved MOLF-WT performs better than MOLF-BP.

Table 2 is a compilation of the metrics used for comparing theperformance of all algorithms on all data sets. It comprises of theaverage minimum testing and validation errors. Based on the resultsobtained on the selected data sets, we can make a few observations.

From the training plots and the data from Table 2, we can say that

1. LM is the top performer in 4 out of the 7 data sets. However, LMbeing a second order method, this performance comes at asignificant cost of computation – almost two orders of magni-tude greater than the rest;

2. Both SCG and OWO-BP consistently appear in the last twoplaces in terms of average training error on 5 out of 7 data sets(Housing and Concrete data sets being the exception). How-ever, being first order methods, they set the bar for the lowestcost of computation;

3. Both MOLF-BP and MOLF-WT always perform better than theircommon predecessor, OWO-BP, on all data sets and they are

Fig. 5. Housing data: average error vs. (a) iterations and (b) multiplies per iteration.

Fig. 6. Concrete data: average error vs. (a) iterations and (b) multiplies per iteration.

S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–15011498

better than SCG except for the housing data set. This perfor-mance is achieved with minimal computational overheadcompared to both SCG and OWO-BP, as evident in the plots oferror vs. number of cumulative multiplications;

4. MOLF-BP is almost never better than LM, except for the remotesensing data set where it is marginally better;

5. MOLF-WT always performs better than its immediate prede-cessor MOLF-BP (although MOLF-BP comes very close on 3 ofthe 7 data sets);

6. MOLF-WT achieves a better performance than LM on one dataset (remote sensing) and has almost identical performance ontwo other data sets (wine and prognostics). It is also a closesecond on 3 of the remaining 4 data sets. The same cannot besaid of MOLF-BP;

7. LM is again consistently the top performer with the bestaverage testing error on 6 out 7 data sets, while MOLF basedalgorithms figure consistently in the top two performers (notethat the network with the smallest validation error is used toobtain the testing error).

In general, we can see that

1. Both MOLF-BP and the improved MOLF-WT algorithms con-sistently outperform OWO-BP in all three phases of learning(i.e. MOLF inserted into OWO-BP does help improve its perfor-mance in training, validation and testing);

2. Both MOLF-BP and the improved MOLF-WT algorithms con-sistently outperform SCG in terms of the average minimumtesting error;

3. The MOLF-WT very often has a performance comparable to LMand it attains that with minimal computational overhead;

4. While MOLF-BP is an improvement over OWO-BP, it is not asgood as MOLF-WT.

7. Conclusions

Starting with the concept of equivalent networks and itsimplications, a class of hybrid learning algorithms is presented

Fig. 7. Remote sensing data: average error vs. (a) iterations and (b) multiplies per iteration.

Fig. 8. Wine quality data: average error vs. (a) iterations and (b) multiplies per iteration.

S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–1501 1499

that uses first and second order information to simultaneouslycalculate optimal learning factors for all hidden units and train theinput weights. Detailed analysis of the structure of the Hessianmatrix for the proposed algorithm in relation to the full NewtonHessian matrix is presented, showing how the smaller MOLFHessian could potentially be less ill-conditioned. Elements of thereduced size Hessian are shown to be weighted sums of theelements of the entire network's Hessian. Limitations of the basicMOLF procedure in the presence of linearly dependent inputs isshown mathematically. The limitations are used as a launch pad todevelop a version of MOLF that incorporates a whitening trans-formation type operation to mitigate the effect of linearly depen-dent inputs. This version is called MOLF-WT.

A detailed experimental procedure is presented and the perfor-mance of MOLF-BP and MOLF-WT are compared to 3 other existingfeed-forward batch learning algorithms on seven publicly availabledata sets. For the data sets investigated, both MOLF-BP and MOLF-WTare about as computationally efficient per iteration as OWO-BP andSCG. MOLF-BP consistently performs better than OWO-BP. In effectthe MOLF procedure improves OWO-BP by using an Nh by Nh Hessianin place of the normal 1 by 1 Hessian used to calculate a scalar OLF.MOLF-WT has a performance comparable and in some cases betterthan that of LM. From the results, MOLF-WT stays close to thecomputational cost of OWO-BP and SCG to achieve a performance

comparable with LM andmay be a viable alternative to LM in terms ofefficiency of training, minimum validation error and testing errors.

Since they use smaller Hessian matrices, the MOLF methodspresented in this paper require fewer multiplies per iteration thantraditional second order methods based on Newton's method.Finally, it is clear from the figures that MOLF-BP and MOLF-WTdecrease the training error much faster than the other algorithms.

In future work, the MOLF approach will be applied to othertraining algorithms and to networks having additional hidden layers.An alternate approach, in which each network layer has its ownoptimal learning factor, will also be tried. In addition, the MOLFalgorithm will be extended to handle classification type applications.

References

[1] R.C. Odom, P. Pavlakos, S.S. Diocee, S.M. Bailey, D.M. Zander, J.J. Gillespie, Shalysand analysis using density-neutron porosities from a cased-hole pulsedneutron system, SPE Rocky Mountain Regional Meeting Proceedings: Societyof Petroleum Engineers, 1999, pp. 467–476.

[2] A. Khotanzad, M.H. Davis, A. Abaye, D.J. Maratukulam, An artificial neuralnetwork hourly temperature forecaster with applications in load forecasting,IEEE Trans. Power Syst. 11 (2) (1996) 870–876.

[3] S. Marinai, M. Gori, G. Soda, Artificial neural networks for document analysisand recognition, IEEE Trans. Pattern Anal. Mach. Intell. 27 (1) (2005) 23–35.

[4] J. Kamruzzaman, R.A. Sarker, R. Begg, Artificial Neural Networks: Applicationsin Finance and Manufacturing, Idea Group Inc (IGI), Hershey, PA, 2006.

Fig. 9. Parkinson's disease data: average error vs. (a) iterations and (b) multiplies per iteration.

Table 2Statistics of testing and validation errors.

Comparison metric Data file name OWO-BP SCG MOLF-BP MOLF-WT LM

Average minimumtesting error

Prognostics 2.4187E7 2.6845E7 1.5985E7 1.4973E7 1.4509E7Federal reserve 0.108636974 0.33502068 0.105951897 0.120668845 0.103284696Housing 1.7329E9 1.7441E9 1.2758E9 1.3599E9 1.1627E9Concrete 33.48613532 83.78832466 32.57090477 33.72180127 29.73279111Remote sensing 1.568023633 1.51843944 2.189078998 0.136484007 0.637828985White wine 0.513739373 1.25867336 0.507925435 0.514729807 0.499294581Parkinson's 139.0323995 125.5411318 129.4456136 127.3415428 122.2485104

Average minimumvalidation error

Prognostics 2.4764E7 2.6675E7 1.5999E7 1.5155E7 1.4329E7Federal reserve 0.108442923 0.17653648 0.09675038 0.10759092 0.09371866Housing 1.3512E9 1.1683E9 1.2414E9 1.2109E9 1.0662E9Concrete 33.43842214 48.63268454 29.52707558 28.72108048 26.4212992Remote sensing 1.534430917 0.71842796 0.57845174 0.15130356 0.37738708White wine 0.503644249 0.50386288 0.49644024 0.49550544 0.49058676Parkinson's 131.2226476 124.3035235 124.0881855 122.3187088 117.3367678

S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–15011500

[5] L. Wang, X. Fu, Data Mining With Computational Intelligence, Springer-Verlag,New York, Inc, Seacaucus, NJ, USA, 2005.

[6] G. Cybenko, Approximations by superposition of a sigmoidal function, Math.Control. Signals Syst. (MCSS) 2 (1989) 303–314.

[7] D.W. Ruck, et al., The multi-layer perceptron as an approximation to a Bayesoptimal discriminant function, IEEE Trans. Neural Netw. 1 (4) (1990) 296–298.

[8] Q. Yu, S.J. Apollo, M.T. Manry, MAP estimation and the multi-layer perceptron,in: Proceedings of the 1993 IEEE Workshop on Neural Networks for SignalProcessing. Linthicum Heights, Maryland, September 6–9, 1993, pp. 30–39.

[9] D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning internal representations byerror propagation, in: D.E. Rumelhart, J.L. McClelland (Eds.), Parallel Distrib-uted Processing, vol. I, The MIT Press, Cambridge, Massachusetts, 1986.

[10] J.P. Fitch, S.K. Lehman, F.U. Dowla, S.Y. Lu, E.M. Johansson, D.M. Goodman, Shipwake-detection procedure using conjugate gradient trained artificial neuralnetworks, IEEE Trans. Geosci. Remote Sens. 29 (5) (1991) 718–726.

[11] S. McLoone, G. Irwin, A variable memory quasi-Newton training algorithm,Neural Process. Lett. 9 (1999) 77–89.

[12] A.J. Shepherd, Second-Order Methods for Neural Networks, Springer-Verlag,New York, Inc., 1997.

[13] K. Levenberg, A method for the solution of certain problems in least squares,Q. Appl. Math. 2 (1944) 164–168.

[14] D. Marquardt, An algorithm for least-squares estimation of nonlinear para-meters, SIAM J. Appl. Math. 11 (1963) 431–441.

[15] Changhua Yu, Michael T. Manry, Jiang Li, Effects of nonsingular pre-processingon feed-forward network training, Int. J. Pattern Recognit. Artif. Intell. 19 (2)(2005) 217–247.

[16] S. Ergezinger, E. Thomsen, An accelerated learning algorithm for multilayerperceptrons: optimization layer by layer, IEEE Trans. Neural Netw. 6 (1) (1995)31–42.

[17] M.T. Manry, S.J. Apollo, L.S. Allen, W.D. Lyle, W. Gong, M.S. Dawson, A.K. Fung,Fast training of neural networks for remote sensing, Remote Sens. Rev. 9(1994) 77–96.

[18] S.A. Barton, A matrix method for optimizing a neural network, Neural Comput.3 (3) (1991) 450–459.

[19] J. Olvera, X. Guan, M.T. Manry, Theory of monomial networks, in: Proceedingsof SINS'92, Automation and Robotics Research Institute, Fort Worth, Texas,1992, pp. 96–101.

[20] G.D. Magoulas, M.N. Vrahatis, G.S. Androulakis, Improving the convergence ofthe backpropagation algorithm using learning rate adaptation methods,Neural Comput. 11 (7) (1999) 1769–1796 (October 1).

[21] Amit Bhaya, Eugenius Kaszkurewicz, Steepest descent with momentum forquadratic functions is a version of the conjugate gradient method, NeuralNetw. 17 (1) (2004) 65–71.

[22] S.S. Malalur, M.T. Manry, Multiple optimal learning factors for feed-forwardnetworks, in: Proceedings of SPIE: Independent Component Analyses, Wave-lets, Neural Networks, Biosystems, and Nanoengineering VIII, vol. 7703,Orlando Florida, April 7–9, 2010, 77030F.

[23] P. Jesudhas, M.T. Manry, R. Rawat, S. Malalur, Analysis and improvement ofmultiple optimal learning factors for feed-forward networks, in: Proceedingsof International Joint Conference on Neural Networks (IJCNN), San Jose,California, USA, July 31–August 5, 2011, pp. 2593–2600.

[24] W. Kaminski, P. Strumillo, Kernel orthonormalization in radial basis functionneural networks, IEEE Trans. Neural Netw. 8 (5) (1997) 1177–1183.

[25] R.P. Lippman, An introduction to computing with neural nets, IEEE ASSPMagazine (1987) (April).

[26] R. Battiti, First- and second-order methods for learning: between steepestdescent and Newton's method, Neural Comput. 4 (1992) 141–166.

[27] S. Raudys, Statistical and Neural Classifiers: An Integrated Approach to Design,Springer-Verlag, London, 2001.

[28] M.F. Moller, A scaled conjugate gradient algorithm for fast supervisedlearning, Neural Netw. 6 (1993) 525–533.

[29] P.L. Narasimha, W.H. Delashmit, M.T. Manry, Jiang Li, and F. Maldonado, Anintegrated growing-pruning method for feed-forward network training,NeuroComputing 71 (2008) 2831–2847.

[30] University of Texas at Arlington, Training Data Files – ⟨http://www.uta.edu/faculty/manry/new_mapping.html⟩.

[31] M.T. Manry, H. Chandrasekaran, C.-H. Hsieh, Signal Processing Applications ofthe Multilayer Perceptron, book chapter in Handbook of Neural NetworkSignal Processing, in: Yu Hen Hu, Jenq-Nenq Hwang (Eds.), CRC Press, BocaRaton, 2001.

[32] US Census Bureau [⟨http://www.census.gov⟩] (under Lookup Access [⟨http://www.census.gov/cdrom/lookup⟩]: Summary Tape File 1).

[33] Bilkent University, Function Approximation Repository – ⟨http://funapp.cs.bilkent.edu.tr/DataSets/⟩.

[34] University of Toronto, Delve Data Sets – ⟨http://www.cs.toronto.edu/~delve/data/datasets.html⟩.

[35] A. Frank, A. Asuncion, UCI Machine Learning Repository, University ofCalifornia, School of Information and Computer Science, Irvine, CA, 2010http://archive.ics.uci.edu/ml.

[36] A.K. Fung, Z. Li, K.S. Chen, Back scattering from a randomly rough dielectricsurface, IEEE Trans. Geosci. Remote. Sens. 30 (2) (1992).

[37] P. Cortez, A. Cerdeira, F. Almeida, T. Matos, J. Reis., Modeling wine preferencesby data mining from physicochemical properties, Decision Support Systems,vol. 47, Elsevier (2009) 547–553 (ISSN: 0167-9236).

[38] A. Tsanas, M.A. Little, P.E. McSharry, L.O. Ramig, Accurate telemonitoring ofParkinson's disease progression by non-invasive speech tests, IEEE Trans.Biomed. Eng. 57 (4) (2009) 884–893.

Sanjeev S. Malalur received his M.S. and Ph.D. inElectrical Engineering from The University of Texas atArlington, mentored by Prof. Michael T. Manry. For hisdoctoral dissertation, he developed a family of fast androbust training algorithms for the multi-layer percep-tron feed-forward neural networks. He developed afingerprint verification system as a part of his Master'sthesis. He is currently employed by Hunter WellScience where he continues to extend his research bydeveloping techniques for modeling, analyzing andinterpreting well log data. His research interestsinclude machine learning algorithms, with an emphasison neural networks, signal and image processing tech-

niques, statistical pattern recognition and biometrics. He has authored severalpeer-reviewed publications in his field of research.

Michael T. Manry is a professor at the University ofTexas at Arlington (UTA) and his experience includeswork in neural networks for the past 25 years. He iscurrently the director of the Image Processing andNeural Networks Lab (IPNNL) at UTA. He has publishedmore than 150 papers and he is a senior member of theIEEE. He is an associate editor of Neurocomputing, andalso a referee for several journals. His recent work,sponsored by the Advanced Technology Program of thestate of Texas, Bell Helicopter Textron, E-Systems, MobilResearch, and NASA has involved the development oftechniques for the analysis and fast design of neuralnetworks for image processing, parameter estimation,

and pattern classification. He has served as a consultant for the Office of MissileElectronic Warfare at White Sands Missile Range, MICOM (Missile Command) atRedstone Arsenal, NSF, Texas Instruments, Geophysics International, HalliburtonLogging Services, Mobil Research, Williams-Pyro and Verity Instruments. Dr. Manryis President of Neural Decision Lab LLC, which is a member of the ArlingtonTechnology Incubator. The company commercializes software developed bythe IPNNL.

Praveen Jesudhas received the B.E in Electrical Engineer-ing from Anna University, India in 2008 and the M.S.degree in Electrical Engineering from the University ofTexas at Arlington, USA in 2010. In his university research,Praveen worked on problems in iris detection, iris featureextraction, and second order training algorithms for neuralnets. As a Pattern Recognition Intern at FastVDO LLC inMaryland, he developed particle filter-based vehicle track-ing algorithms and Adaboost-based vehicle detection soft-ware. Currently he is employed at Egrabber, India as aResearch Engineer, where he uses machine learning tosolve problems related to automated online informationretrieval. His current interests include neural networks,

machine learning, natural language processing and big data analytics. He is a member ofTau Beta Pi.

S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–1501 1501