[IEEE 2014 Recent Advances and Innovations in Engineering (ICRAIE) - Jaipur, India...

6
IEEE International Conference on Recent Advances and Innovations in Engineering (ICRAIE-2014), May 09-11, 2014, Jaipur, India A Skewed Derivative Activation Function for SFFAs Pravin Chandra * and Sartaj Singh Sodhi t University School of Information and Communication Technology Guru Gobind Singh Indraprastha University, Sector 16C, Dwarka, Delhi(lNDIA) - I10078 * [email protected], [email protected]; t [email protected], [email protected] Abstract-Tn the current paper, a new activation function is proposed for usage in constructing sigmoidal feedforward artificial neural networks. The suitability of the proposed activation function is established. The proposed activation function has a skewed derivative whereas the usually utilized activation functions derivatives are symmetric about the y-axis (as for the log-sigmoid or the hyperbolic tangent function). The efficiency and efficacy of the usage of the proposed activation function is demonstrated on six function approximation tasks. The obtained results indicate that if a network using the proposed activation nction in the hidden layer, is trained then it converges to deeper minima of the error functional,generalizes better and converges faster as compared to networks using the standard log-sigmoidal activation function at the hidden layer. I. INTRODUCTION Sigmoidal Feedforward Artificial Neural Networks (SFFANNs) have been used to solve a variety of complex leaing tasks ranging over a broad specum like patte recognition, nction approximation, beam-forming etc. [ I], [2], [3]. The ndamental result that allows the usage of these networks to solve leaing task is encapsulated in the universal approximation results / property (UAP) for these networks[4], [5], [6], [7]. These results asserts that there exists a network with sufficient number of non-linear nodes in at least one hidden layer that can approximate any continuous nction arbiarily well. The non-linearity at the hidden node is generally required to be a sigmoidal class nction. The generally used non-linearity of this class are the hyperbolic tangent nction: and the logistic or the log-sigmoid nction: 2 (X) = - 1 1 _ , ; d 2 (X) =2(x)(1-2(x)) + e . (1) (2) Both these nonlinear sigmoidal nctions and other nction generally used (see [8], for a survey) in SFFANN literature have derivatives that are symmetric about the y-axis ( Fig. 1). In this paper we propose a new sigmoidal class nction such that its derivative is asymmetric about the y-axis, and demonstrate its efficiecy and efficency over a set of 6 nction approximation tasks. The obtained results allow us to assert that SFFANNs using the proposed activation nction as hidden node activations can be trained faster (to achieve the same goal) and have better generalization capability if trained for equal number of epochs. The paper is organized as follows: [978-1-4799-4040-0/14/$31.00 ©2014 IEEE] Section II describes the architecture of the SFFANN, the nomenclature used and the proposed activation nction. Section III describes the design of the experiment. Section IV presents the results of the experiments while conclusions are given in Section V. 0.5 o ----l --- -0.5 derivative ./ , , , I , , , , ---- - - --- -1 -5 0.8 0.6 OA 0.2 o (a) hyperboilc tangent function (1) .-- - ' _ , /derivative , . . . - - - 9 5 � - 0 (b) log-SigmOid function (2) Fig. 1. The standard sigmoidal nctions, the hyperbolic tangent and the log-sigmoid. II. NETWO ARCHITECTURE AND ACTIVATION FUNCTION A. Neork Architecture The smallest network that possesses the UAP is required to have at least one hidden layer of non linear nodes. And, since a network with multiple ouuts can always be eated as an ensemble of networks, all with the same inputs (that is sharing the inputs but having different hidden layer nodes), this is the architecture used to demonstrate the effect of the usage of the proposed activation nction in this paper. The schematic diagram representing such a network is shown in Fig. 2. The inputs are represented by Xi where i E {I, 2, . . . , /}, with / being the input space dimensionality. The desired ouut for

Transcript of [IEEE 2014 Recent Advances and Innovations in Engineering (ICRAIE) - Jaipur, India...

Page 1: [IEEE 2014 Recent Advances and Innovations in Engineering (ICRAIE) - Jaipur, India (2014.5.9-2014.5.11)] International Conference on Recent Advances and Innovations in Engineering

IEEE International Conference on Recent Advances and Innovations in Engineering (ICRAI E-20 14), May 09-11, 2014, Jaipur, India

A Skewed Derivative Activation Function for SFFANNs

Pravin Chandra*

and Sartaj Singh Sodhit

University School of Information and Communication Technology Guru Gobind Singh Indraprastha University, Sector 16C, Dwarka, Delhi(lNDIA) - I 10078

* [email protected], [email protected];t [email protected], [email protected]

Abstract-Tn the current paper, a new activation function is

proposed for usage in constructing sigmoidal feedforward artificial

neural networks. The suitability of the proposed activation function is

established. The proposed activation function has a skewed derivative

whereas the usually utilized activation functions derivatives are

symmetric about the y-axis (as for the log-sigmoid or the hyperbolic

tangent function). The efficiency and efficacy of the usage of the proposed activation function is demonstrated on six function

approximation tasks. The obtained results indicate that if a network

using the proposed activation function in the hidden layer, is trained then it converges to deeper minima of the error functional,generalizes

better and converges faster as compared to networks using the

standard log-sigmoidal activation function at the hidden layer.

I. INTRODUCTION

Sigmoidal Feedforward Artificial Neural Networks (SFFANNs) have been used to solve a variety of complex learning tasks ranging over a broad spectrum like pattern recognition, function approximation, beam-forming etc. [I], [2], [3]. The fundamental result that allows the usage of these networks to solve learning task is encapsulated in the universal approximation results / property (UAP) for these networks[4], [5], [6], [7]. These results asserts that there exists a network with sufficient number of non-linear nodes in at least one hidden layer that can approximate any continuous function arbitrarily well. The non-linearity at the hidden node is generally required to be a sigmoidal class function. The generally used non-linearity of this class are the hyperbolic tangent function:

and the logistic or the log-sigmoid function:

0"2(X) = -1

1_ , ; dO"dx

2(X) =0"2(x)(1-0"2(x)) +e .

(1)

(2)

Both these nonlinear sigmoidal functions and other function generally used (see [8], for a survey) in SFFANN literature have derivatives that are symmetric about the y-axis ( Fig. 1). In this paper we propose a new sigmoidal class function such that its derivative is asymmetric about the y-axis, and demonstrate its efficiecy and efficency over a set of 6 function approximation tasks. The obtained results allow us to assert that SFFANNs using the proposed activation function as hidden node activations can be trained faster (to achieve the same goal) and have better generalization capability if trained for equal number of epochs. The paper is organized as follows:

[978-1-4799-4040-0/14/$31.00 ©2014 IEEE]

Section II describes the architecture of the SFFANN, the nomenclature used and the proposed activation function. Section III describes the design of the experiment. Section IV presents the results of the experiments while conclusions are given in Section V.

0.5

o ----l ----0.5 derivative

./ , , , I , , , , ----

-

----

-1 �------��--�------------� -5

0.8

0.6

OA

0.2

o (a) hyperboilc tangent function (1)

.,.-----'_, /derivative , ...........

--

-95�==�� -------0�

------ ���� (b) log-SigmOid function (2)

Fig. 1. The standard sigmoidal functions, the hyperbolic tangent and the log-sigmoid.

II. NETWORK ARCHITECTURE AND ACTIVATION FUNCTION

A. Network Architecture The smallest network that possesses the UAP is required to

have at least one hidden layer of non linear nodes. And, since a network with multiple outputs can always be treated as an ensemble of networks, all with the same inputs (that is sharing the inputs but having different hidden layer nodes), this is the architecture used to demonstrate the effect of the usage of the proposed activation function in this paper. The schematic diagram representing such a network is shown in Fig. 2. The inputs are represented by Xi where i E {I, 2, . . . , /}, with / being the input space dimensionality. The desired output for

Page 2: [IEEE 2014 Recent Advances and Innovations in Engineering (ICRAIE) - Jaipur, India (2014.5.9-2014.5.11)] International Conference on Recent Advances and Innovations in Engineering

IEEE International Conference on Recent Advances and Innovations in Engineering (ICRAIE-2014),May 09-11, 2014, Jaipur, India

any specific input tuple is represented by y. The number of hidden nodes is represented by H. The connection strength between the ith hidden node and the /' input is represented by Wij . The threshold of the ith hidden node is represented by 8i. Then the net input to the ith hidden node is given by:

H

rh = I WljXj + OJ j=l

(3)

y

Fig. 2. The schematic diagram of a single hidden layer network.

The hidden node net input is non-linearly transformed using a sigmoidal function (a), this function is known variously as the transfer function, activation function or the squashing function. The output from the jth hidden node (for a specific input set) may be represented by hi that is hi =a(llD. Then the output from the network is given by:

H

y= I aihi + r i=l

(4)

That is, the threshold of the output node is represented by r

while the connection strength between the ith hidden nodes and the output node is ai. Moreover, the transfer function of the output node may be said to be a linear function.

B. Activation Function

The universal approximation property existence proofs require the following condition to be imposed on the non­linear function used as activations:

1) The function should to be bounded.[5], [6], [7], [9]

2) The function should be continuous [5], [6], [9] and differentiable [3] as the training algorithm (s) require the calculation of the derivative of the activation function.

3) The function (say a ex)) should be sigmoidal [5], [6],[7], [9], or the following limiting values should exist[5], [6], [7], [9], [10]:

lim O"(x) = a X�oo (5)

lim O"(x) = f3 X--7-OO (6)

and a > �. The usual value is a = 1 and � = 0 or � = -l. But any finite scaling of the limits would also lead to a sigmoidal function. Moreover, the existence of the limits, for any function, assert that if the function is monotone and continuous then the boundedness property is satisfied.

Thus, for any function to be considered for usage as an activation function we need to verify the properties of continuity, monotonocity and relations (5) and (6). It can be trivially seen that the hyperbolic tangent function (1) and log­sigmoid function (2) are monotonically increasing,continuous, differentiable and bounded sigmoidal functions.

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 -5 -4 -3 -2 -1

r\ :\ , , . . : \ derivative \ "\'\,

... ' .......... ---

Fig. 3. The proposed activation function a3 (7) and its derivative (10).

Let R be the set of all real numbers, then if two mappings (say f and g, repectively) exist, such that f: R----> [�,a] and g : R----> [�,a] and moreover for \;j Xl , X2E R with Xl < X2, if f (Xl) < f (X2) and g (Xl) < g (X2) then the two functions are monotonically increasing function and by term by term multiplication of the two sides of the inequality we obtain f(xl)g(xl) <f(x2)g(x2) or the function sex) == f(x)g(x) is also a monotonically increasing function. A similar result can be obtained for continuity and boundedness property. Thus we define a function:

(}3(X)= (1+�-X t+:-2X ) (7)

The function a3 (7) is a product of two log-sigmoid functions with different slopes, and since the log-sigmoid function is monotone, this function is also monotone. Moreover, the function satisfies the following limits:

lim 0"3 (x) = lim 0"2 (x) = 1 X�oo X�oo

lim 0"3 (x) = lim 0"2 (x) = 0 X---7-OO X---t-OO

(8)

(9)

Page 3: [IEEE 2014 Recent Advances and Innovations in Engineering (ICRAIE) - Jaipur, India (2014.5.9-2014.5.11)] International Conference on Recent Advances and Innovations in Engineering

IEEE International Conference on Recent Advances and Innovations in Engineering (ICRAIE-2014),May 09-11, 2014, Jaipur, India

D •

(a) (b)

D 0

Cd)

Fig. 4. Functions fl (11) to 16 (16).

Thus, the function Ci3 is monotonically increasing, continuous (as it is product of two continuous functions), differentiable and bounded. Hence, Ci3 is a sigmoidal function that can be utilized as an activation function at the hidden nodes of a SFF ANN. And, networks using this nonlinearity at the hidden nodes' will possess the universal approximation property (UAP). The derivative of this function is:

d0'3CX) e-x + 2e-2x + 3e-3x ---dx (e-x + IY(e-2x +1)2 (10)

The function Ci3 and and its derivative is shown in Fig.3. From the figure it is apparent that the maximum of the derivative does not occur when the independent argument is zero, and the derivative is skewed.

III. DESIGN OF EXPERIMENT

A. Function Approximation Tasks The approximation of the following two dimensional input

functions were used as the learning tasks (the input and the outputs are normalized to the interval [0,1] for diagrammatic representation in FigA):

1) Two dimensional input function adapted from [11],[12], [2].

flex], X2) = 42.659(0.1 +xICO.05+ x� -1 ° xf x� + 5xi )) (11)

where X l E [-0.5, 0.5] and X2 E [-0.5, 0.5]. See Fig. 4a. 2) Two dimensional input function from [11],12],[2].

(e)

a 0

(c)

(1)

[1.5 (1-Xl )+ : 12 (xlh)= 1.3356 e2x1 -I sin (3Jr(xl -0.6)2 )+ e3(x2 -05) sin (4Jr(x2 -0.9l)

where XI E [0,1] and X2 E [0,1]. See Fig. 4b.

D.'

3) Two dimensional input function from [11],12],[2].

( ) [1.35+eXl Sin (13(XI -0.6? )1 13 x\, x2 =1.9 xe-X2 sin(7x2)

where XI E [0,1] and X2 E [0, I]. See Fig. 4c. 4) Two dimensional input function from [13],12],[2].

14(X\,X2) = sin(x\x2) where X l E [-2, 2] and X2 E [-2, 2]. See Fig. 4d.

(12)

(13)

(14)

5) Two dimensional input function adapted from [12],[2].

( ) 1 + sin(2x \ + 3X2 ) 15 X\,X2 = . ( ) (15)

3.5 + sm XI -x2 where X l E [-2, 2] and X2 E [-2, 2]. See Fig. 4e.

6) Two dimensional input function adapted from [12], [2].

16(XI'X2)=Sin(2Jr�X? +xi ) (16)

where XI E [-1, 1] and X2 E [-1, 1]. See Fig. 4f.

Data sets are generated for each function task by uniform random sampling of the input domain. The data set of size 250 is used for training (called TrS) the networks. A further sample of 1500 data points is used as the test set (called TeS). All data (both input and output) are scaled to [-1,1] for training and further reporting.

Page 4: [IEEE 2014 Recent Advances and Innovations in Engineering (ICRAIE) - Jaipur, India (2014.5.9-2014.5.11)] International Conference on Recent Advances and Innovations in Engineering

IEEE International Conference on Recent Advances and Innovations in Engineering (ICRAIE-2014),May 09-11, 2014, Jaipur, India

B. SFF ANN Architecture Used The network architecture for a specific learning task is

specified by the number of hidden layer nodes used in the experiments. The number of hidden nodes for a specific function approximation task was decided by exploratory experiments wherein the number of hidden nodes was varied between 2 and 30 in steps of 1, and the hidden layer size (number of nodes in the layer) where the error of the network on the training set was satisfactory was taken as the appropriate size for a specific task. The summary of the architecture is given in Table I.

All weights/thresholds of the network were initialized to uniform random values in the range [-I, 1]. Since the range of the proposed activation function (7) is [0, 1], the activation functions used in the experiment to compare with the proposed activation function is the log-sigmoidal activation (2). This choice is made because the range of the proposed activation function (j3 (7) as well as the standard log-sigmoid activation, (j2 (2) is the same, that is [0,1]. Whereas the range of the activation function (jl (1) is [-1,1].

TABLE 1: NETWORK SIZE SUMMARY.

Sr.No. I Function I Inputs I Hidden Layer Size I Outputs 1. it 2 l7 1 2. h 2 lO 1 3. 13 2 12 1 4. f4 2 20 1 5. is 2 25 1 6. f6 2 21 1

C. Experiment Procedure

For each activation function and for each function approximation task, a set of thirty networks is trained on the training data set for 1000 epochs and the mean squared error (MSE) on the training set and the test dataset is measured after training. This allows us to measure the goodness of training (the value of the error achieved during training, with lower value being better) when the measurement is done over the training data. The error over the test set measures the generalization capability of the networks. To study the speed of training, we set the goal of training to be twice of the worst mean (over networks) MSE value achieved by any of the activation function's network ensemble for a specific task.

Specifically, for each of the function approximation tasks the following statistics are measured and reported for the part of the experiments conducted to measure the training and generalization error reached when the two activation functions are used.:

1) For each function approximation task and for each network (out of the fifty network ensemble for the task) and for each activation function ((j2 and (j3), we measure the mean squared error (MSE). Thus, in each configuration (Task, Activation function, Network) we measure the mean squared error over the training set (TrS) and the test set (TeS); to obtain 50x3 values of the MSE for the data sets.

2) We report the average MSE over the networks (Mean), the standard deviation of the MSE' s (Std.Dev.), the minimum MSE1(Min.), the maxmimum MSE

2 (Max.) and the median of

the MSE's for the ensemble (Median).

3)We identify a trained network for which the validation error is the least for each function approximation task and activation function and report the minimum test data set (TeS) error for this network. The summary statistics reported are the MSE and the standard deviation of the errors (Std.Dev.).

To measure the speed of training when the two activation functions are used at the hidden layers of the networks, we set the goal of training to twice the mean MSE (Mean) of the network ensemble (identified by the activation function) for whichever ensemble that has the higher value. In this case we report the number of epochs required to converge on an average (Mean), the standard deviation of the convergent networks training epoch (Std.), the minimum number of the training epochs required by a particular network of the ensemble (Min.), the maximum number of training epochs required (Max.) and the median of the epochs required to converge. These values are reported only for the networks that converge. We also report the number of non-convergent networks in 1000 epochs of training (NCN). A particular activation function would be called a "better activation function" if the networks using it as activation functions are able to achieve the training goals in lesser number of epochs on an average and if less number of networks are non­convergent in the maxmimum number of epochs (that is 1000 epochs of training). This part of the experiment would be called the speed of convergence experiment.

The trammg algorithm used is the Resilient Backpropagation (RPROP) method [14]. The implementation as in Matlab version 2013a is used for the experiments with the default settings of the algorithm, and the experiments were performed on an Intel 64-bit i7 processor running Microsoft Windows 7 (64-bit).

IV. RESULTS AND DISCUSSION

For the sake of brevity, only summary data is being reported. The summary of the training, validation and the test set statistics is shown in Table II for the activation function (j2 (.) (2) while the corresponding table for the activation function (j3 (-) (7) is Table III. From the results we may infer the following:

l)From the mean of the MSE's values across the networks, for all function approximation tasks, it is observed that the networks using the activation function (j3 are able to achieve significantly deeper minima of the network error functional on an average during training (see the TrS Mean value in the two cases), as compared to the networks using the activation function (j2.

1 That is, the least MSE obtained by any of the fifty networks of a particular ensemble. 2 That is, the largest MSE obtained by any of the fifty networks of a particular ensemble.

TABLE I: N e t w o r k S i z e S u m m a r y .

Sr.No. Function Inputs Hidden Layer Size Outputs1. Si 2 17 12. h 2 10 13. h 2 12 14. U 2 20 15. /5 2 25 16. h 2 21 1

Page 5: [IEEE 2014 Recent Advances and Innovations in Engineering (ICRAIE) - Jaipur, India (2014.5.9-2014.5.11)] International Conference on Recent Advances and Innovations in Engineering

IEEE International Conference on Recent Advances and Innovations in Engineering (ICRAIE-2014),May 09-11, 2014, Jaipur, India

2)From the mean of the MSE's values across the networks, for all function approximation tasks, it is observed that the networks using the activation function Ci3 are able to achieve substantially deeper minima of the network error functional on an average over the test set (see the TeS Mean value in the two cases), as compared to the networks using the activation function Ci2. Thus, we assert that (at least for these tasks), networks using the activation function Ci3 generalize better on the average.

3)A similar trend is seen for the values of the median MSE across networks.

From these observations we assert that (at least for the six function approximation tasks considered in this paper, the networks using the activation function Ci3 (7) are able to achieve deeper minima of the training error and the generalization error on an average, as compared to the activation function Ci2 (2), and thus should be preferred.

Since, for every function approximation task, it is observed that the ensemble mean MSE for the networks using the activation function Ci2 was significantly higher (see Table II and Table III), the goals for the speed of convergence experiments are taken as the twice of the Mean MSE (across networks) for the activation function Ci2 (see Table II), and the goals are shown in Table IV.

The summary data for the speed of convergence for networks using the activation functions Ci2 (2) and Ci3 (7) are shown in Table V and Table VI respectively.

From Table V and VI, the following may be observed:

I )The average number of epochs required for convergence to the specified goal (Table IV), is much smaller in the case of networks using the activation function Ci3 as compared to the networks using the activation function Ci2'

2)A similar statement can be made on the basis of the Median of the MSE values.

3)The spread of the epochs required for convergence (as measured by the standard deviation of the network MSE's) is significantly smaller in the case of networks

T A B L E II: SUMMARY OF RESULTS FOR ACTIVATION FUNCTION a 2{-) ( 2 ) . THE STATISTICS REPORTED ARE THE MEAN ERROR (M E A N ) OF THE M EA N SQUARED ERROR (M S E ) . THE STANDARD DEVIATION OF MSES (S T D .D E V .). THE MINIMUM VALUE OF THE MSE (M lN .) THE MAXIMUM MSE (M A X .) AND THE MEDIAN OF THE MSES OVER ALL THIRTY NETWORKS FOR A PARTICULAR FUNCTION APPROXIMATION TASK. ALL VALUES OF THE STATISTICS ARE REPORTED X lO " 3 .

Sr.No. Fn.Training Set (TrS) Test Set (TeS)

Mean Std.Dev. MilL Max. Median Mean Std.Dev. Min. Max. MedianI /1 11-0109 4-7884 5 0383 21-0117 8-2593 14 0593 5*6648 8-0094 25-5143 11-05782 h 4-4891 1-9707 1*7897 12-6925 4-3284 5 0821 2*2293 1*9812 14-6386 4-78433 h 8-6069 5-3121 2-1238 22 0605 6-7473 11 6720 5*6475 4*1136 24-6425 9-95174 U 12-2701 2-3349 7*5277 17-5280 12-3570 20 1406 3 5212 11-7774 33-0842 20-20135 /3 23-8283 14-7643 8-4793 82-9934 19-6850 26-6793 22*4855 8-7566 130-1140 21-24066 h 45-4219 25-3671 18 4509 180-5005 40-7812 85*9034 35 6388 32 5316 279-3768 79-8568

T A B L E IE: SUMMARY OF RESULTS FOR ACTIVATION FUNCTION <73 ( ) ( 7 ) , THE STATISTICS REPORTED ARE THE MEANe r r o r (M e a n ) o f t h e M e a n S q u a r e d E r r o r (M S E ) . t h e s t a n d a r d d e v i a t i o n o f M S E s ( S t d .D e v .) . t h e

MINIMUM VALUE OF THE M S E (M lN .) THE MAXIMUM M S E (M A X .) AND THE MEDIAN OF THE M S E S OVER ALL THIRTY NETWORKS FOR A PARTICULAR FUNCTION APPROXIMATION TASK. A LL VALUES OF THE STATISTICS ARE REPORTED x l O - 3 .

SrJNo. Fn.Training Set (TrS) Test Set (TeS)

Mean Std.Dev. Min. Max. Median Mean Std.Dev. Min. Max. Median1 /1 5-6359 1-5568 2-9762 9-5600 5-4126 9-0460 2*2524 4*6126 14 2973 8-70182 h 3-9332 2-4977 1-9427 19-3699 3-4007 4 4490 2*7084 2*0147 20-7375 3*88913 h 6-5240 4-5437 2-2593 21-9673 4-8598 9 4654 4-9002 3-5062 24*7284 7*46994 h 10-1274 2 3023 4-8448 15-9685 10-0493 17-1332 2-9299 10-8013 23*0125 1693585 h 13 8148 4 8125 5-2290 26 2111 13 1338 16-0789 5*7761 7-5124 28-4302 1 -1-82016 k 33-3913 1U-3740 18-7541 72-4-151 32-1962 71-9396 1**2199 41-70^-1 U2-5721 71-2927

TABLE IV: TRAINING GOALS FOR SPEED OF CONVER­GENCE EXPERIMENT. A ll VALUES ARE XlO-3 .

Sr.No. Fn. Goal1 f i 22-02172 h 8-97813 h 17-21394 h 24-54015 h 47-65666 k 90-8438

TABLE V: S u m m a r y d a t a f o r s p e e d o f c o n v e r g e n c e

EXPERIM ENT FOR ACTIVATION FU N C T IO N ( 2 ) .

SNo. Fn. Mean Std. Min. Max. Median NCN1 f i 189-04 70-59 0093 0495 178-50 002 h 385-79 172-91 0176 0936 330-00 033 h 455-46 158-72 0194 0870 441-50 044 U 263-84 96-66 0154 0538 237-00 005 /5 532-19 152-86 0285 0945 520-00 036 534-98 160-73 0337 0968 509-00 01

Page 6: [IEEE 2014 Recent Advances and Innovations in Engineering (ICRAIE) - Jaipur, India (2014.5.9-2014.5.11)] International Conference on Recent Advances and Innovations in Engineering

IEEE International Conference on Recent Advances and Innovations in Engineering (ICRAIE-2014),May 09-11, 2014, Jaipur, India

using the activation function 03 as compared to the networks using the activation functions 02,

4)The number of non convergent cases in training to the specified goal in the specified number of epochs of training (1000), is substantially lesser in the case of the networks using the activation function 03 as compared to networks using the activation function 02.

The above observations allow us to assert (on the basis of the data for these six function approximation tasks) that the networks using the activation function 03 (7) train faster as compared to the network trained using the standard log­sigmoid function, 02 (2).

V. CONCLUSIONS

In this paper we propose a skewed derivative sigmoidal function, 03 (7), that can be used as an activation function for sigmoidal feedforward artificial neural networks. The motivation for this work is that the UAP results [4], [5], [6], [7] for SFFANNs do not specify any "preferred" activation function. Though the initial choice made, including the choice of the activation function, has been reported to have an effect on the speed of training [15], [8].

TABLE VI: SUMMA RY DATA FOR SPEED OF CONV ER ­

GENCE EXPERIMENT FOR A CTIVATION FUNCTION 0"3(-) (7).

I SNo Fn I Mean I Std I Min I Max I Median I NCN I 1 h 104·72 29·98 0057 0236 103.00 00 2 h 280·73 121·27 0153 0784 252.00 01 3 Is 278-48 157·01 0111 0901 257.00 02 4 f4 203·54 48·64 0129 0340 206.50 00 5 h 369·22 99·77 0199 0599 346.00 00 6 f6 317·46 106·72 0171 0758 318.50 00

The proposed activation function has a range of [0, [] similar to the log-sigmoid function, 02 (2). To demonstrate the efficiency and efficacy of the usage of the proposed activation function, a set of six function approximation tasks are used. The obtained results suggest that the networks using the proposed activation function 03 (7) achieve deeper minima of the error functional during training for equal number of epochs of training, as compared to the networks using the standard log-sigmoid function 02 (2). The generalization error is also lesser in the case of trained networks using the activation function 03 (7). Moreover, speed of convergence experiment results demo strate that (at least on these six function approximation tasks), the number of epochs required for convergence to a specified goal is significantly lesser for networks using the activation function 03 (7).

But, before a defmite statement is made about the efficiency and efficacy of the usage of the proposed activation function, 03 (7), as a "preferred" activation function, further experiments are required on more learning tasks. [n the experiments, the RPROP algorithm was used, experiments need to be conducted with other training functions like the class of conjugate-gradient algorithms, Levenberg-Marquardt algorithm and other algorithms [16], [17], [18], [[9]. Other skewed derivative activation functions have been proposed [20], [21], [22], the proposed activation functions needs to be

compared with these also. Experiments are being performed in this regard and would be reported separately.

REFERENCES

[I] M. T. Hagan, H. B. Demuth, and M. H. Beale, Neural Network Design. Boston, MA: PWS Publishing Company, 1996.

[2] V. Cherkas sky and F. Mooulier, Learning from Data - Concepts, Theory and Methods. New York: John Wiley, 1998.

[3] S. Haykin, Neural Networks: A Comprehensive Foundations. Englewood Cliffs, NJ: Prentice-Hall, 1999.

[4] A. R. Gallant and H. White, "There exists a neural network that does not make avoidable mistakes, " in International Joint Conference onNeural Networks, vol. I, 1988, pp. 593-606.

[5] G. Cybenko, "Approximation by superpositions of a sigmoidal function, " Math. Control, Signals, and Systems, vol. 2, pp. 303-314, 1989.

[6] K. Funahashi, "On the approximate realization of continuous mapping by neural networks, " Neural Networks, vol. 2, pp. 183-192, 1989.

[7] K. Hornik, M. Stinchcombe, and H. White, "Multilayer feedforward networks are universal approximators, " Neural Networks, vol. 2, pp. 359-366, 1989.

[8] W. Ouch and N. Jankowski, "Survey of neural transfer functions, " Neural Computing Surveys, vol. 2, pp. 163-212, 1999. [Online]. Available: http://www.icsi.berkeley. edu/ jagotalNCS

[9] S. M. Caroll and D. W. Dickinson, "Construction of neural nets using the radon transform, " in International Joint Conference on Neural Networks. IEEE, 1989, pp. 607-611.

[10] P. Chandra and Y. Singh, "Feedforward sigmoidal networks -equicontinuity and fault-tolerance, " IEEE Transactions on Neural Networks, vol. 15, no. 6, pp. 1350-1366, 2004.

[11] M. Maechler, D. Martin, 1. Schimert, M. Csoppenszky, and 1. Hwang, "Projection pursuit learning networks for regression, " in Proc. of the 2nd International IEEE Conference on Tools for Artificial Intelligence, 1990, pp. 350-358.

[12] V. Cherkassky, D. Gehring, and F. Mooulier, "Comparison of adaptive methods for function estimation from samples, " IEEE Transactions on Neural Networks, vol. 7, no. 4, pp. 969-984, 1996.

[13] L. Breiman, "The PI method for estimating multivariate functions from noisy data, " Technometrics, vol. 3, no. 2, pp. 125-160, 1991.

[14] M. Riedmiller and H. Braun, "A direct adaptive method for faster backpropagation learning: The RPROP algorithm, " in Proc. of IEEE conference on Neural Networks, vol. 1, San Francisco, 2010, pp. 586-591.

[15] J. F. Kolen and J. B. Pollack, "Back propagation is sensitive to initial conditions, " in Proc. of the 1990 conference on Advances in neural information processing systems 3. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1990, pp. 860-867.

[16] R. Battiti and F. Masulli, "BFGS optimization for faster and automated supervised learning, " in Proc. of International Neural Network Conference (INCC 90), vol. 2, Paris, 1990, pp. 757-760.

[17] M. F. Moller, "A scaled conjugate gradient algorithm for fast supervised learning, " Neural Networks, vol. 6, pp. 525-533, 1993.

[18] R. Battiti, "First and second order methods for learning: between steepest descent and Newton's method, " Neural Computation, vol. 4, no. 2,pp. 141-166,1992.

[19] F. D. Foresee and M. T. Hagan, "Gauss-Newton approximation to bayesian regularization, " in Proc. International Joint Conference on Neural Network, vol. 3, Houston, Texas, 1990, pp. 1930-1935.

[20] Y. Singh and P. Chandra, "A class +1 sigmoidal activation functions for FFANNs, " Journal of Economic Dynamics and Control, vol. 28, no. I, pp. 183 - 187, 2003.

[21] P. Chandra, "Sigmoidal function classes for feedforward artificial neural networks, " Neural Processing Letters, vol. 18, no. 3, pp. 205-215, 2003.

[22] P. Chandra and Y. Singh, "A case for the self-adaptation of activation functions in FF ANNs, " Neurocomputing, vol. 56, pp. 447 - 454, 2004.