Artificial neural networks with quasipolynomial synapses and product synaptic contacts

Biol. Cybern. 70, 163-175 (1993) Bk logr_ cybemet �9 Springer-Verlag 1993

Artificial neural networks with quasipolynomial synapses and product synaptic contacts Ping Liang 1,,, Nadeem Jamali 2

1 College of Engineering, University of California, Riverside, CA 92521, USA 2 Department of Computer Science, State University of New York at Buffalo, Buffalo, NY 14214, USA

Received: 17 August 1992/Accepted in revised form: 8 June 1993

Abstract. A neural network model with quasipolynomial synapses and product contacts is investigated. The model further generalizes the sigma-pi and product unit models. What and how many quasipolynomial terms, both for individual variables and for cross-product terms, are learned, n o t predetermined, subject to hardware constraint. Three possible cases are considered. In case 1, the number of learnable parameters needed is determined in learning. I~ can be considered another method of "growing" a network for a given task, although the graph of the network is fixed. Mechanisms preventing the network from growing too many parameters are designed. In cases 2 and 3, the number of parameters allowed or available is fixed. Cases 2 and 3 may offer both some control on the generalizability of learning and flexibility in functional representation, and may provide a compromise between the complexity of loading and generalizability of learning. Gradient-descent algorithms for training feedforward networks with polynomial synapses and product contacts are developed. Hardware issues are considered, and experimental results are presented.

1 Introduction

The prevalent models currently used in neural network studies and electronic implementations make some or all of the following three simplifying assumptions:

1. The interaction of all postsynaptic signals is a simple summation in a latent period. 2. A synapse only linearly amplifies or attenuates the presynaptic signal. This is modeled by a connection weight that multiplies the presynaptic signal to produce the postsynaptic signal.

* On leave from the School of Computer Science, Technical University of Nova Scotia, Canada Correspondence to: P. Liang, College Of Engineering, University of California, Riverside, CA 92521-0425, USA

3. There is only one synaptic contact between two neurons.

These models are results of further simplifications of the McCulloch-Pitts model (McCulloch and Pitts 1943). Because of the above simplifications, all the nonlinearity necessary to achieve a desired transformation is re- stricted to the neurons, i.e., the threshold or the sigmoidal transfer functions of the neurons. Neurophysiological data show that the above three assumptions are not true in a real neural network, e.g., see Bullock et al. (1977) and Shepherd (1983). In real neural networks, the interaction between postsynaptic signals is more complicated than a simple summation. Several synapses may come into contact together before acting on a neuron, such as the presynaptic inhibitive synapses. The transformation between the presynaptic and postsynaptic signals is not linear. It is known that the transfer function of a synapse may be better modeled by a nonlinear transformation than by a simple multiplication of a linear coefficient. Also, a neuron may have more than one synaptic contact with another neuron. However, because of the first two assumptions, such multiple synaptic connections are assumed to be included by a single connection weight.

Several models removing some of the above simplify ing assumptions have been reported. The sigma-pi 'units (Rumelhart 1986) and the product units (Durbin and Rumelhart~1989) allow multiplication of synaptic signals. The product units allow both the coefficients and the exponents o.f a product term to be learned. The product term is in the form of a polynomial term. How- ever, the exponents may be any integer or real number. We refer to this type of term as a quasipolynomial term. Giles and Maxwell (1987) investigated a special case of the sigma-pi units, called high-order neural networks. The high-order networks use sigma-pi units only at the input layer. Encouraging results have been reported for these models. These models can be further generalized along the following lines:

1. The synapse transfer function in these models is still only a linear amplification.

164

2. The nonlinear terms in the high-order networks and sigma-pi and product unit networks must be "handcrafted" in the network using a priori knowledge before learning. Although this is not necessarily a disadvantage since embedding a priori knowledge is a way to alleviate the load of the learning algorithm, it is still desirable to have a network that is able to learn nonlinear synapse transformations and to determine by learning what and how many nonlinear terms, both individual variable and cross-product terms, should be present when there is no a priori knowledge available. 3. The number of learnable parameters is fixed in the sigma-pi units, product units, high-order networks, and most other network models (except the growth networks discussed later). Again, this is not a disadvantage if a priori knowledge is available to choose the fight number of learnable parameters. The number of learnable parameters should be kept minimal to ensure good generalization (Baum and Haussler 1989). However, when there is no a priori knowledge available, it is desired to have a network that will gradually increase its total number of learnable parameters, subject to hardware constraint, if the training examples cannot be sucessfully loaded with a given number of learnable parameters.

This paper proposes a neural network model along the lines suggested above. Only feedforward networks are considered in this paper. Section 2 presents the quasipolynomial model of nonlinear synapses and product- synaptic contacts. The associated backpropagation learning algorithms are developed. Three possible cases are identified. The relation between our nonlinear synapses network model and sigma-pi units, product units, high-order networks, and the linear synapse growth networks is discussed. The generalization of learning of the new model and its hardware implementation is considered in Sect. 3. Section 4 presents some experimental results. Conclusions are given in Sect. 5.

2 Quasipolynomial synapses and product synaptic contacts

From a computational point of view, by restricting the synapses to linear and restricting all the nonlinearity to the neurons, the freedom to introduce and change independent nonlinear terms of the inputs is lost. Functions that involve nonlinear terms of individual inputs and cross-product terms will not be readily learned. This limits the functions that can be realized by a given network. Note that although a three-layer network can arbitrarily closely approximate any L 1 function in a finite interval (Carroll and Dickinson 1989; Cybenko 1989; Hornik et al. 1989), it requires an increasing number of hidden units to do so. For fixed networks, there is a difference in which one provides a better approximation.

We propose to introduce a transformation at each synapse, which may be linear or nonlinear. The actual transformation should be learned. The transformation

should accommodate both linear and nonlinear transformations by adjusting its parameters. Moreover, the form of the transformation should be amenable to anal- ysis and derivation of the learning algorithm. Extending the work in product units (Durbin and Rumelhart 1989), we propose to use quasipolynomials to model the nonlinearity of synapses. By a quasipolynomial we mean a function having the same form of a polynomial but with exponents that can be any integer or real numbers. Obvi- ously, a quasipolynomial is more general than a polynomial. Then, the postsynaptic signal becomes a quasipolynomial of the presynaptic signal instead of a simple amplification by multiplying a coefficient. Observe that a quasipolynomial synapse is now a computational unit, similar to a neuron, but with only one input. A postsynaptic signal now may have a product contact with other postsynaptic signals at a neuron, or may have a summation contact with a neuron. For a product contact, all the postsynaptic signals involved are multi- plied together before being summed with other signals. For a summation contact, the signal is simply summed with other signals. The symbols for nonlinear synapses and nonlinear contacts are shown in Fig. 1. The number of terms, the exponents, and the coefficients of the quasipolynomials at the synapses are moved in the direction of least error by a learning mechanism that is derived using the gradient descent method similar to that of Rumelhart et al. (1986). What such a network implements is a quasipolynomial classifier.

Another possibility is to use polynomial synapses, i.e., to allow exponents only to be updated by integer amounts. There is no difference in the learning algorithm formulations except forcing increments to the exponents to be integers. This may cause discontinuities in the improvements of the network performance. The effect of such a polynomial algorithm is yet to be investigated. We suspect that there is no fundamental difference in the performance of a trained network with quasipolynomial synapses vs that with polynomial synapses. Properties of polynomial classifiers were investigated by Schfirmann (1976, 1977).

\Nonl~ear

Fig. 1. Symbols for nonlinear synapses, summation and product contacts. C1 and C3 are summation contacts, and C2 is a product contact

Durbin and Rumelhart (1989) defend their product units against the following possible criticism: introducing more nonlinearity to a network is a trivial exercise of improving the representation capabilities of the network. "One can always improve a fit to data by making a model more complex, and this is rarely worth the price of throwing away elegance." In addition to the argument of being a natural extension as in (Durbin and Rumelhart 1989), we justify the new model for the following three possible cases.

�9 C a s e 1. More parameters can be made available to the network (either in hardware implementation or in simulation) as long as the algorithm calls for more parameters. 1 In this case, the new model is effectively another way of"growing" a network for a given task. It is known that a network with a fixed number of parameters may not be able to learn some functions or may require an exponential amount of time to do so (Judd 1990; Blum and Rivest 1988). A network that not only learns the parameters, but also learns the number of parameters (or the structure) of the network is desirable, provided that some mechanism is present that prevents the network from having too many degrees of freedom, leading to poor generalization (Baum and Haussler 1989).

�9 C a s e 2. The total number of parameters available to the network is fixed, but the distribution of the parameters to each synapse can be determined in learning. The network learns what quasipolynomial terms, and to some extent, how many terms, should be present at each synapse, subject to the limit of the total number of parameters available. Synapses requiring more learnable parameters will get more. Synapses may eliminate terms with small coefficients to release the learnable parameters they grabbed, so that terms with more significant effects can be added. This is effectively a network which learns both its parameters and the structure (i.e., distribution of the learnable parameters), although it may appear to have a fixed-graph representation of the network. The idea is very much like the "blackboard system" in symbolic artificial intelligence (Engelmore and Morgan 1988).

In both cases 1 and 2, the model adaptively learns the distribution of the parameters. This is effectively adapting the structure of a network, although the network appears to have a fixed graph.

�9 C a s e 3. The number of parameters available at each synapse is fixed. However, as in case 2, what quasipolynomial terms, and to some extent, how many terms (subject to the limit of parameters available at each synapse), are present at each synapse is still determined by learning, although with less flexibility than cases 1 and 2. This is an alternative way to using multilayer linear synapse networks with many neurons

1 of course, in practice this growth has to be limited by the maximum resources available. Here it is assumed that the network is a part of a larger system, and it will not use up all the parameters the larger system can supply

165

and connections to realize highly nonlinear functions. The advantage may be fewer connections if suitable hardware is designed to realize the quasipolynomial synapses. As in case 2, a synapse may eliminate terms with small coefficients to release the learnable parameters, so that terms with more significant effects can be learned.

In both cases 2 and 3, terms with small coefficients can be deleted during learning to allow higher-degree quasipolynomial terms to be introduced without exceed- ing the number of parameters specified. Even for case 1, this method can also be applied to reduce the number of parameters.

In case 3, some synapses may require more parameters than available to achieve faster convergence, whereas others may not use all the parameters allocated. Case 2 is more flexible than case 3. The price is more overhead in managing the distribution of parameters to synapses.

For case 1, and for cases 2 and 3 when the number of parameters allocated is much larger than the task requires, there are mechanisms to prevent using too many parameters leading to poor generalization. Note that when the number of parameters allocated is much larger than the task requires, cases 2 and 3 are equivalent to case 1. If that is the case, the network may not use all the parameters allocated in cases 2 and 3.

Case 1 may occur when a network intended as a general system for learning complicated functions is applied to a simple function. It may be possible that even with the mechanisms to prevent too many parameters to be "grown", the generalization capability may still suffer since more parameters than necessary are used. To avoid this, we can limit the number of parameters to suit a given task. This lead us to cases 2 and 3. For both cases 2 and 3, the generalization capability of a network can be controlled by limiting the total number of parameters allowed. Since cases 2 and 3 still have flexibility in learning what quasipolynomial terms should be included in the synapses, it seems that they offer both some control on the generalizability of learning and flexibility in functional representation, and may provide a compromise between the complexity of loading and generalizability of learning. However, note that fixing the number of parameters may not totally decide the representational capacity in cases 2 and 3. This is because the flexibility of the distribution of the parameters to the synapses makes the networks able to implement many more functions than fixed parameters at fixed places. It is the flexibility, or degrees of freedom, in function representation of a network, not the number of parameters, that determines the generalization of a network. Determining the function representation capacity of the new networks, or the V-C dimensions, could be a difficult problem. Such determination is necessary for a complete understanding of the generalization capability of the new networks. Never- theless, the number of parameters is one of the determining factors of the generalization capacity. Fixing the number of parameters in cases 2 and 3 does offer some control of the generalization capability of the network. In

166

order to apply cases 2 and 3, effecient algorithms capable of dynamic parameter release and assignment during learning need to be developed. The complexity of such algorithms should be investigated.

Hereafter, it should be understood that the phrase "what and how many quasipolynomial terms, or the number of learnable parameters, are learned" is subject to the limit of the number of parameters allowed or available.

Our model is most related to the product units proposed by Durbin and Rumelhart (1989). In both their product units and our model, the exponents of a presynaptic signal can be learned. The places where product can take place are predetermined in both models. Our model is a further generalization of Durbin and Rumelhart's work. Some differences between our model and networks using sigma-pi units (Rumelhart et al. 1986), product units (Durbin and Rumelhart 1989), or high-order networks (Giles and Maxwell 1987) are summarized below:

1. Nonlinearity is introduced at each synapse. Each synapse has a quasipolynomial transfer function. Higher- order terms of a presynaptic signal can be added by learning. The postsynaptic signal may produce several quasipolynomial terms of the presynaptic signal instead of just one linear or nonlinear term. 2. The product terms in our model are produced as a result of the product of two or more postsynaptic signals after quasipolynomial synapses. Since a postsynaptic signal may have several quasipolynomial terms of an individual presynaptic signal and these terms are learned, multiplying two or more postsynaptic signals has two consequences: (1) one product contact may produce many cross-product terms of different degrees, and (2) what and how many cross-prodct terms are present is not totally predetermined, but learned. Although what presynaptic signals can be involved in the quasipolynomial cross-product terms is determined by where the product contacts are "handcrafted", what and how many cross-product terms are actually there is not predetermined. In the work of Durbin and Rumelhart (1989), a product unit produces only one quasipolynomial term whose coefficient and exponents are learned. 3. For case 1, considering that how many learnable parameters are needed is determined by learning, our network is much like the growth networks (Frean 1990; Marchand et al. 1990; Mezard and Nadal 1990; Liang 1990; Yin and Liang 1991). The growth networks in general have poor generalization. There are several mechanisms in the new model that contribute to a better generalization performance than the existing growth networks. 4. Unlike Durbin and Rumelhart (1989), who treat a product as a unit (neuron), we treat a product of postsynaptic signals as a synaptic contact, i.e., where several synapses come into contact before acting on a neuron. The difference is that the neuron transfer function acts uniformly on the product and summation terms in our model. In a product unit (Durbin and Rumelhart 1989), each product term is transformed by a neuron transfer function before being summed with other signals.

In the remaining part of this paper, mainly case 1 is considered. Algorithms and hardware designs for dynamically assigning and releasing parameters for the more interesting cases 2 and 3 are currently under inves- tigation

2.1 Quasipolynomial model of nonlinear synapses and its backpropagation learning algorithm

The quasipolynomial synapse model is described below. The transformation at a synapse from neuron i to neuron j [neurons at the lowest level perform a sigmoidal transformation to scale all inputs to within (0, 1), as explained below], is of the form

mij

f/j -----f/j(y,) = w;jy~ 'J "4- 2 WaiJY ~ (1) a = l

where y~ is the output of neuron i, and mlj is a positive integer which is updated in learning. The term Pli is the only exponent that is updated in each iteration during learning. At any time, terms involving each of the integer exponents up to m~j in addition to yf'J are present. In the beginning, m~j -- 0, and p~ = 1. During learning, only p~j changes. When it exceeds 2 for the first time, m~j is set to 1, and a linear term is permanently added. Later on, whenever an integer value larger than m o + 2 is sur- passed, a new term with an exponent of m~j + 1 is introduced. The coefficient of the new term is set to zero when it is first introduced. In o u r current implementation, during learning and after learning converges, p~j may be equal to or less than m~j, since the gradient descent updating rule may decrease as well as increase by p~j. Modification of the algorithm could be made so that when p~j is reduced below m~, terms with exponents higher than pij are removed.

Let the sum of the quasipolynomials at a unit be

nj

Fj = F j ( f l j , f 2 j . . . . . f,~j) = ~ f j + cj (2) i = 1

where nj is the number of inputs coming to j, and ej is the threshold for the unit. Then the output of neuron j is given as

1 Y~ - 1 + exp-~vj (3)

where 2 is the scaling factor and is assumed to be 1 hereafter, to simplify the presentation. For every neuron j, the learnable parameters are wa~j, w'~j, p~j, and cj for a = 1, 2, . . ., m~j, i = 1,2 . . . . . nj.

In the remaining part of this section, we will develop the gradient-descent algorithm for training such a network.

2.1.1 Modified backpropagation algorithm for quasipolynomial synapses. The gradient-descent formulation and associated computational arrangements for networks with quasipolynomial synapses are presented below. The procedure is similar to that of the standard backpropagation algorithm (Rumelhart et al. 1986).

Define the learning error to be minimized as

1 ~ ~ [y j ( i ) -d j ( i ) ] 2 (4) E - m - 2 i = l j = 1

where d~ is the desired output of unit j, N is the number of training samples, and M is the number of output units.

Similar to Rumelhart et al. (1986), we remove one summation in the derivation for simplicity:

M E = 1 y, (yj _ r (5)

2j=1

Suppose neuron k is at the output layer. The partial derivatives of the error E with respect to every learnable parameter associated with neuron k is first computed. If mjk > O, we have

= ~ \ d F k ] kdf~k] \dWajk/ (6)

The first term on the right-hand side is the partial derivative of the error E with respect to the output of neuron k, Yk, and is given as

dE - - = ( Y k - - d k ) (7) dyk

The second terms is the derivative of Yk with respect to the neuron transfer function Fk, and is of the form

dyk - - = (1 - Yk)Yk (8) dFk

The third term is a partial derivative of the neuron transfer function Fk with respect to the postsynaptic signal fjk

dVk = 1 (9)

dfj~

Finally, the partial derivative of the postsynaptic signal with respect to the presynaptic signal, which is either the output of some other neuron or an input from the outside world, is given as

dfjk . > O, a 1 . . . . . (10) = y j , if mjk = mjk dWajk

Therefore, we have

dE - - = (Yk -- dR)(1 -- Yk)YkY~, if mjk > 0 (11) dWajk

Similarly, we also have the following partial derivatives for the remaining three learnable parameters.

dE dE dfjk = OWjk = dfjk dWjk (Yk -- dk)( 1 - - Yk)YkY~ "jk (12)

dE dE Ofjk dpjk -- df~k dPjk = (Yk -- dk)(1 -- yk)ykwjky~ J~ ln(yj)

(13)

dE dE dF k = (Yk -- dk)(1 -- Yk)YR (14)

riCk dFk dCk

167

Note that the term In (y j) in the derivatives requires that yj > 0. In intermediate layers, this does not cause a problem since 0 < yj < 1. At the input layer, this is taken care of by scaling the input to (0, 1). In the product units, Durbin and Rumelhart (1989) allow negative numbers and ignore the imaginary part of the logarithm of negative number. The effects of the two different treatments have not been investigated.

Next, partial derivatives of the error E with respect to the learnable parameters of neurons at the next layer are computed. Let neuron j be a neuron at this layer. We start with the weights again, assuming m u > 0.

dE _ dE dyj dFj df~j (15) dWal j Oyj dFj dfi j dWai j

We can find the individual derivatives as follows:

dE dE dFk dy--~ = ~ dFk Oyj

=2 .w.j.,,j )j (16)

wheredE/dFk has been computed in the layer above and can be propagated backward from the higher layer.

The rest of the partial derivatives can be derived in much the same way as we did for node k.

dyj d ~ = ( 1 - yj)yj (17)

dFj= 1 (18) df, j

0fu = y~ (19) dw,q

Substituting the above derivatives in (15) yields,

-- Wajkay j + WskpjkY j dWaij a~= l

x (1 -- yj)y~y~ (20)

Similarly, the partial derivatives of E with respect to the other learnable parameters are derived as follows:

dE _ dE dfij dE ~j dwij ,j~j

(21)

dE dE df o dE ' n'J (22) wUYi In(y/) dpij dflj @U dfij

dE dE dFj = dE (23) dcj dFj dcj dFj

The above process is then repeated for all the other layers down to the input layer. In this way, the partial derivatives with respect to all the learnable parameters in the network are computed. After that, the learnable parameters are updated using the gradient descent rule. Let

168

p denote all the learnable parameters organized in a vector form. Then the partial derivatives give the gradient Vp. The learnable parameters p are updated accord- ing to

p(t + 1) = p(t) -- eVp(t) (24)

where e is the step size. Many variations of the gradient descent rules can

be applied as well in the learning. For example, the momentum method can be used yielding the following learning rule:

p(t + 1) = p(t) -- e [-Vp(t) + aVp(t - 1)] (25)

where 0 < ~ < 1 is an exponentially decaying factor.

2.1.2 Scaling of the input. Each external input is connected to a unit at the first layer which scales the input to within (0, 1) for each dimension. This is done because of several considerations. First, when computing the partial derivative with respect to the varying exponent of an input, a logarithm of the input is required. Since the logarithm of a negative value yields a complex number, we choose to scale the input to avoid complex numbers. Secondly, in many problems, connections from the inputs directly to the intermediate layers are used to accelerate the learning. This is because the error signal is attenuated each time it passes through a layer. As a result, if there is no direct connection between inputs and the higher layers, learning is slowed due to the slow adaptation of the lower layers. The direct connections provide a larger error signal to the higher layers, thus accelerating the convergence. The outputs of the intermediate neurons are always within (0, 1). In a network with direct connections, if the inputs are not scaled, they could be much larger in magnitude than 1. Higher-order terms of the inputs could then overpower the influence of the outputs of the intermediate neurons. The outputs of the intermediate neurons have already gone through a few layers of transformation and should be given at least equal importance as direct inputs. Therefore, scaling of the inputs to within (0, 1) seems to be a reasonable choice. Thirdly, as a side product, scaling the inputs to within (0, 1) has the desirable effect of limiting the growth of the number of terms of the quasipolynomial at a synapse. This is because higher-order terms of a variable within (0, 1) have a smaller influence than lower-order terms. This becomes a built-in mechanism to prevent the algorithm from producing impractically high-degree quasipolynomials.

There are two possible ways to scale the inputs to within (0, 1). In both cases, the relative relations between the samples are not affected. In the first case, each input x~ is connected to a neuron to produce an output yi:

Y' = 1 + exp [-- 2 ( x l - Y,)] (26)

where ~i is an estimate of the average of feature xi over all samples. It is known that a good estimate of the average of the samples could accelerate learning. The scaling factor 2 should be properly chosen so that most samples fall within the almost linear segment of the sigmoidal function, i.e., they will have an output in the range of, say, [0.15, 0.85].

If the number of samples is finite and an upper estimate of the maximum span of the samples along each dimension is known, another type of scaling could be applied. In this case, all the samples are first shifted to the positive half space. Then, each component of the sample is divided by the upper estimate of the maximum span along that dimension.

The first scaling method is more widely applicable since only a rough estimate of the mean is required, whereas in the second case, if the estimate of the maximum span is much smaller than the actual one, or the translation to change all the negative components to positive is not proper, the scaled inputs may still contain components that are negative and/or larger than 1.

There are some undesirable side effects of scaling. First, because we want to limit all signals to be within (0, 1) in our implementation, negative exponents are not allowed. The current framework could actually realize synapses with negative exponents, therefore, allowing both zeros and poles at each synapses. Since negative exponents of signals in (0, 1) could yield extremely large values, it may lead to a fast increase of the magnitude of the negative exponents. Recall that positive exponents are inherently prevented from growing too large because higher-order terms of a signal in (0, 1) have less influence than lower-order ones. The second problem with scaling is that it may lead to slower learning. This is because all samples are condensed in the unit hypercube, and the difference between samples is small.

The above limitations can be removed if the input is not scaled but only shifted to the positive half space. The problem then may be fast increase of the magnitude of exponents, and hence the terms in the quasipolynomials. This needs to be further investigated.

2.2 Product synaptic contacts and the generalized backpropagation algorithm

Neurophysiological data suggest that the interaction between postsynaptic signals is more than just a linear summation. In real neural networks, there are multiple synaptic connections between two neurons. Such multiple connections do not exist in neural network models using linear synapses and summation contacts. This is because all synapses are linear, and if there are multiple connections, their effects are equally modeled by a single synapse with its weight equal to the sum of all connection weights. These disagreements between linear synapse neural network models and real neural networks prompted us to investigate nonlinear synaptic contacts and the effects of multiple synaptic connections between two neurons.

Following the product units (Durbin and Rumelhart 1989), we modify the linear summation neuron model by

169

allowing both summation and product of postsynaptic signals, that is, (2) is modified to

mj kl Fj = Fj(f l j , f2j . . . . . fn~i) = ~, I-I fkj + C~ (27)

i = 1 k = l

where m r is the total number of product terms, and k~ is the number of postsynaptic signals involved in the ith product term. Note that if k~ = 1 for a given i, the corresponding product term becomes a simple summation term. Recall that a product of postsynaptic signals is modeled to take place at a synaptic contact, i.e., where several synapses come into contact before acting on a neuron, rather than at a neuron. Note that some of the postsynaptic signals in the summation term and the product term may come from the same neuron. Also, some of the postsynaptic signals may have both a summation contact and one or more product contacts with a neuron, thus appearing in both the summation and product terms. The sigmoidal transfer function of a neuron is then applied to Fj, defined in the above equation.

A postsynaptic signal that contributes a summation term in (27) is said to have a linear or summation synaptic contact, or summation contact for short, with neuron j. Similarly, a postsynaptic signal that contributes to a product term in (27) is said to have a nonlinear or product synaptic contact, or product contact for short, with neuron j. Note that product terms in our model are produced as a result of the product of two or more postsynaptic signals after quasipolynomial synapses. This has two consequences: first, one product contact may produce many product terms of different degrees; second, what and how many product terms are present is not predetermined, but learned.

From a computational point of view, allowing both summation and product contacts provides a network with the freedom of adding independent cross-product terms in the quasipolynomial Fj. The gradient descent formulation is generalized to include the product contacts.

2.2.1 Learning algorithm for networks with both summation and product contacts. The propagation of error from the output layer downward to the input layer in a network with product contact is the same as in a network with only linear contacts. The only difference is in finding the partial derivatives of product terms. In the following, the partial derivatives of the error with respect to all the learnable parameters associated with a neuron that has both summation and product contacts are derived. To simplify the presentation, we assume that there are three postsynaptic signalsflj ,f2j , and faj to neuron j where

mij

f~J = f o ( Y ~ ) = w' . , , e ' , (28) 'J:' + E WaiJY ~ + Cij a = l

where mq is the same as defined in (1). Without loss of generality, assume that there are two summation contacts, and one product synapse, as given below:

Fj = F ( f l j , f 2 j , A j ) = f l j + f2j + f2 j* f3 j (29)

Note that it is possible that the same synapse may have both a linear and a product contact with a neuron, as shown in (29). The output of neuron j is then given as

1 YJ = 1 + exp( -- 2Fj) (30)

The learnable parameters associated with neuron j are w,ij, w'ij, pi~, and c~j, for i = 1, 2, 3.

The partial derivatives for parameters associated with f l j can be found in exactly the same way as in Sect. 2.1.1 for summation contacts. Only the partial derivatives involving the signals in product contacts need to be reconsidered. The only difference there is with OFffOfij. Instead of being equal to 1, 8F/c~f21 and ~FflOf3j. are

OF~ = 1 +faj, OFj =f2j (31) of~j of~ i

The partial derivatives with all the learnable parameters associated with f2j and f3j can then be found by substituting the above equation into the corresponding partial derivative expressions in Sect. 2.1.1.

3 Generalization of learning and hardware considerations

Cases 1, 2, and 3 are considered separately.

3.1 Relation between quasipolynomial synapse networks and linear synapse growth networks

Case 1 (including cases 2 and 3 when the number of parameters available is much larger than that required by the task): from the viewpoint of increasing the function representation capacity, increasing the number-of terms of the quasipolynomials at the synapses of a nonlinear synapse network is equivalent to increasing the size of a linear synapse network. In the literature there have been several algorithms that grow a network for a given classification task (Frean 1990; Liang 1990; Marchand et al. 1990; Mezard and Nadal 1990; Yin and Liang 1991). We refer to this type of network (algorithms) as the growth network (algorithms). The idea is to learn the structure, i.e., the graph of the network in these algorithms, and the weights of the network simultaneously. If the training samples cannot be learned within a specified length of time, the growth algorithms will add more neurons and connections. Growth algorithms are guaranteed to converge. The problem is that the algorithm may grow an overly large network and may lead to poor generalization. We claim that increasing the nonlinearity of synapses should yield a better generalization than adding neurons and connections. Our claim is sup- ported by experimental results. This can be attributed to the following reasons:

1. Most growth algorithms use a hard-limit threshold unit. The network in this paper uses sigmoidal functions at neurons. Information is not lost as a result of early immature binary decisions from layer to layer, as in the

170

growth algorithms (Sethi 1990). Therefore, fewer parameters may be required than using hard-limit units [see the analog factor in Abu-Mostafa (1989)]. Of course, growth algorithms using units with sigmoidal transfer function could be developed as well. 2. In our network, the inputs are scaled to (0, 1) at the first laye r. As a result, higher-order terms have less influence because all signals have a magnitude less than one. This effectively prevents the degrees of the quasipolynomials from increasing to impractically high. Hence, the total number of parameters in the network is prevented from growing too large. This limiting capability is inherent and is not a hard-limit type (i.e., it does not prespecify a fixed degree for the synapses' quasipolynomials). 3. Growth algorithms may end up adding one neuron (and the corresponding connections) for one training sample. In general, growth algorithms only use training samples unable to be loaded by existing units to train newly added units and connections. We call this localized learning. Localized learning normally yields faster learning but at the price of generalization. In our algorithm, the number of quasipolynomial terms and all the added parameters of the quasipolynomial of a synapse are affected by all training samples. The exponents may go up or down, depending on the direction of the gradient. This is global learning vs the localized learning in the growth algorithms. 4. Terms with small coefficients in the quasipolynomials can be deleted during learning to reduce the number of parameters. Elimination of units or connections during learning may be difficult in the localized learning type of growth algorithms.

Case 1, and cases 2 and 3 when the number of parameters available is much larger than that required by the task, may occur when the network is intended as a general system for learning complicated functions but is applied to a simple function. It may be possible that even with the above mechanisms, the generalization capability may still suffer since more parameters than necessary are used. To avoid this, we can limit the number of parameters to suit a given task. This leads us to cases 2 and 3. Recall that for cases 2 and 3, although the number of parameters is fixed, what quasipolynomial terms are present, and to some extent how many are present, is still determined by learning.

3.2 Limiting the number of parameters

In addition to the mechanisms listed in Sect. 3.1 for case 1, the generalization capability for cases 2 and 3 can be further controlled by limiting the total number of parameters. This is especially true for hardware implementation since the total number of parameters available in a hardware implementation is always limited. Case 2 is more flexible than case 3. The price is more overhead in managing the distribution of parameters to synapses.

It is observed in experiments that some terms in the learned quasipolynomials may have small coefficients.

These terms can be deleted during learning. In this way, if the number of parameters at a synapse is fixed, a higher degree of quasipolynomial can be realized without increasing the number of terms. This is important in cases 2 and 3, since the maximum number of parameters available to synapses is limited.

Hardware that recursively computes the quasipolynomial at a synapse is easy to design. Using such a design to implement the quasipolynomial synapses makes case 3 an alternative way to realize highly nonlinear functions with fewer connections but more complex synapses. For case 2, a hardware implementation scheme is required that dynamically distributes the number of learnable parameters to the quasipolynomial synapses during learning. These synapses, requiring more learnable parameters (exponents and coefficients for a higher-degree quasipolynomial), will get more. Synapses may eliminate terms with small coefficients to release the learnable parameters they grabbed but no longer need. This is effectively a network which learns both its parameters and the structure (i.e., distribution of the learnable parameters), although it may appear to have a fixed-graph representation of the network. Detailed discussions of hardware designs for cases 2 and 3 is the topic of another paper.

4 Experimental results

Networks of various structures with quasipolynomial synapses and linear contacts, and networks with quasipolynomial synapses and both summation and product contacts, are trained using the gradient descent algorithms described in this paper. The learning algorithms and the networks are tested on many examples. Some of the test results are presented below and are compared with networks with linear synapses and summation-only contacts. Only experiments on case 1 are reported below.

In the following discussion, one iteration means the presentation of one training sample to the learning algorithm. If a network is able to learn the correct classification of the training samples, or in other words, to load the training samples, in less than 1000 000 iterations, we say that it converges. Otherwise, the network is considered unable to load the training samples.

Figure 2 gives eight examples of the two-dimensional training patterns used in training the networks described below. They will be referred to as patterns 1 to 8. Each pattern is a sampling of points on a regular 10 x 10 square grid in (0, 1) x (0, 1). Each training point in Fig. 2 is a 2 x 1 vector of the coordinates of the point in (0, 1) x (0, 1). A point in Fig. 2 is labeled 1 if it belongs to class 1, and labeled 0 otherwise. Points not labeled are not used in the training. The first example shows how much more powerful a single neuron can be by using quasipolynomial synapses. A single neuron with quasipolynomial synapses and summation-only contacts as depicted in Fig. 3 (network 1) is able to load training

171

0 O0 0 Ii 0 0 1

000 ii O0 1

0 0 0 II 0 O0 IIi O0 0 1 O0 O0 1 1 0

(I)

0 O0 0 1 O0 0 0

Iiii 0 0 1 0

1 0 0

ii 0 O0 ii 000

0 0 O0

(5)

Iii 1 O0 1 0 011 1 1 1 0 1 0 0 1 0 1 0 0 O0 i0 0 0 ii 0 0 0 1 0 1 O0 0 0 1 0 1 1 000 000 1 0

0 Ii 1 0 0 1 0 O0 1 1 O0 ii 0 O0 1 1 1 0 1

(2) (3)

0 11 0 0 0 11 1 O0 1 O0

0 111 0 1 O0 O0 1 1 ii

1 1 1 O0 0 1 1 ii 0

0 0 iiiii 0 1 0 0 0 Ii 1 000 0 1 1 ii 1 0

0 0 III 000 000

(6) (7)

O0 0 O0 0 0 0

0000 1 1 0 0 0 0 1

0 1 1 1 00000 1 1 0 O0 111

O0 O0 1 00000

(4)

0 0 I 0

1 O0 0

1 11 1 O0 1

0 0

0 1 1

(8) Fig. 2. Training samples of patterns (1) to (8)

Fig. 3. Network 1: a single neuron with quasipolynomial synapses

patterns 1, 4, 5, and 6. For pattern 1, the synaptic transformations learned are

Synapse 1: - 6.830xz 4"728 - 1.807x 3 + 6.386x2

Synapse 2: - 0.849x~176176 bias, - 1.310'

As is obvious, a single neuron with only linear synapses cannot learn any one of the eight patterns since they are not linearly separable. Figure 4 shows a single neuron with quasipolynomial synapses and both summation and product contacts (network 2). The contacts with synapses 1 and 2 are of the summation type, and the contacts with synapses 3 and 4 are of the product type. The network is able to load training patterns 1, 2, 4, 5, and 6. The synaptic transformations learned are omitted due to space limit. The highest-order term in the synapse

$2

Fig. 4. Network 2: a single neuron with four quasipolynomial synapses and a product contact

quasipolynomials is 4.896, and the synapse with the highest-degree quasipolynomial has five terms. Network 3 in Fig. 5 is derived from network 2 by removing one quasipolynomial synapse. Network 3 is able to load training patterns 1, 4, 5, 6, and 7. The multilayer networks in Fig. 6 (network 4, with quasipolynomial synapses and summation contacts) is able to load training patterns 1, 3, 4, 5, 6, and 8. The number of iterations for convergence for networks 1 to 4 is shown in Table 1. Also shown in Table 1 are the results of training the linear synapse network (network 5 in Fig. 7) with the same structure as network 4. In comparison with network 4, the multilayer linear synapse network, network 5, is also able to load training patterns 1, 4, 5, and 6, but unable to load training patterns 3 and 8, i.e., it does not converge in 1 000 000 iterations. Figure 8 shows examples of the classification functions learned by the nonlinear synapse networks. Each figure in Fig. 8 is produced by

172

using all the 10 • 10 regular ly spaced sampl ing po in ts in (0, 1)x (0, 1) ( including t ra in ing points) as test ing samples. Po in t s classified to class 1 are labeled as 1, and po in t s classified to 0 are labe led 0.

The next five t ra in ing pa t t e rns are classif icat ion in a four -d imens iona l space. The examples are genera ted by

S 1 / ~

$3 $2

input layer

Fig. 5. Network 3: a single neuron with three quasipolynomial synapses and a product contact

a mul t id imens iona l po lynomia l function. The polynomials for pa t te rns 9 to 13 are

Pg: x2 + x2 + x23 + x l - 3 = 0 (32)

Plo: - - 2 x l x ~ + x 3 + 2 x l x 3 + x 2 - 2 = 0 (33)

P l 1 : x 3 + X3 + X3 + X3 -- 3 = 0 (34)

P12:x2x2 - x 2 x 3 q'- X3 + 2 x 2 x 3 x 4

- 2xl + 3 = 0 (35)

2 4 X~ -- X1X 2 P13: -- 2 X I X 3 X 4 + + 2X3X~

-- 3X1X2X3X 4 -- 3 = 0 (36)

A to ta l of 225 or so t ra in ing samples are r a n d o m l y genera ted from the integer poin ts in the [ - 2, 2] 4 cubic. Samples on the negative side of the surface de te rmined by a po lynomia l Pi belong to one class, and samples on the posi t ive side belong to ano the r class. A two- layer network (not count ing the input layer) shown in Fig. 9 is used to learn P9, P10, P l l and P12. Three- layer ne tworks

Fig. 6. Network 4: a two-layer quasipolynomial synapse network Fig. 7. Network 5: a two-layer linear synapse network having the same graph as network 4

Table 1. Two-dimensional experimental results

Pattern No. of iteration to convergence No.

Network 1 Network 2 Network 3 Network 4 Network 5

1 20 511 5 284 4426 2944 51720 2 Not convergent 101465 Not convergent Not convergent Not convergent 3 Not convergent Not convergent Not convergent 10073 Not convergent 4 140 167 21 234 17048 6762 10647 5 92 350 6 452 4 489 6 230 27 998 6 634 786 244 626 108 286 38 436 62 736 7 Not convergent Not convergent 241930 Not convergent Not convergent 8 Not convergent Not convergent Not convergent 40 513 Not convergent

of similar structure (both nonlinear synapse and summation-contact network, and linear synapse network) are used to learn P13.

Table 2 shows the number of iterations for convergence for both the multilayer nonlinear and linear synapse networks. The linear synapse networks with the same structure is unable to converge for patterns 10, 11, and 12 in 1000000 iterations. The nonlinear synapse networks are able to load all five training patterns. The learned synapse functions are not shown due to space limitations. The highest exponents of the quasipolynomials are 4.027 for pattern 9, 6.014 for pattern 10, 6.891 for pattern 11, 6.630 for pattern 12, and 7.848 for pattern 13. Some lower-order terms have small coefficients and can be deleted; therefore, not all lower quasipolynomial terms are needed at synapses with high degrees. For patterns 9 and 11, most of the synapses have only one term with an exponent in the interval of (1, 2). Note that although the nonlinear synapse networks have the same graph as the linear synapses networks, after convergence the former use more parameters than the latter.

Also shown in Table 2 are the results of the generalization test, i.e., testing of the learned function with newly generated samples. The testing samples are the integer points in the [ - 2, 2 ] , cubic subtracting the 225 or so training samples. The class of the new samples is determined using the same functions in (32)-(36) that generated the training patterns. There are about 400 testing samples in each case. The first number in the general-

173

ization column is the number of test samples correctly classified, and the second number is the total number of testing samples (excluding training samples). For example, the trained nonlinear synapse network correctly

ayer

Fig. 9. A two-layer four.input quasipolynomial synapse network. All input units are directly connected to the output unit

0000000000 0000111100 0000111100 0000111100 0000111100 0000111100 0000111100 0000111100 0000111100 0001111100

0001110000 0001111000 0011111000 0011111000 0011111000 0011111000 0011111000 0001111000 0001110000 0000000000

iiiiiiii00 0111111100 0011111100 0001111000 0000111000 0000111000 0000111000 0000111100 0000111110 0001111111

0000001111 0000011111 0000011111 0000011111 0000111111 0000111111 0000111111 0000011111 0000001111 0000000111

ii00001100 Ii00001100 II00001100 Ii00001100 ii00001100 1100001100 1100001100 1100001100 110000110'0 1100001100

0011111000 0011111000 0011111100 iiii001111 1110000111 1110000111 1110000111 iii0000111 0111111100 0011111000

0000000000 0000000011 0000001111 0000011111 0000011111 0000011111 0000011111 0000001111 0000000111 00000O0001

0001100000 0011100000 0011100001 0111000001 Iii0000011 1100000111 0000001111 0000011111 0000111111 0011111111

Fig. 8. Typical classification function learned networks 1 to 5 from the training samples in Fig. 2

Table 2. Four-dimensional experimental results

Pattern Training No. sample No.

Nonlinear synapse network Linear synapse network

Iterations Generalization Iterations Generalization

9 228 891 383/397 5296 10 229 7350 329/396 Not 11 221 1903 367/404 Not 12 226 3745 343/399 Not 13 226 12631 330/399 28 723

362/397 Convergent Convergent Convergent 325/399

174

classifies over 96% of the testing samples in the best case, and over 83% in the worst case.

Note that when both the linear synapse network and the quasipolynomial synapse network converge, the number of iterations required by the nonlinear synapse network is much smaller, sometimes by several orders of magnitude. Note that for a quasipolynomial synapse network, each synapse has more parameters to be updated than a linear synapse network. These computa- tions are done serially in the simulation. In hardware implementation, all parameters at a synapse can be updated at the same time.

Of course, in all the above examples where a linear synapse network fails to converge, a larger linear synapse network may be able to learn the same function. The problem is one may not know what size network to start with in order to get convergence, and how to "hand- craft" product units or high-order units to speed up convergence.

5 Conclusion and discussion

A neural network model has been proposed that intro- duces nonlinearity at synapses and makes the coefficient, the exponents, and the number of terms of the quasipolynomials at synapses learnable, subject to the limit of number of parameters allowed. This can be useful when there is no a priori knowledge available to determine what higher-order terms should be present. Since the number of learnable parameters are learned in case 1, the new model in case 1 is much like the growth networks (Frean 1990; Liang 1990; Marchand et al. 1990; Mezard and Nadal 1990; Yin and Liang 1991), although the graph of the network is fixed. It is known that a network with a fixed number of parameters may not be able to learn some functions, or may require an exponential amount of time to do so (Blum and Rivest 1988; Judd 1990). A network that not only learns the parameters, but also learns the number of parameters (or the structure) of the network is desirable. Several mechanisms are designed to prevent the network from growing too many parameters for the purpose of valid generalization.

Even when the number of parameters allowed or available is fixed, as in case 2 and 3, there is still flexibility in learning what quasipolynomial terms, and to some extent how many terms in case 2 should be present. This is achieved by deleting terms with small coefficients, therefore, dynamically assigning and releasing the parameters for different-degree quasipolynomial terms. This flexibility in representation is gained without increasing the number of parameters in the network. Therefore, cases 2 and 3 may offer both some control on the generalizability of learning and flexibility in functional representation, and may provide a compromise between the complexity of loading and generalizability of learning. In case 3, this is achieved with a fixed graph. Note that fixing the number of parameters may not totally decide the representational capacity in cases 2 and 3. This is because the flexibility of the parameters makes the networks able to implement many more functions than a

fixed number of parameters at fixed places. It is the flexibility in function representation of a network, not the number of parameters, that determines the generalization of a network. Determining the function representation capacity of the new networks, or the V-C dimensions, could be a difficult problem. Such determination is necessary for a complete understanding of the generalization capability of the new networks. Nevertheless, the number of parameters is one of the determining factor of the generalization capacity. Fixing the number of parameters in cases 2 and 3 does offer some control of the generalization capability of the network. Efficient algorithms for dynamically assigning and releasing parameters during learning in cases 2 and 3 need to be developed. The complexity of such an approach is an important question that needs to be answered.

Gradient-descent learning algorithms are developed for quasipolynomial synapse networks and for networks with product contacts. Linear synapse networks of the same structure are not able to converge in most of the examples in the experiment for which the nonlinear (quasipolynomial) synapse networks (with or without product contacts) are able to converge to the right classification. In cases where both the linear synapse network and the nonlinear synapse network converge, the latter takes much fewer iterations, usually by several orders of magnitude, to converge than the former.

References

Abu-Mostafa Y (1989) Complexity in neural systems. In: Mead C (ed) Analog VLSI and neural systems. Addison-Wesley, Reading, Mass

Baum EB, Haussler D (1989) What size net gives valid generalization? Neural Comput l : 151-160

Blum A, Rivest RL (1988) Training a 3-node neural network is NP- complete. In: Proceedings 1988 Workshop on computational learning theory. Morgan Kaufmann, San Mateo, Calif, pp 9-18

Bullock TH, Orkand R, Grinnell A (1977) Introduction to nervous systems. Freeman, San Francisco

Carroll SM, Dickinson BW (1989) Construction of neural nets using the random transform. In. Proceedings of International Joint Confer- ence on Neural Networks, vol 1. pp 607-611

Cybenko G (1989) Approximation by superpositions of a sigmoidal function. Math Control Signals Syst 2:303 314

Durbin R, Rumelhart DE (1989) Product units: a computationally powerful and biologically plausible extension to the backpropagation networks. Neural Comput. 1:133-142

Engelmore R, Morgan T (eds) (1988) Blackboard systems. Addison- Wesley, Reading, Mass

Frean M (1990) The upstart algorithm: a method for constructing and training feedforward networks. Neural Comput. 2:198-209

Giles CL, Maxwell T (1987) Learning, invariance, and generalization in high-order neural networks. Appl. Opt 26:4972-4978

Hornik K, Stinchcombe M, White H (1989) Multilayer feedfoward networks are universal approximators. Neural Networks 2:359-366

Judd JS (1990) Neural network design and the complexity of learning. MIT Press, Cambridge, Mass

Liang P (1990) Problem decomposition and subgoaling in artificial neural networks. In: Proceedings 1990 IEEE Conference on Sys- tems, Man, and Cybernetics, Nov. 4-7, 1990, Los Angeles, Calif, pp 178-181

Marchand M, Golea M, Rujan P (1990) A convergence theorem for sequential learning in two-layer Perceptrons. Europhys Lett 11:487-492

McCulloch WS, Pitts W (1943) A logical calculus of ideas immanent in nervous activity. Bull Math Biophys 5:115-133

Mezard M, Nadal J (1990) Learning in feedforward layered networks: the tiling algorithm. J. Phys A 22:2191

Rumelhart DE, Hinton GE, Williams RJ (1986) Learning internal representations by error propagation. In: Rumelhart DE, McClel- land JL (eds) Parallel distributed processing. MIT Press,Cam- bridge, Mass, pp 318-362

Sch~irmann J (1976) Multifont word recognition system with application to postal address reading. In: Proceedings of

175

International Joint Conference on Pattern Recognition, pp 658-662

Schiirmann J (1977) Polynomklassifikatoren fiir die Zeichenerkennung. Oldenbourg-Verlag, Miinchen.

Sethi IK (1990) Entropy nets: from decision trees to neural networks. Proc IEEE 78:1605-1613

Shepherd GM (1983)Neurobiology. Oxford University Press, New York Yin HF, Liang P (1991) A connectionist expert system combining

production system and associative memory. Intl J Pattern Recog- nition Art Intell 5:523-544

Artificial neural networks with quasipolynomial synapses and product synaptic contacts

Documents

Transcript of Artificial neural networks with quasipolynomial synapses and product synaptic contacts