An algorithm to generate radial basis function (RBF)-like nets for classification problems

Pergamon

CONTRIBUTED A R TI C LE 0893-6080(94)00064-6

Neural Networks, Vol. 8. No. 2, pp. 179-201, 1995 Copyright © 1995 Elsevier Science Ltd Printed in the USA. All rights reserved

0893-6080/95 $9.50 + .00

An Algorithm to Generate Radial Basis Function (RBF)-Like Nets for Classification Problems

ASIM ROY, SANDEEP GOVIL, AND RAYMOND MIRANDA

Arizona State University

(Received 3 August 1993; revised and accepted 6 June 1994 )

Abstract--Th/s paper presents a new algorithm for generating radial basis function ( RBF)-like nets for classification problems. The method uses linear programming ( LP) models to train the RBF-like net. Polynomial time complexity of the method is proven and computational results are provided for many well-known problems. The method can also be implemented as an on-line adaptive algorithm.

Keywords--Radial basis function-like nets, Classification problems, Linear programming models.

1. I N T R O D U C T I O N - - A ROBUST AND EFFICIENT LEARNING THEORY

The science of artificial neural networks needs a robust theory for generating neural networks and for adaptation. Lack of a robust learning theory has been a significant impediment to the successful application of neural networks. A good, rigorous theory for artificial neural networks should include learning methods that adhere to the following stringent performance criteria and tasks. 1. Perform network design task. A neural network

learning method must be able to design an appropriate network for a given problem, because it is a task performed by the brain. A predesigned net should not be provided to the method as part of its external input, because it never is an external input to the brain.

2. Robustness in learning: The method must be robust so as not to have the local minima problem, the problems of oscillation, and catastrophic forgetting or similar learning difficulties. The brain does not exhibit such problems.

3. Quickness in learning: The method must be quick in its learning and learn rapidly from only a few examples, much as humans do. For example, a method that learns from only 10 examples (on-line) learns faster than one that needs 100 or 1000 ex-

Acknowledgement: This research was supported, in part, by the National Science Foundation grant IRI-9113370 and College of Busi- ness Summer Grants.

Requests for reprints should be sent to Asim Roy, Department of Decision and Information Systems, Arizona State University, Tempe, AZ 85287.

amples (on-line) for the same level of error. "Told him a million times, and he still doesn't understand" is a typical remark often heard of a slow learner.

4. Efficiency in learning: The method must be com- putationally efficient in its learning when provided with a finite number of training examples. It must be able to both design and train an appropriate net in polynomial time. That is, given P examples, the learning time should be a polynomial function of P.

5. Generalization in learning: The method must be able to generalize reasonably well so that only a small amount of network resources is used. That is, it must try to design the smallest possible net. This char- acteristic must be an explicit part of the algorithm. This theory defines algorithmic characteristics that

are obviously much more brain-like than those of classical connectionist theory, which is characterized by predefined nets, local learning laws, and memoryless learning. Judging by these characteristics, classical connectionist learning is not very powerful or robust. First of all, it does not even address the issue of network design, a task that should be central to any neural network learning theory. It is also plagued by efficiency (lack of polynomial time complexity, need for excessive number of teaching examples) and robustness problems (local minima, oscillation, catastrophic forgetting), problems that are partly acquired from its attempt to learn without using memory. Classical connectionist learning, therefore, is not very brain-like at all.

Several algorithms have recently been developed that follow these learning principles (Roy & Mukhopadhyay, 1991; Roy, Kim, & Mukhopadhyay, 1993; Mukhopa- dhyay et al., 1993; Govil & Roy, 1993). The algorithm

179

180 A. Roy, S. Govil, and R. Miranda

presented here also satisfies the brain-like properties described above. Successful and reliable on-line self- learning machines can be developed only if learning algorithms adhere to these learning principles.

2. RADIAL BASIS FUNCTION NETS--BACKGROUND

Radial basis function (RBF) nets belong to the group of kernel function nets that utilize simple kernel functions, distributed in different neighborhoods of the input space, for which responses are essentially local in nature. The architecture consists of one hidden and one output layer. This shallow architecture has great advantage in terms of computing speed compared to multiple hidden layer nets.

Each hidden node in a RBF net represents one of the kernel functions. An output node generally com- putes the weighted sum of the hidden node outputs. A kernel function is a local function and the range of its effect is determined by its center and width. Its output is high when the input is close to the center and it de- creases rapidly to zero as the input's distance from the center increases. The Gaussian function is a popular kernel function and will be used in this algorithm. The design and training of a RBF net consists of 1 ) deter- mining how many kernel functions to use, 2) finding their centers and widths, and 3) finding the weights that connect them to an output node.

Several radial basis function algorithms have been proposed recently for both classification and real-valued function approximation. Significant contributions include those by Powell (1987), Moody and Darken (1988, 1989), Broomhead and Lowe (1988), Poggio and Girosi (1990), Baldi (1990), Musavi et al. (1992), Pla t t ( 1991 ), and others. In Moody and Darken's algorithm, an adaptive K-means clustering method and a "P-nearest neighbor" heuristic are used to create the Gaussian units (i.e., determine their centers and widths) and the LMS gradient descent rule (Widrow & Hoff, 1960) is used to determine their weights. Musavi et al. (1992) present an algorithm that forms clusters (Gaussian units) that contain patterns of the same class only. They attempt to minimize the number of Gauss- ian units used and thereby increase the generalization ability. Plat t ( 1991 ) has developed an RBF algorithm for function approximation where a new RBF unit is allocated whenever an unusual pattern is presented. His network, using the standard LMS method, learns much faster than do those using back propagation. Vrckovnik, Carter, & Haykin (1990) have recently ap- plied RBF nets to classify impulse radar waveforms and Renals & Rohwer (1989) to phoneme classification.

3. BASIC IDEAS AND THE ALGORITHM

The following notation is used. An input pattern is represented by the N-dimensional vector x, x = (X~, X2,

. . . . XN); Xn denotes the nth element of vector x. The pattern space, which is the set of all possible values that x may assume, is represented by ~2x. K denotes the total number of classes and k is a class. The method is for supervised learning where the training set xl , x2 . . . . . Xp is a set of P sample patterns with known classification; xp denotes the p th vector.

In this method, a particular subset of the hidden nodes, associated with class k, is connected to the kth output node and those class k hidden nodes are not connected to the other output nodes. Therefore, math- ematically, the input Fk(x) to the kth output node (i.e., the class k output node) is given by

O k

Fk(x) = ~ hkGk(x), (1) q-1

Gkq(X) = R ( I I x - Ckqll/wko). (2)

Here, Qk is the number of hidden nodes associated with class k, q refers to the qth class k hidden node, G~(x ) is the response function of the qth hidden node for class k, R is a radially symmetric kernel function, C~ = (Cqkl . . . CkqN) and Wq k are the center and width of the qth kernel function for class k, and h q k is the weight connecting the qth hidden node for class k to the kth output node. Generally, a Gaussian with unit normalization is chosen as the kernel function:

Gkq(X) = exp -- ~ ~. (Ckq. -- X.)Z/(w~) 2 . (3) n = l

The basic idea of the proposed method is to cover a class region with a set of Gaussians of varying widths and centers. The output function F k ( x ) , for class k, a linear combination of Gaussians, is said to cover the region of class k if it is slightly positive (Fk(X) >--- e) for patterns in that class and zero or negative for patterns outside that class. Suppose Qk Gaussians are required to cover the region of class k in this fashion. The covering (masking) function Fk(X) for class k is given by eqn ( 1 ). An input pattern x, therefore, may be determined to be in class k if Fk(X) > e and not in class k i fFk (X ) <-- O. This condition, however, is not sufficient, and stronger conditions are stated later.

When the effect of a Gaussian unit is small, it can be safely ignored. This idea of ignoring small Gaussian outputs leads to the definition of a truncated Gaussian unit as

( ~ ( x ) = G~(x) if G~(x) > 4~,

= 0 otherwise, (4)

where O f (x ) is the truncated Gaussian function and a small constant. In computational experiments, 4~

was set to 10 -3 . Thus, the function F k ( x ) is redefined in terms of Gqk(x) as

Qk

Fk(X) = E hkqGkq(x) (5) q=l

RBF-Like Nets for Classification Problems 181

where F~(x) now corresponds to the output o f a RBF net with " truncated" RBF units.

So, in general, Qk is the number of Gaussians required to cover class k, k -- 1 . . . . . K, Fk(x) is the covering function (mask) for class k, and Gkl(X) . . . . . G~k(x) are the corresponding Gaussians. Then an input pattern x' will belong to class k iff its mask Fk(X) is at least slightly positive, and the masks for all other classes are zero or negative. This is the necessary and sufficient condition for x ' to belong to class k. Here, each mask Fk(x) , k = 1 . . . . . K, will have its own threshold value ek as determined during its construction. Expressed in mathematical notation, an input pattern x ' is in class k i f fFk(x ' ) >-- e k and Fj (x ' ) _< 0 for all j ~ k, j = 1 . . . . . K. If all masks have values equal to or below zero, the input cannot be classified. If masks from two or more classes have values above their e-thresholds, then also the input cannot be classified, unless the maximum of the mask values is used to determine class ("ambiguity rejection").

Let TRk be the set of pattern vectors of any class k for which masking is desired and TRk be the corresponding set of nonclass k vectors, where TR = TRkUTRk is the total training set. As before, suppose Qk Gaussians of varying widths and centers are available to cover class k. The following linear program is solved to determine the Qk weights h k = (h~ . . . . . h ~ ) for the Qk Gaussians that minimize the classification error:

minimize a ~ d(x i )+l~ ~ d(xi) (6) xieTRk xieTR k

subject to

Fk(xi) + d(xi) >- '~k, xieZRk, (7)

Fk(xi) -- d(xi) < O, xieTRk, (8)

d(xi) >- O, xieTR, (9)

ek >- a small positive constant, (10)

h k in Fk(x) unrestricted in sign, ( 11 )

where d(xi)s are external deviation variables and a and/3 are the weights for the in-class and out-of-class deviations, respectively.

3.1. Generation of Gaussian Units

The network constructed by this algorithm deviates from a typical RBF net. For example, there is truncation at the hidden nodes and the output nodes use a hard limiting nonlinearity [ for the kth output node, the output is 1 if Fk(x) >-- ek, and 0 otherwise]. In addition, the Gaussians here are not viewed as purely local units because it generally results in a very large net. An explicit attempt is made by the algorithm to obtain good generalization. For that purpose, a variety of overlapping Gaussians (different centers and widths) are created to act both as global and local feature detectors and to help map out the territory of each class

with the least number of Gaussians. Though both "fat" (i.e., ones with large widths) and "narrow" Gaussians can be created, the "fat" ones, which detect global features, are created and explored first to see how well the broad territorial features work. The Gaussians, therefore, are generated incrementally and they become narrow local feature detectors in later stages. As new Gaussians are generated for a class at each stage, the LP model [eqns ( 6 ) - ( I 1 )] is solved using all of the Gaussians generated till that stage and the resulting mask evaluated. Whenever the incremental change in the error rate (training and testing) becomes small or overfitting occurs on the training set, masking of the class is determined to be complete and the appropriate solution for the weights retrieved.

The Gaussians for a class k are generated incrementally (in stages) and several Gaussians can be generated in a stage. Let h (= 1, 2, 3 . . . ) denote a stage of this process. A stage is characterized by its majority criterion, a parameter that controls the nature of the Gauss- ians generated (fat or narrow). A majority criterion of 60% for a stage implies that a randomly generated pattern cluster at that stage, which is to be used to define a Gaussian for class k, must have at least 60% of the patterns belong to class k. Let Oh denote the majority criterion for stage h. In the algorithm, Oh starts at 50% (stage 1 majority criterion) and can increase up to 100% in, say, increments of 10%. Thus, the method will have a maximum of six stages (0h = 50%, 60% . . . . . 100% ) when the increment is 10%. A 50% majority criterion allows for the creation of"fa t te r" Gaussians compared to, say, a 90% majority criterion and thus can detect global features in the pattern set that might not otherwise be detected by narrow Gaussians of a higher majority criterion.

The Gaussians for a given class k at any stage h are randomly selected in the following way. Randomly pick a pattern vector xi of class k from the training set and search for all pattern vectors in an expanding 6-neighborhood of x~. The 6-neighborhood of x~ is expanded as long as class k patterns in the expanded neighborhood retain the minimum majority of Oh for stage h. The neighborhood expansion is stopped when class k losses its required majority or when a certain maximum neighborhood radius of 6ma x is reached. When the expansion stops, the class k patterns in the last 6-neighborhood are used to define a Gaussian and are then removed from the training set. To define the next Gaussian, another class k pattern xi is randomly selected from the remaining training set and its 8-neighborhood is similarly grown to its limits, as explained above. This process of randomly picking a pattern vector x~ of class k from the remaining training set and searching for pattern vectors in an expanding neighborhood of x~ to define the next Gaussian is then re- peated until the remaining trainingset is empty of class k vectors.

The process of generating a Gaussian starts with an


initial neighborhood of radius 60 and then enlarges the neighborhood in fixed increments of A6 (6r = 6~_1 + A6). Here 6r is the neighborhood radius at the rth growth step. Let V~ be the set of pattern vectors within the 6~-neighborhood of starting vector x~ for the j t h Gaussian being generated at stage h. A neighborhood size can be increased only if the current pattern set V~ from the 6rneighborhood satisfies the majority criterion and if 6r < 6max. Otherwise, further expansion is stopped. At any growth step r, if the current pattern set V~ fails the majority criterion, the previous set V~ -~ ( if there is one) is used to create the Gaussian.

r r - I When a Gaussian is created from either Vj or Vj , the centroid of class k pattern vectors in the set becomes the center C k and the standard deviation of their distances from Cq k becomes w~, assuming the Gaussian being defined is the qth Gaussian for class k where q is the cumulative total number of Gaussians generated over all of the past and current stages. When the number of patterns in a set V~ or V~ -1 is less than a certain minimum, no Gaussian is created; however, the class k patterns in the set are removed from the remaining training set.

3.2. The Algorithm

The algorithm is stated below. The following notation is used. I and R denote the initial and remaining training sets, respectively. 6ma x is the m ax i m um neighborhood radius, 6~ is the neighborhood radius at the rth growth step, and A6 is the 6r increment at each growth step. V~ is the set of pattern vectors within the 6rneigh- borhood of starting vector xi for the j t h Gaussian of any stage h. PC~(k) denotes the percentage of class k members in V~. N~ denotes the number of vectors in V~. h is the stage counter, Oh is the min imum percentage of class k members in stage h, and A0 is the increment for Oh at each stage. Sk corresponds to the cumulative set of Gaussians created for class k. Cq k and Wq k are the center and width, respectively, of the qth Gaussian for class k. TREh and TSEh are the training and testing set errors, respectively, at the hth stage for the class being masked./3 is the m i n i m um number of patterns required in V~ to form a Gaussian and p is the maxi- m u m of the class standard deviations that are the standard deviations of the distances from the centroid of the patterns of each class, t~ma x is set to some multiple of p. The fixed increment A6 is set to some fraction of 6max - - 6 0 , A 6 = ( 6 m a x - - 6 0 ) / S , where s i s the desired number of growth steps, s was set to 25 and 6max was set to 10p for computational purposes.

The Gaussian Masking (GM) Algorithm

(0) Initialize constants: 6ma x = 10p, A0 = some constant (e.g., 10%), 6o = some constant (e.g., 0 or 0 . 1 p ) , A 6 = ( 6 m a x - - 60)/S.

( 1 ) Initialize class counter: k = 0. (2) Increment class counter: k = k + 1. I f k > K, stop.

Else, initialize cumulative Gaussian counters: Sk = 0 (empty set), q = 0.

(3) Initialize stage counter: h = 0. (4) Increment stage counter: h -- h + 1. Increase ma-

jority criterion: i fh > 1, Oh = Oh-i + A0; otherwise Oh = 50%. I f Oh > 100%, go to (2) to mask next class.

(5) Select Gaussian units for the hth stage: j = 0, R = I . (a) S e t j = j + l , r = 1 , 6 r = 6 0 . (b) Select an input pattern vector xi of class k at

random from R, the remaining training set. (c) Search for all pattern vectors in R within a 6r

radius of xi. Let this set of vectors be V~. (i) if PC~(k) < Oh and r > l, set r = r - l,

go to (e); (ii) if PC~(k) > Oh and r > l, go to (d) to

expand neighborhood; (iii) if PC~(k) < Oh and r = 1, go to (h); (iv) if PC~(k) > Oh and r = l, go to (d) to

expand neighborhood. (d) Set r = r + 1, 6r = 6r_~ + A6. I f 6 , > 6max, set

r = r -- l, go to (e). Else, go to (c). (e) Remove class k patterns of the set V~ from

R. If N~ </3, go to (g). ( f ) Set q = q + 1. Compute the center C 2 and

width w k of the qth Gaussian for class k. Add qth Gaussian to the set Sk. C k = centroid of class k patterns in the set V~, and wq k = standard deviation of the distances from the centroid C0 k of the class k patterns in V~.

(g) I fR is not empty of class k patterns, go to (a) , else go to (6).

(h) Remove class k patterns of the set V~ from R. If R is not empty of class k patterns, go to (a) , else go to (6).

(6) From the set Sk, eliminate similar Gaussians (i.e., those with very close centers and widths). Let Qk be the number of Gaussians after this elimination.

(7) Solve LP (6) - ( 11 ) for class k mask using Qk number of Gaussians.

(8) Compute TSEh and TREh for class k. If h = 1, go to (4). Else: (a) If TSEh < TSEh-t, go to (4). (b) If TSEh > TSEh_t and TREh > TREh-1, go

to (4). (c) Otherwise, overfitting has occurred. Use the

mask generated in the previous stage as class k mask. Go to (2) to mask next class.

Other stopping criteria, like m a x i m u m number of Gaussians used or incremental change in TSE, can also be used.

The GM algorithm needs a representative set of examples, other than the training set, to design and train


an appropriate net. This set can be called the validation set or control test set. This control test set can some- times be created by setting aside examples from the training set itself, when enough training examples are available. That has been done for the four overlapping Gaussian distribution test problems in Section 4, where independent control and test sets were created to test the error rate independently during and after training. For the other four test problems, because the number of training examples is limited, the test sets themselves were used as control test sets. If the training and control test sets are representative of the population, no overfitting to a particular (control) test set should occur. Furthermore, the algorithm does not try to minimize the error on the control test set. Explicit error mini- mization is done only on the training set. The experi- mental results show that the GM algorithm does not get the best test result always, compared to other algorithms.

For multiclass problems, all classes are masked ac- cording to the algorithm. For a K class problem, one needs to test a pattern for only (K - 1 ) classes. If the pattern is not from one of the tested classes, it is assigned to the remaining class. Thus, for a K class problem, one needs to construct a net with the optimal set of Gaussians from (K - 1 ) classes that produce the best error rate. The best (K - 1 ) class combination is selected by testing out all different combinations. For a two-class problem, both classes are masked and the best class selected to construct the net.

3.3. Polynomial Time Convergence of the Algorithm

Polynomial time convergence of the GM algorithm is proved next.

PROPOSITION. The Gaussian Masking ( GM) algorithm terminates in polynomial time.

PROOF. Let M be the number of completed stages of the GM algorithm at termination for a class. The largest number of linear programs is solved when Oh reaches its terminal value of 100%. In this worst case, with a fixed increment of A0 for Oh, the number of completed stages (M) would be (100% - 50%)/A0 = 50%/A0. At each stage, new Gaussian units are generated. Let

= 2, so that only a minimum of two points is required per Gaussian. Suppose, in a worst case scenario, only two-point Gaussians, that satisfy the majority criteria, are generated at each stage and, without any loss of generality, assume P / K , the average number of patterns per class, is even. Thus, in this worst case, a total of P / 2K 100% majority Gaussians are produced from P / K examples of a class if Oh > 50%. (For Oh = 50%, P / K Gaussians can be produced, but that case is ignored to simplify this derivation.) At the hth stage, a total o fhP/ 2K Gaussians are accumulated. Assume that these hP/ 2KGaussians, which are randomly generated, have different centers and widths, and, therefore, are unique.

Further assume that to obtain each two-point Gaussian, the neighborhood radius dr needs to grow, at most, s times.

In this worst case, therefore, to generate the P~ 2K two-point Gaussians at each stage, Z = s (P + (P - 2) + ( P - 4) + . . . + { P - ( P / K - 2)} - P/2K) = Ps

4 K 2 [ ( P - 2 )K 2 - 2 K - ( K - 1 ) [ P ( K - 1) - 2K]]

distances are computed and compared with d r to find the points within the dr-neighborhood. For M stages, M Z distances are computed and compared, which is a polynomial function of P.

Because only Gaussian units are used for masking, the number of variables in the LP ( 6 ) - ( 11 ) is equal to ( h P / 2 K + P) at the hth stage. The h P / 2 K L P variables represent the weights of the Gaussians and the other P variables correspond to the deviation variables. Let Th = hP/2K + P = the (where th = h / 2 K + 1 ) be the total number of variables and Lh be the binary encoding length of the input data for the LP in stage h. For the LP in eqns (6 ) - ( 11 ), the number of constraints is always equal to P. The binary encoding length of the input data for each constraint is proportional to Th in stage h if Gaussian truncation is ignored. Hence, LhaPThathP 2 or Lh = athP 2, where a is a proportion- ality constant.

Khachian's method (1979) solves a linear program in O (L 2 T 4) arithmetic operations, where L is the binary encoding length of the input data and T is the number of variables in the LP. Karmarkar 's method (1984) solves a linear program in O ( L T 3"5) arithmetic operations. More recent algorithms (Todd & Ye, 1990; Monteiro & Adler, 1989 ) have a complexity of O (LT 3 ). Using more recent results, the worst case total LP solution time for all M stages is proportional to

M M

Z OILhT 3] = Z OI(trthP2)(thP) 3] h=l h=l

M

= E O[t4p5], h=l

which again is a polynomial function of P. Thus, both Gaussian generation and LP solutions can be done in polynomial time. •

In practice, LP solution times have been found to grow at a much slower rate, perhaps about O(P3), if not less.

4. COMPUTATIONAL RESULTS

This section presents computational results on a variety of problems that have appeared in the literature. All problems were solved on a SUN Sparc2 workstation. Linear programs were solved using Roy Marsten's OB 1 interior point code from Georgia Institute of Technol- ogy. OB 1 has a number of interior point methods implemented and the dual log barrier penalty method was


used. The weights in the LP ( 6 ) - ( 11 ), a and fl, were set to 1 in all cases.

The problems were also solved with the RCE method of Reilly, Cooper, and Elbaum (1982), the standard RBF method of Moody and Darken ( 1988, 1989), and the conjugate gradient method for multilayer perceptrons (MLPs) (Rumelhart, Hinton, & Williams, 1986 ) and the computational results are reported. One of the commercial versions of back propagation was tried on several of the test problems and the results were mis- erable. It was then decided to formulate the MLP weight training problem as a purely nonlinear unconstrained optimization problem, and the Polak-Ribiere conjugate gradient method was used to solve it (Luenberger, 1984 ). A two-layer, fully connected net (a single hidden layer net) and a standard number of hidden nodes (2, 5, 10, 15, and 20) were used on all problems. The starting weights were generated randomly in all cases except for the breast cancer problem, which, for some reason, worked only with zero starting weights; otherwise it got stuck close to the initial solution. The sonar and vowel recognition problems could not be solved even after trying various starting weights and different number of hidden units. Lee (1989) reports similar difficulties on the vowel recognition problem with the conjugate gradient method. For standard RBF, K-means clustering was used to obtain Gaussian centers and the LMS rule used for weight training. The width or standard deviation of a Gaussian was set to some multiple ( 1, 2, and 3 were the multipliers used) of the distance of its center from the center of its nearest neighbor. All problems were solved with a standard set of RBF nodes (5, 10, . . . . 100). The RCE algorithm was run with different levels of pruning; the pruning criterion was specified by the minimum number of points per hypersphere, which was set to some percentage of the inclass points. For all three algorithms, the runs were thus standardized. That is, the choices for the number of hidden nodes (RBF, MLP), the RBF Gaussian widths, and the pruning level for RCE were all standardized. The main reason for this is that, without standardized runs, one gets into a trial-and-error manual optimization process for these net parameters [number of hidden nodes

(RBF, MLP), RBF node width, pruning level for RCE] to obtain the best error rate on a test set. That is not a desirable way to evaluate and compare algorithms. This also implies that the best possible results were not obtained for these algorithms. The paper, therefore, wher- ever possible, quotes better results obtained with these algorithms by other researchers that are known to the authors. All algorithms were run on a SUN Sparc 2 workstation.

4.1. Overlapping Gaussian Distributions

The algorithm was tested on a class of problems, where the patterns of a class are normally distributed in the multiple input dimensions. They are two-class problems with overlapping regions. All problems were tried with randomly generated training and test sets. Because they are two-class problems, only one of the classes needs to be masked.

Problem 1 : The I-I Problem. A simple two-class problem where the classes are described by Gaussian distributions with different means and identity covariance matrices. A four-dimensional problem with mean vectors [0000] and [1111] was tried. The Bayes error is about 15.86% in this case. Tables 1A and 9 show that an error rate of 17.90% was obtained by the GM algorithm using 18 Gaussians. Tables 1B-D and 9 show the results for the other algorithms. MLP had the best error rate of 16.51%.

Problem 2: The I-4I Problem. Another four-dimensional two-class problem where the classes are described by Gaussian distributions with zero mean vectors and covariance matrices equal to I and 41. The optimal classifier here is quadratic, and optimal Bayes error is 17.64%. Tables 2A and 9 show that an error rate of 17.97% was obtained by the GM algorithm using five Gaussians. Tables 2B-D and 9 show the results for the other algorithms.

Problem 3. This problem is similar to problem 2, except that it is eight-dimensional instead of four. Optimal

TABLE 1 Overlapping Gaussian Distributions: Problem 1

1A: GM Algorithm Results, Problem 1 (Mask Class 0)

Majority Criterion

(%) No. of

Gaussians

Cumulative No. of

Gaussians

Independent Test Control Test Set Error

Training Set Error Set Error (10,000 Pts) Time (s)

Total Incl. Outcl. Total Incl. Outcl. Total Incl. Outcl. Ph. I Ph. II

50 60 70 80 90

1 1 179 15 164 500 500 0 3471 348 3123 9 383 5 6 78 51 27 178 118 60 1891 1249 642 9 807 5 11 73 49 24 178 104 74 1813 1200 613 8 1050 7 18 71 48 23 176 117 59 1790 1169 621 11 1176

12 30 67 43 24 188 120 68 1830 1126 704 19 1275

TABLE 1 Continued

1B: RCE Results, Problem 1

No. of Hyperspheres

Independent Test Set Error

Training Set Error (10,000 Pts) Training

Total Incl. Outcl. Total Incl. Outcl. Time (s)

No. pruning (min. 1 point) 2% pruning (min. 5 points) 4% pruning (min. 10 points) 6% pruning (min. 15 points)

109 62 23 39 2204 1054 1150 3.3 53 79 62 17 1947 1327 620 5.4 37 85 74 11 2013 1555 458 5.4 23 94 86 8 2174 1790 384 5.5

1C: Standard RBF Results, Problem 1

Test Set Error Training Set Error (10,000 Examples) Time (s)

No. of Width Learning Gaussians Multiplier Rate Total Incl. Outcl. Total Incl. Outcl. Ph. I Ph. II

5 1 0.0001 180 51 129 3782 1008 2774 1 7 10 1 0.0001 170 49 121 3544 964 2580 2 9 20 1 0.0001 151 44 107 3218 886 2332 4 19 30 1 0.0001 136 47 89 2750 941 1809 7 30 40 1 0.0001 148 44 104 3031 847 2184 12 29 50 1 0.0001 148 37 111 3175 803 2372 9 38 60 1 0.0001 129 41 88 2788 849 1909 11 60 70 1 0.0001 144 39 105 3039 791 2248 17 57 80 1 0.0001 109 37 72 2422 833 1589 16 108 90 1 0.0001 132 40 92 2896 854 2042 14 85

100 1 0.0001 250 0 250 5000 0 5000 14 22 5 2 0.0001 249 9 240 4969 120 4849 1 3

10 2 0.0001 202 94 108 4212 1798 2414 2 6 20 2 0.0001 98 56 42 2053 1108 945 4 22 30 2 0.0001 98 55 43 2154 1115 1039 7 18 40 2 0.0001 92 50 42 1987 1046 941 12 23 50 2 0.0001 99 52 47 2150 1057 1093 9 22 60 2 0.0001 89 40 49 1868 1013 855 11 30 70 2 0.0001 94 49 45 2047 1033 1014 17 28 80 2 0.0001 90 49 41 1898 1023 875 16 34 90 2 0.0001 86 47 39 1890 1004 886 14 36

100 2 0.0001 95 49 46 2081 1079 1002 22 28 5 3 0.0001 251 10 241 4993 127 4866 2 3

10 3 0.0001 254 12 242 5035 121 4914 2 3 20 3 0.0001 111 65 46 2174 1217 957 4 24 30 3 0.0001 96 54 42 2057 1117 940 7 21 40 3 0.0001 88 49 39 1917 1015 902 12 22 50 3 0.0001 74 42 32 1743 945 798 9 33 60 3 0.0001 81 44 37 1727 895 832 11 29 70 3 0.0001 221 191 30 4241 3663 578 17 12 80 3 0.0001 230 226 4 4561 4450 111 16 13 90 3 0.0001 195 194 1 3838 3784 54 14 19

100 3 0.0001 161 7 154 2876 95 2781 22 46

1D: Multilayer Perceptron Results, Problem 1

Test Set Error Training Set Error (10,000 Examples)

No. of Hidden Training Nodes Total Incl. O(Jtcl. Total Incl. Outcl. Time (s)

2 76 38 38 1665 890 775 51 5 66 39 27 1653 900 753 188

10 66 38 28 1651 903 748 637 15 76 38 38 1664 895 769 2190 20 288 38 250 5898 848 5000 2682

Description: two classes {0, 1 }, four dimensions, different centers. Training examples = 250 + 250 = 500; control test set = 500 + 500 = 1000; independent test set = 5000 + 5000 = 10,000. Minimum number of points per Gaussian = 4% of in-class points = 10. Nomenclature: Incl., number of in-class points classified as out-of-class; Outcl., number of out-of-class points classified as in-class; Total, total number of errors; Ph. I, generation of Gaussians (centers and widths) (GM and standard RBF algorithms); Ph. II, determination of weights (by LP or LMS method, as appropriate).




Majority Criterion

(%) No. of

Gaussians

Cumulative No. of

Gaussians




50 60 70 80 90

1 1 90 10 80 218 15 203 2151 159 1992 9 385 1 2 91 49 42 195 95 100 1855 889 966 4 578 1 3 74 27 47 193 63 130 1799 573 1226 4 524 2 5 76 29 47 190 61 129 1797 584 1213 5 676 5 10 74 28 46 204 75 129 1819 677 1142 14 771


Pruning No. of

Hyperspheres

Independent Test Set Error Training Set Error (10,000 Pts)

Training Total Incl. Outcl. Total Incl. Outcl. Time (s)

None, min. 1 point 209 48 1 47 2329 623 1706 6 2%, min. 5 points 121 65 36 29 2270 892 1378 10 4%, min. 10 points 54 81 67 14 2446 1542 904 10 6%, min. 15 points 34 112 102 10 2773 2147 626 10.6


No. of Gaussians

Width Multiplier

Learning Rate


Total Incl. Outcl. Total Incl. Outcl. Ph. I Ph. II

5 1 0.0001 250 0 250 5000 0 5000 1 22 10 1 0.0001 249 0 249 5015 35 4980 3 48 20 1 0.0001 287 6 231 4903 202 4701 4 62 30 1 0.0001 248 0 248 4997 7 4990 4 53 40 1 0.0001 208 22 186 4529 428 4101 8 97 50 1 0.0001 176 25 151 4109 587 3522 15 133 60 1 0.0001 193 20 173 4413 550 3863 10 138 70 1 0.0001 194 18 176 4505 571 3934 14 175

8 0 1 0.0001 187 21 166 4399 674 3725 17 198 90 1 0.0001 220 11 209 4793 330 4463 16 192

100 1 0.0001 199 17 182 4613 542 4071 15 253 5 2 0.0001 250 0 250 5000 0 5000 2 4

10 2 0.0001 250 0 250 5000 0 5000 2 3 20 2 0.0001 106 17 89 2621 292 2329 4 54 30 2 0.0001 116 18 98 2743 279 2464 4 58 40 2 0.0001 92 24 68 2429 443 1986 8 44 50 2 0.0001 102 25 77 2453 428 2025 15 50 60 2 0.0001 90 29 61 2130 562 1568 10 42 70 2 0.0001 100 36 64 2382 659 1723 14 44 80 2 0.0001 86 31 55 2148 603 1545 18 55 90 2 0.0001 101 33 68 2280 610 1670 16 49

100 2 0.0001 103 31 72 2315 555 1760 16 51 5 3 0.0001 250 0 250 5000 0 5000 2 3

10 3 0.0001 250 0 250 5000 0 5000 3 3 20 3 0.0001 250 0 250 5000 0 5000 4 3 30 3 0.0001 270 22 248 5427 476 4951 4 3 40 3 0.0001 250 0 250 5000 0 5000 8 5 50 3 0.0001 250 0 250 5000 0 5000 14 5 60 3 0.0001 272 27 245 5525 612 4913 10 10 70 3 0.0001 282 250 32 5649 4999 650 14 9 80 3 0.0001 259 250 9 5187 5000 187 18 6 90 3 0.0001 260 250 10 5186 5000 186 16 7

100 3 0.0001 250 0 250 5000 0 5000 16 5


TABLE 2 Continued


Test Set Error Training Set Error

(10,000 Examples) No. of Hidden Training

Nodes Total Incl. Outcl. Total Incl. Outcl. Time (s)

2 137 40 97 3500 1036 2464 129 5 90 33 57 2454 775 1679 739

10 70 25 45 2077 734 1343 873 15 71 25 46 2031 731 1300 1314 20 61 25 36 2017 766 1251 2842

Description: two classes {0, 1 }, four dimensions, same center. Training examples = 250 + 250 = 500; control test set = 500 + 500 = 1000; independent test set = 5000 + 5000 = 10,000. Minimum number of points per Gaussian = 4% of in-class points = 10.

Bayes error is 9% for this problem. Tables 3A and 9 show that an error rate of 9.19% was obtained by the GM algorithm using four Gaussians. Tables 3B-D and 9 show the results for the other algorithms. Musavi et al. (1992) solved this problem with their RBF method and achieved an error rate of 12% with 128 Gaussian units. The error rate was 13% with their MNN method (Musavi et al., 1993). They also report solving this problem with Specht's (1990) PNN method for an error rate of 25.69% and that the back-propagation algorithm did not converge for over 40 training points. For con- verged BP nets, the error rate was higher than their MNN method. Hush and Salas (1990) report that the back-propagation algorithm obtained an error rate of 14.5% with 800 training examples and 30 hidden nodes (also had the same error rate for 50 hidden nodes), and an error rate of 11% with 6400 training examples and 55-60 hidden nodes.

Problem 4. This is a two-class, two-dimensional problem where the first class has a zero mean vector with identity covariance matrix and the second class has a mean vector [1, 2 ] and a diagonal covariance matrix with entries of 0.01 and 4.0. The estimated optimal error rate for this problem is 6% (Musavi et al., 1992 ). Tables 4A and 9 show that an error rate of 7.70% was obtained by the GM algorithm using 16 Gaussians. Tables 4B-D and 9 show the results for the other algorithms. Musavi et al. (1992) solved this problem with their RBF method and achieved an error rate of 9.26% with 86 Gaussian units. Musavi et al. (1993) also solved this problem with their MNN method and Specht's (1990) PNN method, both of which achieved an error rate of about 8% when trained with 300 points. They again report that the back propagation algorithm failed to converge, regardless of the number of layers and nodes, when provided with over 40 training samples.

4.2. Medical Diagnosis

Breast Cancer Detection. The breast cancer diagnosis problem is described in Mangasarian, Setiono, and

Wolberg (1990). The data is from the University of Wisconsin Hospitals and contains 608 cases. Each case has nine measurements made on a fine needle aspirate (fna) taken from a patient's breast. Each measurement is assigned an integer value between 1 and 10, with larger numbers indicating a greater likelihood of ma- lignancy. Of the 608 cases, 379 were benign, and the rest malignant. Four hundred fifty of the cases were used for training and the rest were used for testing. Tables 5A and 9 show that an error rate of 3.94% was obtained by the GM algorithm using 11 Gaussians. Tables 5B-D and 9 show the results for the other algorithms. Bennett and Mangasarian (1992) report average error rates of 2.56% and 6.10% with their MSM 1 and MSM methods, respectively.

Heart Disease Diagnosis. This problem is described in Detrano et al. (1989). This data base contains 297 cases. Each case has 13 real-valued measurements and is classified either as positive or negative. One hundred ninety-eight of the cases were used for training and the rest were used for testing. Tables 6A and 9 show that an error rate of 18.18% was obtained by the GM algorithm using 24 Gaussians. Tables 6B-D and 9 show the results for the other algorithms. Bennett and Man- gasarian (1992) report a test error of 16.53% with their MSMI method, 25.92% with their MSM method, and about 25% error with back propagation.

4.3. Speech Classification

The vowel classification problem is described in Lippmann (1988). The data were generated from the spectrographic analysis of vowels in words formed by "h," followed by a vowel, followed by a "d," and consists of two-dimensional patterns. The words were spoken by 67 persons, including men, women, and children. The data on 10 vowels were split into two sets for training and testing. Lippmann (1988) tested four classi- f iers--KNN, Gaussian, two-layer perceptron, and feature map-on this dataset. All classifiers had similar error




Majority Criterion

(%) No. of

Gaussians

Cumulative No. of

Gaussians




50 60 70 80 90

1 1 41 9 32 88 19 69 937 228 709 16 398 1 2 48 27 21 89 51 38 979 582 397 9 1199 1 3 43 22 21 89 44 45 937 468 469 9 745 1 4 42 19 23 88 36 52 919 375 544 7 999 2 6 42 19 23 84 33 51 927 384 543 19 1084


Pruning No. of

Hyperspheres


Training Total Incl. Outcl. Total Incl. Outcl. Time (s)

None, min. 1 point 176 37 2 35 1322 219778 1103 9.2 2%, min. 5 points 137 39 8 31 1306 255 1051 16.8 4%, min. 10 points 96 42 17 25 1273 333 940 15.3 6%, min. 15 points 69 40 20 20 1286 448 888 15.3


No. of Gaussians

Width Multiplier

Learning Rate



5 1 0.0001 250 0 250 5000 0 5000 3 309 10 1 0.0001 250 0 250 5000 0 5000 5 153 20 1 0.0001 248 0 248 5002 10 4992 9 132 30 1 0.0001 167 15 152 3508 474 3034 13 247 40 1 0.0001 209 8 201 4431 181 4250 10 276 50 1 0.0001 139 14 125 3150 501 2599 13 373 60 1 0.0001 178 15 163 3653 431 3222 17 303 70 1 0.0001 156 13 143 3207 395 2812 19 289 80 1 0.0001 105 27 78 2283 621 1662 26 362 90 1 0.0001 146 19 127 3016 482 2534 34 351

100 1 0.0001 130 20 110 2963 572 2391 24 378 5 2 0.0001 252 2 250 5060 60 5000 3 8

10 2 0.0001 252 2 250 5089 90 4999 5 7 20 2 0.0001 254 4 250 5150 151 4999 9 9 30 2 0.0001 88 3 85 1872 90 1782 13 132 40 2 0.0001 62 2 60 1439 111 1328 10 204 50 2 0.0001 71 2 69 1411 90 1321 13 181 60 2 0.0001 77 4 73 1575 71 1504 17 102 70 2 0.0001 250 0 250 5000 0 12 19 12 80 2 0.0001 250 0 250 5000 0 5000 26 20 90 2 0.0001 308 250 58 6090 5000 1090 34 19

100 2 0.0001 290 250 40 5646 5000 646 24 13 5 3 0.0001 251 1 250 5026 26 5000 3 6

10 3 0.0001 251 1 250 5046 46 5000 5 6 20 3 0.0001 253 3 250 5108 109 4999 9 5 30 3 0.0001 275 26 249 5523 528 4995 13 6 40 3 0.0001 250 0 250 5000 0 5000 10 8 50 3 0.0001 250 0 250 5000 0 5000 13 17 60 3 0.0001 250 250 0 5000 5000 0 17 8 70 3 0.0001 250 0 250 5000 0 5000 19 7 80 3 0.0001 250 0 250 5000 0 5000 26 8 90 3 0.0001 250 0 250 5000 0 5000 34 9

100 3 0.0001 250 0 250 5000 0 5000 24 9


TABLE 3 Conbnued



No. of Hidden Training Nodes Total Incl. Outcl. Total Incl. Outcl. Time (s)

2 119 48 71 3404 1389 2015 73 5 59 24 35 2264 838 1426 6238

10 26 9 17 1683 570 1113 8325 15 16 2 14 1787 612 1175 41713 20 14 2 12 1555 580 975 39440

Description: two classes {0, 1 }, eight dimensions, same center. Training examples = 250 + 250 = 500; control test set = 500 + 500 = 1000; independent test set = 5000 + 5000 = 10,000. Minimum number of points per Gaussian = 4% of in-class points = 10.



Majority Cumulative Criterion No. of No. of

(%) Gaussians Gaussians

Independent Test Control Test Set Set Error

Training Set Error Error (10,000 Pts) Time (s)


50 2 2 47 8 39 615 36 579 2151 397 1754 1 49 60 2 4 35 29 6 280 236 44 1359 1190 169 1 83 70 5 9 16 13 3 245 240 5 891 793 98 1 96 80 7 16 14 14 0 165 135 30 770 594 176 1 103 90 7 20 7 6 1 172 144 28 900 519 381 1 121



No. of Training Pruning Hyperspheres Total Incl. Outcl. Total Incl. Outcl. Time (s)





5 1 0.0001 41 7 34 2417 578 1839 1 4 10 1 0.0001 45 5 40 2408 362 2046 1 7 20 1 0.0001 44 5 39 2352 346 2006 1 15 30 1 0.0001 31 6 25 1970 444 1526 2 30 40 1 0.0001 61 1 60 3114 49 3065 1 24 50 1 0.0001 29 5 24 1878 341 1537 1 55 60 1 0.0001 54 1 53 2810 46 2764 3 37 70 1 0.0001 51 4 47 2519 141 2378 2 51 80 1 0.0001 51 4 47 2519 141 2378 2 59 90 1 0.0001 71 0 71 3674 5 3669 8 55

100 1 0.0001 81 0 81 3958 7 3951 3 68 5 2 0.0001 38 4 34 2335 306 2029 1 5

10 2 0.0001 57 7 50 2889 560 2329 1 6 20 2 0.0001 74 8 66 3646 407 3239 1 6 30 2 0.0001 44 12 32 2402 825 1577 2 13 40 2 0.0001 85 4 81 4388 261 4127 1 10

Con~nued


TABLE 4 Continued




50 2 0.0001 71 9 62 3507 481 3026 2 16 60 2 0.0001 53 8 45 2665 498 2167 3 19 70 2 0.0001 64 10 54 2968 615 2353 2 28 80 2 0.0001 70 8 62 3442 498 2944 2 28 90 2 0.0001 45 7 38 2393 441 1952 8 30

100 2 0.0001 46 9 37 2505 575 1930 3 49 5 3 0.0001 43 2 41 2510 115 2395 1 8

10 3 0.0001 96 0 96 4877 70 4807 1 5 20 3 0.0001 51 8 43 2691 499 2192 1 9 30 3 0.0001 70 6 64 3389 417 2972 2 7 40 3 0.0001 76 11 65 3635 605 3030 1 10 50 3 0.0001 72 10 62 3550 575 2975 2 14 60 3 0.0001 50 14 36 2510 834 1676 3 20 70 3 0.0001 68 12 56 3198 724 2474 2 22 80 3 0.0001 73 11 62 3465 609 2856 2 23 90 3 0.0001 35 16 19 2072 947 1125 8 32

100 3 0.0001 59 14 45 2777 823 1954 3 38



No. of Hidden Training Nodes Total Incl. Outcl. Total Incl. Outcl. Time (s)

2 33 18 15 1737 987 750 46 5 31 18 13 1670 1016 654 98

10 31 20 11 1627 1033 594 571 15 33 18 15 1736 988 748 922 20 36 20 16 1875 1045 830 1236

Description: two classes {0, 1 }, two dimensions, different centers. Training examples = 100 + 100 = 200; control test set = 1000 + 1000 = 2000; independent test set = 5000 + 5000 = 10,000. Minimum number of points per Gaussian = 4% of in-class points = 4.

TABLE 5 Breast Cancer Problem

5A: GM Algorithm Results, Breast Cancer Problem (Mask Class 0)

Cumulative Training Set Error Test Set Error Time (s) Majority Criterion No. of No. of

(%) Gaussians Gaussians Total Incl. Outcl. Total Incl. Outcl. Ph. I Ph. II

50 1 1 15 12 3 10 7 3 16 118 60 1 1 15 12 3 10 7 3 15 118 70 1 2 14 10 4 10 7 3 15 151 80 1 3 14 10 4 10 7 3 12 168 90 1 4 14 10 4 9 6 3 11 222

100 7 11 10 9 1 8 7 1 18 236

5B: RCE Results, Breast Cancer Problem

Training Set Error Test Set Error No. of Training

Pruning Hyperspheres Total Incl. Outcl. Total Incl. Outcl. Time (s)



TABLE 5 Continued

5C: Standard RBF Results, Breast Cancer Problem

Training Set Error Test Set Error Time (s) No. of Width Learning

Gaussians Multiplier Rate Total Incl. Outcl. Total Incl. Outcl. Ph. I Ph. II

5 1 0.0001 10 1 0.0001 20 1 0.0001 30 1 0.0001 40 1 0.0001 50 1 0.0001 60 1 0.0001 70 1 0.0001 80 1 0.0001 90 1 0.0001

100 1 0.0001 5 2 0.0001

10 2 0.0001 20 2 0.0001 3O 2 0.0001 40 2 0.0001 50 2 0.0001 60 2 0.0001 70 2 0.0001 80 2 0.0001 90 2 0.0001

100 2 0.0001 5 3 0.0001

10 3 0.0001 2O 3 0.0001 30 3 0.0001 40 3 0.0001 5O 3 0.0001 60 3 0.0001 70 3 0.0001 8O 3 0.0001 90 3 0.0001

100 3 0.0001

26 9 17 11 3 8 1 24 78 7 71 31 3 28 1 40 3 10 23 12 3 9 4 86

32 10 22 18 3 15 7 99 35 10 25 19 3 16 7 175 78 8 70 35 3 32 13 143 44 9 35 21 3 18 12 222 78 9 69 31 3 28 10 194 33 10 23 18 3 15 13 304 31 9 22 12 3 9 37 208 40 9 31 17 3 14 20 348 30 3 27 15 0 15 1 16 21 14 7 15 7 8 1 26 21 11 10 11 5 6 4 22 25 9 16 13 3 10 7 21 13 11 2 11 6 5 7 41 14 11 3 10 5 5 13 51 15 12 3 10 5 5 12 59 14 12 2 10 5 5 10 85 22 11 11 14 5 9 13 68 13 11 2 6 4 2 37 58 23 9 14 11 3 8 20 60 44 2 42 19 0 19 1 27 17 10 7 12 5 7 1 17 14 10 4 8 5 3 4 18 13 10 3 8 5 3 7 18 13 10 3 8 5 3 7 25 14 10 4 11 6 5 13 31 13 11 2 9 7 2 12 37 13 11 2 9 7 2 10 53 14 12 2 12 7 5 13 48 13 11 2 9 6 3 37 37 14 11 3 10 6 4 20 46

5D: Multilayer Perceptron Results, Breast Cancer Problem

Training Set Error Test Set Error No. of Hidden Training


2 13 10 3 6 4 2 563 5 13 10 3 6 4 2 2977

10 15 12 3 6 5 1 2324 15 12 10 2 6 4 2 4524 20 405 251 154 203 128 75 5125

Description: two classes {0, 1 }, nine dimensions. Training examples = 405; testing examples = 203. Minimum number of points per Gaussian = 4% of in-class points = 8 points.

rates ranging from 18-22.8%. Tables 7A and 9 show that an error rate of 24.3% was obtained by the GM algorithm using 92 Gaussians. Tables 7B and C and 9 show the results for the other algorithms. The conjugate gradient method could not solve this MLP training problem even with various starting points and hidden nodes. Lee (1989) also reports serious local minima problems with conjugate gradient training on MLPs

for the vowel problem. Lee (1989) reports obtaining an error rate of 21% with standard back-propagation training and 21.9% on an adaptive step size variation of BP on a single-layer net with 50 hidden nodes. Lee (1989) also reports that a hypersphere algorithm, where hyperspheres are allowed to both expand and contract, had an error rate of 23.1% with 55 hyperspheres after proper pruning. Lippmann (1988) reports solving the


TABLE 6 Heart Disease Problem

6A: GM Algorithm Results, Heart Disease Problem (Mask Class 0)

Majority Criterion

(%) No. of

Gaussians

Cumulative Training Set Error Test Set Error No. of

Gaussians Total Incl. Outcl. Total Incl. Outcl.

Time (s)

Ph. I Ph. II

50 2 2 73 65 8 42 41 1 4 41 60 5 7 38 26 12 29 22 7 5 71 70 3 10 32 22 10 28 22 6 8 85 80 4 14 27 18 9 22 14 8 10 90 90 10 24 20 12 8 18 13 5 13 131

100 10 30 19 11 8 18 13 5 15 210

6B: RCE Results, Heart Disease Problem

Pruning No. of

Hyperspheres

Training Set Error Test Set Error

Total Incl. Outcl. Total Incl. Outcl. Training Time (s)

None, min. 1 point 53 19 4 15 29 18 11 2 2%, min, 2 points 53 19 4 15 29 18 11 3.5 4%, min. 4 points 32 25 19 6 29 21 8 3.0 6%, min, 6 points 25 31 27 4 33 25 8 2.9

6C: Standard RBF Results, Heart Disease Problem

No. of Gaussians

Width Multiplier

Learning Rate

Training Set Error Test Set Error Time (s)


5 1 0.0001 77 0 77 52 0 52 7 18 10 1 0.0001 79 0 79 54 0 54 2 24 20 1 0.0001 79 0 79 52 0 52 4 44 30 1 0.0001 77 0 77 52 0 52 3 54 40 1 0.0001 62 1 61 49 1 48 6 74 50 1 0.0001 57 1 56 47 1 46 7 118 60 1 0.0001 75 0 75 52 0 52 9 166 70 1 0.0001 75 0 75 52 0 52 10 187 80 1 0.0001 76 0 76 51 0 51 12 169 90 1 0.0001 74 0 74 51 0 51 13 201

100 1 0.0001 74 0 74 51 0 51 17 223 5 2 0.0001 70 1 69 50 0 50 1 36

10 2 0.0001 73 0 73 56 0 56 2 73 20 2 0.0001 61 2 59 46 1 45 4 42 30 2 0.0001 73 0 73 50 0 50 3 49 40 2 0.0001 46 3 43 41 2 39 6 71 50 2 0.0001 66 0 66 49 1 48 7 78 60 2 0.0001 78 0 78 52 0 52 9 52 70 2 0.0001 68 1 67 98 0 98 10 60 80 2 0.0001 65 2 63 48 1 47 12 62 90 2 0.0001 49 5 44 43 3 40 13 89

100 2 0.0001 71 1 70 50 0 50 17 66 5 3 0.0001 72 0 72 50 0 50 1 17

10 3 0.0001 61 0 61 45 0 45 2 258 20 3 0.0001 51 1 50 41 1 40 4 70 30 3 0.0001 47 3 44 39 3 36 3 81 40 3 0.0001 51 1 50 41 0 41 6 76 50 3 0.0001 50 1 49 42 1 41 7 78 60 3 0.0001 44 4 40 36 2 34 9 43 70 3 0.0001 75 0 75 51 0 51 10 65 80 3 0.0001 52 2 50 44 2 42 12 79 90 3 0.0001 50 5 45 42 3 39 13 91

100 3 0.0001 47 7 40 37 4 33 17 103


TABLE 6 Continued

6D: Multilayer Perceptron Results, Heart Disease Problem

Training Set Error Test Set Error No. of Hidden Training


2 29 12 17 18 6 12 53 5 81 0 81 56 0 56 116

10 29 11 18 19 7 12 627 15 28 12 16 20 7 13 1224 20 198 117 81 99 43 56 1976

Description: two classes {0, 1 }, 13 dimensions. Training examples = 198; testing examples = 99. Minimum number of points per Gaussian = 4% of in-class points = 4.

TABLE 7 Vowel Recognition Problem

7A: GM Algorithm Results, Vowel Recognition Problem

Training Set Error Test Set Error Majority Criterion Cumulative No.

(%) of Gaussians Total Percentage Total Percentage

50 37 227 67.2 244 73.2 60 62 132 39.0 117 35.1 70 92 78 23.1 81 24.32 80 111 73 21.6 104 31.2

7B: RCE Results, Vowel Recognition Problem

Training Set Error No. of

Pruning Hyperspheres Total Percentage Total

Test Set Error

Percentage

None, min. 1 point 250 116 34.32 191 57.36 2%, min. 1 point 250 116 34.32 191 57.36 4%, min. 2 points 250 116 34.32 191 57.36 6%, min. 3 points 175 79 23.37 143 42.94 8%, min. 3 points 175 79 23.37 143 42.94 10%, rain. 4 points 126 56 16.57 110 33.03

7C: Standard RBF Results, Vowel Recognition Problem

Training Set Error No. of Width Learning

Gaussians Multiplier Rate Total Percentage Total

Test Set Error

Percentage

5 1 0.0001 283 83.73 282 84.7 10 1 0.0001 204 60.36 204 61.3 20 1 0.0001 167 49.4 142 42.6 30 1 0.0001 150 44.38 145 43.54 40 1 0.0001 155 45.86 151 45.35 50 1 0.0001 153 45.27 141 42.34 60 1 0.0001 153 45.27 147 44.14 70 1 0.0001 154 45.56 148 44.44 80 1 0.0001 116 34.32 118 35.44 90 1 0.0001 106 31.36 112 33.63

100 1 0.0001 150 44.38 157 47.15 5 2 0.0001 320 94.68 310 93.1

10 2 0.0001 266 78.7 269 80.8 20 2 0.0001 204 60.36 191 5 7 . 3 6 30 2 0.0001 179 52.96 162 48.65 40 2 0.0001 170 50.3 157 47.15 50 2 0.0001 163 48.23 140 42 60 2 0.0001 163 48.23 140 42 70 2 0.0001 164 48.52 143 42.9

Con~nued


TABLE 7 Continued

7C: Standard RBF Results, Vowel Recognition Problem

Training Set Error Test Set Error No. of Width Learning

Gaussians Multiplier Rate Total Percentage Total Percentage

80 2 0.0001 137 40.53 120 36 90 2 0.0001 150 44.38 140 42

100 2 0.0001 173 51.18 154 46.25 5 3 0.0001 336 99.41 332 99.7

10 3 0.0001 306 90.53 298 89.5 20 3 0.0001 239 70.71 236 70.9 30 3 0.0001 206 60.95 192 57.7 40 3 0.0001 191 56.51 186 55.86 50 3 0.0001 181 53.55 174 52.3 60 3 0.0001 177 52.36 168 50.5 70 3 0.0001 175 51.78 164 49.25 80 3 0.0001 155 45.86 152 45.65 90 3 0.0001 147 43.49 138 41.4

100 3 0.0001 126 37.28 121 36.34

Description: 10 classes {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}; two dimensions. Training examples = 338; testing examples = 333. Minimum number of points per Gaussian = 4% of in-class = 2.

problem with BP with 50 hidden nodes and achieving an error rate of 19.8%. Moody and Darken ( 1989 ) report obtaining an error rate of 18% with 100 hidden nodes with their RBF method.

4.4. Target Recognition

The sonar target classification problem is described in Gorman and Sejnowski (1988). The task is to discriminate between sonar signals bounced offa metal cylinder from those bounced offa roughly cylindrical rock. The patterns were obtained by bouncing sonar signals off the two cylinder types at various angles and under var-

ious conditions. Each pattern is a set of 60 numbers between 0.0 and 1.0. The training and test sets each have 104 members. Gorman and Sejnowski ( 1988 ) ex- perimented with a no hidden layer perceptron and single hidden layer perceptrons with 2, 3, 6, 12, and 24 hidden units. Each network was trained l0 times by the back-propagation algorithm over 300 epochs. The error rate decreased from 26.9% for zero hidden units (with standard deviation of error = 4.8 ) to 9.6% for 12 hidden units (with standard deviation of error = 1.8 ). They also report that a KNN classifier had an error rate of 17.3%. Roy et al. ( 1993 ) had reported that the training set is actually linearly separable. Tables 8A and 9 show that an error rate of 2 I. 15% was obtained

TABLE 8 Sonar Problem

8A: GM Algorithm Results, Sonar Problem (Mask Class 0)

Majority Criterion (%)

No. of Gaussians

Cumulative Training Set Error No. of

Gaussians Total Incl. Outcl.

Test Set Error

Total Incl. Outcl.

Time (s)

Ph. I Ph. II

50 60 70 80 90

100

7 7 37 6 31 30 9 21 58 13 6 13 21 11 10 29 18 11 87 19 8 21 13 11 2 24 20 4 111 26

10 28 11 8 3 24 23 1 135 30 11 32 7 5 2 22 14 8 156 34 11 32 7 5 2 22 14 8 160 34

8B: RCE Results, Sonar Problem

Pruning No. of

Hyperspheres

Training Set Error Test Set Error

Total Incl. Outcl. Total Incl. Outcl. Training Time (s)

None, min. 1 point 20 21 10 11 28 17 11 3 2%, min. 1 point 20 21 10 11 28 17 11 5.7 4%, min. 2 points 20 21 10 11 28 17 11 4.3 6%, min. 3 points 12 22 16 6 34 25 9 5.6

RBF-Like Nets for 67assification Problems 195

TABLE 8 Con~nued

8C: Standard RBF Results, Sonar Problem

Training Set Error Test Set Error Time (s) No. of Width Learning

Gaussians Multiplier Rate Total Incl. Outcl. Total Incl. Outcl. Ph. I Ph. II

5 1 0.0001 52 52 0 44 39 5 4 41 10 1 0.0001 55 55 0 45 41 4 5 163 20 1 0.0001 55 55 0 42 42 0 7 155 30 1 0.0001 55 55 0 41 41 0 10 204 40 1 0.0001 55 55 0 41 41 0 15 197 50 1 0.0001 55 55 0 41 41 0 21 195 60 1 0.0001 55 55 0 42 42 0 22 423 70 1 0.0001 55 55 0 42 42 0 26 407 80 1 0.0001 55 55 0 42 42 0 30 328 90 1 0.0001 55 55 0 42 42 0 29 535

100 1 0.0001 55 55 0 42 42 0 39 517 5 2 0.0001 54 54 0 39 38 1 4 7

10 2 0.0001 54 54 0 41 38 3 5 11 20 2 0.0001 55 55 0 41 41 0 7 12 30 2 0.0001 55 55 0 42 42 0 10 15 40 2 0.0001 55 55 0 42 42 0 15 17 50 2 0.0001 55 55 0 42 42 0 21 18 60 2 0.0001 55 55 0 42 42 0 22 21 70 2 0.0001 55 55 0 42 42 0 26 20 80 2 0.0001 55 55 0 42 42 0 30 23 90 2 0.0001 55 55 0 4 2 42 0 29 27

100 2 0.0001 55 55 0 42 42 0 39 29 5 3 0.0001 55 55 0 41 41 0 4 5

10 3 0.0001 55 55 0 41 41 0 5 7 20 3 0.0001 55 55 0 42 42 0 7 8 30 3 0.0001 55 55 0 42 42 0 10 10 40 3 0.0001 55 55 0 42 42 0 15 10 50 3 0.0001 55 55 0 42 42 0 21 13 60 3 0.0001 55 55 0 42 42 0 22 15 70 3 0.0001 55 55 0 42 42 0 26 17 80 3 0.0001 55 55 0 42 42 0 30 16 90 3 0.0001 55 55 0 42 42 0 29 17

100 3 0.0001 55 55 0 42 42 0 39 18

Description: two classes {0, 1 }, 60 dimensions. Training examples = 104; testing examples = 104. Minimum number of points per Gaussian = 4% of in-class points = 2.

by the GM algorithm using 32 Gaussians. Tables 8B and C and 9 show the results for the other algorithms. The conjugate gradient method ran into local minima problems and could not solve this problem, even with various starting points and hidden nodes.

Table 9 summarizes the results for all of the algorithms and test problems. Figures 1-4 show how the classification boundary develops as the algorithm pro- gresses for the two-dimensional overlapping Gaussian distribution problem (problem 4).

TABLE 9 Summary of Results: All Four Algorithms

Problem Type

Test Error Rate (%)

GM RBF RCE MLP

Network size (No. of Hidden Nodes or Hyperspheres)

GM RBF RCE MLP

Overlapping Gaussians Problem 1 17.90 17.27 19.47 16.51 Problem 2 17.97 21.30 22.7 20.17 Problem 3 9.19 14.11 12.73 15.55 Problem 4 7.70 18.78 12.35 16.27

Breast cancer 3.94 2.96 4.43 2.96 Heart disease 18.18 36.36 29.29 18.18 Vowel recognition 24.32 33.63 33.03 N/A Sonar 21.15 37.51 26.92 N/A

18 60 53 10 5 60 121 20 4 50 96 20

16 50 11 10 11 90 17 2 24 60 32 2 92 90 126 N/A 32 5 20 N/A


i

t~ '*~¢*~ ~ : • i " ' 2 . . . . . . . . . . . . . ,,,:~" ................................... ? . . . - : . . . . . ~ . . . . . . . . . . . . . . . , . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

/ i " i " /''';''u, i

• . : , . . • .............

• " " ' ° i " °" J • - ' . "," i:

.......... ,. !I t " .......... ,,i .... , ¢ ,¢~ :

' " ..................... ,, r

-4 f ..................................................... ~:i ................................................................................................................................... -~1 -4 -2 0

FIGURE 1. Decision boundary for Gaussian problem 4 with 50% mask.

4.5. GM Algorithm With LMS Training

Another algorithmic possibility is to generate the Gaussians with the GM algorithm and then use them to train a regular RBF net with the LMS algorithm or by matrix inversion. A regular RBF net in this case implies one without thresholding at the output nodes, such as is used in the GM algorithm. Tables 10-16 show the results of this modified algorithm for some of the test problems. Table 17 compares this modified algorithm with the GM algorithm. As can be observed, this modified algorithm is indeed quite fast, but the GM algorithm provides better quality solutions most of the time.

f i' I . . . . . . . . • ~ ' ~ : ............... i . . . . . . . . . . . . . . . . . . . . . . . i ........

-4 - 2 0 2 4


5. ADAPTIVE LEARNING W ITH M E M O R Y - - A N A L G O R I T H M

In the current neural network learning paradigm, on- line adaptive algorithms like back propagation, RCE, and RBF, that attempt to generate general characteristics of a problem from the training examples, are not supposed to store or remember any particular information or example. These algorithms can observe and learn whatever they can from an example, but must forget that example completely thereafter. This process of learning is obviously very memory efficient. However, the process does slow down learning because it does not allow for learning by comparison. Humans learn

' T- ' I ! . . . . . I ' /

...... " " - ~ ,: i : i i ' ............................... i :,iii .......... ii ....

"! , ( : '

........................ i : ' ~;~"~ ............... ~ ................................... ~ ............

" ° " ~ . . " ~ ' i

....: / ,

t . j y . "

[ • I I ~ , i

-4 -2 0 2 4


" , . , . ~ o . . . ° " ° : : '"

..................... . i . o ° . .o o-~',:;e io , . . . .

r : F :

- 4 - 2 0 2 4


RBF-Like Nets for Classification Problems

TABLE 10 GM Algorithm With LMS Weight Training, Problem 1, Overlapping Gaussians

197



Training Set Error


(10,000 Examples) Time (s)


50 1 1 171 146 25 3250 2671 569 9 1 60 5 6 111 73 38 2135 1326 809 9 34 70 5 11 87 40 47 1678 847 831 8 71 80 7 18 75 37 38 1724 886 838 11 167 90 12 30 80 43 37 1699 957 742 19 79

100 12 42 78 40 38 1787 912 875 31 51




Training Set Error




50 1 1 141 139 2 3060 3038 22 9 1 60 1 2 78 61 17 2012 1695 317 4 33 70 1 3 71 55 16 1901 1543 358 4 32 80 2 5 145 60 85 3058 1294 1764 5 87 90 5 10 114 62 52 2626 1479 1147 14 81

100 9 19 92 27 65 2300 909 1391 26 202




Training Set Error




50 1 1 38 28 10 931 669 262 16 2 60 1 2 67 13 54 1708 274 1434 9 2 70 1 3 80 71 9 1553 1321 232 9 2 80 1 4 90 79 11 1618 1408 210 7 2 90 2 6 87 54 33 2020 1045 975 19 3

100 25 31 89 65 24 2137 1356 781 91 9




Training Set Error




50 2 2 67 66 1 3304 3242 62 60 2 4 71 9 62 3253 271 2982 70 5 9 53 33 20 2134 1453 681 80 7 16 40 28 12 2065 1379 686 90 7 20 42 26 16 1948 1197 751

100 7 24 39 15 24 1401 477 924

1 13 8

10 4

19


TABLE 14 GM Algorithm With LMS Weight Training, Breast Cancer Problem

Majority Criterion No. of

(%) Gaussians

Cumulative Training Set Error Test Set Error Time (s) No. of

Gaussians Total Incl. Outcl. Total Incl. Outcl. Ph. I Ph. II

50 1 60 1 70 1 80 1 90 1

100 7

1 16 3 13 10 3 7 16 1 1 16 3 13 10 3 7 15 1 2 15 5 10 10 3 7 15 1 3 32 29 3 17 16 1 12 1 4 19 4 15 10 2 8 11 1

11 19 9 10 9 4 5 18 120

TABLE 15 GM Algorithm With LMS Weight Training, Heart Disease Problem


(%) Gaussians



Time (s)

Ph. I Ph. II

50 2 60 5 70 3 80 4 90 10

100 10

2 132 69 63 68 46 22 4 3 7 70 55 15 30 24 6 5 34

10 80 50 30 29 22 7 8 29 14 57 40 17 23 17 6 10 165 24 69 57 12 37 34 3 13 37 30 48 37 11 22 19 3 15 211

TABLE 16 GM Algorithm With LMS Weight Training, Sonar Problem


(%) Gaussians



Time (s)

Ph. I Ph. II

50 7 60 6 70 8 80 10 90 11

100 11

7 44 17 27 58 12 46 58 40 13 42 18 24 47 13 34 87 67 21 36 14 22 50 12 38 111 153 28 37 15 22 48 12 36 135 227 32 38 16 22 47 12 35 156 268 32 38 16 22 47 12 35 160 268

TABLE 17 Summary of Results: GM Algorithm and GM Algorithm With

LMS Weight Training

Problem Type

Network Size Test Error Rate (No. of Hidden

(%) Nodes)

GM With GM with LMS LMS

GM Training GM Training

Gaussians Problem 1 17.90 16.78 18 11 Problem 2 17.97 19.01 5 3 Problem 3 9.19 9.31 4 1 Problem 4 7.70 14.01 16 24

Breast cancer 3.94 4.43 11 11 Heart disease 18.18 22.22 24 30 Sonar 21.15 45.19 32 13

rapidly when allowed to compare the objects/concepts to be learned, as it provides very useful extra information. I f one, for example, is expected to learn to pronounce a thousand Chinese characters, it would def- initely help to see very similar ones together, side-by- side, to properly discriminate between them. If one is denied this opportunity, and shown only one character at a time, with the others hidden away, this same task becomes much more difficult and the learning could take longer. In the case of learning new medical diagnosis and treatment, if medical researchers were not allowed to remember, recall, compare, and discuss different cases and their diagnostic procedures, treatments, and results, the learning of new medical treatments would be slow and in serious jeopardy. Remembering relevant facts and examples is very much a part of the


human learning process because it facilitates comparison of facts and information that forms the basis for rapid learning.

To simulate on-line adaptive learning with no memory, generally a fixed set of training examples is re- peatedly cycled through an algorithm. The supposition, however, is that a new example is being observed on- line each time. If an algorithm has to observe n examples p times for such simulated learning, it implies requiring n p different training examples on-line. So, if a net is trained with 100 examples over 100 epochs, it implies that it required to observe 10,000 examples on- line. On-line adaptive learning with no memory is an inefficient form of learning because it requires observing many more examples for the same level of error-free learning than those that use memory. For example, Govil and Roy (1993) report that their RBF algorithm for function approximation learned to predict the lo- gistic map function very accurately (0.129% error) with just 100 training examples. In comparison, Moody and Darken (1989) report training a back-propagation net on the same problem with 1000 training examples that took 200 iterations (line minimizations) of the conjugate gradient method and obtaining a prediction ac- curacy of 0.59%. This means that the back-propagation algorithm, in a real on-line adaptive mode, would have required at least 1000 × 200 = 200,000 on-line examples to learn this map, which is 199,900 examples more than that required by the RBF algorithm of Govil and Roy (1993), an algorithm that uses memory for quick and efficient learning. If the examples were being generated in a very slow and costly process, which is often the case, this would have meant a long and costly wait before the back-propagation algorithm learned. On the other hand, in such a situation, the net generated by the memory-based RBF algorithm could have been operational after only 1 O0 observations. This essentially implies that in many critical applications, where training examples are in short supply, and costly and hard to generate, an on-line no-memory adaptive algorithm could be a potential disaster, because it cannot learn quickly from only a few examples. For example, it might be too risky to employ such a system on-line to detect credit card fraud. New fraudulent practices may not be properly detected by such a no-memory system for quite some time and can result in significant losses to a company. The "thieves" will be allowed to enjoy their new inventions for quiet some time.

An on-line adaptive learning algorithm based on the GM method is proposed here. The basic idea is as fol- lows. Suppose that some memory is available to the algorithm to store examples. It uses part of the memory to store some testing examples and the remaining part to store training examples. Assume it first collects and stores some test examples. Training examples are col- lected and stored incrementally and the GM algorithm

used on the available training set at each stage to generate (regenerate) a RBF-like net. Once the training and testing set errors converge and stabilize, on-line training is complete. During the operational phase, the system continues to monitor its error rate by collecting and testing batches of incoming examples. If the test error is found to have increased beyond a certain level, it proceeds to retrain itself in the manner described above.

The following notation is used to describe the proposed algorithm. MA denotes the maximum number of examples that can be stored by the algorithm. Nxs corresponds to the number of testing examples stored and NXR the number of training examples stored, where NTR + NTS --< MA. ~/is the incremental addition to the training set NTR. tsej and trej correspond to the testing and training set errors, respectively, after the j th incremental addition to the training set. tSeold denotes the test set error after completion of training and tse, ew the test error on a new batch of on-line examples. # is the tolerance for the difference between tSen,w and tSeold and O the tolerance for the error rate difference during incremental learning or adaptation. The adaptive algorithm is summarized below.

On-Line Adaptive Learning with Fixed Memory

( 1 ) Collect Nxs examples on-line for testing. (2) Initialize counters and constants: j = 0, NXR = 0,

#. (3) Increment collection counter: j -- j + 1. (4) Collect n number of (additional) examples for

training; add to the training set; NTR = NXR + 77. If NXR + Nvs > MA, go tO (7).

(5) Regenerate the RBF net with the GM algorithm using NTR training examples and Nxs testing examples.

(6) Compute tsej and trej. I f j = 1, go to (3). If Itsej - tsej_ll <- O and Itrej - trej_ll < O, go to (7); else go to (3).

(7) Current adaptation is complete. Set tSeola = tsej. Test system continuously for any significant change in error rate.

(8) Collect N r s new examples on-line for testing; test and compute tSenew.

(9) If I tSenew -- /SCold I --< #, go to ( 8 ). Otherwise, it is time to retrain, go to (2).

Table 18 shows how the overlapping Gaussian distribution problem 4 is adapted on-line. RBF nets were generated with the GM algorithm at increments of 100 examples, p was set to 1.5%. With this p, adaptation is complete within 300 examples. If p is reduced further, adaptation would take longer (i.e., would need more examples). If the GM algorithm used a different stopping rule, such as [ trej - tsejl within some bound, then


TABLE 18 On-Line Adaptation of Overlapping Gaussian Distributions, Problem 4

NTR, Cumulative

No. of Collection Training Counter Points

Cumulative No. of LP Solution t re j t se j I tse j - t se j_ l I I t re j - t re j l I t re j - tsej_~ I

Gaussians Time (s) (%) (%) (%) (%) (%)

1 100 2 200 3 300 4 400 5 500

11 9 5 9.65 4.65 13 69 6.5 7.15 2.5 1.5 0.65 8 213 7.67 7.9 0.75 1.17 0.23

17 555 6 6.15 1.75 1.67 0.15 17 1129 7.4 6.75 0.5 1.4 0.65

it cou ld have s topped at 400 e x a m p l e s a n d o b t a i n e d close to the op t i ma l e r ror ra te o f 6.15% on the test set.

6. C O N C L U S I O N

T h e pape r has de f ined a set o f r o b u s t a n d c o m p u t a - t iona l ly efficient l e a r n i n g p r inc ip le s for n e u r a l n e t w o r k a lgor i thms . T h e a l g o r i t h m p re sen t ed here, a long wi th some o f the p rev ious ones ( G o v i l a n d Roy, 1993; M u - k h o p a d h y a y et al., 1993; R o y & M u k h o p a d h y a y , 1991; Roy et al., 1993) have b e e n based on these l e a r n i n g pr inciples . These l ea rn ing pr inc ip les differ subs tant ia l ly f rom classical c o n n e c t i o n i s t l ea rn ing .

Extens ive c o m p u t a t i o n a l e x p e r i m e n t s show tha t the a l g o r i t h m p re sen t ed here works qu i t e well. W o r k is u n - derway to i m p r o v e these m e t h o d s a n d ex tend t h e m to o the r types o f n e u r a l ne tworks . T h e y will be r epo r t ed in the future .

R E F E R E N C E S

Baldi, P. (1990). Computing with arrays of bell-shaped and sigmoid functions. Proceedings of IEEE Neural Information Processing Systems, 3, 728-734.

Bennett, K. P., & Mangasarian, O. L. (1992). Robust Linear Pro- gramming Discrimination of Two Linearly Inseparable Sets. Optimization Methods and Software, 1, 23-34.

Blum, A. L., & Rivest, R. L. (1992). Training a 3-node neural network is NP-complete. Neural Network, 5( 1 ), 117-127.

Broomhead, D., & Lowe, D. (1988). Multivariable function interpolation and adaptive networks, Complex Systems, 2, 321-355.

Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., Schmid, J., Sandhu, S., Guppy, K., Lee, S., & Froelicher, V. (1989). Inter- national application of a new probability algorithm for the diagnosis of coronary artery disease. American Journal of Cardiology, 64, 304-310.

Gormann, R. P., & Sejnowski, T. J. (1988). Analysis of hidden units in a layered network trained to classify sonar targets. Neural Net- works, 1, 75-89.

Govil, S., & Roy, A. (1993). Generating a radial basis function net in polynomial time for function approximation. Working paper.

Hush, D. R., & Salas, J. M. (1990). Classification with neural networks: A comparison (UNM Tech. Rep. Number EECE 90-004 ). University of New Mexico.

Judd, J. S. (1990). Neural network design and the complexity of learning. Cambridge, MA: MIT Press.

Karmarkar, N. (1984). A new polynomial time algorithm for linear programming. Combinatorica, 4, 373-395.

Khachian, L. G. (1979). A polynomial algorithm in linear programming. Doklady Akademii Nauks SSR, 244 ( 5 ), 1093-1096.

Lee, Y. ( 1989 ). Classifiers: Adaptive modules in pattern recognition systems. Masters thesis, Dept. of Electrical Engineering and Com- puter Science, MIT, Cambridge, MA.

Lippmann, R. P. (1988). Neural network classifiers for speech recognition. The Lincoln Laboratory Journal, 1 ( 1 ), 107-128.

Luenberger, D. (1984). Linear and nonlinear programming. Reading, MA: Addison-Wesley.

Mangasarian, O. L., Setiono, R., & Wolberg, W. H. (1990). Pattern recognition via linear programming: Theory and application to medical diagnosis. In T. E Coleman and Y. Li (Eds.), Proceedings of the Workshop on Large-Scale Numerical Optimization, Cornell University, Ithaca, NY, Oct. 19-20, 1989, pp. 22-31, Philadelphia, PA, 1990, SIAM.

Montiero, R. C., & Adler, I. (1989). Interior path following primal- dual algorithms. Part I: Linear programming. Mathematical Pro- gramming, 44, 27-41.

Moody, J., & Darken, C. (1988). Learning with localized receptive fields: In: D. Touretzky, G. Hinton, T. Sejnowski (Eds.), Proceed- ings of the 1988 Connectionist Models Summer School (pp. 133- 143). San Mateo: Morgan Kaufmann.

Moody, J., & Darken, C. (1989). Fast learning in networks of locally- tuned processing units. Neural Computation, 1 (2), 281-294.

Mukhopadhyay, S., Roy, A., Kim, L. S., & Govil, S. (1993). A polynomial time algorithm for generating neural networks for pattern classification--its stability properties and some test results. Neural Computation. 5(2), 225-238.

Musavi, M. T., Ahmed, W., Chan, K. H., Faris, K. B., & Hummels, D. M. ( 1992 ). On the training of radial basis function classifiers. Neural Networks, 5(4), 595-603.

Musavi, M. T., Kalantri, K., Ahmed, W., & Chan, K. H. (1993). A minimum error neural network (MNN). Neural Networks. 6, 397- 407.

Platt, J. (1991). A resource-allocating network for function interpolation. Neural Computation, 3 (2), 213-225.

Poggio, T., & Girosi, F. (1990). Regularization algorithms for learning that are equivalent to multilayer networks. Science, 247, 978-982.

Powell, M. J. D. (1987). Radial basis functions for multivariable interpolation: A review. In: J. C. Mason, M. G. Cox (Eds.) Al- gorithms for approximation. Oxford: Clarendon Press.

Reilly, D., Cooper, L., & Elbaum, C. (1982). A neural model for category learning. Biological Cybernetics, 45, 35-41.

Renals, S., & Rohwer, R. ( 1989 ). Phoneme classification experiments using radial basis function. Proceedings of International Joint Conference on Neural Networks, I, 461-467.

Roy, A., & Mukhopadhyay, S. (1991). Pattern classification using


linear programming. ORSA Journal on Computing, Winter, 3( 1 ), 66-80.

Roy, A., Kim, L. S., & Mukhopadhyay, S. (1993). A polynomial time algorithm for the construction and training of a class of multilayer perceptrons. Neural Networks, 6(4), 535-545.

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation (Chap. 8). In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing: Explorations in microstructure of cognition, vol. 1: Foun- dations (pp. 318-362) Cambridge, MA: MIT Press.

Specht, D. E (1990). Probabilistic neural networks. Neural Networks, 3(1), 109-118.

Todd, M., & Ye, Yinyu (1990). A centered projective algorithm for linear programming. Mathematics of Operations Research, 15(3), 508-529.

Vrckovnik, G., Carter, C. R., & Haykin, S. (1990). Radial basis function classification of impulse radar waveforms. Proceedings of the International Joint Conference on Neural Networks, I, 45-50.

Widrow, B., & Hoff, M. (1960). Adaptive switching circuits. In: 1960 IRE WESCON Convention Record, 96-104, IRE, New York.

An algorithm to generate radial basis function (RBF)-like nets for classification problems

Documents

Transcript of An algorithm to generate radial basis function (RBF)-like nets for classification problems