A polynomial time algorithm for the construction and training of a class of multilayer perceptrons

11
Neural Networks, Vol. 6, pp. 535-545, 1993 0893-6080/93 $6.00 + .00 Printed in the USA. All rights reserved. Copyright © 1993 Pergamon Press Ltd. ORIGINAL CONTRIBUTION A Polynomial Time Algorithm for the Construction and Training of a Class of Multilayer Perceptrons ASIM ROY, LARK SANG KIM, AND SOMNATH MUKHOPADHYAY Arizona State University ( Received 4 March 1991; rtwised and accepted 6 October 1992 ) Abstract--This paper presents a polynomial time algorithm for the construction and training of a class of multifayer perceptrons for classification. It uses linear programming models to incrementally generate the hidden layer in a restricted higher-order perceptron. Pol.vnomial time complexio' of the method is proven. Computational results are provided for several well-known applications in the areas of speech recognition, medical diagnosis, and target detection. In all cases, veo' small nets were created that had error rates similar to those reported so far. Keywords--Polynomial time algorithm, Multilayer perceptrons, Linear programming, Classification algorithm, Supervised learning, Clustering, Net design. 1. INTRODUCTION This paper presents a polynomial time algorithm for the construction and training of a class of multilayer perceptrons for classification. This work is an extension of the earlier work of Roy and Mukhopadhyay ( 1991 ) on using linear programming (LP) models for pattern classification. This paper presents a number of modi- fications to that algorithm, proves polynomial time complexity of the method, and provides computational results on several well-known applications. The following notation is used in the paper. An input pattern is represented by the N-dimensional vector x, X = (Xl, X2 ..... XN). The pattern space, which is the set of all possible values that x may assume, is repre- sented by ft,.. K denotes the total number of classes. The method is for supervised learning where the train- ing set x~, x2 ..... x,, is a set of sample patterns with known classifications. The paper is organized as follows. Section 2 briefly discusses the basic linear programming formulation ideas. Some material from Roy and Mukhopadhyay (1991 ) is repeated here to make the paper self-con- tained. In Section 3, construction of multilayer percep- trons and other related issues are discussed. Section 4 Acknowledgement: This research was supported, in part, by the National Science Foundation grant IRI-9113370. Requests for reprints should be sent to Asim Roy, Dept. of Decision & Information Systems, Arizona State University, Tempe, AZ 85287, USA. presents the new outlier detection procedure. The two- phase algorithm is summarized in Section 5. The proof of polynomial time complexity of the method is also in that section. Section 6 has computational results on several well-known applications. 535 2. COVERING CLASS REGIONS BY THE LP METHOD--BASIC IDEAS Linear programming models have been used in many classification algorithms. Significant contributions in- clude those by Glover, Keene, and Duea (1988), Freed and Glover ( 1981, 1986), Mangasarian (1965), Man- gasarian, Setiono, and Wolberg (1990), Smith ( 1968 ), and many others. Tou (1974) has a lucid presentation of pattern classification using linear programming models. Most of them derive hyperplanes to separate the classes. One of the fundamental ideas in pattern classifica- tion is to draw proper boundaries to separate the class regions. This method uses the idea of "masking" or "covering" a class region. Any complex nonconvex re- gion can be covered by a set of elementary convex forms of varying size, such as hyperspheres and hyperellipsoids in the N-dimensional case (see Figure 1 ). The idea of using elementary convex covers is not new in pattern classification or neural networks. The hypersphere (Reilly, Cooper, & Elbaum, 1982 ) and associated clas- sifiers use hyperspheres of varying size to cover a class region. A multilayer perceptron is based on the same concept and the proof (by construction) of its ability

Transcript of A polynomial time algorithm for the construction and training of a class of multilayer perceptrons

Neural Networks, Vol. 6, pp. 535-545, 1993 0893-6080/93 $6.00 + .00 Printed in the USA. All rights reserved. Copyright © 1993 Pergamon Press Ltd.

ORIGINAL CONTRIBUTION

A Polynomial Time Algorithm for the Construction and Training of a Class of Multilayer Perceptrons

ASIM ROY, LARK SANG KIM, AND SOMNATH MUKHOPADHYAY

Arizona State University

( Received 4 March 1991; rtwised and accepted 6 October 1992 )

Abstract--This paper presents a polynomial time algorithm for the construction and training of a class of multifayer perceptrons for classification. It uses linear programming models to incrementally generate the hidden layer in a restricted higher-order perceptron. Pol.vnomial time complexio' of the method is proven. Computational results are provided for several well-known applications in the areas o f speech recognition, medical diagnosis, and target detection. In all cases, veo' small nets were created that had error rates similar to those reported so far.

Keywords--Polynomial time algorithm, Multilayer perceptrons, Linear programming, Classification algorithm, Supervised learning, Clustering, Net design.

1. INTRODUCTION

This paper presents a polynomial time algorithm for the construction and training of a class of multilayer perceptrons for classification. This work is an extension of the earlier work of Roy and Mukhopadhyay ( 1991 ) on using linear programming (LP) models for pattern classification. This paper presents a number of modi- fications to that algorithm, proves polynomial time complexity of the method, and provides computational results on several well-known applications.

The following notation is used in the paper. An input pattern is represented by the N-dimensional vector x, X = (Xl, X2 . . . . . XN). The pattern space, which is the set of all possible values that x may assume, is repre- sented by ft,.. K denotes the total number of classes. The method is for supervised learning where the train- ing set x~, x2 . . . . . x,, is a set of sample patterns with known classifications.

The paper is organized as follows. Section 2 briefly discusses the basic linear programming formulation ideas. Some material from Roy and Mukhopadhyay (1991 ) is repeated here to make the paper self-con- tained. In Section 3, construction of multilayer percep- trons and other related issues are discussed. Section 4

Acknowledgement: This research was supported, in part, by the National Science Foundation grant IRI-9113370.

Requests for reprints should be sent to Asim Roy, Dept. of Decision & Information Systems, Arizona State University, Tempe, AZ 85287, USA.

presents the new outlier detection procedure. The two- phase algorithm is summarized in Section 5. The proof of polynomial time complexity of the method is also in that section. Section 6 has computational results on several well-known applications.

535

2. COVERING CLASS REGIONS BY THE LP M E T H O D - - B A S I C IDEAS

Linear programming models have been used in many classification algorithms. Significant contributions in- clude those by Glover, Keene, and Duea (1988), Freed and Glover ( 1981, 1986), Mangasarian (1965), Man- gasarian, Setiono, and Wolberg (1990), Smith ( 1968 ), and many others. Tou (1974) has a lucid presentation of pattern classification using linear programming models. Most of them derive hyperplanes to separate the classes.

One of the fundamental ideas in pattern classifica- tion is to draw proper boundaries to separate the class regions. This method uses the idea of "masking" or "covering" a class region. Any complex nonconvex re- gion can be covered by a set of elementary convex forms of varying size, such as hyperspheres and hyperellipsoids in the N-dimensional case (see Figure 1 ). The idea of using elementary convex covers is not new in pattern classification or neural networks. The hypersphere (Reilly, Cooper, & Elbaum, 1982 ) and associated clas- sifiers use hyperspheres of varying size to cover a class region. A multilayer perceptron is based on the same concept and the proof (by construction) of its ability

536 A. Roy, L. S. Kim, and S. Mukhopadhyay

x2"

X I

FIGURE 1. Covering a complex nonconvex class region such as A by elementary convex covers.

to handle arbitrarily complex regions is based on par- titioning the desired class region into small hypercubes or some other arbitrarily shaped convex regions.

As with hypersphere and other classifiers, there may exist overlap among the elementary convex covers in order to provide complete and adequate coverage of a region. Nonconvex covers are also used here when there is no problem in doing so. How these covers are con- structed and used is explained in the next section.

2.1. Classifying Input Patterns by Means of Covers (Masks)

Let p elementary covers or masks (henceforth generally referred to as masks) be required to cover a certain class P region. To classify an input pattern as being in class P, it is necessary to determine if it falls within the area covered by one of the p masks. If the pattern space is two-dimensional and one of the p masks is a circle centered at (a , b) with a radius r, a simple test (or "masking") function can be created to determine if an input pattern falls in the territory of the designated mask. L e t f ( X i , X , ) = r 2 - [(Yl - a) 2 + (X2 - b) 2]

be the masking function for this circular mask. If (X'~, X'2) is an input pattern, and

if f(X'~, X'_,) >/0, then (X'~, X'2) is inside this mask and (X '~, X '2) belongs to class P;

if . f (X '~ ,X ' , )<O, then (X't , X ~) is not inside this mask and other masks will have

to be tested before a conclusion can be reached as to the classification of (X' t , X'2). A similar masking function for an elliptical mask will be.f(X~, Xz) = 1 - [(X~ - a)2 /c z + (Xz - b)E/d z] in the usual notation.

In constructing a masking function f ( x ) , the pro- cedure requires f ( x ) to be at least slightly positive ( f ( x ) >_ e) for training samples covered by the mask and at least slightly negative ( f ( x ) _< - e) for those not covered. So, in general, let Pk be the number of masking functions required to cover class k, k = 1 . . . K. Let f k ( x ) . . . . .

f k k ( x ) denote these masking functions for class k. Then an input pattern x ' will belong to class j if and only if one or more of its masks is at least slightly positive, i.e., equal to or above a small positive threshold value t, and the masks for all other classes are at least slightly negative, i.e., below a threshold value - e . Each mask will have its own threshold value as determined during its construction. Expressed in mathematical notation, an input pattern x ' is in class j , if and only if

. f~(x')>lt~ for at least one maski, i = 1 . . . p j , and . f ~ ( x ' ) < - ~ f o r a l l k ~ j a n d i = I . . . Pc.

( I )

If all masks are at least slightly negative (i.e., below their individual - ~ thresholds), the input pattern can- not be classified, unless one uses a nearest mask notion. If masks from two or more classes are at least slightly positive, then also the input pattern cannot be classified, unless once again one uses a "interior most" in a mask or some similar notion. In the second situation, an in- dication can be obtained, however, about the possible contenders. Such cases would possibly arise when the training set or the nature of the class regions does not prevent the formation of overlapped masks belonging to different classes.

2.2. Construction of Masking Functions

A variety of elementary convex masks (such as squares, rectangles, triangles, and so on for the two-dimensional case) can be used to cover a class region. A quadratic function,

N N N N

. f (x )= Z a , A;+ Z b , X~ + Z Z coX, X , + & i - I i - I i = l I=i+l

(2)

can generate hyperellipsoids, hyperspheres, etc., as masks in the N-dimensional case. It can also generate nonconvex shapes (masks) which is acceptable as long as they help to cover a class region properly. A quadratic is used here as the standard masking function. If N, the size of the pattern vector x, is large, the cross-prod- uct terms are usually dropped or only a few used. The algorithm basically determines the coefficients ai, bi, c 0 , ( i = 1 . . . n , j = i + 1) and d of these masks. In constructing masks, it is ensured that the generated mask covers only sample patterns of its designated class and not of others. Next is explained the construction of these masks; that is, how to determine the number of masks required and how to solve for the parameters (coefficients) of a masking function.

Consider a simple two class problem shown in Figure 2 in which class A is bounded by a unit circle centered at the origin and class B is the rest of the two-dimen- sional space. A total of 18 sample patterns have been taken as the training se t - -8 patterns from class A and l0 patterns from class B- -a s shown in the figure.

Polynomial Time Algorithm 537

x 2

B

XIB ./f~r-~, N XI5

X l ~ XII

XI7 X14 XI6

>

x 1

FIGURE 2. Two classes, A and B, and the training samples.

A priori, it is not known how many elementary masks will suffice for any of the class regions. Thus, an attempt is first made to define a single elementary mask which will cover the whole class region. If this fails, the sample patterns in that class are generally split into two or more clusters (using a clustering procedure that can produce a prespecified number of clusters) and then attempts are made to define separate masks for each of these clusters. If that should fail, or if only some are masked, then the unmasked clusters are further split for separate masking until masks are provided for each ultimate cluster. The general idea is to define as large a mask as possible to include as many of the sample patterns within a class in a given mask as is feasibly possible, thereby minimizing the total number of masks required to cover a given region. When it is not feasible to cover with a certain number o f masks, the region is subdivided into smaller pieces and masking is at- tempted for each piece. That is, the unmasked sample patterns are successively subdivided into smaller clusters for masking. At any stage of this iterative procedure, there will be a number of clusters to be masked. It might be feasible to mask some of them, thereby necessitating the breakup only of the remaining unmasked clusters. This "divide and conquer" procedure is a heuristic procedure. One can explore many variations o f it, some of which are discussed later.

Going back to the example, one first tries to mask class A with a single masking function. Let a mask of the form

f4 (x ) = atXt + a2X2 + bmX~ + b2X~ + c (3)

be tried such that for input patterns in class A, fA(X) is at least slightly positive. As noted before, the masks are constructed so that they are slightly positive for the boundary points in the mask. This ensures a finite sep-

aration between the classes and prevents the formation of c o m m o n boundaries. To determine the parameters am, a2, b~, b2 and c of the masking function in eqn (3) , a linear program is set up which essentially states the following: "construct a masking function such that sample patterns from class A are at least slightly positive and those from class B are at least slightly negative." A linear p rogramming model can be used because polynomials are linear functions of their parameters. The LP set up in this case is as follows:

Minimize

S.t.

f a x , ) >i

fAxs) >/

f~Cxg) ~ -

>/a small positive constant. (4)

Generally, a lower bound of 0.001 is used for ~. In terms of the masking function parameters, the LP will be as follows:

Minimize

S.t.

a2 + b, + c >/e for pa t te rn x,

at + bj + c >/~ for pa t te rn x2

- a 2 + b2 + c >/e for pa t tern x3

- a , + bt + c >t e for pa t te rn x4

at + a2 + b, + b2 + c ~ -~ for pattern X,s

a, - a2 + b, + b2 + c ~< - e f o r pattern X16

-a~ - a2 + b~ + b2 + c ~< -~ for pattern xt7

-a t + a,_ + bt + b2 + c ~< -~ for pattern x~a

>/a small positive constant. ( 5 )

The solution to this LP is a~ = 0, a2 = 0, bl = - 1 , b2 = - l and c -- I + ~ with ~ at its lower bound. The single masking function for class A, therefore, is

.fA(x) = I - X~ - X 2 + ~. (6)

For any pattern in class A,f4(x) >- E, and for any pattern in class B, fA(X) --< --E. In this example, it is easy to determine the masking function for class B- - se t fn (x ) to -fA(x). So,

f e ( x ) = X ~ + X 2~ - l - e . (7)

For any pattern in class B,fB(x) > E, and for any pattern

/

538

×2

FIGURE 3. A problem with disjoint class regions.

> ×1

in class A, fa (x ) < -~ . The masking function for class B can also be determined by setting up a linear program similar to eqns (4) and (5). Note that the single mask for class B is a nonconvex mask (covers the whole space except the unit circle) and prevents any subdivision of class B for masking purposes.

Consider another problem, shown in Figure 3, con- ceptually only. Here class A territory consists of four disjoint regions (A l, A2, A3, and A4) and class B com- prises the rest of the space. According to the masking procedure, the first effort should be to define a single mask for each of the two classes. To define a single mask for class A, an LP is set up as follows:

Minimize

s . t .

f~(x~ ) >~ for all sample patterns x~

belonging to class A,

.f.4(x,) ~< - ~ for all sample patterns x~

belonging to class B, and

>/a small positive constant, (8)

where f4(x) , suppose, is a quadratic. This LP has no feasible solution when B sample patterns occur between the A regions. This indicates that a single mask cannot be constructed for class A. So, in the next step, the class A sample patterns are divided, suppose, into two clus- ters for further masking attempts. As shown in Figure 3, let sample patterns from the regions A I and A2 form one cluster (identified as A') and those from A3 and A4 form the second cluster (identified as A"). Now, an attempt is made to mask A' and A" clusters separately. The LP for the A' cluster is:

Minimize

S.t.

f4'(x~ ) >/~ for all sample patterns x~

belonging to cluster A',

A. Roy, L. S. Kim, and S. Mukhopadhyay

f4,(x~) <~ - ~ for all sample patterns x~

belonging to class B, and

>/a small positive constant. (9)

Note that the A" sample patterns are ignored in the above LP model since it does not matter whether or not they are covered by this mask. Similarly, the LP for cluster A" is:

Minimize E

S.t.

f~,,(x,) >/ for all sample patterns x~

belonging to cluster A',

f~,,(x,) <~ - ~ for all sample patterns x~

belonging to class B, and

>/a small positive constant. (10)

Here, the A' sample patterns are ignored for the same reason. Both of these LPs are infeasible when B patterns occur between the A regions and further clustering must be done to breakup clusters A' and A". Let the splitting of each of these clusters into two produce the four clus- ters which correspond to the four original disjoint re- gions A l, A2, A3, and A4. Four different LPs must then be set up for these four clusters, in a manner similar to eqns (9) and (10), to define their masks. The LPs are feasible now (since there are no intervening B pat- terns present anymore) and four masking functions are defined for class A which correspond to the four disjoint regions. This is the "divide and conquer" procedure required to define operable masks.

3. C O N S T R U C T I N G A MULTILAYER P E R C E P T R O N FROM T H E MASKS

The masking procedure actually generates a multilayer perceptron. Figure 4 shows how a multilayer perceptron is constructed from the masking functions when qua- dratics are used as masks. Suppose class A has £ mask- ing functions and class B has p. Each masking function is evaluated in parallel at nodes AI through Ak and B~ through Bp. The output of a node is 1 if the mask is at least slightly positive ( > ~ ) and zero otherwise, A hard limiting nonlinearity (linear threshold unit) is used at these nodes. Class A hidden nodes A~ through Ak are connected to the final output node A for the class and likewise for class B. The output of node A is 1 if at least one of the inputs is 1, it is zero otherwise, and likewise for node B. Again, hard limiting nonlinearities are used at these output nodes. An input pattern is in class A if the output of node A is 1 and that of node B is zero and vice versa. The masking function coefficients cor- respond to the connection weights and are placed on the connections between the input nodes and hidden layer nodes. The higher-order product and power terms

Polynomial Time Algorithm 539

a 1

E

X 1 X 2 1 XmXj Xn

2 Xn

FIGURE 4. Masking functions generate a multilayer perceptron.

have been shown as direct inputs to the network. One more layer is actually needed at the input end to com- pute these higher-order terms.

As can be seen, the masking procedure constructs a restricted higher-order network (Giles & Maxwell, 1987; Nilsson, 1965). This restricted net is allowed to grow laterally in the hidden layer as more masking functions are added for the required coverage. Adding a mask is equivalent to adding a node in the hidden layer. This incremental growth of the net in the hidden layer is, in spirit, similar to many other neural network algorithms such as the Adaptive Resonance Theory (ART) (Carpenter & Grossberg, 1987), the reduced coulomb energy (RCE) classifier (Rielly et al., 1982) and the group method of data handling (GMDH) method (Farlow, 1984) that add nodes when necessary. The masking procedure expands the net only when it is necessary.

The net generated has low interconnects from the hidden layer to the output layer since each hidden node connects to one output node only. Weights are deter- mined for the input layer to hidden layer connections only (see Figure 4). As such, the net can be visualized as a single-layer perceptron that is allowed to grow lat- erally.

3.1. Learning by Comparison

Like GMDH, decision tree (Quinlan, 1986), and other such methods, this procedure requires simultaneous

access to all training examples but uses them differently. One can think of this procedure as learning by com- parison, a technique used very effectively by humans. Humans learn rapidly when allowed to compare the objects/concepts to be learnt. If one, for example, is expected to learn one thousand Chinese characters, it would help to see them next to each other to observe the differences in shapes. If one is denied this oppor- tunity, the task becomes extremely difficult and learning is obviously slowed. By defining a constraint for each sample pattern, this procedure facilitates an explicit view (comparison) of the class territories. This allows rapid learning since the class boundaries can be devel- oped quickly with this positional view of the classes.

4. OUTLIERS IN CLASSIFICATION PROBLEMS

Many classification problems, by their very nature, generate pattern vectors that fall outside their core class regions. Let these pattern vectors be called "outliers". A basic concept underlying many classification systems is to extract the core regions of a class from the infor- mation provided (the training set) and ignore its out- liers. The idea used here to extract the core class regions is similar to the ones used in several other algorithms. Conceptually, the procedure divides the pattern space covered by the training examples into small neighbor- hoods (by means of a clustering procedure) and assigns

540 A. Ro); L. S. Kim, and S. Mukhopadhyay

each neighborhood or cluster to the class with the ma- jority of its members. The remaining minority sample patterns in these neighborhoods (clusters) are treated as outliers and discarded. This is very similar to what is done in the k-nearest neighbor (KNN) (Duda & Hart, 1973) and other such algorithms. The KNN al- gorithm assigns a neighborhood to the class with most members. This amounts to discarding (ignoring) the other minority members. Or, in other words, treating them as outliers. The feature map (Kohonen, 1984; Huang & Lippman, 1988 ) and learning vector quantizer (LVQ) (Kohonen, 1988) algorithms are similar in spirit. Humans, for example, don't attempt to learn the infrequently encountered foreign pronunciation of words--they are treated as anomalies and forgotten. So, filtering of the noise (outliers), as is done here, does replicate human behavior.

This territory assignment idea is implemented in a three-step procedure in this method. In the first step, territories are allocated on the basis of a significant majority (e.g., more than two-thirds majority in the cluster). More mixed clusters (neighborhoods) are dealt with in the next two steps by defining smaller clusters and using a simpler majority rule (e.g., more than half). One can view this procedure as doing the easy, non- controversial allocations first and taking on harder cases later. The last two steps examine gradually smaller neighborhoods (clusters) for assignment decisions. The procedure also performs a two-point sensitivity analysis in the first step. The sensitivity analysis verifies the con- sistency of the assignments under two different neigh- borhood sizes (i.e., two different average cluster sizes). All inconsistent assignments are carried forward to the next step. For example, when a sample pattern is found to be an outlier under one average cluster size but not the other, it is carried forward as "unassigned" to the next step for a closer examination. The procedure is outlined in more detail next.

Let m~, m2, and m 3 (where pn~ > m2 >-- m3) be the average cluster sizes used in the three steps. For instance, one can set m t = 7, m2 = 5, and m3 = 3. The number of sample patterns remaining unassigned in any step (n~) is divided by the corresponding average cluster size ( mi ) to determine the number of clusters to form (P,) in that step, using a clustering algorithm that can pro- duce a prespecified number of clusters, such as k-means and hierarchical clustering. So Pg = n i / m i , and nt = n, n being the total number of training samples. In the first step, the training set is broken up into ( 1 + c) P~ and ( 1 - c) P~ clusters for sensitivity analysis, where c is a factor usually between 0.2 and 0.3. In this step, sample patterns in a cluster are labeled as "outlier" if their class has less than one-third of the members and they are labeled as "core patterns" if their class has at least two-thirds of the members. Otherwise, they are simply labeled "unassigned" for further processing in the next step. Consistency of these categorizations is

checked across the two different breakup schemes (( 1 + c) P~ and ( 1 - c) P~ clusters). I fa pattern's category is not consistent, it is relabeled as "unassigned". Only "unassigned" patterns are carried forward to the next step--they comprise the n 2 patterns for step 2. The "core patterns" are kept aside to be masked and the "outliers" discarded.

In steps 2 and 3, the essential scheme of breaking up the remaining unassigned sample patterns into small clusters and labeling them using a majority rule remains the same. In these steps, an attempt is made to quickly resolve the remaining cases by relaxing the majority rule. In step 2, sample patterns are labeled as "outliers" if their class has less than 50% of the members in a cluster, as "core patterns" if they have over 50% and as "unassigned" if equal to 50%. The step 2 unassigned patterns are carried over to step 3. In step 3, if a class has less than 50% of the members in a cluster, its pat- terns are labeled "outliers". The remaining patterns are retained for masking and nothing is left unassigned.

5. AN A LG O RITH M FOR T H E CONSTRUCTION AND TRAINING OF A

CLASS OF MULTILAYER PERCEPTRONS

Some notation, introduced in the last section, is defined more formally here. ni is the number of sample patterns remaining unassigned at the start of the i-th step of the outlier detection phase (phase I). mi is the average breakup cluster size and Pi, where Pi = n i / m i , is the number of clusters to form in the same i-th step of phase I. c is a factor used in sensitivity analysis in the first step of phase I. The training set is broken up into ( 1 + c) Pt and ( 1 - c) Pt clusters for this sensitivity analysis, q is the minimum cluster size for masking.

5.1. Phase l--Weed Out Outliers

Step I

1. Breakup the training set into ( 1 + c) Pt and ( 1 - c) Pt small clusters using a clustering procedure that can produce a prespecified number of clusters (e.g., hierarchical clustering).

2. For each set of clusters, label a pattern as "outlier" if its class has less than one-third of the members in its cluster; label as "core pattern" if its class has at least two-third of the members and as "unas- signed" otherwise.

3. Check consistency of these labels across the two sets of clusters. Any inconsistently labeled pattern is re- labeled as "unassigned".

4. "Unassigned" patterns are carried over to step 2, the "outlier" patterns are discarded and the "core patterns" retained for masking.

Polynomial Time Algorithm 541

Step 2

1. Breakup the n2 remaining "unassigned" patterns into P2 (=n2/m2) small clusters using a clustering procedure that can produce a prespecified number of clusters.

2. Discard a pattern as "outlier" if its class has less than 50% of the members in its cluster, save it as "core pattern" if its class has more than 50% of the members and carry it over to step 3 as "unassigned" otherwise.

Step 3

1. Breakup the n3 remaining "unassigned" patterns into P3 (=n3/m3) small clusters using a clustering procedure that can produce a prespecified number of clusters.

2. Discard a pattern as "outlier" if its class has less than 50% of the members in its cluster and save it as "core pattern" otherwise.

5.2. Phase II--Construct Masking Functions

1. Initialize class index i (i = 0). 2. Let i = i + 1. If i > K, where K is the total number

of classes, stop. Otherwise, set j = 1 and KLj = 1, where KLj is the number of unmasked class i clusters at the j - th stage of breaking up unmasked patterns. Let these unmasked clusters be indexed as Cj.t ..... Cj, KLf

3. Using a linear masking func t ion f (x ) , set up an LP as follows for each unmasked cluster ~.k, k = 1, . . . . KL/

Minimize

s. t ,

f ~-.~(.x)) >~

.

f ~,A-~; ) ~ - '

for all pattern vectors Xp

in cluster ~.k of class i,

for all pattern vectors A)

in classes other than class i,

>/a small positive constant. ( 11 )

where i . fci~(x) is the masking function for cluster Cj.k of class i. Solve the LP for each unmasked cluster. If all LP solutions are feasible and optimal, the masking of class i is complete. Go to step 2 to mask next class. Otherwise, when some or all LPs are infeasible, save all feasible masking functions obtained and go to step 4. Let KL) be the number of clusters with infeasible LP solutions at the j - th stage. Subdivide (breakup) the sample patterns in these infeasible clusters into

.

KLj+j (where KLj+~ > KL)) small clusters using a clustering procedure that can produce a prespecified number of clusters. Discard as "outliers" all sample patterns that are from clusters of size q or less. Set j = j + 1. KLj is the number of unmasked class i clusters at this new stage. These unmasked clusters are indexed as before as Ci, t, Cj.2 . . . . Q.KLj. GO back to step 3.

5.3. Comments on the Algorithm

1. If noisy patterns are absent in a problem, phase I should be skipped.

2. In step 3 of phase II, one can use any suitable linear masking function. Generally, a single higher-order polynomial mask can cover very complex regions and thus prevent further splitting of a class region that occurs with the use of simple quadratic masks. However, higher-order polynomials can cause nu- merical difficulties for the LP solver. One can always experiment some with different masking functions.

3. Many different reclustering strategies can be used in step 4 of phase II. Essentially, one can breakup quickly or slowly, depending on the size of the cluster at hand. A large cluster can be broken up faster (e.g., into four to six clusters) to avoid unnecessary in- termediate masking attempts. If one desires a min- imum number of hidden nodes, then a minimal breakup strategy of splitting into one extra cluster each time should be used. Step 4 of phase II detects and deletes outliers that remain after phase I and q is usually 2 to 3 in this step. Some fractionation of masks can result from these remaining outliers. One can rerun phase II, once these outliers are discarded, to get better and less number of masks. This second pass through phase II can be seen as a clean-up pass that compacts the hidden layer to obtain better generalization.

5. In step 4 of phase II, one can remove patterns from infeasible clusters if they are already covered by an existing mask of that class. This will reduce the set of unmasked patterns.

6. I f a class has a large pattern set, one can break it up into some number of clusters before attempting any masking (step 2 of phase II). So KL~ need not be I.

7. In the classifcation phase, mask boundaries can be slightly relaxed to compensate for numerical inac- curacies in the evaluation of masking functions. In the computational studies reported here, a 2~ relax- ation was used. That is, a pattern was in a mask f (x) i f f ( x ) >- - ~ instead o f f ( x ) >_ ~.

4.

5.4. Polynomial Time Convergence

In this procedure, the basic purpose of clustering is to dissect the data (and not to uncover "real" clusters).

542 ,4. Roy. L. S. Kim, and S. Mukhopadhyay

A variety of methods is available for clustering into a prespecified number of clusters (Everitt, 1980; Harti- gan, 1975) and many of them have polynomial time complexity. For instance, the time requirement for the ultrametric methods of hierarchical clustering is roughly proportional to n 2, where n is the number of observations, and for the density methods is somewhere between n In(n)and n2 (Everitt, 1980; Hartigan, 1975). The time also depends on the number of input variables (size of the pattern vector) and is roughly proportional to it.

In phase I, clustering into a prespecified number of clusters is done only four times. It remains to be shown if the iterative procedure of phase II is also of poly- nomial time complexity.

PROPOSITION. In phase H, ttle classes can be m a s k e d in poO,nomial time.

Proof Let Mi, i = 1 . . . . . K, be the number of class i pattern vectors to be masked in phase II. From a com- putational point of view, let the worst case scenario be where a separate mask is required for each sample pat- tern. Phase II then would have to proceed through many breakups before final masking. Thus, in the worst case, all of the linear programs ( 11 ) for a class will be infea- sible until the class is broken up into single point clus- ters (Mi clusters). (For this to happen, outlier deletion in step 4 is suppressed by setting q < 1.) All single point clusters will produce feasible LP solutions and masking of the class will be complete. In this worst case scenario, further assume that the slowest breakup strategy is used in step 4 in which the Mi pattern vectors are broken up into one extra cluster at each stage. Thus, phase II proceeds as follows: Initially a class i has just one cluster and masking is attempted on it and it fails; the M~ patterns are then broken into two clusters and masking is attempted on these and it too fails; they are then broken into three clusters and so on. For each class i, this requires solving Mi(M~ + 1)/2 linear programs (feasible and infeasible combined) and clustering of 34,. patterns M~ times. For all classes combined, a total of F, ~ 1 Mi (Mi + 1 ) / 2 linear programs are solved and clustering performed ~ ,/'i=1 Mi = n times. Since each linear program can be solved in polynomial time ( Kar- markar, 1984; Khachian, 1979) and each clustering operation to obtain a prespecified number of clusters can also be performed in polynomial time, it follows that phase II masking can be completed in polynomial time even in the worst case. •

Khachian's method solves a linear program in O(L2n 4) arithmetic operations, where L is the binary encoding length of input data and n is the dimension of the problem (for masking LPs in eqn ( 11 ), n cor- responds to the number of mask parameters to solve for). Karmarkar's method solves a linear program in O(L 2n 3.5) arithmetic operations. Computational results

have shown that Karmarkar's method behaves re- markably well in practice and that the number of it- erations is insensitive to the size of the LP problem (increases at most by the factor In n) . In the compu- tational studies reported here, none of the polynomial time algorithms were actually used.

6. COMPUTATIONAL RESULTS

All computational results have been obtained by im- plementing this algorithm on SAS ® (SAS or User's Guide 1990). The problems were solved on an IBM 3090 operating under the multiple virtual system (MVS) operating system. The linear programming al- gorithm implemented in SAS LP is the simplex method (Dantzig, 1963), which is not a polynomial time method. SAS LP also is not one of the fastest imple- mentations of the simplex method and the reported LP solution times can be improved substantially with a faster LP code. For clustering, the average linkage method of hierarchical clustering (Sokal & Michener, 1958) was used.

6.1. Vowel Classification

The vowel classification problem is described in Lippmann (1988) and has been used to compare dif- ferent classifiers. It is based on the vowel formant data of Peterson and Barney (1952). The data was generated from the spectrographic analysis of vowels in words formed by "h", followed by a vowel, followed by a "d" and consists of two-dimensional patterns. The words were spoken by 67 persons, including men, women, and children. The data on 10 vowels was split into two sets--338 examples each for the training and test sets. Four classifiers--KNN, Gaussian, two-layer percep- tron, and feature map--were compared on this dataset (Lippmann, 1988 ). All classifiers had similar error rates ranging from 18 to 22.8%. The feature map classifier combined supervised and unsupervised training with 100 nodes in the hidden layer. The first level weights were trained using Kohonen's feature map learning. The top level was trained using the back propagation or maximum likelihood method. The two-layer per- ceptron used 50 hidden nodes.

Table 1 shows the results of using different phase I cluster sizes in this procedure. The overall error rate is quite stable across the variety of phase I cluster sizes and is in the 19.2 to 22.8% range with one being 24.6%. This fact is often evident in other problems too, as will be shown, where the same cluster size combinations are used. The total number of outliers found (phases I and II combined) is also very consistent across the cluster size combinations. Because of outliers found in phase II, clean-up phase II runs were made. T~ is the total LP solution time for the first phase II pass and T2

Polynomial Time Algorithm

TABLE 1 Results of Using Different Cluster Sizes in Phase I for the Vowel Classification Problem

543

Cluster Sizes # of Outliers # of Outliers Phase I + Phase II LP Times # of M = m l , m2, m3 Found in Found in Clustering Time Solution (s) Masking Error

Phase I Phase I Phase II (Secs) T1 T2 Functions (Percentage)

M = 6, 4, 3 57 17 7.1 261 54 14 24.6 M = 6, 4, 4 52 27 7.2 285 49 12 22.5 M = 6, 5, 3 61 13 7.2 189 53 12 22.8 M = 6, 5, 4 58 20 7.1 219 49 12 20.4 M = 7, 4, 3 58 17 7.3 242 49 12 21 M = 7, 4, 4 53 24 7.4 317 77 15 22.5 M = 7, 5, 3 61 14 7.4 192 52 12 21.3 M = 7, 5, 4 58 21 7.3 231 48 12 19.8 M = 7, 6, 3 64 11 7 125 52 12 21.3 M = 7, 6, 4 63 11 7.1 128 53 12 21.3 M = 8, 4, 3 58 16 7.1 209 50 12 22.5 M = 8, 4, 4 53 26 7.2 260 46 12 21.9 M = 8, 5, 3 61 13 7.1 180 51 12 19.2 M = 8, 5, 4 58 20 7.1 201 45 12 20.7

for the second. Both passes combined, the total LP so- lution time ranged from 177 to 394 seconds. Compared to other classifiers reported, about the same error rate is generally achieved with 12 hidden nodes (masking functions) only. So, a much smaller net is created by this method. Figure 5 is a plot of the masking functions for one of the phase I cluster size combinations (M = 7 , 5 , 3 ) .

6.2. Classifying Sonar Targets

The sonar target classification problem is described in Gorman and Sejnowski ( 1988 ). The task is to discrim- inate between sonar signals bounced offa metal cylinder and those bounced off a roughly cylindrical rock. The pattern vectors were obtained by bouncing sonar signals off the two cylinder types at various angles and under various conditions. Each pattern is a set of 60 numbers in the range 0.0 to 1.0. In their aspect-angle dependent series of experiments, they tried to balance the training and test sets so that each would have a representative number of samples from all aspect angles. The training and test sets each had 104 members. They experi- mented with a no hidden layer perceptron and single hidden layer perceptrons with 2, 3, 6, 12, and 24 hidden units. Each network was trained by the back propa- gation algorithm over 300 epochs. In the angle-depen- dent experiments, the error rate decreased from 14.3% for the two hidden units net to 9.6% for the 12 hidden unit one. They further report that a KNN classifier had an error rate of 17.3% on the same data set.

Our experiments showed that the training set is lin- early separable. So, there was no reason to use phase I. Three different masking functions were t r ied- -a lin- ear one, one with linear and square terms, and a third

one with linear, square, and six cross terms. The error rate on the linear mask was 9.62%, on the square one was 12.5%, and on the one with cross terms was 10.6%. The error rates are comparable to the ones reported and all were achieved with just one hidden node. The total LP solution times were 9, 15, and 16 seconds for the linear, square, and cross term masks, respectively.

6.3. Medical Diagnosis (Breast Cancer Detection)

The breast cancer diagnosis problem is described in Mangasarian et al. (1990). The data is from the Uni- versity of Wisconsin Hospitals and has 369 pattern vec- tors. Each pattern vector has nine measurements made on a fine needle aspirate (fna) taken from a patient's breast. They are clump thickness, size uniformity, shape uniformity, marginal adhesion, cell size, bare nuclei, bland chromatin, normal necleoli and mitosis. Each measurement is assigned an integer value between 1 and 10, with larger numbers indicating a greater like- lihood of malignancy. Of the 369 pattern vectors, 201 were from patients with no breast malignancy, while 168 were from ones with confirmed malignancy. Man- gasarian et al. (1990), using their linear programming method, successfully classified all 45 patterns in their test set. Their LP method constructs a sequence of paired hyperplanes to separate patterns in the training set and none are considered to be "noisy" patterns. The test set considered here (a total of 118 patterns) is larger than the one used in Mangasarian et al. (1990).

Using this procedure, the total number of outliers detected was again found to be very consistent across the different cluster size combinations. A single mask with linear and square terms was used to mask the benign class and the total LP solution time in phase II

544 A. Roy, L. S. Kim, and S. Mukhopadhyay

A20

I 0

SAS

0.9

(1 .6 I

0 . ? ¸

0 . 6

0 5

0 4

0 . ~

0 . 2

6 6 6

6 6 6 6~

6 66 6 ~ 66

s ~66~] 6 666

8 8 8 8 9

8 888 8

7 8 888 8

8

° , ,

I I I o) % o ~ 1 1 0 0 3

3 33 3

I it ! 3 3

7 7 7 2

4 4 4'

0 . 0 0.1 n .2 0.3 0.4 0.5 0 .6 0 .? 0 .8 0.9

A)O

FIGURE 5. Plot of masking functions for all classes over the training set (M = 7, 5, 3). Numbers correspond to classes,

1 . 0

was about 27 seconds. The total clustering t ime in phase I was a little more than 9 seconds. The er ror rate was

1.7% (two errors) and was achieved with only one hid- den node. Mangasar ian et al. (1990) used four pairs of hyperplanes.

7. C O N C L U S I O N

An alternative method for t ra in ing a class o f mult i layer percept rons for classification p rob lems has been pre- sented. Research is cont inuing to explore its behavior and per formance on variety of other problems. Mod- ifications are being made to the a lgor i thm to solve large classification p rob lems efficiently. These will be re- por ted in the future.

Note that the masking funct ions can also be gen- erated in parallel for each class and within a class by defining and solving the different LP's on parallel ma- chines, such as a number o f workstat ions or PCs on a network sharing c o m m o n data files. This parallel ization will drast ical ly reduce the total t ime taken to const ruct and train an appropr ia te perceptron.

REFERENCES

Carpenter, G. A., & Grossberg, S. ( 1987 ). ART 2: Self-organization of stable category recognition codes for analog input patterns. Applied Optics, 26(4), 919-930.

Dantzig, G. B. ( 1963 ). Linear programming and extensions. Prince- ton, N J: Princeton University Press.

Duda, R. O.. & Hart, P. E. (1973). Pattern classification andscene analysis. New York: John Wiley & Sons.

Everitt, B. S. (1980). Cluster analysis. (2nd ed.). London: Heinemann Educational Books Ltd.

Farlow, S. ( 1984 ). Self-organizing methods in modeling. New York: Marcel-Dekker.

Freed, N., & Glover, E ( 198t ). A linear programming approach to the discriminant problem. Decision Sciences, 12, 68-74.

Freed, N., & Glover, E (1986). Evaluating alternative linear pro- gramming models to solve the two-group discriminant problem. Decision Sciences. 17, 151 - 162.

Giles, C. L., & Maxwell, T. (1987). Learning, invariance, and gen- eralization in high-order networks, Applied Optics, 26(4), 972- 978.

Glover, E, Keene, S., & Dueao B. (1988). A new class of models for the discriminant problem. Decision Sciences. 19, 269-280.

Gorman, R. P., & Sejnowski, T.. J. (1988). Analysis of hidden units in a layered network trained to classify sonar targets. Neural Net- works. I, 75-89.

Polynomial Time Algorithm 545

Hartigan, J. A. ( 1975 ). Clustering algorithms. New York: John Wiley & Sons.

Huang, W. Y., & Lippmann, R. P. (1988). Neural net and traditional classifiers. In D. Anderson (Ed.), Neural information processing systems (pp. 387-396). New York: American Institute of Physics.

Karmarkar, N. (1984). A new polynomial time algorithm for linear programming. Combinatorica. 4, 373-395.

Khachian, L. G. (1979). A polynomial algorithm in linear program- ming. Soviet Mathematika Dokladj; 20, 191-194.

Kohonen, T. (1984). Self-organization and associative memor)'. Berlin: Springer-Verlag.

Kohonen, T. (1988). An introduction to neural computing. Neural Networks, I, 3-16.

Lippmann, R. P. (1988). Neural network classifiers for speech rec- ognition. The Lincohl Laborator.v Journal, 1 ( 1 ), 107-128.

Mangasarian, O. L. ( 1965 ). Linear and nonlinear separation of pat- terns by linear programming. Operations Research. 13, 444-452.

Mangasarian, O. L., Setiono, R., & Wolberg, W. H. (1990). Pattern recognition via linear programming: Theory and application to medical diagnosis. In T. E Coleman, & Y. Li (Eds.), Large scale numerical optimization ( pp. 22-30 ). Philadelphia: SIAM.

Nilsson, N. J. ( 1965 ). Learning machines. New York: McGraw Hill. Peterson, G. E., & Barney, H. L. (1952). Control methods used in a

study of vowels. Journal of the Acoustic Society of America, 24, 175.

Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1, 81-106.

Rielly, D. L., Cooper, L. N., & Elbaum, C. (1982). A neural model for category learning. Biological Cybernetics. 45, 35-41.

McClelland (Eds.), Parallel distributed processing: Explorations in microstructure of cognition. I.'ol. 1: Foundations ( pp. 318-362 ). Cambridge, MA: MIT Press.

Roy, A., & Mukhopadhyay, S. (1991). Pattern classification using linear programming. ORSA Journal on Computing. 3( 1 ), 66-80.

SAS Institute Inc. (1988). SAS ® or User's Guide, Version 6, 1990, Cary, NC.

Smith, E W. ( 1968 ). Pattern classifier design by linear programming. IEEE Transactions on Computers, C-17, 367-372.

Sokal, R. R., & Michener, C. D. (1958). A statistical method for evaluating systematic relationships. University of Kansas Science Bulletin. 38, 1409-1438.

Tou, J. T. (1974). Pattern recognition principles. Reading, MA: Ad- dison-Wesley Publishing Co.