Martinet z 1993

download Martinet z 1993

of 12

Transcript of Martinet z 1993

  • 8/12/2019 Martinet z 1993

    1/12

    558 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL.4, NO. 4, JULY 1993

    Neural-Gas Networkfor Vector Quantization

    and its Application to Time-Series PredictionThomas M. Martinetz, Member, ZEEE, Stanislav G. Berkovich, and Klaus J. Schulten

    Abstract- a data compression technique, vector quantiza-tion requires the minimization of a cost function-the distortionerror-which, in general, has many local minima. In this paper,a neural network algorithm based on a soft-max adaptationrule is presented that exhibits good performance in reaching theoptimum, or at least coming close. The soft-max rule employed isan extension of the standard h--means clustering procedure andtakes into account a neighborhood ranking of the reference(weight) vectors. It is shown that the dynamics of the reference(weight) vectors during the input-driven adaptation procedure 1)is determined by the gradient of an energy function whose shapecan be modulated through a neighborhood determining parame-ter, and 2) resembles the dynamics of Brownian particles movingin a potential determined by the data point density. The networkis employed to represent the attractor of the Mackey-Glassequation and to predict the Mackey-Glass time series, withadditional local linear mappings for generating output values.The results obtained for the time-series prediction compare veryfavorably with the results achieved by back-propagation andradial basis function networks.

    I. INTRODUCTION

    RTIFICIAL as well as biological information processingA ystems that are involved in storing or transfering largeamounts of data often require the application of coding tech-niques for data compression. In a wide range of applications,including speech and image processing, the data compressionprocedure is based on vector quantization techniques (for areview see [l]) .

    Vector quantization techniques encode a data manifold, e.g.,a submanifold V C R D , utilizing only a finite set w =( w l , . ,w ~ )f reference or codebook vectors (also calledcluster centers) wi E R D , = 1, . . ,N . A data vector U E Vis described by the best-matching or winning referencevector w i p ) of w for which the distortion error d ( u ,w ; ( ~ ) ) ,e.g., the squared error I ) U - W ~ ~ )112is minimal. This proceduredivides the manifold V into a number of subregions

    called Voronoi polygons or Voronoi polyhedra, out of whicheach data vector v is described by the corresponding referencevector w i . If the probability distribution of data vectors overthe manifold V is described by P u ) , then the average

    Manuscript received November 11, 1991; revised Septembe r13, 1992. Thiswork wa s supported by the National Science Foundation under Grant DIR-90 -15561 and by a Fellowshipof the Volkswagen Foundation to T. M. Martinetz.

    The authors are with the Beckman Institute and the Departmentof Physics,University of Illinois at Urbana-Champaign, Urbana, IL 61801,

    IEEE Log Number 9204650.

    distortion or reconstruction error is determined by

    and has to be minimized through an optimal choice of refer-ence vectors w i .

    The straightforward approach to minimizing 2 ) would be agradient descent on E which leads to Lloyd and MacQueenswell-known K-means clustering algorithm [ 2 ] , 3]. In its on-line version, which is applied if the data point distributionP u ) s not given a priori, but instead a stochastic sequence ofincoming sample data points u t = l ) , u t = 2),u t = 3) , . . .which is governed by P u ) drives the adaptation procedure,the adjustment steps for the reference vectors or cluster centerswi is determined by

    Awi = E G i ; u t ) ) V ( t ) - w ; ) = 1 , . . ,N (3)

    with E as the step size and Sij as the Kronecker delta. However,adaptation step (3) as a stochastic gradient descent on Ecan, in general, only provide suboptimal choices of referencevectors w ; for nontrivial data distributions P ( u ) and nontrivialnumbers of reference vectors, due to the fact that the errorsurface E has many local minima.

    To avoid confinement to local minima during the adaptationprocedure, a common approach is to introduce a soft-maxadaptation rule that not only adjusts the winning referencevector i v ) as in 3), but affects all cluster centers dependingon their proximity to U . One interesting approach of this kind,to which we will refer as maximum-entropy clustering [4],employs the adaptation step

    which corresponds to a stochastic gradient descent on the costfunction

    - N1

    E,,, = - - / dDv P T J ) n e - B u - w t ) 2 ( 5 )i l

    instead of 2). As we can see in 4), with a data vector Unot only the winning reference vector w i c Y ) but each wiis updated, with a step size that decreases with the distanceIJu - will between wi and the currently presented U . For

    1045-9227/93 03.00 199 3 IEEE

    I

  • 8/12/2019 Martinet z 1993

    2/12

    MARTINETZet al.: NEURAL-GAS ETWORK 559

    p 00 the cost function E,,, becomes equivalent to E.Rose et al. [4] have shown what the form of 5) suggests,namely that adaptation step (4) is, in fact, a deterministic

    annealing procedure with /? as the inverse temperature andE,,, as the free energy. By starting at a high temperaturewhich, during the adaptation process, is carefully lowered tozero, local minima in E are avoided. However, to obtain goodresults, the decrement of the temperature has to be very slow.In theory it must hold /3 o In t with t as the number of i terationsteps (4) performed [5]. Adaptation step (4) is also used inthe context of maximum likelihood approaches where, asreported in [6], the goal is to model the data distribution P v )through overlapping radial basis functions (RBFs) (adaptationstep 4) corresponds to Gaussians as RBFs).

    Kohonens topology-conserving feature map algorithm[7]-[9] is another well-known model that has been appliedto the task of vector quantization ([lo]-[MI) and that

    incorporates a soft-max adaptation rule. In Kohonens modelevery reference vector wi is assigned to a site i of a lattice A.Each time a data vector v is presented, not only the winningreference vector is adjusted according to (3), but alsothe reference vectors w; that are assigned to lattice sites zadjacent to i ( v ) are updated, with a step size that decreaseswith the lattice distance between i and i v ) .The correspondingadaptation step is of the form

    A w ~ E . hm i , ~ ) )U - i ) = 1 , . . . , N (6)

    with h m i , j ) s a unimodal function that decreases mono-tonically for increasing Ili - II with a characteristic decayconstant 0 For 0 = O , h m z , j ) = 6;j and 6) becomesequivalent to the K-means adaptation rule (3). Kohonensmodel is particularly interesting since, through (6), a mappingbetween the data space V and the chosen lattice A emerges thatmaintains the topology of the input space V as completely aspossible [9], [191. Kohonens topology-conserving maps havebeen employed successfully in a wide range of applications,from the modeling of biological feature maps found in thecortex [20] to technical applications like speech recognition[lo], [ l l ] [15], image processing [13], [14], and the controlof robot arms [12], [16]-[MI, [21].

    Adaptation step (6) provides reasonable distributions of thereference vectors with only a small number of iteration steps,which is essential especially in technical applications like thecontrol of robot arms [12], [16]-[NI, [21]. However, to obtaingood results with Kohonens algorithm as a vector quantizer,

    the topology of the lattice A has to match the topology of thespace V which is to be represented. In addition, there existsno cost function that yields Kohonens adaptation rule as itsgradient [22], [23]. In fact, as long as 0 > 0, one cannotspecify a cost function that is minimized by 6).

    11. THE NEURAL-GAS LGORITHM

    In this paper, we present a neural network model which,applied to the task of vector quantization, 1) converges quicklyto low distortion errors, 2) reaches a distortion error Elower than that resulting from K-means clustering, maximum-entropy clustering (for practically feasible numbers of iteration

    steps) and from Kohonens feature map, and 3) at the sametime obeys a gradient descent on an energy surface (likethe maximum-entropy clustering, in contrast to Kohonens

    feature map algorithm). For reasons we will give later, wecall this network model the neural-gas network. Similarto the maximum-entropy clustering and Kohonens featuremap the neural-gas network also uses a soft-max adaptationrule. However, instead of the distance IIv - will or of thearrangement of the W ; S within an external lattice, it utilizesa neighborhood-ranking of the reference vectors wi for thegiven data vector U .

    Each time data vector v is presented, we determine theneighborhood-ranking (wi0 wil , . . ,~ i ~ - ~ )f the refer-ence vectors with w;, being closest to v , w;, being secondclosest to U , and wik C = 0,. . ,N - 1 being the referencevector for which there are ICvectors w j with IIv - w j l l 8000, all three procedures perform worse than theneural-gas algorithm. For a total number of 100 000 adaptationsteps the distortion error of the maximum-entropy clusteringis more than twice as large as the distortion error achievedby the neural-gas algorithm. Theoretically, for the maximum-entropy approach the performance measure Q should convergeto zero for t,, 4 03. However, as mentioned already inthe introduction, the convergence might be extremely slow.Indeed, all four clustering procedures, including the maximum-entropy approach and the neural-gas algorithm, do not improvetheir final distortion error significantly further within the range

  • 8/12/2019 Martinet z 1993

    4/12

    MARTINET2 et al.: NEURAL-GAS NETWORK 561

    - Kohonen-map.K-means

    Maximum-entropy

    - Neural-gas.

    Total number of adaptation steps t

    Fig. 2. The performance of the neural-gas algorithm in minimizing the distortion error E for the model distribution of data points which is described in thetext and an example of which is shown in Fig. 1. Depicted is the result for different total numbers of adaptation steps t,,,. The performance measure is

    E - Eo)/Eo with Eo as the minima l distortion error that can be achieved for the type of data distribution and the number of refere nce vectors we used forthe test. For comparison we also sho w the result obtained with the standard K- mea ns clustering, maximum-e ntropy clustering, and Kohonens feature mapalgorithm. U p to tmax = 8000 only the distortion error of the IC-means clustering is slightly smaller than the distortion error of the neural-gas algorithm.For t,,, > 8000, the three other approaches perform worse than the neural-gas model. For a total number of 100 000 adaptation steps the distortion errorof the neural-gas algorithm is smaller by more than a factor of two than the distortion error achieved by the max imum-entr opy procedure.

    100000 < t,, < 500000.t,, = 500000 is the limit up towhich the tests were made.

    Fig. 2 demonstrates that the convergence of the neural-gas algori thm is faster than the convergence of the three

    other approaches. This is important for practical applicationswhere adaptation steps are expensive, e.g., in robot controlwhere each adaptation step corresponds to a trial movementof the robot arm. Applications that require the learning ofinput-output relations, vector quantization networks establisha representation of the input space that can then be usedfor generating output values, either through discrete outputvalues [26], local linear mappings [12], or through radialbasis functions [27]. Kohonens topology-conserving map asa vector quantizer, together with local linear mappings forgenerating output values, has been studied for a number ofrobot control tasks [12], [16]-[18]. However, for the reasonof faster convergence, we took the neural-gas algorithm foran implementat ion of the learning algorithms [16]-[18] on

    an industrial robot arm[28].

    Compared to the versions thatare based on Kohonens feature map and require about 6000adaptation steps (trial movements of the robot arm) to reach theminimal positioning error [18], only 3000 steps are sufficientwhen the neural-gas network is employed [28].

    For the simulations of the neural-gas network as presentedin Fig. 2 we chose h x ( k ) = exp(-k/X) with X decreasingexponentially with the number of adaptation steps t as X t ) =Xi(Xf/Xi)t/tmax with X i = 10,Xf = 0.01, and t,, E[0, 100 0001. Compared to other choices for the neighborhoodfunction h x ( k ) , e.g., Gaussians, h x ( k ) = exp(-k/X) pro-vided the best result. The step size E has the same timedependence as A, i.e., ~ t ) E ; c ~ / c ~ ) ~ / ~ ~ ~ ~ith E; = 0.5

    and ~f = 0.005. The similarity of the neural-gas networkand the Kohonen algorithm motivated the time dependencez t ) = z i z f / ~ ; ) ~ / ~ m =or E and A. This time dependence hasprovided good results in applications of the Kohonen network

    [16]-[18]. The particular choice for Xi , Xf ,E;, and E f is notvery critical and was optimized by trial and error. The onlysimulation parameter of the the adaptive K-means clustering isthe step size E, which was chosen in our simulations identicalto that of the neural-gas algorithm. In contrast to the three othervector quantization procedures, the final result of the K-meansclustering depends very much on the quality of the initialdistribution of the reference vectors w ; . Therefore, to avoid acomparison in favor of the neural-gas network, we initializedthe K-means algorithm more prestructured by exploiting apriori knowledge about the data distribution. Rather thaninitializing the wis totally at random, they were randomlyassigned to vectors lying within the 15 clusters. This choiceprevents that some of the codebook vectors remain unused.

    For the Monte Carlo simulations of the maximum-entropyclustering the step size E was also chosen as for the neural-gasalgorithm. The inverse temperature P had the time dependenceP t ) = P; Pf/P;)t/tmax ith P; = 1 and ,6f = 10000. Thisscheduling of ,8 provided the best results for the range oftotal numbers of adaptation steps t,, that was investigated.Also for Kohonens feature map algorithm the step size Ewas chosen as for the neural-gas algorithm and the other twoclustering procedures. The function h, 2 , j ) hat determinesthe neighborhood relation between site i and site j of thelattice A of Kohonens feature map algorithm was chosen tobe a Gaussian of the form h b i l j )= exp(-l/i - 1I2 /2a2)[9], [16]-[18]. The decay constant ol like A, decreased with

  • 8/12/2019 Martinet z 1993

    5/12

    562 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 4, NO. 4, JULY 1993

    the adaptation steps t according to a t ) = ~ ; a f / a ; ) ~ / ~ ~ ~ ~with C T ~ 2 and af = 0.01. The values for T; and a f wereoptimized.

    Iv. GAS-LIKE DYNAMICS F THE REFERENCE ECTORS

    In this section we clarify the name neural-gas and givea quantitative expression for the density distribut ion of thereference vectors.

    We define the density e(u) of reference vectors at locationU of V C RD hrough e(u) = FC with i(u) being thereference vector closest to U and being the volumeof Voronoi polygon V, u). According to the definition of V, ,which was given in (l), U E V ) is valid. Hence, e(u) is astep function that is constant on each Voronoi polygon V,. Inthe following, we study the case where the Voronoi polygonschange their size Fi slowly from one Voronoi polygon to thenext. Then we can regard e(u) as being continuous, which

    allows to derive an expression for e u)s dependence on thedensity distribution of data points.

    For a given U , the density distribution ~ ( u )etermines thenumbers ki (u ,w ) , = 1, . . ,N, which are necessary for anadjustment of the reference vectors w; . k; (u ,w ) s the numberof reference vectors within a sphere centered at v with radiusJIu - will, i.e.,

    e(v +U ) Du. (9)I; (u ,w ) = llull

  • 8/12/2019 Martinet z 1993

    6/12

    MARTINETZ et al.: NEURAL-GAS NETWORK 563

    time required for an adaptation step of the neural-gas algorithmincreases faster with N than the corresponding computationtime for a step of the K-means clustering. Determining the

    ki, = 1, . . ,N in a serial implementation corresponds tosorting the distances [ [ U - will , i = 1 , . . ,N , which scaleslike N log N . Searching for the smallest distance I[U - wio Itoperform a step of the K-means clustering scales only linearlywith the number of reference vectors.

    VI. APPLICATION TO TIME-SERIES REDICTION

    A very interest ing learning problem is the prediction of adeterministic, but chaotic, time-series, that we want to takeas an application example of the neural-gas network. Theparticular time series we choose is the time-series generated bythe Mackey-Glass equation [30]. The prediction task requiresto learn an input-output mapping y = f u ) of a current state

    U of the time-series (a vector of D consecutive time-seriesvalues) into a predict ion of a future time-series value y . Ifone chooses D large enough, i.e., D = 4 in the case ofthe Mackey-Glass equation, the D-dimensional state vectorsU all lie within a limited part of the D-dimensional spaceand form the attractor V of the Mackey-Glass equation. Inorder to approximate the input-output relation y = f(u) wepartition the attractors domain into N smaller subregionsvi = 1 , . . ,N and complete the mapping by choosinglocal linear mappings to generate the output values y on eachsubregion vi. To achieve optimal results by this approach,the partitioning of V into subregions has to be optimized bychoosing V , s the overall size of which is as small as possible.

    A way of partitioning V is to employ a vector quantizationprocedure and take the resulting Voronoi polygons as subre-gions V,. To break up the domain region of an input-outputrelation by employing a vector quantization procedure and toapproximate the input-output relation by local linear mappingswas suggested in [12]. Based on Kohonens feature mapalgorithm, this approach has been applied successfully tovarious robot control tasks [12], [16]-[18] and has also beenapplied to the task of predicting time series [31]. However,as we have shown in Section 111, for intricately structuredinput manifolds the Kohonen algorithm leads to suboptimalpartitionings and, therefore, provides only suboptimal approx-imations of the input-output relation y = f(u). The attractorof the Mackey-Glass equation forms such a manifold. Itstopology and fractal dimension of 2.1 for the parameterschosen makes it impossible to specify a corresponding latticestructure. For this reason, we employ the neural-gas networkfor partitioning the input space VI which allows to achievegood or even optimal subregions V, , also in the case oftopologically intricately structured input spaces.

    A hybrid approximation procedure that also uses a vectorquantization technique for preprocessing the input domain toobtain a convenient coding for generating output values hasbeen suggested by Moody and Darken [27]. In their approach,preprocessing the input signals, for which they used the K -means clustering, serves the task of distributing the centers wiof a set of radial basis functions, i.e., Gaussians , over the inputdomain. The approximation of the input-output relation is

    then achieved through superpositions of the Gaussians. Moodyand Darken demonstrated the performance of their approachalso for the problem of predicting the Mackey-Glass time-

    series. A comparison of their result with the performancewe achieve with the neural-gas network combined with locallinear mappings is given below.

    A . Adaptive Local Linear M appings

    The task is to adaptively approximate the function y = f u )with U E V C RD and y E R.V denotes the functionsdomain region. Our network consists of N computationalunits, each containing a reference or weight vector wi (forthe neural-gas algorithm) together with a constant yi and a D-dimensional vector ai . The neural-gas procedure assigns eachunit i to a subregion vi as defined in (l), and the coefficiantsyi and ai define a linear mapping

    14)= yz +02

    .U

    -W i )

    from R D o R ver each of the Voronoi polyhed_ra K. Hence,the function y = f(u) is approximated by y = f (u) with

    J 4 = Yi(U) + % ( U ) . U - Wi ( U ) ) . 15)i u ) denotes the computational unit i with its w; closest to U .

    To learn the input-output mapping we perform a series oftraining steps by presenting input-output pairs ( U , y = f (u) ) .The reference vectors wi are adjusted according to adaptationstep (7) of the neural-gas algorithm. To obtain adaptation rulesfor the output coefficients yi and ai, for each a we require themean squared error Jv, d D v P u ) f u ) f ~ ) ) ~etween theactual and the required output, averaged over subregion V,, to

    be minimal. A gradient descent with respect to yi and a; yields

    Ayi =E d D vP ( v ) ( y - i - ai . U - wi))

    = E dDvP ( U ) S ~ ~ ( ~ ) ( Yyi - (U - wi)) 16)

    and

    dDvP(u)(y - i - ai ( U - w i ) ) ( U - w i )

    = E J, dDvP(u)Giip)(y - yi - ai . U - w i ) ). U - w ; ) . (17)

    For A + 0 in adaptation step (7) the neural-gas algorithmprovides an equilibrium distribution of the wis for whichJv, d D v P u ) u - w i ) = 0 for each i i.e., wi denotes the centerof gravity of the subregion V,. Hence, Jv, dDvP u)a i u - w i )in (16) vanishes and the adaptation step for the yis takeson a form similar to the adaptation step of the wis, exceptthat only the output of the winner unit is adjusted with atraining step. To obtain a significantly faster convergence ofthe output weights y; and ai , we replace S;; v) n 16) and (17)by h x t ( k i ( u , w ) ) ,which has the same form as hx(ki(u,w)) nadaptation step (7), except that the decay constant A might beof a different value than A. By this replacement we achievethdt the yi and a; of each unit is updated in a training step,

  • 8/12/2019 Martinet z 1993

    7/12

    564 IEEE TRANSAaIONS ON NEURAL NETWORKS, VOL. 4, NO. 4, JULY 1993

    2

    Fig. 3. The temporal evolution of the neural-ga s while adapting to a representation of he Mac key-Glass attractor. Presented is a thre e-dimensional projectionof the distribution of he reference vectors big dots) at four different stages of the adaptation procedure, starting with a very early state at the top left part ofthis figure. The final state is shown at the bottom right. The sm all dots depict already presented inputs U, orming the attractor which has to be partitioned.

    with a step size that decreases with the units neighborhood-rank to the current input U . In the beginning of the training

    procedure, A is large and the range of input signals that affectthe weights of a unit is large. As the number of training stepsincreases, A decreases to zero and the fine tuning of the outputweights to the local shape of f(u) akes place. Hence, in theiron-line formulation, the adaptation steps we use for adjusting9 and 0 re given by

    B. Prediction of the Macke y-Glass Time Series

    algorithm is generated by the Mackey-Glass equationThe time series we want to predict with our network

    with parameters (I = 0.2,/3 = -0.1, and T = 17 [30]. x t ) squasi-periodic and chaotic with a fractal attractor dimension2.1 for the parameter values we chose. The characteristic timeconstant of x t ) is tchar = 50, which makes it particularlydifficult to forecast x t + At) with At > 50.

    Input U of our network algorithm consists of four past valuesof x t ) , .e.,

    U = ~ t ) , t 6) ,~ t12), ~ t18)).

    Embedding a set of time-series values in a state vector iscommon to several approaches, including those of Moodyand Darken [ 2 7 ] , Lapedes and Farber [32], and Sugiharaand May [33]. The time span we want to forecast into thefuture is A t = 90. For that purpose we iteratively predictx t +6), x t +12), etc., until we have reached x t +90) after15 of such iterations. Because of this iterative forecasting, theoutput y which corresponds to U and which is used for trainingthe network is the true value of x t + 6).

    We studied several different training procedures. First, wetrained several networks of different s izes using 100 000to 200000 training steps and 100000 training pairs o =x t ) , z t 6 ) , ~ ( t 12),2(t - lS)) ,y = z t + 6). One

    could deem this training as on-line because of the abundantsupply of data. Fig. 3 presents the temporal evolution of theneural-gas adapting to a representation of the Mackey-Glassattractor. We show a three-dimensional projection of the four-dimensional input space. The initialization of the referencevectors, presented in the top left part of Fig. 3, is random. After500 training steps, the reference vectors have contractedcoarsely to the relevant part of the input space (top right).With further training steps (20 000, bottom left), the networkassumes the general shape of the attractor, and at the end of theadaptation procedure (100 000 training steps, bottom right),the reference vectors are distributed homogeneously overthe Mackey-Glass attractor. The small dots depict already-presented inputs U , whose distribution is given by the shapeof the attractor.

    1

  • 8/12/2019 Martinet z 1993

    8/12

    MARTINETZ et al.: NEURAL-GAS NETWORK

    -0.60-0.80-1.00-1.20-1.40-1.60-1.80

    -2-2.20-2.40

    I

    -

    --------

    I I I I I I 1

    -0.20-0.40

    Fig. 4. The normalized prediction error versus the size of the network.The neural-gas algorithm combined with local linear mappings (1) compareswell against Moody and Darkens IC-means RBF method (2) and against

    back-propagation (3). To obtain the same pred iction error, the K-m ea ns RBFmethod requires about 10 times more weights than the neural-gas algorithmwith local linear mappings.

    Fig. 4 shows the normalized prediction error as a functionof network size. The size of a network is the number of itsweights, with nine weights per computational unit (four foreach w z and a,, plus one for yz). The prediction error isdetermined by the rms value of the absolute prediction errorfor A t = 90, divided by the s tandard deviation of z t ) .As we can see in Fig. 4, compared to the results Moodyand Darken obtained with K-means clustering plus radialbasis functions [27], the neural-gas network combined withlocal linear mappings requires about 10 times fewer weightsto achieve the same prediction error. The horizontal line inFig. 4 shows the prediction error obtained by Lapedes andFarber with the back-propagation algorithm [ 3 2 ] .Lapedes andFarber tested only one network size. On a set of only 500data points they achieved a normalized prediction error ofabout 0.05. However, their learning time was on the orderof an hour on a Cray X-MP. For comparison, we traineda network using 1000 data points and obtained the sameprediction error of 0.05, however, training took only 90 son a Silicon Graphics IRIS, which achieves 4MFlops forLIIWACK benchmarks. To achieve comparable results, Moodyand Darken employed about 1 3 000 data points, which required1800 s at 90 KFLops. Hence, compared to our learningprocedure, Moody and Darkens approach requires a muchlarger data set but is twice as fast. However, because ofpossible variations in operating systems and other conditions,both speeds can be considered comparable.

    Fig. 5 shows the results of a study of our algorithmsperformance in an off-line or scarce data environment. Wetrained networks of various sizes through 200 epochs (or200 000 steps, whichever is smaller) on different sizes oftraining sets. Due to the effect of overfitting, small networksachieve a better performance than large networks if the trainingset of data points is small. With increasingly large amountsof data the prediction error for the different network sizessaturates and approaches its lower bound.

    565

    I 1 1 I I I

    -0.20i2 400W

    8 -0.608 -1.00a4 -1.40

    ? -1.80

    -

    - 2ooL 8 O25

    .$-

    -0

    -

    -M

    s

    -2.201.50 2 2.50 3 3.50 4 4.50 5

    log( Size of Data Set )

    Fig. 5 . The normalized prediction error versus training set size for differentnetwork sizes. Due to the effect of overfitting, small networks achieve a betterperformance than large networks if the training set of data points is small.

    With increasingly large amounts of data, the prediction error for the differentnetwork sizes approaches its lower bound.

    As n the simulations described in Section 111, the parametersE, A, E, and A had the time dependence z = z i ( ~ f / z i ) ~ / ~ ~ ~with t as the current and t,, as the total number of trainingsteps. The initial and final values for the simulation parameterswere ~i = 0 .99 ,~f = 0.001;Ai = N/3,Af = 0.0001;~{ =0 . 5 , ~ ; = 0.05;A: = N / 6 , and A; = 0.05. As in thesimulations of Section 111, the particular choice for theseparameter values is not very critical and was optimized bytrial and error. Both h x ( k ) and hxt (k) decreased exponentiallywith k.

    VII. DISCUSSIONIn this paper we presented a novel approach to the task of

    minimizing the distortion error of vector quantization coding.The goal was to present an approach that does not requireany prior knowledge about the set of data points and, atthe same time, converges quickly to optimal or at least nearoptimal distortion errors. The adaptation rule we employedis a soft-max version of the K-means clustering algorithmand resembles to a certain extent both the adaptation ruleof the maximum-entropy clustering and Kohonens featuremap algorithm. However, compared to the maximum-entropyclustering, it is a distance ranking instead of the absolutedistance of the reference vectors to the current data vectorthat determines the adaptation step. Compared to Kohonensfeature map algorithm, it is not the neighborhood rankingof the reference vectors within an external lattice but theneighborhood-ranking within the input space that is taken intoaccount.

    We compared the performance of the neural-gas approachwith K -means clustering, maximum-entropy clustering, andKohonens feature map algorithm on a model data distributionwhich consisted of a number of separated data clusters. Onthe model data distribution we found that 1) the neural-gasalgorithm converges faster and 2) reaches smaller distortionerrors than the three other clustering procedures. The price forthe faster convergence to smaller distortion errors, however,

    I

  • 8/12/2019 Martinet z 1993

    9/12

    566 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 4, NO. 4, JULY 1993

    is a higher computational effort. In a serial implementationthe computation time of the neural-gas algorithm scales likeN log N with the number N of reference vectors, whereas

    the three other clustering procedures all scale only linearlywith N . Nonetheless, in a highly parallel implementationthe computation time required for the neural-gas algorithmbecomes the same as for the three other approaches, namelyO 1og N ) .

    We showed that, in contrast to Kohonens feature mapalgorithm, the neural-gas algorithm minimizes a global costfunction. The shape of the cost function depends on theneighborhood parameter A, which determines the range of theglobal adaptation of the reference vectors. The form of the costfunction relates the neural-gas algorithm to fuzzy clustering,with an assignment degree of a data point to a referencevector that depends on the reference vectors neighborhoodrank to this data point. Through an analysis of the average

    change of the reference vectors for small but finite A, wecould demonstrate that the dynamics of the neural-gas networkresembles the dynamics of a set of particles diffusing ina potential. The potential is given by the negative densitydistribution of the data points, which leads to a higher densityof reference vectors in those regions where the data pointdensity is high. A quantitative relation between the densityof reference vectors and the density of data points could bederived.

    To demonstrate the performance of the neural-gas algorithm,we chose the problem of predicting the chaotic time-seriesgenerated by the Mackey-Glass equation. The neural-gasnetwork had to form an efficient representation of the un-derlying attractor, which has a fractal dimension of 2.1. Therepresentation (as a discretization of the relevant parts of theinput space) was utilized to learn the required output, i.e., aforecast of the time-series, by using local linear mappings.A comparison with the performance of K-means clusteringcombined with radial basis functions showed that the neural-gas network requires an order of magnitude fewer weightsto achieve the same prediction error. Also, the generalizationcapabilities of the neural-gas algorithm combined with locallinear mappings compared favorably with the generalizationcapabilities of the RBF-approach. To achieve identical accu-racy, the RBF-approach requires a training data set that islarger by an order of magnitude than the training data setwhich is sufficient for a neural-gas network with local linearmappings.

    APPENDIX

    We show that the dynamics of the neural-gas network,described by adaptation step 7), corresponds to a stochasticgradient descent on a potential function. For this purpose weprove the following:

    Theorem: For a set of reference vectors w = w 1 , . ,W N ) ,wi E RD and a density distribution P ( u ) of data pointsU E R D over the input space V R D , he relation

    (19)d EdW;

    s,d D vP U ) h A k i U , W ) ) ( U - wz) = --

    with

    d D vP ( U ) h A ( k j ( U , W ) ) ( U - W j ) 2 (20)

    is valid. k j u , w ) denotes the number of reference vectors w1with IJu wll ( < IIu - w j l l .

    Proof: For notational convenience we define di u) = u-wi ,This yields

    d E

    dW;-- = Ri s,d D vP ( u ) h x ( k i ( v, ) ) ( u - w i )

    with

    h i . ) denotes the derivative of hx .) . We have to show thatRi vanishes for each i = 1, . . , N .

    For kj u, w ) the relation

    N

    qU ) = e d; - d ; )1=1

    is valid, with e .) as the Heavyside step function

    (23)

    1, forz > 00, forz 5 0.(X) =

    The derivative of the Heavyside step function e(z) is the deltadistribution S z) with

    S(z) = 0 fo rz 0

    and

    S z) z = 1.sThis yields

    Each of the N integrands in the second term of (24) isnonvanishing only for those U S for which dj = d: is valid,respectively. For these U S we can write

    N

    k j ( U , w ) = e ( d ; - d ; )1=1

    N

  • 8/12/2019 Martinet z 1993

    10/12

    MARTINETZ et al.: NEURAL-GAS NETWORK

    and, hence, we obtainN

    Ri = J, dDvP(v)hi(k;(v, w ) ) : d; 6( - df)1=1

    N

    - 1 DvP(v)h:,(ki(v, w ) ) 4 6 dj - $).j=1

    (26)

    Since 6(x) = 6(-x) is valid, Ri vanishes for each i =1, . . N .

    APPENDIX 1

    In the following we provide a derivation of 6). The averagechange (Awi) of a reference vector with adaptation step (7)is given by

    Awi) = ELDvP(v)hx(ki(v,w))(v - w i ) (27)with hx(k i (u ,w)) being a factor that determines the size of theadaptation step and that depends on the number ki of referencevectors w j being closer to v than wi. We assume h x ( k ; ) o beunity for lci = 0 and to decrease to zero for increasing ki witha characteristic decay constant A We can express k i ( v,w ) by

    ki(U, w ) = e v +U ) Du (28)with p u ) as the density distribution of the reference vectorsin the input space V = R D .

    For a given set w = ~ 1 , . ,W N ) of reference vectors,k i ( v ,w ) depends only on r = v - wi and, therefore, weintroduce

    z r ) = i f i(r)l/D. (29)

    In the following we assume the limit of a continuous distri-bution of reference vectors wi, i.e., we assume e(u) to beanalytical and nonzero over the input space V. Then, since@(U) > 0, for a fixed direction i , z ( r ) ncreases strictlymonotonically with 11r11 and, therefore, the inverse of z ( r ) ,denoted by r ( z ) , xists. This yields for (27)

    (Awi) = E P(wi +r(z))hx(xD)r(z) ( z )dDx (30)J,w i t h J ( z ) = de t( dr, /d x, ), ,u, v= l , . . - , D a n d x = 1 ~ 1 1 .

    We assume h x ( k i ( v- w i o decrease rapidly to zero withincreasing 1 1 ~w i l l , i.e., we assume the range of h x ( k ; ( r ) )within V to be small enough to be able to neglect any higherderivatives of P( u) and e(u) within this range. Then we mayreplace P(wi + r ( z ) ) and J ( z ) in (30) by the first terms oftheir Taylor expansion around z = 0 , yielding

    The Taylor expansion of r ( z ) can be obtained by determin-ing the inverse of z ( r ) n (29) for small 1 1 ~ 1 1 .For small llrll

    561

    (it holds 1 1 ~ 15 l ) r ) l n (28)), the density e v + U ) n (28)is given by

    e v +U ) = e(wi + r +U )= e w i ) + ( r +u>+e wi) +o r2) (32)

    with dr = d / d r . Together with (29), this yields

    with

    TD =

    r +i)(34)

    as the volume of a sphere with radius one in dimension D. Asthe inverse of (33) we obtain for small I Jz I

    r ( z ) = z T D Q ) - l D 1 - TD@)-lD* +o(x2))eo (35)(

    which yields the first terms of the Taylor expansion of r ( s )around z = 0.

    Because of (35) it holds

    d r , 2)ax,

    - l / D= 6pu (TD e

    and, therefore,d P d P d rdz d r dz

    -

    = (TDe)-l/D P (37)

    is valid at z = 0.The off-diagonal terms in (36) contribute only in quadratic

    order to J ( z ) = de t (dr,/ax, ). Considering the diagonalterms up to linear order in z ields

    and, therefore,

    is valid at z = 0.Taking into account (35), (37)-(39) we obtain, for (31),

    (Awi)

    . 1 - TDe) - l /D- . & e + . . . dDx.eD

  • 8/12/2019 Martinet z 1993

    11/12

    568 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 4, NO. , JULY 1993

    The integral over the terms of odd order in z anish becauseof the rotational symmetry of h x z D ) . Considering only theleading nonvanishing term (leading in A), we finally obtain

    with

    and

    Dy=

    D + 2

    42)

    (43)

    ACKNOWLEDGMENT

    The authors would like to thank J. Walter for many discus-sions on the problem of predicting chaotic time series and forproviding parts of the prediction software. They also thank R.Der and T. Villmann for valuable remarks on the theorem inAppendix I.

    REFERENCES

    R. M. Gray, Vector quantization,IEEE ASSP Mag., vol. 1, no. 2, pp.4-29, 1984.S. P. Lloyd, Least squares quantization in PCM,IEEE Tra ns. Inform.Theory, vol. IT-28, p. 2, 1982.J. MacQueen, Some methods for classification and analysis of mul-tivariate observations, inProc. 5th Berkeley Symp. on Mathematics,Statistics, and Probability, L. M. LeCam and J. Neyman, Eds., 1967,pp. 281-297.K. Rose, F. Gurewitz, and G. Fox, Statistical mechanics and phasetransitions in clustering,Physical Rev. Lett., vol. 65, no. 8, pp.

    S. Geman and D. Geman, Stochastic relaxation, Gibbs distributions,and the Bayesian restoration of images,IEEE Trans. Patt. Anal.Machine Intell., vol. PAMI-6, pp. 721-741, 1984.S. J. Nowlan, Max imum likelihood com petitive learning, inAdvancesin Neural Information Processing Systems 2, D. Touretzky, Ed. NewYork: Morgan Ka uffman, 199 0, pp. 574-582.T. Kohonen, Self-organized formation of topologically correct featuremaps, Biological Cybern., vol. 43, pp. 5 M 9 , 1982.T. Kohonen, An alysisof a simple self-organizing process,BiologicalCybern, vol. 44, pp. 135-140, 198 2.T. Kohonen,Selforganization andAssociative Memory (Springer Seriesin Information Sciences 8).T. Kohon en, K. Makisara, andT. Saramak i, Phonotop ic maps-insightfulrepresentation of phonological features for speech recognition, inProc.7th nt . Conf on Pattern Recognition, Montrkal, 1984, pp. 182-185.J. Makhoul, S. Roucos, and H. Gish, Vector quantization in speechcoding, Proc. IEEE, vol. 73, pp. 1551-1588, 1985.H. Ritter and K. Schulten, Topology cons erving mapping s for learningmotor tasks, inNeural Networks for Computing, J. S. Denker, Ed.,M P Con Proc. 151, Snowbird, UT, 1986, pp. 376-380.N. M. Nasrabadi and R. A. King, Image coding using vector quanti-zation: A review,IEEE Trans. Commun., vol. 36, no. 8, pp. 957-971,1988.N. M. Nasrabadi andY. eng, Vector quantization of images basedupon the Kohonen self-organizing feature maps, inIEEE Int. Conf on

    945-948, 1990.

    Heidelberg: Springer.

    [16] H. Ritter, T. Martinetz, and K. Schulten, Topology-conservin g map sfor learning visuomotor-coordination.Neural Networks, vol. 2, DD.159-168, 1988.

    1171 T. Martinetz, H. Ritter. and K. Schulten. Three-dimensional neural

    net for learning visuomotor-coordination of a robot arm,IEEE Trans.Neural Networks, vol. 1, no. 1, pp. 131-136.

    (181 T. Martinetz, H. Ritter, and K. Schulten, Learningof visuomotorcoordination of a robot arm with redundant degrees of freedom, inProc. lnt. Conf on Parallel Processing in Neural Systems and Com puters1990, R. Eckmiller, G. Hartmann, and G. Hauske, Eds., pp. 431434;and Proc. 3rd Int. Symp. on Robotics and Manufacturing, Vancouver,BC, Canada, 1 990, pp. 521-526.

    [19] H. Ritter and K. Schulten, Convergence properties of Kohonens topol-ogy conserving maps: Fluctuations, stability and dimension selection,Biological Cybern., vol. 60, pp. 59-71, 1989.

    [20] K. Oberma yer, H. Ritter, and K. Schulten, A principle for the formationof the spatial structure of cortical feature maps,Proc. Nut. Acad. Sci.US. ol. 87, pp. 834543349, 1990.

    (211 D. H. Graf and W. R. LaLonde,A neural controller for collision-freemovement of general robot manipulators, inIEEE Int. Conf on NeuralNetworks, San Diego, CA, 1988 , pp. 77-84.

    [22] T. Kohonen, Self-organizing maps: Optimization approaches, inAr-tificial Neural N etworks, vol. 11, T. Kohonen et al., Eds. Amsterdam:

    North Holland, 19 91, pp. 981-990.[23] E. Erwin, K. Obermayer, and K. Schulten, Self-organizing maps:Ordering, convergence properties, and energy functions,BiologicalCybern., vol. 67, pp. 47-55, 199 2.

    [24] J. C. Dunn, A fuzzy relativeof the isodata process and itsuse indetecting compact well-separated clusters,J. Cybern., vol. 3, no. 3,pp. 32-57, 1974.

    1251 J. C. Bezdek, A convergence theorem for the fuzzy isodata clustering-algorithm, IEEE T rans.P att. Anal. Machine Intell.; vol . 2 , no. 1 , p i1-8, 1980.

    [26] R. Hecht-Nielsen, Applications of counterprop agation network s,Neu-ral Networks, vol. 1, no. 2, pp. 131-141, 1988.

    [27] J. Moody andC. J. Darken, Fast learning in networksof locally tunedprocessing units,Neural Computation, vol. 1, pp. 281-294, 1989.

    [28] J. Walter, T. Martinetz, and K. Schulten, Industrial robot learnsvisuomotor coordination by means of neural-gas network, inArtificialNeural Networks, vol. I, T. Kohonen et al., Eds. Amsterdam: North

    [29] P. I. Zador, Asymptotic quantization errorof continuous signals andthe quantization dimension,IEEE Trans. Inform. Theory, vol. IT-28,no. 2, pp. 139-149, 1982.

    [30] M. Mackey and L. Glass, Oscillation and ch aos in physiological controlsystems, Science, vol. 19 7, p. 287, 1977.

    [31] J. Walter, H. Ritter, and K. Schulten, Nonlinear prediction with self-organizing maps. inIJCNN-90 Conf Proc.. vol. 111 San Diego, CA,

    Holland, pp. 357-364.

    1&0, pp-589-594.1321 A. Lapedes and R. Farber, Nonlinear signal processing using neural

    -~-

    netwoiks: Prediction and system modelling,Technical ReporFU -UR-87-2662, Los Alamo s National Laboratory,Los Alamos, NM, 1988.

    [33] G. Sugih ara and R. May, Nonlinear forecasting as a way of distinguish-ing chaos from measurement error in time-series,Nature, vol. 344, pp.734-740, 1990.

    Thomas M. Martinetz (M91) was born in 1962.He received the Diplom degree in physics and thePh.D. degree in theoretical physics, both from theTechnical University of Munich, Ge rmany, in 198 8and 1992, respectively.

    From 1 989 to 1991 he was with the BeckmanInstitute and the Department of Physics at the Uni-versity of Illinois at Urbana-Champaign. Currently,he is a memberof the Neural Networks Groupof theCorporate Research and Development of SiemensAG, Mun ich. Since 19 87 his research has been

    Neural Networks, San Diego, CA, 1988, pp. -1101-1108.J. Naylor and K. P. Li, Analysisof a neural network algorithm forvector quantization of speech parameters, inProc. 1st Annu. INNSMeet., 1988, p. 310.

    in the area of neural networks. His main interests are in the design andanalysis of neural network architectures for adaptive signal processing, mo torcoordination, and process control. He is the author of several papers andcoauther of a textbook on neural networks.

  • 8/12/2019 Martinet z 1993

    12/12

    MARTINET2 et al.: NEURAL-GAS NETWORK

    Stanislav G. Berkovich was born in Moscow, Rus-sia, on September2, 1970. He received the B.S.degree in computer engineeringfrom the Universityof Illinois at Urbana-Champaign.

    Since Octobe r 199 1 he has been with th e Paral-lel Distributed Processing Department of the SonyTelecommunication and Information Systems Re-search Laboratory, Tokyo, apan. His research in-terests are in the areasof topology-preserving mapsand other neural network appro aches for perceptionand mo tor coordination.

    569

    Klaus J. schulten was born in 1947. He receivedthe Diplom degreein physics from the University ofMunster, Germany, and the Ph.D. degree in chem-ical physics from Harvard University, Cambridge,MA, in 196 9 and 1974, respectively.

    In 19 74 he joined the Max-Planck-Institute forBiophysical Chemistry, Gottingen, Germany. Hewas Professor of Theoretical Physics at the Tech-nical University of Munich, Germany, in1980.In 1988 he became Professorof Physics, Chem-istry, and Biophysics at the University of Illinois

    at Urbana-Champaign where he is also a memberof the Beckman Institute.His research is in the areas of statistical physics, molecular biophysics, andcompu tational neural scien ce. In the last area, he currently studies visuomotorcontrol and brain maps.