Barbara Hammer and Thomas Villmann- Generalized Relevance Learning Vector Quantization

8/3/2019 Barbara Hammer and Thomas Villmann- Generalized Relevance Learning Vector Quantization

1/16

Generalized Relevance Learning Vector

Quantization

Barbara Hammer

Thomas Villmann

March 11, 2002

Abstract

We propose a new scheme for enlarging generalized learning vector quantiza-

tion (GLVQ) with weighting factors for the input dimensions. The factors allow

an appropriate scaling of the input dimensions according to their relevance. They

are adapted automatically during training according to the specific classification

task whereby training can be interpreted as stochastic gradient descent on an ap-

propriate error function. This method leads to a more powerful classifier and to an

adaptive metric with little extra cost compared to standard GLVQ. Moreover, the

size of the weighting factors indicates the relevance of the input dimensions. This

proposes a scheme for automatically pruning irrelevant input dimensions. The al-

gorithm is verified on artificial data sets and the iris data from the UCI repository.

Afterwards, the method is compared to several well known algorithms which de-

termine the intrinsic data dimension on real world satellite image data.

Keywords: clustering, learning vector quantization, adaptive metric, relevance de-

termination.

1 Introduction

Self-organizing methods such as the self-organizing map (SOM) or vector quantization

(VQ) as introduced by Kohonen provide a successful and intuitive method of process-

ing data for easy access [18]. Assumed data are labeled, an automatic clustering can

be learned via attaching maps to the SOM or enlarging VQ with a supervised compo-

nent to so-called learning vector quantization (LVQ) [19, 23]. Various modifications of

LVQ exist which ensure faster convergence, a better adaptation of the receptive fields

to optimum Bayesian decision, or an adaptation for complex data structures, to name

just a few [19, 29, 33].A common feature of unsupervised algorithms and LVQ consists in the fact that in-

formation is provided by the distance structure between the data points which is deter-

mined by the chosen metric. Learning heavily relies on the commonly used Euclidian

University of Osnabruck, Department of Mathematics/Computer Science, Albrechtstrae 28, 49069 Os-

nabruck, Germany

University of Leipzig, Clinic for Psychotherapy and Psychosomatic Medicine, Karl-Tauchnitz-

Strae 25, 04107 Leipzig, Germany

1


2/16

metric and hence crucially depends on the fact that the Euclidian metric is appropri-

ate for the respective learning task. Therefore data are to be preprocessed and scaledappropriately such that the input dimensions have approximately the same importance

for the classification. In particular, the important features for the respective problem

are to be found, which is usually done by experts or with rules of thumb. Of course,

this may be time consuming and requires prior knowledge which is often not avail-

able. Hence methods have been proposed which adapt the metric during training. Dis-

tinction sensitive LVQ (DSLVQ), as an example, automatically determines weighting

factors to the input dimensions of the training data [26]. The algorithm adapts LVQ3

for the weighting factors according to plausible heuristics. The approaches [17, 32]

enhance unsupervised clustering algorithms by the possibility of integrating auxiliary

information such as a labeling into the metric structure. Alternatively, one could use

information geometric methods in order to adapt the metric such as in [14].

Concerning SOM, another major problem consists in finding an appropriate topol-

ogy of the initial lattice of prototypes such that the prior topology of the neural ar-chitecture mirrors the intrinsic topology of the data. Hence various heuristics exist to

measure the degree of topology preservation, to adapt the topology to the data, to de-

fine the lattice a posteriori, or to evolve structures which are appropriate for real world

data [2, 7, 20, 27, 37]. In all tasks the intrinsic dimensionality of data plays a cru-

cial role since it determines an important aspect of the optimum neural network: the

topological structure, i.e., the lattice for SOM. Moreover, superfluous data dimensions

slow down the training for LVQ as well. They may even cause a decrease in accu-

racy since they add possibly noisy or misleading terms to the Euclidian metric where

LVQ is based on. Hence a data dimension as small as possible is desirable for the

above mentioned methods in general, for the sake of efficiency, accuracy, and sim-

plicity of neural network processing. Therefore various algorithms exist which allow

to estimate the intrinsic dimension of the data: PCA and ICA constitute well estab-lished methods which are often used for adequate preprocessing of data and which can

be implemented with neural methods [15, 25]. A Grassberger-Procaccia analysis esti-

mates the dimensionality of attractors in a dynamic system [12]. SOMs which adapt

the dimensionality of the lattice during training like the growing SOM (GSOM) au-

tomatically determine the approximate dimensionality of the data [2]. Naturally, all

adaptation schemes which determine weighting factors or relevance terms for the input

dimensions constitute an alternative method for determining the dimensionality: The

dimensions which are ranked as least important, i.e. they possess the smallest relevance

terms, can be dropped. The intrinsic dimensionality is reached when an appropriate

quality measure such as an error term changes significantly. There exists a wide va-

riety of input relevance determination methods in statistics and the field of supervised

neural networks, e.g. pruning algorithms for feedforward networks as proposed in [10],

the application of adaptive relevance determination for the support vector machine or

Gaussian processes [9, 24, 31], or adaptive ridge regression and the incorporation of

penalizing function as proposed in [11, 28, 30]. However, note that our focus lies on

improving metric based algorithms via involving an adaptive metric which allows di-

mensionality reduction as a byproduct. The above mentioned methods do not yield

a metric which could be used in self-organizing algorithms but primarily investigate

the goal of sparsity and dimensionality reduction in neural network architectures or

2


3/16

alternative classifiers.

In the following, we will focus on LVQ since it combines the elegancy of simpleand intuitive updates in unsupervised algorithms with the accuracy of supervised meth-

ods. We will propose a possibility of automatically scaling the input dimensions and

hence adapting the Euclidian metric to the specific training problem. As a byprod-

uct, this leads to a pruning algorithm for irrelevant data dimensions and the possibility

of computing the intrinsic data dimension. Approaches like [16] clearly indicate that

often a considerable reduction of the data dimension is possible without loss of infor-

mation. The main idea of our approach is to introduce weighting factors to the data

dimensions which are adapted automatically such that the classification error becomes

minimal. Like LVQ, the formulas are intuitive formulas and can be interpreted as Heb-

bian learning. From a mathematical point of view, the dynamics constitute a stochastic

gradient descent on an appropriate error surface. Small factors in the result indicate

that the respective data dimension is irrelevant and can be pruned. This idea can be

applied to any generalized LVQ (GLVQ) scheme as introduced in [29] or other plau-sible error measures such as the Kullback-Leibler-divergence. With the error measure

of GLVQ, a robust and efficient method results which can push the classification bor-

ders near to the optimum Bayesian decision. This method, generalized relevance LVQ

(GRLVQ), generalizes relevance LVQ (RLVQ) [3] which is based on simple Hebbian

learning and leads to worse and instable results in case of noisy real life data. However,

like RLVQ, GRLVQ has the advantage of an intuitive update rule and allows efficient

input pruning compared to other approaches which adapt the metric to the data involv-

ing additional transformations as proposed in [8, 13, 34] or depend on less intuitive

differentiable approximations of the original dynamics [21]. Moreover, it is based on a

gradient dynamics compared to heuristic methods like DSLVQ [26].

We will verify our method on various small data sets. Moreover, we will apply

GRLVQ to classify a real life satellite image with approx.

mio. data points. As al-ready mentioned, weighting factors allow us to approximately determine the intrinsic

data dimensionality. An alternative method is the growing SOM (GSOM) which au-

tomatically adapts the lattice of neurons to the data and hence gives hints about the

intrinsic dimensionality as well. We compare our GRLVQ experiments to the results

provided by GSOM. In addition, we relate it to a Grassberger-Procaccia analysis. We

obtain comparable results concerning the intrinsic dimensionality of our data. In the

following, we will first introduce our method GRLVQ, present applications to simple

artificial and real life data, and finally discuss the results for the satellite data.

2 The GRLVQ Algorithm

Assume a finite training set ! # % ' ) 2 4 4 4 7 9 A B 2 4 4 4 H 9 oftraining data is given and the clustering of the data into

7classes is to be learned. We

denote the components of a vector P % '

by T 4 4 4

'

!in the following. GLVQ

chooses a fixed number of vectors in% '

for each class, so called prototypes. Denote

the set of prototypes by W

T

4 4 4 W Y 9and assign the label

b

bto

W iff

W belongs

3


4/16


5/16

since it combines adaptation near the optimum Bayesian borders like LVQ2.1, whereby

prohibiting the possible divergence of LVQ2.1 as reported in [29]. We refer to theupdate as GLVQ:

W

l

o |

sgd ! ! |

l

n

!

u W

l

!

W

u o |

sgd ! ! |

l

l

n

!

u W

!(3)

Obviously, the success of GLVQ crucially depends on the fact that the Euclidian

metric is appropriate for the data and the input dimensions are approximately equally

scaled and equally important. Here, we introduce input weights T 4 4 4

'

!,

in order to allow a different scaling of the input dimensions hence making pos-

sibly time consuming preprocessing of the data superfluous. Substituting the Euclidian

metric u

by its scaled variant

u

'

k

T

u

!

(4)

the receptive field of prototypeW

becomes

f

P A q W r

u W

w

u W r

9 4

Replacingf

by

f

in the error function

in (1) yields a different weighting of the

input dimensions and hence an adaptive metric. Appropriate weighting factors

can

be determined automatically via a stochastic gradient descent as well. Hence the rule

(2) where the relevance factors

of the metric are integrated is accompanied by the

update

j

j

u o T

j

u W

r

j

!

if b

r

j

n o T

j

u W

r

j

!

otherwise

(5)

for eachH

, whereo T P 2 !

. We add a normalization to obtain 2

such that we

avoid numerical instabilities for the weighting factors. This update constitutes RLVQ

as proposed in [3].

We remark that this update can be interpreted in a Hebbian way: Assumed the near-

est prototypeW

l

is correct then those weighting factors are decreased only slightly for

which the term j u Wl

j

!

is small. Taking the normalization of the weighting factors

into account, the weighting factors are increased in this situation iff they contribute to

the correct classification. Conversely, those factors are increased most for which the

term

j

u W

l

j

!

is large if the classification is wrong. Hence if the classification is

wrong, precisely those weighting factors are increased which do not contribute to the

wrong classification. Since the error function is not continuous in this case, this yields

merely a plausible explanation of the update rule. However, it is not surprising that themethod shows instabilities for large datasets which are subject to noise as we will see

later.

We can apply the same idea to GLVQ. Then the modification of (3) which involves

the relevance factors

of the metric is accompanied by

j

j

u o Tsgd

|

l

n

!

j

u Wl

j

!

u

l

l

n

!

j

u W

j

!

(6)

5


6/16

for eachH

,W

l

andW

being the closest correct or wrong prototype, respectively, and

l and

the respective squared distances in the weighted Euclidian metric. Again,this is followed by normalization. We term this generalization of RLVQ and GLVQ

generalized relevance learning vector quantization or GRLVQ, for short. Note that

the update can be motivated intuitively by the Hebb paradigm taking the normalization

into account: they comprise the same terms as in (5). Hence those weighting factors

are reinforced most, which coefficients are closest to the respective data point if

this point is classified correct; otherwise, if

is classified wrong, those factors are

reinforced most, which coefficients are far away. The difference in (6) compared to

(5) consists in appropriate situation dependent weightings for the two terms and in the

simultaneous update according to the next correct and next wrong prototype. Besides,

the update rule obeys a gradient dynamics on the corresponding error function (1) as

we show in the appendix.

Obviously, the same idea could be applied to any gradient dynamics. We could, for

example, minimize a different error function such as the Kullback-Leibler divergenceof the distribution which is to be learned and the distribution which is implemented by

the vector quantizer. Moreover, this approach is not limited to supervised tasks, we

could enlarge unsupervised methods like the neural gas algorithm [20] which obey a

gradient dynamics with weighting factors in order to obtain an adaptive metric.

3 Relation to previous research

The main characteristics of GRLVQ as proposed in the previous section are as follows:

The method allows an adaptive metric via scaling the input dimensions. The metric is

restricted to a diagonal matrix. The advantages are the efficiency of the method, inter-

pretability of the matrix elements as relevance factors, and the correlated possibility of

pruning. The update proposed in GRLVQ is intuitive and efficient, at the same time a

thorough mathematical foundation can be found due to the gradient dynamics. As we

will see in the next section, GRLVQ provides a robust classification system which is

appropriate for real-life data.

Naturally, various approaches in the literature consider the questions of an adap-

tive metric, input pruning, and dimensionality determination, too. The most similar

approach we are aware of constitutes distinction sensitive LVQ (DSLVQ) [26]. The

method introduces weighting factors, too, and is based on LVQ3. The main advantages

of our iterative update scheme compared to the DSLVQ update are threefold: Our up-

date is very intuitive and can be explained with Hebbian learning; our method is more

efficient since in DSLVQ each update step requires twice normalization; and, which

we believe is the most important difference, our update constitutes a gradient descent

on an error function, hence the dynamics can be mathematically analyzed and a clearobjective can be identified.

Recently, Kaski et.al. proposed two different approaches which allow an adap-

tive metric for unsupervised clustering if additional information in an auxiliary space

is available [17, 32]. Their focus lies on unsupervised clustering and they use the

Bayesian-framework in order to derive appropriate algorithm. The approach in [17]

explicitely adapts the metric, however it needs a model for explaining the auxiliary

6


7/16

data. Hence, we cannot apply the method for our purpose, explicit clustering, i.e. de-

veloping the model. In [32] an explicit model is no longer necessary. However, themethod relies on several statistical assumptions and is derived for soft clustering in-

stead of exact LVQ. One could borrow ideas from [32]. Alternatively to the statistical

scenario, GRLVQ proposes another direct, efficient, and intuitive approach.

Methods as proposed in [13] and variations allow an adaptive metric for other clus-

tering algorithms like fuzzy clustering. The algorithm in [13] even allows a more flex-

ible metric with non-vanishing entries outside the diagonal; however, the algorithms

are naturally less efficient and require a matrix inversion, for example. In addition,

well known methods like RBF networks can be put in the same line since they can

provide a clustering with adaptive metric as well. Commonly, training is less intuitive

and efficient than GRLVQ. Moreover, a more flexible metric which does not restrict to

a diagonal matrix does no longer propose a natural pruning scheme.

Apart from the flexibility due to an adaptive metric, GRLVQ provides a simple way

of determining which data dimensions are relevant: we can just drop those dimensionswith lowest weighting factor until a considerable increase of the classification error is

observed. This is a common feature for all methods which determine weighting factors

describing the metric. Alternatively, one can use general methods for determining the

dimensionality of the data which are not fitted to the classifier LVQ. The most popular

approaches are probably ICA and PCA, as already mentioned [15, 25]. Alternatively,

one could use the above mentioned GSOM algorithm [2]. However, because of its

remaining hypercubical structure the results may be inaccurate. Another method is

to apply a Grassberger-Procaccia-analysis to determine the intrinsic dimension. This

method is unfortunately sensitive to noise [12, 38]. A wide variety of relevance deter-

mination methods exists in statistics or in the supervised neural network literature, e.g.

[9, 10, 11, 24, 28, 30, 31]. These methods mostly focus on the task of obtaining sparse

classifications and they do not yield an adaptive metric which could be used in self-organizing metric-based algorithms like LVQ and SOM. Hence a comparison with our

method which primarily focuses on an adaptive metric for self-organizing algorithms

would be interesting, but beyond the scope of this article.

4 Experiments

Artificial data

We first tested GRLVQ on two artificial data sets from [3] in order to compare it to

RLVQ. We refer to the sets as data2

and data

, respectively. The data comprise

clusters with small or large overlap, respectively, of the clusters in two dimensions as

depicted in Fig. 1. We embed the points in%

T

as follows: Assume T

!

is onedata point. Then we add

dimensions obtaining a point

T 4 4 4 T !. We choose

T n { T 4 4 4 T n { , where

{

comprises noise with a Gaussian distribu-

tion with variances 4

, 4 2

, 4

, and 4

, respectively.

, . . . , T

contain pure noise

which is uniformly distributed in u 4 4 -

and u 4 4 -

or distributed according to

Gaussian noise with variances 4 and 4 , respectively. We refer to the noisy data as

data and data , respectively. In each run, data are randomly separated into a training

7


8/16

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

class 1

class 2

class 3

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

class 1

class 2

class 3

Figure 1: Artificial data sets consisting of three classes with two clusters each and small

or large overlap, respectively; only the first two dimensions are depicted.

and test set of the same size.o

is chosen as constant 4 2

,o T

is chosen as 4 2

. Since the

weighting factors are updated in each step compared to the prototypes, the learning rate

for the weighting terms should be smaller than the learning rate for the prototypes. Pre-

training with simple LVQ till the prototypes nearly converge is mandatory for RLVQ,

otherwise, the classification error is usually large and the results are not stable. It is ad-

visable to train the prototypes with GLVQ for a few2

epochs before using GRLVQ,

either, in order to avoid instabilities. We use

prototypes for each class according to

the priorly known distribution. The results on training and test set are comparable in

all runs, i.e. the test set accuracy is not worse or only slightly worse compared to the

accuracy on the training set. GRLVQ obtains about the same accuracy as RLVQ on

all data sets (see Tab. 1) and clearly indicates which dimensions are less important via

assigning small weighting factors to the less important dimensions which are known in

these examples. Typical weighting factors are the vectors

RLVQ 4 4 4 4 !

GRLVQ 4 4 4 4 4 !

for data

or the vectors

RLVQ 4 2 4 2 4 2 4 2 2 4 2 4 4 2 4 4 4 !

GRLVQ 4 4 4 4 !

for data

, hence clearly separating the important first two data dimensions from the

remaining

dimensions of which the first

contain some information. This is pointed

out via a comparably large third weighting term for the second data set. The remain-

ing four dimensions contain no information at all. However, GRLVQ shows a faster

convergence and larger stability compared to RLVQ in particular if used for noisy data

sets with large overlap of the classes as for data

. There the separation of the impor-

tant dimensions is clearer in GRLVQ than RLVQ. Concerning RLVQ, pre-training with

LVQ and small learning rates were mandatory in order to ensure good results; the same

situations turn out to be less critical for GRLVQ, although it is advisable to choose the

learning rate for the weighting terms an order of magnitude smaller than the learning

8


9/16

data2

data

data

data

LVQ 91 - 96 81 - 89 79 - 86 56 - 70

RLVQ 91 - 96 90 - 96 80 - 86 79 - 86

GRLVQ 94 - 97 93 - 97 83 - 87 83 - 86

Table 1: Percentage of correctly classified patterns (maximum2

) for the two artificial

training data, data2

and data

, with and without additional noisy dimensions and

RLVQ or GRLVQ, respectively.

rate for the prototype update. These results indicate that GRLVQ is particularly well

suited for noisy real life data sets. Based on the above weighting factors one can obtain

a ranking of the input dimensions and drop all but the first two dimensions without

increasing the classification error.

Iris data

In a second test we applied GRLVQ to the well known Iris data set provided in the UCI

repository of machine learning [4]. The task is to predict three classes of plants based

on numerical attributes in 2X instances, i.e., we deal with data points in %

with

labels in F 2 3 9 . Both, LVQ and RLVQ obtain an accuracy of about 4 for a training

and test set if trained with prototypes for each class. RLVQ shows a slightly cyclic

behavior in the limit, the accuracy changing between 4

and 4

. The computed

weighting factors for RLVQ are

RLVQ 4 4 2 4 4 !

indicating that based on the last dimension a very good classification would be possible.If more dimensions would be taken into account, a better accuracy of about

2 4 would

be possible as reported in the literature. We could not produce such a solution with LVQ

or RLVQ. Moreover, a perfect recognition of2

would correspond to overfitting since

the data comprises small noise as reported in the literature. GRLVQ yields the better

accuracy of at least 4

on the training as well as the test set and obtains weighting

factors of the form

GRLVQ 4 4 !

hence, indicating that the last dimension is most important as already found by RLVQ,

and dimension

contributes to a better accuracy which has not been pointed out by

RLVQ. Note that the result obtained by GRLVQ is in coincidence with results obtained

e.g. with rule extraction from feedforward networks [6].

Satellite data

Finally, we applied the algorithm to a large real world data set: a multi-spectral LAND-

SAT TM satellite image of the Colorado area.1 Satellites of LANDSAT-TM type pro-

duce pictures of the earth in 7 different spectral bands. The ground resolution in meter

1Thanks to M. Augusteijn (University of Colorado) for providing this image.

9


10/16

LVQ RLVQ GLVQ GRLVQ

mean (train) 85.21 86.1 87.32 91.08

variance (train) 0.59 0.18 0.17 0.11

mean (test) 85.2 86.36 87.28 91.04

variance (test) 0.46 0.16 0.1 0.13

Table 2: Percentage of correctly classified patterns (maximum2

) and variance of the

runs on the satellite data obtained in a 10-fold-crossvalidation.

is )

for the bands 1-5 and band 7. Band 6 (thermal band) has a resolution of

) only and, therefore, it is often dropped. The spectral bands represent useful

domains of the whole spectrum in order to detect and discriminate vegetation, water,

rock formations and cultural features [5, 22]. Hence, the spectral information, i.e., the

intensity of the bands associated with each pixel of a LANDSAT scene, is representedby a vector in

% 'with

. Generally, the bands are highly correlated [1, 35]. Ad-

ditionally, the Colorado image is completely labeled by experts. There are2

labels

describing different vegetation types and geological formations. Thereby, the label

probability varies in a wide range [36]. The size of the image is25 ) 2

pixels.

We trained RLVQ and GRLVQ with

prototypes (

for each class) on

of

the data set till convergence. The algorithm converged in less than2

cycles ifo

and

o T

were chosen as 4 2

and 4 2

, respectively, as before. RLVQ yields an accuracy

of about

on the training data as well as the entire data set, however, it does not

provide a ranking of the prototypes, i.e. all weighting terms are close to their initial

value 4 2

. GRLVQ leads to the better accuracy of 2

on the training set as well

as the entire data set and provides a clear ranking of the several data dimensions. See

Table 2 for a comparison of the results obtained by the various algorithms. In allexperiments, dimension

is ranked as least important with weighting factor close to

.

The weighting factors approximate

GRLVQ 4 2 4 2 4 4 2 4 !

in several runs. This weighting clearly separates the first two dimensions via a small

weighting factor. If we prune dimension

,2

, and

, still an accuracy of

can be

achieved. Hence this indicates, that the intrinsic data dimension is at most

. Pruning

one additional data dimension, dimension

still allows an accuracy of more than

,

hence indicating that the intrinsic dimension may be even lower and the relevant direc-

tions are not parallel to the axes or even curved. These results are visualized in Fig.

2 where the misclassified pixels in the respective cases are colored in black, the other

pixels are colored corresponding to their respective class.For comparison we applied a Grassberger-Procaccia-analysis and the GSOM ap-

proach. The first estimates the intrinsic dimension as

4 2 2 whereas GSOM

generates a lattice of shape2 ) )

, hence indicating an intrinsic dimension between

2and

. These methods show a good agreement with the drastic loss of information if

more than

dimensions are pruned with GRLVQ.

10


11/16

Figure 2: Colorado-satellite-image: the pixels are colored according to the labels;

above-left: original labeling; above-right: GRLVQ without pruning; below-left: GR-

LVQ with pruning of dimensions 2 , , ; below-right: GRLVQ with pruning of di-

mensions. Misclassified pixels in the GRLVQ-generated images are black colored. (Acolored version of the image can be obtained from the authors on request.)

5 Conclusions

The presented clustering algorithm GRLVQ provides a new robust method for auto-

matically adapting the Euclidian metric used for clustering to the data, determining

the relevance of the several input dimensions for the overall classifier, and estimat-

ing the intrinsic dimension of data. It reduces the input dimensions onto the essential

parameters which is demanded to obtain optimal network structures. This is an impor-

tant feature, if the network is used to reduce the data amount to subsequent systems

in complex data analysis tasks as we can find in medical applications (image analy-

sis) or satellite remote sensing systems, for example. Here, the reduction of data tobe transferred is one of the most important features, however, preserving the essential

information in the data.

The GRLVQ-algorithm was successfully tested on artificial as well as real world

data, a large and noisy satellite multi-spectral image. A comparison with other ap-

proaches validates the results even in real life applications.

It should be noted that the GRLVQ algorithm can be easily adapted to other types

11


12/16

of neural vector quantizers as neural gas or SOM, to mention just a few. Furthermore, it

is clear that if we assume an unknown probability distribution of the labels for a givendata set, the here discussed variant of GRLVQ tries to maximize the Kullback-Leibler

divergence. Hence, we can state for this feature some similarities in our approach to

the work of Kaski [17, 32].

Further considerations of GRLVQ should incorporate information theory approaches

like entropy maximization to improve the capabilities of the network.

References

[1] M. F. Augusteijn, K. A. Shaw, and R. J. Watson. A study of neural network

input data for ground cover identification in satellite images. In S. Gielen and

B. Kappen, editors, Proc. ICANN93, Int. Conf. on Artificial Neural Networks,

pages 10101013, London, UK, 1993. Springer.

[2] H.-U. Bauer and T. Villmann. Growing a Hypercubical Output Space in a Self

Organizing Feature Map. IEEE Transactions on Neural Networks, 8(2):218226,

1997.

[3] T. Bojer, B. Hammer, D. Schunk, and K. Tluk von Toschanowitz. Relevance

determination in learning vector quantization. In Proc. Of European Symposium

on Artificial Neural Networks (ESANN01), pages 271-276, Brussels, Belgium,

2001. D facto publications.

[4] C.L. Blake and C. J. Merz, UCI Repository of machine learning databases , Irvine,

CA: University of California, Department of Information and Computer Science.

[5] J. Campbell. Introduction to Remote Sensing. The Guilford Press, U.S.A., 1996.

[6] W. Duch, R. Adamczak, K. Grabczewski, A new method of extraction, optimiza-

tion and application of crisp and fuzzy logical rules. IEEE Transactions on Neural

Networks 12: 277-306, 2001.

[7] B. Fritzke. Growing grid: a self-organizing network with constant neighborhood

range and adaptation strength. Neural Processing Letters, 2(5):913, 1995.

[8] I. Gath and A. Geva. Unsupervised optimal fuzzy clustering. IEEE Transactions

on Pattern Analysis and Machine Intelligence, 11:773791, 1989.

[9] T. van Gestel, J. A. K. Suykens, B. de Moor, and J. Vandewalle. Automatic

relevance determination for least squares support vector machine classifiers. In

M. Verleysen, editor,European Symposium on Artificial Neural Networks, 1318,

2001.

[10] Y. Grandvalet. Anisotropic noise injection for input variables relevance determi-

nation. IEEE Transactions on Neural Networks, 11(6):12011212, 2000.

12


13/16

[11] Y. Grandvalet. Least absolute shrinkage is equivalent to quadratic penalization.

In L. Niklasson, M. Boden, and T. Ziemke, editors, ICANN98, volume 1 ofPerspectives in Neural Computing, pages 201206. Springer, 1998.

[12] P. Grassberger and I. Procaccia. Measuring the strangeness of strange attractors.

Physica, 9D:189208, 1983.

[13] D. Gustafson and W. Kessel. Fuzzy clustering with a fuzzy covariance matrix. In

Proceedings of IEEE CDC79, pages 761766, 1979.

[14] T. Hofmann. Learning the similarity of documents: An information geometric

approach to document retrieval and categorization. In S. A. Solla, T. K. Leen,

and K. R. Muller, editors, Advances in Neural Information Processing Systems,

volume 12, pages 914920. MIT Press, 2000.

[15] A. Hyvarinen and E. Oja. A fast fixed-pointalgorithm for independent componentanalysis. Neural Computation, 9(7):14831492, 1997.

[16] S. Kaski. Dimensionality reduction by random mapping: fast similarity compu-

tation for clustering. In Proceedings of IJCNN92, pages 413418, 1998.

[17] S. Kaski. Bankruptcy analysis with self-organizing maps in learning metrics. To

appear in IEEE Transactions on Neural Networks.

[18] T. Kohonen. Learning vector quantization. In M. Arbib, editor, The Handbook of

Brain Theory and Neural Networks, pages 537540. MIT Press, 1995.

[19] T. Kohonen. Self-Organizing Maps. Springer, 1997.

[20] T. Martinetz and K. Schulten. Topology representing networks. Neural Networks,7(3):507522, 1993.

[21] U. Matecki. Automatische Merkmalsauswahl f ur Neuronale Netze mit Anwen-

dung in der pixelbezogenen Klassifikation von Bildern. Shaker, 1999.

[22] E. Merenyi. The challenges in spectral image analysis: An introduction and re-

view of ANN approaches. In Proc. Of European Symposium on Artificial Neural

Networks (ESANN99), pages 9398, Brussels, Belgium, 1999. D facto publica-

tions.

[23] A. Meyering and H. Ritter. Learning 3D-shape-perception with local linear maps.

In Proceedings of IJCNN92, pages 432436, 1992.

[24] R. Neal. Bayesian Learning for Neural Networks. Springer, 1996.

[25] E. Oja. Principal component analysis. In M. Arbib, editor, The Handbook of

Brain Theory and Neural Networks, pages 753756. MIT Press, 1995.

[26] M. Pregenzer, G. Pfurtscheller, and D. Flotzinger. Automated feature selection

with distinction sensitive learning vector quantization. Neurocomputing 11:19-

29, 1996.

13


14/16

[27] H. Ritter. Self-organizing maps in non-euclidean spaces. In E. Oja and S. Kaski,

editors, Kohonen Maps, pages 97108. Springer, 1999.[28] V. Roth. Sparse kernel regressors. In G. Dorffner, H. Bischof, and K. Hornik, ed-

itors, Artificial Neural Networks ICANN 2001, pages 339346. Springer, 2001.

[29] A. S. Sato and K. Yamada. Generalized learning vector quantization. In

G. Tesauro, D. Touretzky, and T. Leen, editors, Advances in Neural Information

Processing Systems, volume 7, pages 423429. MIT Press, 1995.

[30] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the

Royal Statistical Society, Series B, 58, 267-288, 1996.

[31] M. Tipping. The relevance vector machine. In S. A. Solla, T. K. Leen, and

K.-R. Muller, editors, Advances in Neural Information Processing Systems, vol-

ume 12, pages 652658. Cambridge MIT Press, 2000.

[32] J. Sinkkonen and S. Kaski. Clustering based on conditional distribution in an

auxiliary space. To appear in Neural Computation.

[33] P. Somervuo and T. Kohonen. Self-organizing maps and learning vector quanti-

zation for feature sequences. Neural Processing Letters, 10(2):151159, 1999.

[34] M.-K. Tsay, K.-H. Shyu, and P.-C. Chang. Feature transformation with general-

ized learning vector quantization for hand-written Chinese character recognition.

IEICE Transactions on Information and Systems, E82-D(3):687692, 1999.

[35] T. Villmann and E. Merenyi. Extensions and modifications of the Kohonen-SOM

and applications in remote sensing image analysis. In U. Seiffert and L.C. Jain

(eds.): Self-Organizing Maps. Recent Advances and Applications, pages 121-145.

Springer, 2001

[36] T. Villmann. Benefits and limits of the self-organizing map and its variants in the

area of satellite remote sensoring processing. In Proc. Of European Symposium

on Artificial Neural Networks (ESANN99), pages 111116, Brussels, Belgium,

1999. D facto publications.

[37] T. Villmann, R. Der, M. Herrmann, and T. Martinetz. Topology Preservation

in SelfOrganizing Feature Maps: Exact Definition and Measurement. IEEE

Transactions on Neural Networks, 8(2):256266, 1997.

[38] W. Wienholt. Entwurf Neuronaler Netze. Verlag Harri Deutsch, Frankfurt/M.,

Germany, 1996.

Appendix

The general error function (1) has here the special form

j

k

T

sgd

l

u

l

n

14


15/16

l

and

being the quadratic weighted distance to the closest correct or wrong pro-

totype,W

l

andW

, respectively. For convenience we denote

A A u W A A

and

r

r

u W

r

! !

r

T

'

. Assume data come from a distribution on the input

space % ' and a labeling function % ' 2 4 4 4 7 9 . Then the continuous version

of the error function reads as

sgd

l

u

l

n

! 4

We assume that the sets

A ! b 9are measurable. Than we can write the

error term in the following way:

k

T

k

!

sgd

r

u

r

n

b ! b !

! (7)

where b

denotes the indices of prototypes labeled withb

, b

denotes the

indices of prototypes not labeled withb

, b !

is an indicator function forW

r

being the

closest prototype to

among those labeled withb

, and b !

is an indicator function

forW

being the closest prototype to

among those not labeled withb

. Denote by H

the Heaviside function. Denote byA b A

orA b A

the number of prototypes labeled

withb

or not labeled withb

, respectively. Then we find

b ! H

k

H

u

r

! u A b A

and

b ! H k

H

u

! u A b A 4

The derivative of the Heaviside function is the delta function

which is a symmetric

function with !

for

and

! 2.

We are interested in the derivative of (7) with respect to everyW

and every

,

respectively. Assumeb

is the label ofW

. Then the derivative of (7) with respect toW

yields:

k

sgd

u

n

n

!

B b

! b

!

!(8)

n

k

k

sgd

u

n

u

n

!

b

! B b

!

!(9)

n

k

!

sgd

r

u

r

n

b

!

W

b

!

!(10)

n

k

k

!

sgd

r

u

r

n

b

!

b

!

W

!(11)

15


16/16

(8) and (9) correspond up to a constant factor to the update (3). (10) and (11) vanish due

to the following reason: Denote by !

the term sgd

b

! b

!

.The integrand in (10) yields

k

!

sgd

r

u

r

n

b

!

k

H

u

r

! u A b

A

k

u

r

!

W

u

r

W

k

!

!

u

r

!

n

k

B !

k

u

! u !

This term vanishes since

is symmetric and non-vanishing only for

r and

, respectively. In the same way, it can be seen that each integrand of (11) vanishes.

The derivative of (7) with respect to

can be computed as

k

T

k

!

sgd

r

u

r

n

b ! b !(12)

r

n

!

u W

r

!

u

r

r

n

!

u W

!

!(13)

n

k

T

k

!

sgd

r

u

r

n

(14)

b !

b ! n b !

b !

!(15)

(12) and (13) correspond to the update for

in (5). (14) and (15) vanish since we obtain

for the integrand the following equation:

k

!

!

k

u

r

!

u

r

n

k

u

!

u

k

k

!

!

u

r

!

u

k

!

!

u

r

!

r

n

k

r

k

!

!

u

!

u

k

!

!

u

!

Again, this is zero because of the symmetry of

and the fact, that

is non-vanishing

only for

r

and

, respectively. Hence the update of GRLVQ constitutes a

stochastic gradient descent method with appropriate choices of the learning rates.

16

Barbara Hammer and Thomas Villmann- Generalized Relevance Learning Vector Quantization

Documents

Transcript of Barbara Hammer and Thomas Villmann- Generalized Relevance Learning Vector Quantization