8/3/2019 Barbara Hammer and Thomas Villmann- Generalized Relevance Learning Vector Quantization
1/16
Generalized Relevance Learning Vector
Quantization
Barbara Hammer
Thomas Villmann
March 11, 2002
Abstract
We propose a new scheme for enlarging generalized learning vector quantiza-
tion (GLVQ) with weighting factors for the input dimensions. The factors allow
an appropriate scaling of the input dimensions according to their relevance. They
are adapted automatically during training according to the specific classification
task whereby training can be interpreted as stochastic gradient descent on an ap-
propriate error function. This method leads to a more powerful classifier and to an
adaptive metric with little extra cost compared to standard GLVQ. Moreover, the
size of the weighting factors indicates the relevance of the input dimensions. This
proposes a scheme for automatically pruning irrelevant input dimensions. The al-
gorithm is verified on artificial data sets and the iris data from the UCI repository.
Afterwards, the method is compared to several well known algorithms which de-
termine the intrinsic data dimension on real world satellite image data.
Keywords: clustering, learning vector quantization, adaptive metric, relevance de-
termination.
1 Introduction
Self-organizing methods such as the self-organizing map (SOM) or vector quantization
(VQ) as introduced by Kohonen provide a successful and intuitive method of process-
ing data for easy access [18]. Assumed data are labeled, an automatic clustering can
be learned via attaching maps to the SOM or enlarging VQ with a supervised compo-
nent to so-called learning vector quantization (LVQ) [19, 23]. Various modifications of
LVQ exist which ensure faster convergence, a better adaptation of the receptive fields
to optimum Bayesian decision, or an adaptation for complex data structures, to name
just a few [19, 29, 33].A common feature of unsupervised algorithms and LVQ consists in the fact that in-
formation is provided by the distance structure between the data points which is deter-
mined by the chosen metric. Learning heavily relies on the commonly used Euclidian
University of Osnabruck, Department of Mathematics/Computer Science, Albrechtstrae 28, 49069 Os-
nabruck, Germany
University of Leipzig, Clinic for Psychotherapy and Psychosomatic Medicine, Karl-Tauchnitz-
Strae 25, 04107 Leipzig, Germany
1
8/3/2019 Barbara Hammer and Thomas Villmann- Generalized Relevance Learning Vector Quantization
2/16
metric and hence crucially depends on the fact that the Euclidian metric is appropri-
ate for the respective learning task. Therefore data are to be preprocessed and scaledappropriately such that the input dimensions have approximately the same importance
for the classification. In particular, the important features for the respective problem
are to be found, which is usually done by experts or with rules of thumb. Of course,
this may be time consuming and requires prior knowledge which is often not avail-
able. Hence methods have been proposed which adapt the metric during training. Dis-
tinction sensitive LVQ (DSLVQ), as an example, automatically determines weighting
factors to the input dimensions of the training data [26]. The algorithm adapts LVQ3
for the weighting factors according to plausible heuristics. The approaches [17, 32]
enhance unsupervised clustering algorithms by the possibility of integrating auxiliary
information such as a labeling into the metric structure. Alternatively, one could use
information geometric methods in order to adapt the metric such as in [14].
Concerning SOM, another major problem consists in finding an appropriate topol-
ogy of the initial lattice of prototypes such that the prior topology of the neural ar-chitecture mirrors the intrinsic topology of the data. Hence various heuristics exist to
measure the degree of topology preservation, to adapt the topology to the data, to de-
fine the lattice a posteriori, or to evolve structures which are appropriate for real world
data [2, 7, 20, 27, 37]. In all tasks the intrinsic dimensionality of data plays a cru-
cial role since it determines an important aspect of the optimum neural network: the
topological structure, i.e., the lattice for SOM. Moreover, superfluous data dimensions
slow down the training for LVQ as well. They may even cause a decrease in accu-
racy since they add possibly noisy or misleading terms to the Euclidian metric where
LVQ is based on. Hence a data dimension as small as possible is desirable for the
above mentioned methods in general, for the sake of efficiency, accuracy, and sim-
plicity of neural network processing. Therefore various algorithms exist which allow
to estimate the intrinsic dimension of the data: PCA and ICA constitute well estab-lished methods which are often used for adequate preprocessing of data and which can
be implemented with neural methods [15, 25]. A Grassberger-Procaccia analysis esti-
mates the dimensionality of attractors in a dynamic system [12]. SOMs which adapt
the dimensionality of the lattice during training like the growing SOM (GSOM) au-
tomatically determine the approximate dimensionality of the data [2]. Naturally, all
adaptation schemes which determine weighting factors or relevance terms for the input
dimensions constitute an alternative method for determining the dimensionality: The
dimensions which are ranked as least important, i.e. they possess the smallest relevance
terms, can be dropped. The intrinsic dimensionality is reached when an appropriate
quality measure such as an error term changes significantly. There exists a wide va-
riety of input relevance determination methods in statistics and the field of supervised
neural networks, e.g. pruning algorithms for feedforward networks as proposed in [10],
the application of adaptive relevance determination for the support vector machine or
Gaussian processes [9, 24, 31], or adaptive ridge regression and the incorporation of
penalizing function as proposed in [11, 28, 30]. However, note that our focus lies on
improving metric based algorithms via involving an adaptive metric which allows di-
mensionality reduction as a byproduct. The above mentioned methods do not yield
a metric which could be used in self-organizing algorithms but primarily investigate
the goal of sparsity and dimensionality reduction in neural network architectures or
2
8/3/2019 Barbara Hammer and Thomas Villmann- Generalized Relevance Learning Vector Quantization
3/16
alternative classifiers.
In the following, we will focus on LVQ since it combines the elegancy of simpleand intuitive updates in unsupervised algorithms with the accuracy of supervised meth-
ods. We will propose a possibility of automatically scaling the input dimensions and
hence adapting the Euclidian metric to the specific training problem. As a byprod-
uct, this leads to a pruning algorithm for irrelevant data dimensions and the possibility
of computing the intrinsic data dimension. Approaches like [16] clearly indicate that
often a considerable reduction of the data dimension is possible without loss of infor-
mation. The main idea of our approach is to introduce weighting factors to the data
dimensions which are adapted automatically such that the classification error becomes
minimal. Like LVQ, the formulas are intuitive formulas and can be interpreted as Heb-
bian learning. From a mathematical point of view, the dynamics constitute a stochastic
gradient descent on an appropriate error surface. Small factors in the result indicate
that the respective data dimension is irrelevant and can be pruned. This idea can be
applied to any generalized LVQ (GLVQ) scheme as introduced in [29] or other plau-sible error measures such as the Kullback-Leibler-divergence. With the error measure
of GLVQ, a robust and efficient method results which can push the classification bor-
ders near to the optimum Bayesian decision. This method, generalized relevance LVQ
(GRLVQ), generalizes relevance LVQ (RLVQ) [3] which is based on simple Hebbian
learning and leads to worse and instable results in case of noisy real life data. However,
like RLVQ, GRLVQ has the advantage of an intuitive update rule and allows efficient
input pruning compared to other approaches which adapt the metric to the data involv-
ing additional transformations as proposed in [8, 13, 34] or depend on less intuitive
differentiable approximations of the original dynamics [21]. Moreover, it is based on a
gradient dynamics compared to heuristic methods like DSLVQ [26].
We will verify our method on various small data sets. Moreover, we will apply
GRLVQ to classify a real life satellite image with approx.
mio. data points. As al-ready mentioned, weighting factors allow us to approximately determine the intrinsic
data dimensionality. An alternative method is the growing SOM (GSOM) which au-
tomatically adapts the lattice of neurons to the data and hence gives hints about the
intrinsic dimensionality as well. We compare our GRLVQ experiments to the results
provided by GSOM. In addition, we relate it to a Grassberger-Procaccia analysis. We
obtain comparable results concerning the intrinsic dimensionality of our data. In the
following, we will first introduce our method GRLVQ, present applications to simple
artificial and real life data, and finally discuss the results for the satellite data.
2 The GRLVQ Algorithm
Assume a finite training set ! # % ' ) 2 4 4 4 7 9 A B 2 4 4 4 H 9 oftraining data is given and the clustering of the data into
7classes is to be learned. We
denote the components of a vector P % '
by T 4 4 4
'
!in the following. GLVQ
chooses a fixed number of vectors in% '
for each class, so called prototypes. Denote
the set of prototypes by W
T
4 4 4 W Y 9and assign the label
b
bto
W iff
W belongs
3
8/3/2019 Barbara Hammer and Thomas Villmann- Generalized Relevance Learning Vector Quantization
4/16
8/3/2019 Barbara Hammer and Thomas Villmann- Generalized Relevance Learning Vector Quantization
5/16
since it combines adaptation near the optimum Bayesian borders like LVQ2.1, whereby
prohibiting the possible divergence of LVQ2.1 as reported in [29]. We refer to theupdate as GLVQ:
W
l
o |
sgd ! ! |
l
n
!
u W
l
!
W
u o |
sgd ! ! |
l
l
n
!
u W
!(3)
Obviously, the success of GLVQ crucially depends on the fact that the Euclidian
metric is appropriate for the data and the input dimensions are approximately equally
scaled and equally important. Here, we introduce input weights T 4 4 4
'
!,
in order to allow a different scaling of the input dimensions hence making pos-
sibly time consuming preprocessing of the data superfluous. Substituting the Euclidian
metric u
by its scaled variant
u
'
k
T
u
!
(4)
the receptive field of prototypeW
becomes
f
P A q W r
u W
w
u W r
9 4
Replacingf
by
f
in the error function
in (1) yields a different weighting of the
input dimensions and hence an adaptive metric. Appropriate weighting factors
can
be determined automatically via a stochastic gradient descent as well. Hence the rule
(2) where the relevance factors
of the metric are integrated is accompanied by the
update
j
j
u o T
j
u W
r
j
!
if b
r
j
n o T
j
u W
r
j
!
otherwise
(5)
for eachH
, whereo T P 2 !
. We add a normalization to obtain 2
such that we
avoid numerical instabilities for the weighting factors. This update constitutes RLVQ
as proposed in [3].
We remark that this update can be interpreted in a Hebbian way: Assumed the near-
est prototypeW
l
is correct then those weighting factors are decreased only slightly for
which the term j u Wl
j
!
is small. Taking the normalization of the weighting factors
into account, the weighting factors are increased in this situation iff they contribute to
the correct classification. Conversely, those factors are increased most for which the
term
j
u W
l
j
!
is large if the classification is wrong. Hence if the classification is
wrong, precisely those weighting factors are increased which do not contribute to the
wrong classification. Since the error function is not continuous in this case, this yields
merely a plausible explanation of the update rule. However, it is not surprising that themethod shows instabilities for large datasets which are subject to noise as we will see
later.
We can apply the same idea to GLVQ. Then the modification of (3) which involves
the relevance factors
of the metric is accompanied by
j
j
u o Tsgd
|
l
n
!
j
u Wl
j
!
u
l
l
n
!
j
u W
j
!
(6)
5
8/3/2019 Barbara Hammer and Thomas Villmann- Generalized Relevance Learning Vector Quantization
6/16
for eachH
,W
l
andW
being the closest correct or wrong prototype, respectively, and
l and
the respective squared distances in the weighted Euclidian metric. Again,this is followed by normalization. We term this generalization of RLVQ and GLVQ
generalized relevance learning vector quantization or GRLVQ, for short. Note that
the update can be motivated intuitively by the Hebb paradigm taking the normalization
into account: they comprise the same terms as in (5). Hence those weighting factors
are reinforced most, which coefficients are closest to the respective data point if
this point is classified correct; otherwise, if
is classified wrong, those factors are
reinforced most, which coefficients are far away. The difference in (6) compared to
(5) consists in appropriate situation dependent weightings for the two terms and in the
simultaneous update according to the next correct and next wrong prototype. Besides,
the update rule obeys a gradient dynamics on the corresponding error function (1) as
we show in the appendix.
Obviously, the same idea could be applied to any gradient dynamics. We could, for
example, minimize a different error function such as the Kullback-Leibler divergenceof the distribution which is to be learned and the distribution which is implemented by
the vector quantizer. Moreover, this approach is not limited to supervised tasks, we
could enlarge unsupervised methods like the neural gas algorithm [20] which obey a
gradient dynamics with weighting factors in order to obtain an adaptive metric.
3 Relation to previous research
The main characteristics of GRLVQ as proposed in the previous section are as follows:
The method allows an adaptive metric via scaling the input dimensions. The metric is
restricted to a diagonal matrix. The advantages are the efficiency of the method, inter-
pretability of the matrix elements as relevance factors, and the correlated possibility of
pruning. The update proposed in GRLVQ is intuitive and efficient, at the same time a
thorough mathematical foundation can be found due to the gradient dynamics. As we
will see in the next section, GRLVQ provides a robust classification system which is
appropriate for real-life data.
Naturally, various approaches in the literature consider the questions of an adap-
tive metric, input pruning, and dimensionality determination, too. The most similar
approach we are aware of constitutes distinction sensitive LVQ (DSLVQ) [26]. The
method introduces weighting factors, too, and is based on LVQ3. The main advantages
of our iterative update scheme compared to the DSLVQ update are threefold: Our up-
date is very intuitive and can be explained with Hebbian learning; our method is more
efficient since in DSLVQ each update step requires twice normalization; and, which
we believe is the most important difference, our update constitutes a gradient descent
on an error function, hence the dynamics can be mathematically analyzed and a clearobjective can be identified.
Recently, Kaski et.al. proposed two different approaches which allow an adap-
tive metric for unsupervised clustering if additional information in an auxiliary space
is available [17, 32]. Their focus lies on unsupervised clustering and they use the
Bayesian-framework in order to derive appropriate algorithm. The approach in [17]
explicitely adapts the metric, however it needs a model for explaining the auxiliary
6
8/3/2019 Barbara Hammer and Thomas Villmann- Generalized Relevance Learning Vector Quantization
7/16
data. Hence, we cannot apply the method for our purpose, explicit clustering, i.e. de-
veloping the model. In [32] an explicit model is no longer necessary. However, themethod relies on several statistical assumptions and is derived for soft clustering in-
stead of exact LVQ. One could borrow ideas from [32]. Alternatively to the statistical
scenario, GRLVQ proposes another direct, efficient, and intuitive approach.
Methods as proposed in [13] and variations allow an adaptive metric for other clus-
tering algorithms like fuzzy clustering. The algorithm in [13] even allows a more flex-
ible metric with non-vanishing entries outside the diagonal; however, the algorithms
are naturally less efficient and require a matrix inversion, for example. In addition,
well known methods like RBF networks can be put in the same line since they can
provide a clustering with adaptive metric as well. Commonly, training is less intuitive
and efficient than GRLVQ. Moreover, a more flexible metric which does not restrict to
a diagonal matrix does no longer propose a natural pruning scheme.
Apart from the flexibility due to an adaptive metric, GRLVQ provides a simple way
of determining which data dimensions are relevant: we can just drop those dimensionswith lowest weighting factor until a considerable increase of the classification error is
observed. This is a common feature for all methods which determine weighting factors
describing the metric. Alternatively, one can use general methods for determining the
dimensionality of the data which are not fitted to the classifier LVQ. The most popular
approaches are probably ICA and PCA, as already mentioned [15, 25]. Alternatively,
one could use the above mentioned GSOM algorithm [2]. However, because of its
remaining hypercubical structure the results may be inaccurate. Another method is
to apply a Grassberger-Procaccia-analysis to determine the intrinsic dimension. This
method is unfortunately sensitive to noise [12, 38]. A wide variety of relevance deter-
mination methods exists in statistics or in the supervised neural network literature, e.g.
[9, 10, 11, 24, 28, 30, 31]. These methods mostly focus on the task of obtaining sparse
classifications and they do not yield an adaptive metric which could be used in self-organizing metric-based algorithms like LVQ and SOM. Hence a comparison with our
method which primarily focuses on an adaptive metric for self-organizing algorithms
would be interesting, but beyond the scope of this article.
4 Experiments
Artificial data
We first tested GRLVQ on two artificial data sets from [3] in order to compare it to
RLVQ. We refer to the sets as data2
and data
, respectively. The data comprise
clusters with small or large overlap, respectively, of the clusters in two dimensions as
depicted in Fig. 1. We embed the points in%
T
as follows: Assume T
!
is onedata point. Then we add
dimensions obtaining a point
T 4 4 4 T !. We choose
T n { T 4 4 4 T n { , where
{
comprises noise with a Gaussian distribu-
tion with variances 4
, 4 2
, 4
, and 4
, respectively.
, . . . , T
contain pure noise
which is uniformly distributed in u 4 4 -
and u 4 4 -
or distributed according to
Gaussian noise with variances 4 and 4 , respectively. We refer to the noisy data as
data and data , respectively. In each run, data are randomly separated into a training
7
8/3/2019 Barbara Hammer and Thomas Villmann- Generalized Relevance Learning Vector Quantization
8/16
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
class 1
class 2
class 3
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
class 1
class 2
class 3
Figure 1: Artificial data sets consisting of three classes with two clusters each and small
or large overlap, respectively; only the first two dimensions are depicted.
and test set of the same size.o
is chosen as constant 4 2
,o T
is chosen as 4 2
. Since the
weighting factors are updated in each step compared to the prototypes, the learning rate
for the weighting terms should be smaller than the learning rate for the prototypes. Pre-
training with simple LVQ till the prototypes nearly converge is mandatory for RLVQ,
otherwise, the classification error is usually large and the results are not stable. It is ad-
visable to train the prototypes with GLVQ for a few2
epochs before using GRLVQ,
either, in order to avoid instabilities. We use
prototypes for each class according to
the priorly known distribution. The results on training and test set are comparable in
all runs, i.e. the test set accuracy is not worse or only slightly worse compared to the
accuracy on the training set. GRLVQ obtains about the same accuracy as RLVQ on
all data sets (see Tab. 1) and clearly indicates which dimensions are less important via
assigning small weighting factors to the less important dimensions which are known in
these examples. Typical weighting factors are the vectors
RLVQ 4 4 4 4 !
GRLVQ 4 4 4 4 4 !
for data
or the vectors
RLVQ 4 2 4 2 4 2 4 2 2 4 2 4 4 2 4 4 4 !
GRLVQ 4 4 4 4 !
for data
, hence clearly separating the important first two data dimensions from the
remaining
dimensions of which the first
contain some information. This is pointed
out via a comparably large third weighting term for the second data set. The remain-
ing four dimensions contain no information at all. However, GRLVQ shows a faster
convergence and larger stability compared to RLVQ in particular if used for noisy data
sets with large overlap of the classes as for data
. There the separation of the impor-
tant dimensions is clearer in GRLVQ than RLVQ. Concerning RLVQ, pre-training with
LVQ and small learning rates were mandatory in order to ensure good results; the same
situations turn out to be less critical for GRLVQ, although it is advisable to choose the
learning rate for the weighting terms an order of magnitude smaller than the learning
8
8/3/2019 Barbara Hammer and Thomas Villmann- Generalized Relevance Learning Vector Quantization
9/16
data2
data
data
data
LVQ 91 - 96 81 - 89 79 - 86 56 - 70
RLVQ 91 - 96 90 - 96 80 - 86 79 - 86
GRLVQ 94 - 97 93 - 97 83 - 87 83 - 86
Table 1: Percentage of correctly classified patterns (maximum2
) for the two artificial
training data, data2
and data
, with and without additional noisy dimensions and
RLVQ or GRLVQ, respectively.
rate for the prototype update. These results indicate that GRLVQ is particularly well
suited for noisy real life data sets. Based on the above weighting factors one can obtain
a ranking of the input dimensions and drop all but the first two dimensions without
increasing the classification error.
Iris data
In a second test we applied GRLVQ to the well known Iris data set provided in the UCI
repository of machine learning [4]. The task is to predict three classes of plants based
on numerical attributes in 2X instances, i.e., we deal with data points in %
with
labels in F 2 3 9 . Both, LVQ and RLVQ obtain an accuracy of about 4 for a training
and test set if trained with prototypes for each class. RLVQ shows a slightly cyclic
behavior in the limit, the accuracy changing between 4
and 4
. The computed
weighting factors for RLVQ are
RLVQ 4 4 2 4 4 !
indicating that based on the last dimension a very good classification would be possible.If more dimensions would be taken into account, a better accuracy of about
2 4 would
be possible as reported in the literature. We could not produce such a solution with LVQ
or RLVQ. Moreover, a perfect recognition of2
would correspond to overfitting since
the data comprises small noise as reported in the literature. GRLVQ yields the better
accuracy of at least 4
on the training as well as the test set and obtains weighting
factors of the form
GRLVQ 4 4 !
hence, indicating that the last dimension is most important as already found by RLVQ,
and dimension
contributes to a better accuracy which has not been pointed out by
RLVQ. Note that the result obtained by GRLVQ is in coincidence with results obtained
e.g. with rule extraction from feedforward networks [6].
Satellite data
Finally, we applied the algorithm to a large real world data set: a multi-spectral LAND-
SAT TM satellite image of the Colorado area.1 Satellites of LANDSAT-TM type pro-
duce pictures of the earth in 7 different spectral bands. The ground resolution in meter
1Thanks to M. Augusteijn (University of Colorado) for providing this image.
9
8/3/2019 Barbara Hammer and Thomas Villmann- Generalized Relevance Learning Vector Quantization
10/16
LVQ RLVQ GLVQ GRLVQ
mean (train) 85.21 86.1 87.32 91.08
variance (train) 0.59 0.18 0.17 0.11
mean (test) 85.2 86.36 87.28 91.04
variance (test) 0.46 0.16 0.1 0.13
Table 2: Percentage of correctly classified patterns (maximum2
) and variance of the
runs on the satellite data obtained in a 10-fold-crossvalidation.
is )
for the bands 1-5 and band 7. Band 6 (thermal band) has a resolution of
) only and, therefore, it is often dropped. The spectral bands represent useful
domains of the whole spectrum in order to detect and discriminate vegetation, water,
rock formations and cultural features [5, 22]. Hence, the spectral information, i.e., the
intensity of the bands associated with each pixel of a LANDSAT scene, is representedby a vector in
% 'with
. Generally, the bands are highly correlated [1, 35]. Ad-
ditionally, the Colorado image is completely labeled by experts. There are2
labels
describing different vegetation types and geological formations. Thereby, the label
probability varies in a wide range [36]. The size of the image is25 ) 2
pixels.
We trained RLVQ and GRLVQ with
prototypes (
for each class) on
of
the data set till convergence. The algorithm converged in less than2
cycles ifo
and
o T
were chosen as 4 2
and 4 2
, respectively, as before. RLVQ yields an accuracy
of about
on the training data as well as the entire data set, however, it does not
provide a ranking of the prototypes, i.e. all weighting terms are close to their initial
value 4 2
. GRLVQ leads to the better accuracy of 2
on the training set as well
as the entire data set and provides a clear ranking of the several data dimensions. See
Table 2 for a comparison of the results obtained by the various algorithms. In allexperiments, dimension
is ranked as least important with weighting factor close to
.
The weighting factors approximate
GRLVQ 4 2 4 2 4 4 2 4 !
in several runs. This weighting clearly separates the first two dimensions via a small
weighting factor. If we prune dimension
,2
, and
, still an accuracy of
can be
achieved. Hence this indicates, that the intrinsic data dimension is at most
. Pruning
one additional data dimension, dimension
still allows an accuracy of more than
,
hence indicating that the intrinsic dimension may be even lower and the relevant direc-
tions are not parallel to the axes or even curved. These results are visualized in Fig.
2 where the misclassified pixels in the respective cases are colored in black, the other
pixels are colored corresponding to their respective class.For comparison we applied a Grassberger-Procaccia-analysis and the GSOM ap-
proach. The first estimates the intrinsic dimension as
4 2 2 whereas GSOM
generates a lattice of shape2 ) )
, hence indicating an intrinsic dimension between
2and
. These methods show a good agreement with the drastic loss of information if
more than
dimensions are pruned with GRLVQ.
10
8/3/2019 Barbara Hammer and Thomas Villmann- Generalized Relevance Learning Vector Quantization
11/16
Figure 2: Colorado-satellite-image: the pixels are colored according to the labels;
above-left: original labeling; above-right: GRLVQ without pruning; below-left: GR-
LVQ with pruning of dimensions 2 , , ; below-right: GRLVQ with pruning of di-
mensions. Misclassified pixels in the GRLVQ-generated images are black colored. (Acolored version of the image can be obtained from the authors on request.)
5 Conclusions
The presented clustering algorithm GRLVQ provides a new robust method for auto-
matically adapting the Euclidian metric used for clustering to the data, determining
the relevance of the several input dimensions for the overall classifier, and estimat-
ing the intrinsic dimension of data. It reduces the input dimensions onto the essential
parameters which is demanded to obtain optimal network structures. This is an impor-
tant feature, if the network is used to reduce the data amount to subsequent systems
in complex data analysis tasks as we can find in medical applications (image analy-
sis) or satellite remote sensing systems, for example. Here, the reduction of data tobe transferred is one of the most important features, however, preserving the essential
information in the data.
The GRLVQ-algorithm was successfully tested on artificial as well as real world
data, a large and noisy satellite multi-spectral image. A comparison with other ap-
proaches validates the results even in real life applications.
It should be noted that the GRLVQ algorithm can be easily adapted to other types
11
8/3/2019 Barbara Hammer and Thomas Villmann- Generalized Relevance Learning Vector Quantization
12/16
of neural vector quantizers as neural gas or SOM, to mention just a few. Furthermore, it
is clear that if we assume an unknown probability distribution of the labels for a givendata set, the here discussed variant of GRLVQ tries to maximize the Kullback-Leibler
divergence. Hence, we can state for this feature some similarities in our approach to
the work of Kaski [17, 32].
Further considerations of GRLVQ should incorporate information theory approaches
like entropy maximization to improve the capabilities of the network.
References
[1] M. F. Augusteijn, K. A. Shaw, and R. J. Watson. A study of neural network
input data for ground cover identification in satellite images. In S. Gielen and
B. Kappen, editors, Proc. ICANN93, Int. Conf. on Artificial Neural Networks,
pages 10101013, London, UK, 1993. Springer.
[2] H.-U. Bauer and T. Villmann. Growing a Hypercubical Output Space in a Self
Organizing Feature Map. IEEE Transactions on Neural Networks, 8(2):218226,
1997.
[3] T. Bojer, B. Hammer, D. Schunk, and K. Tluk von Toschanowitz. Relevance
determination in learning vector quantization. In Proc. Of European Symposium
on Artificial Neural Networks (ESANN01), pages 271-276, Brussels, Belgium,
2001. D facto publications.
[4] C.L. Blake and C. J. Merz, UCI Repository of machine learning databases , Irvine,
CA: University of California, Department of Information and Computer Science.
[5] J. Campbell. Introduction to Remote Sensing. The Guilford Press, U.S.A., 1996.
[6] W. Duch, R. Adamczak, K. Grabczewski, A new method of extraction, optimiza-
tion and application of crisp and fuzzy logical rules. IEEE Transactions on Neural
Networks 12: 277-306, 2001.
[7] B. Fritzke. Growing grid: a self-organizing network with constant neighborhood
range and adaptation strength. Neural Processing Letters, 2(5):913, 1995.
[8] I. Gath and A. Geva. Unsupervised optimal fuzzy clustering. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 11:773791, 1989.
[9] T. van Gestel, J. A. K. Suykens, B. de Moor, and J. Vandewalle. Automatic
relevance determination for least squares support vector machine classifiers. In
M. Verleysen, editor,European Symposium on Artificial Neural Networks, 1318,
2001.
[10] Y. Grandvalet. Anisotropic noise injection for input variables relevance determi-
nation. IEEE Transactions on Neural Networks, 11(6):12011212, 2000.
12
8/3/2019 Barbara Hammer and Thomas Villmann- Generalized Relevance Learning Vector Quantization
13/16
[11] Y. Grandvalet. Least absolute shrinkage is equivalent to quadratic penalization.
In L. Niklasson, M. Boden, and T. Ziemke, editors, ICANN98, volume 1 ofPerspectives in Neural Computing, pages 201206. Springer, 1998.
[12] P. Grassberger and I. Procaccia. Measuring the strangeness of strange attractors.
Physica, 9D:189208, 1983.
[13] D. Gustafson and W. Kessel. Fuzzy clustering with a fuzzy covariance matrix. In
Proceedings of IEEE CDC79, pages 761766, 1979.
[14] T. Hofmann. Learning the similarity of documents: An information geometric
approach to document retrieval and categorization. In S. A. Solla, T. K. Leen,
and K. R. Muller, editors, Advances in Neural Information Processing Systems,
volume 12, pages 914920. MIT Press, 2000.
[15] A. Hyvarinen and E. Oja. A fast fixed-pointalgorithm for independent componentanalysis. Neural Computation, 9(7):14831492, 1997.
[16] S. Kaski. Dimensionality reduction by random mapping: fast similarity compu-
tation for clustering. In Proceedings of IJCNN92, pages 413418, 1998.
[17] S. Kaski. Bankruptcy analysis with self-organizing maps in learning metrics. To
appear in IEEE Transactions on Neural Networks.
[18] T. Kohonen. Learning vector quantization. In M. Arbib, editor, The Handbook of
Brain Theory and Neural Networks, pages 537540. MIT Press, 1995.
[19] T. Kohonen. Self-Organizing Maps. Springer, 1997.
[20] T. Martinetz and K. Schulten. Topology representing networks. Neural Networks,7(3):507522, 1993.
[21] U. Matecki. Automatische Merkmalsauswahl f ur Neuronale Netze mit Anwen-
dung in der pixelbezogenen Klassifikation von Bildern. Shaker, 1999.
[22] E. Merenyi. The challenges in spectral image analysis: An introduction and re-
view of ANN approaches. In Proc. Of European Symposium on Artificial Neural
Networks (ESANN99), pages 9398, Brussels, Belgium, 1999. D facto publica-
tions.
[23] A. Meyering and H. Ritter. Learning 3D-shape-perception with local linear maps.
In Proceedings of IJCNN92, pages 432436, 1992.
[24] R. Neal. Bayesian Learning for Neural Networks. Springer, 1996.
[25] E. Oja. Principal component analysis. In M. Arbib, editor, The Handbook of
Brain Theory and Neural Networks, pages 753756. MIT Press, 1995.
[26] M. Pregenzer, G. Pfurtscheller, and D. Flotzinger. Automated feature selection
with distinction sensitive learning vector quantization. Neurocomputing 11:19-
29, 1996.
13
8/3/2019 Barbara Hammer and Thomas Villmann- Generalized Relevance Learning Vector Quantization
14/16
[27] H. Ritter. Self-organizing maps in non-euclidean spaces. In E. Oja and S. Kaski,
editors, Kohonen Maps, pages 97108. Springer, 1999.[28] V. Roth. Sparse kernel regressors. In G. Dorffner, H. Bischof, and K. Hornik, ed-
itors, Artificial Neural Networks ICANN 2001, pages 339346. Springer, 2001.
[29] A. S. Sato and K. Yamada. Generalized learning vector quantization. In
G. Tesauro, D. Touretzky, and T. Leen, editors, Advances in Neural Information
Processing Systems, volume 7, pages 423429. MIT Press, 1995.
[30] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the
Royal Statistical Society, Series B, 58, 267-288, 1996.
[31] M. Tipping. The relevance vector machine. In S. A. Solla, T. K. Leen, and
K.-R. Muller, editors, Advances in Neural Information Processing Systems, vol-
ume 12, pages 652658. Cambridge MIT Press, 2000.
[32] J. Sinkkonen and S. Kaski. Clustering based on conditional distribution in an
auxiliary space. To appear in Neural Computation.
[33] P. Somervuo and T. Kohonen. Self-organizing maps and learning vector quanti-
zation for feature sequences. Neural Processing Letters, 10(2):151159, 1999.
[34] M.-K. Tsay, K.-H. Shyu, and P.-C. Chang. Feature transformation with general-
ized learning vector quantization for hand-written Chinese character recognition.
IEICE Transactions on Information and Systems, E82-D(3):687692, 1999.
[35] T. Villmann and E. Merenyi. Extensions and modifications of the Kohonen-SOM
and applications in remote sensing image analysis. In U. Seiffert and L.C. Jain
(eds.): Self-Organizing Maps. Recent Advances and Applications, pages 121-145.
Springer, 2001
[36] T. Villmann. Benefits and limits of the self-organizing map and its variants in the
area of satellite remote sensoring processing. In Proc. Of European Symposium
on Artificial Neural Networks (ESANN99), pages 111116, Brussels, Belgium,
1999. D facto publications.
[37] T. Villmann, R. Der, M. Herrmann, and T. Martinetz. Topology Preservation
in SelfOrganizing Feature Maps: Exact Definition and Measurement. IEEE
Transactions on Neural Networks, 8(2):256266, 1997.
[38] W. Wienholt. Entwurf Neuronaler Netze. Verlag Harri Deutsch, Frankfurt/M.,
Germany, 1996.
Appendix
The general error function (1) has here the special form
j
k
T
sgd
l
u
l
n
14
8/3/2019 Barbara Hammer and Thomas Villmann- Generalized Relevance Learning Vector Quantization
15/16
l
and
being the quadratic weighted distance to the closest correct or wrong pro-
totype,W
l
andW
, respectively. For convenience we denote
A A u W A A
and
r
r
u W
r
! !
r
T
'
. Assume data come from a distribution on the input
space % ' and a labeling function % ' 2 4 4 4 7 9 . Then the continuous version
of the error function reads as
sgd
l
u
l
n
! 4
We assume that the sets
A ! b 9are measurable. Than we can write the
error term in the following way:
k
T
k
!
sgd
r
u
r
n
b ! b !
! (7)
where b
denotes the indices of prototypes labeled withb
, b
denotes the
indices of prototypes not labeled withb
, b !
is an indicator function forW
r
being the
closest prototype to
among those labeled withb
, and b !
is an indicator function
forW
being the closest prototype to
among those not labeled withb
. Denote by H
the Heaviside function. Denote byA b A
orA b A
the number of prototypes labeled
withb
or not labeled withb
, respectively. Then we find
b ! H
k
H
u
r
! u A b A
and
b ! H k
H
u
! u A b A 4
The derivative of the Heaviside function is the delta function
which is a symmetric
function with !
for
and
! 2.
We are interested in the derivative of (7) with respect to everyW
and every
,
respectively. Assumeb
is the label ofW
. Then the derivative of (7) with respect toW
yields:
k
sgd
u
n
n
!
B b
! b
!
!(8)
n
k
k
sgd
u
n
u
n
!
b
! B b
!
!(9)
n
k
!
sgd
r
u
r
n
b
!
W
b
!
!(10)
n
k
k
!
sgd
r
u
r
n
b
!
b
!
W
!(11)
15
8/3/2019 Barbara Hammer and Thomas Villmann- Generalized Relevance Learning Vector Quantization
16/16
(8) and (9) correspond up to a constant factor to the update (3). (10) and (11) vanish due
to the following reason: Denote by !
the term sgd
b
! b
!
.The integrand in (10) yields
k
!
sgd
r
u
r
n
b
!
k
H
u
r
! u A b
A
k
u
r
!
W
u
r
W
k
!
!
u
r
!
n
k
B !
k
u
! u !
This term vanishes since
is symmetric and non-vanishing only for
r and
, respectively. In the same way, it can be seen that each integrand of (11) vanishes.
The derivative of (7) with respect to
can be computed as
k
T
k
!
sgd
r
u
r
n
b ! b !(12)
r
n
!
u W
r
!
u
r
r
n
!
u W
!
!(13)
n
k
T
k
!
sgd
r
u
r
n
(14)
b !
b ! n b !
b !
!(15)
(12) and (13) correspond to the update for
in (5). (14) and (15) vanish since we obtain
for the integrand the following equation:
k
!
!
k
u
r
!
u
r
n
k
u
!
u
k
k
!
!
u
r
!
u
k
!
!
u
r
!
r
n
k
r
k
!
!
u
!
u
k
!
!
u
!
Again, this is zero because of the symmetry of
and the fact, that
is non-vanishing
only for
r
and
, respectively. Hence the update of GRLVQ constitutes a
stochastic gradient descent method with appropriate choices of the learning rates.
16
Top Related