Barbara Hammer and Thomas Villmann- Generalized Relevance Learning Vector Quantization

download Barbara Hammer and Thomas Villmann- Generalized Relevance Learning Vector Quantization

of 16

Transcript of Barbara Hammer and Thomas Villmann- Generalized Relevance Learning Vector Quantization

  • 8/3/2019 Barbara Hammer and Thomas Villmann- Generalized Relevance Learning Vector Quantization

    1/16

    Generalized Relevance Learning Vector

    Quantization

    Barbara Hammer

    Thomas Villmann

    March 11, 2002

    Abstract

    We propose a new scheme for enlarging generalized learning vector quantiza-

    tion (GLVQ) with weighting factors for the input dimensions. The factors allow

    an appropriate scaling of the input dimensions according to their relevance. They

    are adapted automatically during training according to the specific classification

    task whereby training can be interpreted as stochastic gradient descent on an ap-

    propriate error function. This method leads to a more powerful classifier and to an

    adaptive metric with little extra cost compared to standard GLVQ. Moreover, the

    size of the weighting factors indicates the relevance of the input dimensions. This

    proposes a scheme for automatically pruning irrelevant input dimensions. The al-

    gorithm is verified on artificial data sets and the iris data from the UCI repository.

    Afterwards, the method is compared to several well known algorithms which de-

    termine the intrinsic data dimension on real world satellite image data.

    Keywords: clustering, learning vector quantization, adaptive metric, relevance de-

    termination.

    1 Introduction

    Self-organizing methods such as the self-organizing map (SOM) or vector quantization

    (VQ) as introduced by Kohonen provide a successful and intuitive method of process-

    ing data for easy access [18]. Assumed data are labeled, an automatic clustering can

    be learned via attaching maps to the SOM or enlarging VQ with a supervised compo-

    nent to so-called learning vector quantization (LVQ) [19, 23]. Various modifications of

    LVQ exist which ensure faster convergence, a better adaptation of the receptive fields

    to optimum Bayesian decision, or an adaptation for complex data structures, to name

    just a few [19, 29, 33].A common feature of unsupervised algorithms and LVQ consists in the fact that in-

    formation is provided by the distance structure between the data points which is deter-

    mined by the chosen metric. Learning heavily relies on the commonly used Euclidian

    University of Osnabruck, Department of Mathematics/Computer Science, Albrechtstrae 28, 49069 Os-

    nabruck, Germany

    University of Leipzig, Clinic for Psychotherapy and Psychosomatic Medicine, Karl-Tauchnitz-

    Strae 25, 04107 Leipzig, Germany

    1

  • 8/3/2019 Barbara Hammer and Thomas Villmann- Generalized Relevance Learning Vector Quantization

    2/16

    metric and hence crucially depends on the fact that the Euclidian metric is appropri-

    ate for the respective learning task. Therefore data are to be preprocessed and scaledappropriately such that the input dimensions have approximately the same importance

    for the classification. In particular, the important features for the respective problem

    are to be found, which is usually done by experts or with rules of thumb. Of course,

    this may be time consuming and requires prior knowledge which is often not avail-

    able. Hence methods have been proposed which adapt the metric during training. Dis-

    tinction sensitive LVQ (DSLVQ), as an example, automatically determines weighting

    factors to the input dimensions of the training data [26]. The algorithm adapts LVQ3

    for the weighting factors according to plausible heuristics. The approaches [17, 32]

    enhance unsupervised clustering algorithms by the possibility of integrating auxiliary

    information such as a labeling into the metric structure. Alternatively, one could use

    information geometric methods in order to adapt the metric such as in [14].

    Concerning SOM, another major problem consists in finding an appropriate topol-

    ogy of the initial lattice of prototypes such that the prior topology of the neural ar-chitecture mirrors the intrinsic topology of the data. Hence various heuristics exist to

    measure the degree of topology preservation, to adapt the topology to the data, to de-

    fine the lattice a posteriori, or to evolve structures which are appropriate for real world

    data [2, 7, 20, 27, 37]. In all tasks the intrinsic dimensionality of data plays a cru-

    cial role since it determines an important aspect of the optimum neural network: the

    topological structure, i.e., the lattice for SOM. Moreover, superfluous data dimensions

    slow down the training for LVQ as well. They may even cause a decrease in accu-

    racy since they add possibly noisy or misleading terms to the Euclidian metric where

    LVQ is based on. Hence a data dimension as small as possible is desirable for the

    above mentioned methods in general, for the sake of efficiency, accuracy, and sim-

    plicity of neural network processing. Therefore various algorithms exist which allow

    to estimate the intrinsic dimension of the data: PCA and ICA constitute well estab-lished methods which are often used for adequate preprocessing of data and which can

    be implemented with neural methods [15, 25]. A Grassberger-Procaccia analysis esti-

    mates the dimensionality of attractors in a dynamic system [12]. SOMs which adapt

    the dimensionality of the lattice during training like the growing SOM (GSOM) au-

    tomatically determine the approximate dimensionality of the data [2]. Naturally, all

    adaptation schemes which determine weighting factors or relevance terms for the input

    dimensions constitute an alternative method for determining the dimensionality: The

    dimensions which are ranked as least important, i.e. they possess the smallest relevance

    terms, can be dropped. The intrinsic dimensionality is reached when an appropriate

    quality measure such as an error term changes significantly. There exists a wide va-

    riety of input relevance determination methods in statistics and the field of supervised

    neural networks, e.g. pruning algorithms for feedforward networks as proposed in [10],

    the application of adaptive relevance determination for the support vector machine or

    Gaussian processes [9, 24, 31], or adaptive ridge regression and the incorporation of

    penalizing function as proposed in [11, 28, 30]. However, note that our focus lies on

    improving metric based algorithms via involving an adaptive metric which allows di-

    mensionality reduction as a byproduct. The above mentioned methods do not yield

    a metric which could be used in self-organizing algorithms but primarily investigate

    the goal of sparsity and dimensionality reduction in neural network architectures or

    2

  • 8/3/2019 Barbara Hammer and Thomas Villmann- Generalized Relevance Learning Vector Quantization

    3/16

    alternative classifiers.

    In the following, we will focus on LVQ since it combines the elegancy of simpleand intuitive updates in unsupervised algorithms with the accuracy of supervised meth-

    ods. We will propose a possibility of automatically scaling the input dimensions and

    hence adapting the Euclidian metric to the specific training problem. As a byprod-

    uct, this leads to a pruning algorithm for irrelevant data dimensions and the possibility

    of computing the intrinsic data dimension. Approaches like [16] clearly indicate that

    often a considerable reduction of the data dimension is possible without loss of infor-

    mation. The main idea of our approach is to introduce weighting factors to the data

    dimensions which are adapted automatically such that the classification error becomes

    minimal. Like LVQ, the formulas are intuitive formulas and can be interpreted as Heb-

    bian learning. From a mathematical point of view, the dynamics constitute a stochastic

    gradient descent on an appropriate error surface. Small factors in the result indicate

    that the respective data dimension is irrelevant and can be pruned. This idea can be

    applied to any generalized LVQ (GLVQ) scheme as introduced in [29] or other plau-sible error measures such as the Kullback-Leibler-divergence. With the error measure

    of GLVQ, a robust and efficient method results which can push the classification bor-

    ders near to the optimum Bayesian decision. This method, generalized relevance LVQ

    (GRLVQ), generalizes relevance LVQ (RLVQ) [3] which is based on simple Hebbian

    learning and leads to worse and instable results in case of noisy real life data. However,

    like RLVQ, GRLVQ has the advantage of an intuitive update rule and allows efficient

    input pruning compared to other approaches which adapt the metric to the data involv-

    ing additional transformations as proposed in [8, 13, 34] or depend on less intuitive

    differentiable approximations of the original dynamics [21]. Moreover, it is based on a

    gradient dynamics compared to heuristic methods like DSLVQ [26].

    We will verify our method on various small data sets. Moreover, we will apply

    GRLVQ to classify a real life satellite image with approx.

    mio. data points. As al-ready mentioned, weighting factors allow us to approximately determine the intrinsic

    data dimensionality. An alternative method is the growing SOM (GSOM) which au-

    tomatically adapts the lattice of neurons to the data and hence gives hints about the

    intrinsic dimensionality as well. We compare our GRLVQ experiments to the results

    provided by GSOM. In addition, we relate it to a Grassberger-Procaccia analysis. We

    obtain comparable results concerning the intrinsic dimensionality of our data. In the

    following, we will first introduce our method GRLVQ, present applications to simple

    artificial and real life data, and finally discuss the results for the satellite data.

    2 The GRLVQ Algorithm

    Assume a finite training set ! # % ' ) 2 4 4 4 7 9 A B 2 4 4 4 H 9 oftraining data is given and the clustering of the data into

    7classes is to be learned. We

    denote the components of a vector P % '

    by T 4 4 4

    '

    !in the following. GLVQ

    chooses a fixed number of vectors in% '

    for each class, so called prototypes. Denote

    the set of prototypes by W

    T

    4 4 4 W Y 9and assign the label

    b

    bto

    W iff

    W belongs

    3

  • 8/3/2019 Barbara Hammer and Thomas Villmann- Generalized Relevance Learning Vector Quantization

    4/16

  • 8/3/2019 Barbara Hammer and Thomas Villmann- Generalized Relevance Learning Vector Quantization

    5/16

    since it combines adaptation near the optimum Bayesian borders like LVQ2.1, whereby

    prohibiting the possible divergence of LVQ2.1 as reported in [29]. We refer to theupdate as GLVQ:

    W

    l

    o |

    sgd ! ! |

    l

    n

    !

    u W

    l

    !

    W

    u o |

    sgd ! ! |

    l

    l

    n

    !

    u W

    !(3)

    Obviously, the success of GLVQ crucially depends on the fact that the Euclidian

    metric is appropriate for the data and the input dimensions are approximately equally

    scaled and equally important. Here, we introduce input weights T 4 4 4

    '

    !,

    in order to allow a different scaling of the input dimensions hence making pos-

    sibly time consuming preprocessing of the data superfluous. Substituting the Euclidian

    metric u

    by its scaled variant

    u

    '

    k

    T

    u

    !

    (4)

    the receptive field of prototypeW

    becomes

    f

    P A q W r

    u W

    w

    u W r

    9 4

    Replacingf

    by

    f

    in the error function

    in (1) yields a different weighting of the

    input dimensions and hence an adaptive metric. Appropriate weighting factors

    can

    be determined automatically via a stochastic gradient descent as well. Hence the rule

    (2) where the relevance factors

    of the metric are integrated is accompanied by the

    update

    j

    j

    u o T

    j

    u W

    r

    j

    !

    if b

    r

    j

    n o T

    j

    u W

    r

    j

    !

    otherwise

    (5)

    for eachH

    , whereo T P 2 !

    . We add a normalization to obtain 2

    such that we

    avoid numerical instabilities for the weighting factors. This update constitutes RLVQ

    as proposed in [3].

    We remark that this update can be interpreted in a Hebbian way: Assumed the near-

    est prototypeW

    l

    is correct then those weighting factors are decreased only slightly for

    which the term j u Wl

    j

    !

    is small. Taking the normalization of the weighting factors

    into account, the weighting factors are increased in this situation iff they contribute to

    the correct classification. Conversely, those factors are increased most for which the

    term

    j

    u W

    l

    j

    !

    is large if the classification is wrong. Hence if the classification is

    wrong, precisely those weighting factors are increased which do not contribute to the

    wrong classification. Since the error function is not continuous in this case, this yields

    merely a plausible explanation of the update rule. However, it is not surprising that themethod shows instabilities for large datasets which are subject to noise as we will see

    later.

    We can apply the same idea to GLVQ. Then the modification of (3) which involves

    the relevance factors

    of the metric is accompanied by

    j

    j

    u o Tsgd

    |

    l

    n

    !

    j

    u Wl

    j

    !

    u

    l

    l

    n

    !

    j

    u W

    j

    !

    (6)

    5

  • 8/3/2019 Barbara Hammer and Thomas Villmann- Generalized Relevance Learning Vector Quantization

    6/16

    for eachH

    ,W

    l

    andW

    being the closest correct or wrong prototype, respectively, and

    l and

    the respective squared distances in the weighted Euclidian metric. Again,this is followed by normalization. We term this generalization of RLVQ and GLVQ

    generalized relevance learning vector quantization or GRLVQ, for short. Note that

    the update can be motivated intuitively by the Hebb paradigm taking the normalization

    into account: they comprise the same terms as in (5). Hence those weighting factors

    are reinforced most, which coefficients are closest to the respective data point if

    this point is classified correct; otherwise, if

    is classified wrong, those factors are

    reinforced most, which coefficients are far away. The difference in (6) compared to

    (5) consists in appropriate situation dependent weightings for the two terms and in the

    simultaneous update according to the next correct and next wrong prototype. Besides,

    the update rule obeys a gradient dynamics on the corresponding error function (1) as

    we show in the appendix.

    Obviously, the same idea could be applied to any gradient dynamics. We could, for

    example, minimize a different error function such as the Kullback-Leibler divergenceof the distribution which is to be learned and the distribution which is implemented by

    the vector quantizer. Moreover, this approach is not limited to supervised tasks, we

    could enlarge unsupervised methods like the neural gas algorithm [20] which obey a

    gradient dynamics with weighting factors in order to obtain an adaptive metric.

    3 Relation to previous research

    The main characteristics of GRLVQ as proposed in the previous section are as follows:

    The method allows an adaptive metric via scaling the input dimensions. The metric is

    restricted to a diagonal matrix. The advantages are the efficiency of the method, inter-

    pretability of the matrix elements as relevance factors, and the correlated possibility of

    pruning. The update proposed in GRLVQ is intuitive and efficient, at the same time a

    thorough mathematical foundation can be found due to the gradient dynamics. As we

    will see in the next section, GRLVQ provides a robust classification system which is

    appropriate for real-life data.

    Naturally, various approaches in the literature consider the questions of an adap-

    tive metric, input pruning, and dimensionality determination, too. The most similar

    approach we are aware of constitutes distinction sensitive LVQ (DSLVQ) [26]. The

    method introduces weighting factors, too, and is based on LVQ3. The main advantages

    of our iterative update scheme compared to the DSLVQ update are threefold: Our up-

    date is very intuitive and can be explained with Hebbian learning; our method is more

    efficient since in DSLVQ each update step requires twice normalization; and, which

    we believe is the most important difference, our update constitutes a gradient descent

    on an error function, hence the dynamics can be mathematically analyzed and a clearobjective can be identified.

    Recently, Kaski et.al. proposed two different approaches which allow an adap-

    tive metric for unsupervised clustering if additional information in an auxiliary space

    is available [17, 32]. Their focus lies on unsupervised clustering and they use the

    Bayesian-framework in order to derive appropriate algorithm. The approach in [17]

    explicitely adapts the metric, however it needs a model for explaining the auxiliary

    6

  • 8/3/2019 Barbara Hammer and Thomas Villmann- Generalized Relevance Learning Vector Quantization

    7/16

    data. Hence, we cannot apply the method for our purpose, explicit clustering, i.e. de-

    veloping the model. In [32] an explicit model is no longer necessary. However, themethod relies on several statistical assumptions and is derived for soft clustering in-

    stead of exact LVQ. One could borrow ideas from [32]. Alternatively to the statistical

    scenario, GRLVQ proposes another direct, efficient, and intuitive approach.

    Methods as proposed in [13] and variations allow an adaptive metric for other clus-

    tering algorithms like fuzzy clustering. The algorithm in [13] even allows a more flex-

    ible metric with non-vanishing entries outside the diagonal; however, the algorithms

    are naturally less efficient and require a matrix inversion, for example. In addition,

    well known methods like RBF networks can be put in the same line since they can

    provide a clustering with adaptive metric as well. Commonly, training is less intuitive

    and efficient than GRLVQ. Moreover, a more flexible metric which does not restrict to

    a diagonal matrix does no longer propose a natural pruning scheme.

    Apart from the flexibility due to an adaptive metric, GRLVQ provides a simple way

    of determining which data dimensions are relevant: we can just drop those dimensionswith lowest weighting factor until a considerable increase of the classification error is

    observed. This is a common feature for all methods which determine weighting factors

    describing the metric. Alternatively, one can use general methods for determining the

    dimensionality of the data which are not fitted to the classifier LVQ. The most popular

    approaches are probably ICA and PCA, as already mentioned [15, 25]. Alternatively,

    one could use the above mentioned GSOM algorithm [2]. However, because of its

    remaining hypercubical structure the results may be inaccurate. Another method is

    to apply a Grassberger-Procaccia-analysis to determine the intrinsic dimension. This

    method is unfortunately sensitive to noise [12, 38]. A wide variety of relevance deter-

    mination methods exists in statistics or in the supervised neural network literature, e.g.

    [9, 10, 11, 24, 28, 30, 31]. These methods mostly focus on the task of obtaining sparse

    classifications and they do not yield an adaptive metric which could be used in self-organizing metric-based algorithms like LVQ and SOM. Hence a comparison with our

    method which primarily focuses on an adaptive metric for self-organizing algorithms

    would be interesting, but beyond the scope of this article.

    4 Experiments

    Artificial data

    We first tested GRLVQ on two artificial data sets from [3] in order to compare it to

    RLVQ. We refer to the sets as data2

    and data

    , respectively. The data comprise

    clusters with small or large overlap, respectively, of the clusters in two dimensions as

    depicted in Fig. 1. We embed the points in%

    T

    as follows: Assume T

    !

    is onedata point. Then we add

    dimensions obtaining a point

    T 4 4 4 T !. We choose

    T n { T 4 4 4 T n { , where

    {

    comprises noise with a Gaussian distribu-

    tion with variances 4

    , 4 2

    , 4

    , and 4

    , respectively.

    , . . . , T

    contain pure noise

    which is uniformly distributed in u 4 4 -

    and u 4 4 -

    or distributed according to

    Gaussian noise with variances 4 and 4 , respectively. We refer to the noisy data as

    data and data , respectively. In each run, data are randomly separated into a training

    7

  • 8/3/2019 Barbara Hammer and Thomas Villmann- Generalized Relevance Learning Vector Quantization

    8/16

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8 1

    class 1

    class 2

    class 3

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8 1

    class 1

    class 2

    class 3

    Figure 1: Artificial data sets consisting of three classes with two clusters each and small

    or large overlap, respectively; only the first two dimensions are depicted.

    and test set of the same size.o

    is chosen as constant 4 2

    ,o T

    is chosen as 4 2

    . Since the

    weighting factors are updated in each step compared to the prototypes, the learning rate

    for the weighting terms should be smaller than the learning rate for the prototypes. Pre-

    training with simple LVQ till the prototypes nearly converge is mandatory for RLVQ,

    otherwise, the classification error is usually large and the results are not stable. It is ad-

    visable to train the prototypes with GLVQ for a few2

    epochs before using GRLVQ,

    either, in order to avoid instabilities. We use

    prototypes for each class according to

    the priorly known distribution. The results on training and test set are comparable in

    all runs, i.e. the test set accuracy is not worse or only slightly worse compared to the

    accuracy on the training set. GRLVQ obtains about the same accuracy as RLVQ on

    all data sets (see Tab. 1) and clearly indicates which dimensions are less important via

    assigning small weighting factors to the less important dimensions which are known in

    these examples. Typical weighting factors are the vectors

    RLVQ 4 4 4 4 !

    GRLVQ 4 4 4 4 4 !

    for data

    or the vectors

    RLVQ 4 2 4 2 4 2 4 2 2 4 2 4 4 2 4 4 4 !

    GRLVQ 4 4 4 4 !

    for data

    , hence clearly separating the important first two data dimensions from the

    remaining

    dimensions of which the first

    contain some information. This is pointed

    out via a comparably large third weighting term for the second data set. The remain-

    ing four dimensions contain no information at all. However, GRLVQ shows a faster

    convergence and larger stability compared to RLVQ in particular if used for noisy data

    sets with large overlap of the classes as for data

    . There the separation of the impor-

    tant dimensions is clearer in GRLVQ than RLVQ. Concerning RLVQ, pre-training with

    LVQ and small learning rates were mandatory in order to ensure good results; the same

    situations turn out to be less critical for GRLVQ, although it is advisable to choose the

    learning rate for the weighting terms an order of magnitude smaller than the learning

    8

  • 8/3/2019 Barbara Hammer and Thomas Villmann- Generalized Relevance Learning Vector Quantization

    9/16

    data2

    data

    data

    data

    LVQ 91 - 96 81 - 89 79 - 86 56 - 70

    RLVQ 91 - 96 90 - 96 80 - 86 79 - 86

    GRLVQ 94 - 97 93 - 97 83 - 87 83 - 86

    Table 1: Percentage of correctly classified patterns (maximum2

    ) for the two artificial

    training data, data2

    and data

    , with and without additional noisy dimensions and

    RLVQ or GRLVQ, respectively.

    rate for the prototype update. These results indicate that GRLVQ is particularly well

    suited for noisy real life data sets. Based on the above weighting factors one can obtain

    a ranking of the input dimensions and drop all but the first two dimensions without

    increasing the classification error.

    Iris data

    In a second test we applied GRLVQ to the well known Iris data set provided in the UCI

    repository of machine learning [4]. The task is to predict three classes of plants based

    on numerical attributes in 2X instances, i.e., we deal with data points in %

    with

    labels in F 2 3 9 . Both, LVQ and RLVQ obtain an accuracy of about 4 for a training

    and test set if trained with prototypes for each class. RLVQ shows a slightly cyclic

    behavior in the limit, the accuracy changing between 4

    and 4

    . The computed

    weighting factors for RLVQ are

    RLVQ 4 4 2 4 4 !

    indicating that based on the last dimension a very good classification would be possible.If more dimensions would be taken into account, a better accuracy of about

    2 4 would

    be possible as reported in the literature. We could not produce such a solution with LVQ

    or RLVQ. Moreover, a perfect recognition of2

    would correspond to overfitting since

    the data comprises small noise as reported in the literature. GRLVQ yields the better

    accuracy of at least 4

    on the training as well as the test set and obtains weighting

    factors of the form

    GRLVQ 4 4 !

    hence, indicating that the last dimension is most important as already found by RLVQ,

    and dimension

    contributes to a better accuracy which has not been pointed out by

    RLVQ. Note that the result obtained by GRLVQ is in coincidence with results obtained

    e.g. with rule extraction from feedforward networks [6].

    Satellite data

    Finally, we applied the algorithm to a large real world data set: a multi-spectral LAND-

    SAT TM satellite image of the Colorado area.1 Satellites of LANDSAT-TM type pro-

    duce pictures of the earth in 7 different spectral bands. The ground resolution in meter

    1Thanks to M. Augusteijn (University of Colorado) for providing this image.

    9

  • 8/3/2019 Barbara Hammer and Thomas Villmann- Generalized Relevance Learning Vector Quantization

    10/16

    LVQ RLVQ GLVQ GRLVQ

    mean (train) 85.21 86.1 87.32 91.08

    variance (train) 0.59 0.18 0.17 0.11

    mean (test) 85.2 86.36 87.28 91.04

    variance (test) 0.46 0.16 0.1 0.13

    Table 2: Percentage of correctly classified patterns (maximum2

    ) and variance of the

    runs on the satellite data obtained in a 10-fold-crossvalidation.

    is )

    for the bands 1-5 and band 7. Band 6 (thermal band) has a resolution of

    ) only and, therefore, it is often dropped. The spectral bands represent useful

    domains of the whole spectrum in order to detect and discriminate vegetation, water,

    rock formations and cultural features [5, 22]. Hence, the spectral information, i.e., the

    intensity of the bands associated with each pixel of a LANDSAT scene, is representedby a vector in

    % 'with

    . Generally, the bands are highly correlated [1, 35]. Ad-

    ditionally, the Colorado image is completely labeled by experts. There are2

    labels

    describing different vegetation types and geological formations. Thereby, the label

    probability varies in a wide range [36]. The size of the image is25 ) 2

    pixels.

    We trained RLVQ and GRLVQ with

    prototypes (

    for each class) on

    of

    the data set till convergence. The algorithm converged in less than2

    cycles ifo

    and

    o T

    were chosen as 4 2

    and 4 2

    , respectively, as before. RLVQ yields an accuracy

    of about

    on the training data as well as the entire data set, however, it does not

    provide a ranking of the prototypes, i.e. all weighting terms are close to their initial

    value 4 2

    . GRLVQ leads to the better accuracy of 2

    on the training set as well

    as the entire data set and provides a clear ranking of the several data dimensions. See

    Table 2 for a comparison of the results obtained by the various algorithms. In allexperiments, dimension

    is ranked as least important with weighting factor close to

    .

    The weighting factors approximate

    GRLVQ 4 2 4 2 4 4 2 4 !

    in several runs. This weighting clearly separates the first two dimensions via a small

    weighting factor. If we prune dimension

    ,2

    , and

    , still an accuracy of

    can be

    achieved. Hence this indicates, that the intrinsic data dimension is at most

    . Pruning

    one additional data dimension, dimension

    still allows an accuracy of more than

    ,

    hence indicating that the intrinsic dimension may be even lower and the relevant direc-

    tions are not parallel to the axes or even curved. These results are visualized in Fig.

    2 where the misclassified pixels in the respective cases are colored in black, the other

    pixels are colored corresponding to their respective class.For comparison we applied a Grassberger-Procaccia-analysis and the GSOM ap-

    proach. The first estimates the intrinsic dimension as

    4 2 2 whereas GSOM

    generates a lattice of shape2 ) )

    , hence indicating an intrinsic dimension between

    2and

    . These methods show a good agreement with the drastic loss of information if

    more than

    dimensions are pruned with GRLVQ.

    10

  • 8/3/2019 Barbara Hammer and Thomas Villmann- Generalized Relevance Learning Vector Quantization

    11/16

    Figure 2: Colorado-satellite-image: the pixels are colored according to the labels;

    above-left: original labeling; above-right: GRLVQ without pruning; below-left: GR-

    LVQ with pruning of dimensions 2 , , ; below-right: GRLVQ with pruning of di-

    mensions. Misclassified pixels in the GRLVQ-generated images are black colored. (Acolored version of the image can be obtained from the authors on request.)

    5 Conclusions

    The presented clustering algorithm GRLVQ provides a new robust method for auto-

    matically adapting the Euclidian metric used for clustering to the data, determining

    the relevance of the several input dimensions for the overall classifier, and estimat-

    ing the intrinsic dimension of data. It reduces the input dimensions onto the essential

    parameters which is demanded to obtain optimal network structures. This is an impor-

    tant feature, if the network is used to reduce the data amount to subsequent systems

    in complex data analysis tasks as we can find in medical applications (image analy-

    sis) or satellite remote sensing systems, for example. Here, the reduction of data tobe transferred is one of the most important features, however, preserving the essential

    information in the data.

    The GRLVQ-algorithm was successfully tested on artificial as well as real world

    data, a large and noisy satellite multi-spectral image. A comparison with other ap-

    proaches validates the results even in real life applications.

    It should be noted that the GRLVQ algorithm can be easily adapted to other types

    11

  • 8/3/2019 Barbara Hammer and Thomas Villmann- Generalized Relevance Learning Vector Quantization

    12/16

    of neural vector quantizers as neural gas or SOM, to mention just a few. Furthermore, it

    is clear that if we assume an unknown probability distribution of the labels for a givendata set, the here discussed variant of GRLVQ tries to maximize the Kullback-Leibler

    divergence. Hence, we can state for this feature some similarities in our approach to

    the work of Kaski [17, 32].

    Further considerations of GRLVQ should incorporate information theory approaches

    like entropy maximization to improve the capabilities of the network.

    References

    [1] M. F. Augusteijn, K. A. Shaw, and R. J. Watson. A study of neural network

    input data for ground cover identification in satellite images. In S. Gielen and

    B. Kappen, editors, Proc. ICANN93, Int. Conf. on Artificial Neural Networks,

    pages 10101013, London, UK, 1993. Springer.

    [2] H.-U. Bauer and T. Villmann. Growing a Hypercubical Output Space in a Self

    Organizing Feature Map. IEEE Transactions on Neural Networks, 8(2):218226,

    1997.

    [3] T. Bojer, B. Hammer, D. Schunk, and K. Tluk von Toschanowitz. Relevance

    determination in learning vector quantization. In Proc. Of European Symposium

    on Artificial Neural Networks (ESANN01), pages 271-276, Brussels, Belgium,

    2001. D facto publications.

    [4] C.L. Blake and C. J. Merz, UCI Repository of machine learning databases , Irvine,

    CA: University of California, Department of Information and Computer Science.

    [5] J. Campbell. Introduction to Remote Sensing. The Guilford Press, U.S.A., 1996.

    [6] W. Duch, R. Adamczak, K. Grabczewski, A new method of extraction, optimiza-

    tion and application of crisp and fuzzy logical rules. IEEE Transactions on Neural

    Networks 12: 277-306, 2001.

    [7] B. Fritzke. Growing grid: a self-organizing network with constant neighborhood

    range and adaptation strength. Neural Processing Letters, 2(5):913, 1995.

    [8] I. Gath and A. Geva. Unsupervised optimal fuzzy clustering. IEEE Transactions

    on Pattern Analysis and Machine Intelligence, 11:773791, 1989.

    [9] T. van Gestel, J. A. K. Suykens, B. de Moor, and J. Vandewalle. Automatic

    relevance determination for least squares support vector machine classifiers. In

    M. Verleysen, editor,European Symposium on Artificial Neural Networks, 1318,

    2001.

    [10] Y. Grandvalet. Anisotropic noise injection for input variables relevance determi-

    nation. IEEE Transactions on Neural Networks, 11(6):12011212, 2000.

    12

  • 8/3/2019 Barbara Hammer and Thomas Villmann- Generalized Relevance Learning Vector Quantization

    13/16

    [11] Y. Grandvalet. Least absolute shrinkage is equivalent to quadratic penalization.

    In L. Niklasson, M. Boden, and T. Ziemke, editors, ICANN98, volume 1 ofPerspectives in Neural Computing, pages 201206. Springer, 1998.

    [12] P. Grassberger and I. Procaccia. Measuring the strangeness of strange attractors.

    Physica, 9D:189208, 1983.

    [13] D. Gustafson and W. Kessel. Fuzzy clustering with a fuzzy covariance matrix. In

    Proceedings of IEEE CDC79, pages 761766, 1979.

    [14] T. Hofmann. Learning the similarity of documents: An information geometric

    approach to document retrieval and categorization. In S. A. Solla, T. K. Leen,

    and K. R. Muller, editors, Advances in Neural Information Processing Systems,

    volume 12, pages 914920. MIT Press, 2000.

    [15] A. Hyvarinen and E. Oja. A fast fixed-pointalgorithm for independent componentanalysis. Neural Computation, 9(7):14831492, 1997.

    [16] S. Kaski. Dimensionality reduction by random mapping: fast similarity compu-

    tation for clustering. In Proceedings of IJCNN92, pages 413418, 1998.

    [17] S. Kaski. Bankruptcy analysis with self-organizing maps in learning metrics. To

    appear in IEEE Transactions on Neural Networks.

    [18] T. Kohonen. Learning vector quantization. In M. Arbib, editor, The Handbook of

    Brain Theory and Neural Networks, pages 537540. MIT Press, 1995.

    [19] T. Kohonen. Self-Organizing Maps. Springer, 1997.

    [20] T. Martinetz and K. Schulten. Topology representing networks. Neural Networks,7(3):507522, 1993.

    [21] U. Matecki. Automatische Merkmalsauswahl f ur Neuronale Netze mit Anwen-

    dung in der pixelbezogenen Klassifikation von Bildern. Shaker, 1999.

    [22] E. Merenyi. The challenges in spectral image analysis: An introduction and re-

    view of ANN approaches. In Proc. Of European Symposium on Artificial Neural

    Networks (ESANN99), pages 9398, Brussels, Belgium, 1999. D facto publica-

    tions.

    [23] A. Meyering and H. Ritter. Learning 3D-shape-perception with local linear maps.

    In Proceedings of IJCNN92, pages 432436, 1992.

    [24] R. Neal. Bayesian Learning for Neural Networks. Springer, 1996.

    [25] E. Oja. Principal component analysis. In M. Arbib, editor, The Handbook of

    Brain Theory and Neural Networks, pages 753756. MIT Press, 1995.

    [26] M. Pregenzer, G. Pfurtscheller, and D. Flotzinger. Automated feature selection

    with distinction sensitive learning vector quantization. Neurocomputing 11:19-

    29, 1996.

    13

  • 8/3/2019 Barbara Hammer and Thomas Villmann- Generalized Relevance Learning Vector Quantization

    14/16

    [27] H. Ritter. Self-organizing maps in non-euclidean spaces. In E. Oja and S. Kaski,

    editors, Kohonen Maps, pages 97108. Springer, 1999.[28] V. Roth. Sparse kernel regressors. In G. Dorffner, H. Bischof, and K. Hornik, ed-

    itors, Artificial Neural Networks ICANN 2001, pages 339346. Springer, 2001.

    [29] A. S. Sato and K. Yamada. Generalized learning vector quantization. In

    G. Tesauro, D. Touretzky, and T. Leen, editors, Advances in Neural Information

    Processing Systems, volume 7, pages 423429. MIT Press, 1995.

    [30] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the

    Royal Statistical Society, Series B, 58, 267-288, 1996.

    [31] M. Tipping. The relevance vector machine. In S. A. Solla, T. K. Leen, and

    K.-R. Muller, editors, Advances in Neural Information Processing Systems, vol-

    ume 12, pages 652658. Cambridge MIT Press, 2000.

    [32] J. Sinkkonen and S. Kaski. Clustering based on conditional distribution in an

    auxiliary space. To appear in Neural Computation.

    [33] P. Somervuo and T. Kohonen. Self-organizing maps and learning vector quanti-

    zation for feature sequences. Neural Processing Letters, 10(2):151159, 1999.

    [34] M.-K. Tsay, K.-H. Shyu, and P.-C. Chang. Feature transformation with general-

    ized learning vector quantization for hand-written Chinese character recognition.

    IEICE Transactions on Information and Systems, E82-D(3):687692, 1999.

    [35] T. Villmann and E. Merenyi. Extensions and modifications of the Kohonen-SOM

    and applications in remote sensing image analysis. In U. Seiffert and L.C. Jain

    (eds.): Self-Organizing Maps. Recent Advances and Applications, pages 121-145.

    Springer, 2001

    [36] T. Villmann. Benefits and limits of the self-organizing map and its variants in the

    area of satellite remote sensoring processing. In Proc. Of European Symposium

    on Artificial Neural Networks (ESANN99), pages 111116, Brussels, Belgium,

    1999. D facto publications.

    [37] T. Villmann, R. Der, M. Herrmann, and T. Martinetz. Topology Preservation

    in SelfOrganizing Feature Maps: Exact Definition and Measurement. IEEE

    Transactions on Neural Networks, 8(2):256266, 1997.

    [38] W. Wienholt. Entwurf Neuronaler Netze. Verlag Harri Deutsch, Frankfurt/M.,

    Germany, 1996.

    Appendix

    The general error function (1) has here the special form

    j

    k

    T

    sgd

    l

    u

    l

    n

    14

  • 8/3/2019 Barbara Hammer and Thomas Villmann- Generalized Relevance Learning Vector Quantization

    15/16

    l

    and

    being the quadratic weighted distance to the closest correct or wrong pro-

    totype,W

    l

    andW

    , respectively. For convenience we denote

    A A u W A A

    and

    r

    r

    u W

    r

    ! !

    r

    T

    '

    . Assume data come from a distribution on the input

    space % ' and a labeling function % ' 2 4 4 4 7 9 . Then the continuous version

    of the error function reads as

    sgd

    l

    u

    l

    n

    ! 4

    We assume that the sets

    A ! b 9are measurable. Than we can write the

    error term in the following way:

    k

    T

    k

    !

    sgd

    r

    u

    r

    n

    b ! b !

    ! (7)

    where b

    denotes the indices of prototypes labeled withb

    , b

    denotes the

    indices of prototypes not labeled withb

    , b !

    is an indicator function forW

    r

    being the

    closest prototype to

    among those labeled withb

    , and b !

    is an indicator function

    forW

    being the closest prototype to

    among those not labeled withb

    . Denote by H

    the Heaviside function. Denote byA b A

    orA b A

    the number of prototypes labeled

    withb

    or not labeled withb

    , respectively. Then we find

    b ! H

    k

    H

    u

    r

    ! u A b A

    and

    b ! H k

    H

    u

    ! u A b A 4

    The derivative of the Heaviside function is the delta function

    which is a symmetric

    function with !

    for

    and

    ! 2.

    We are interested in the derivative of (7) with respect to everyW

    and every

    ,

    respectively. Assumeb

    is the label ofW

    . Then the derivative of (7) with respect toW

    yields:

    k

    sgd

    u

    n

    n

    !

    B b

    ! b

    !

    !(8)

    n

    k

    k

    sgd

    u

    n

    u

    n

    !

    b

    ! B b

    !

    !(9)

    n

    k

    !

    sgd

    r

    u

    r

    n

    b

    !

    W

    b

    !

    !(10)

    n

    k

    k

    !

    sgd

    r

    u

    r

    n

    b

    !

    b

    !

    W

    !(11)

    15

  • 8/3/2019 Barbara Hammer and Thomas Villmann- Generalized Relevance Learning Vector Quantization

    16/16

    (8) and (9) correspond up to a constant factor to the update (3). (10) and (11) vanish due

    to the following reason: Denote by !

    the term sgd

    b

    ! b

    !

    .The integrand in (10) yields

    k

    !

    sgd

    r

    u

    r

    n

    b

    !

    k

    H

    u

    r

    ! u A b

    A

    k

    u

    r

    !

    W

    u

    r

    W

    k

    !

    !

    u

    r

    !

    n

    k

    B !

    k

    u

    ! u !

    This term vanishes since

    is symmetric and non-vanishing only for

    r and

    , respectively. In the same way, it can be seen that each integrand of (11) vanishes.

    The derivative of (7) with respect to

    can be computed as

    k

    T

    k

    !

    sgd

    r

    u

    r

    n

    b ! b !(12)

    r

    n

    !

    u W

    r

    !

    u

    r

    r

    n

    !

    u W

    !

    !(13)

    n

    k

    T

    k

    !

    sgd

    r

    u

    r

    n

    (14)

    b !

    b ! n b !

    b !

    !(15)

    (12) and (13) correspond to the update for

    in (5). (14) and (15) vanish since we obtain

    for the integrand the following equation:

    k

    !

    !

    k

    u

    r

    !

    u

    r

    n

    k

    u

    !

    u

    k

    k

    !

    !

    u

    r

    !

    u

    k

    !

    !

    u

    r

    !

    r

    n

    k

    r

    k

    !

    !

    u

    !

    u

    k

    !

    !

    u

    !

    Again, this is zero because of the symmetry of

    and the fact, that

    is non-vanishing

    only for

    r

    and

    , respectively. Hence the update of GRLVQ constitutes a

    stochastic gradient descent method with appropriate choices of the learning rates.

    16