A 091

download A 091

of 4

Transcript of A 091

  • 8/8/2019 A 091

    1/4

    A Validity Index Based on Connectivity

    Sriparna Saha and Sanghamitra Bandyopadhyay

    Machine Intelligence Unit, Indian Statistical Institute, Kolkata, India

    Email:{sriparna r, sanghami}@isical.ac.in

    Abstract

    In this paper we have developed a connectivity based

    cluster validity index. This validity index is able to detect

    the number of clusters automatically from data sets having

    well separated clusters of any shape, size or convexity.

    The proposed cluster validity index, connect-index, uses the

    concept of relative neighborhood graph for measuring the

    amount of connectedness of a particular cluster. The

    proposed connect-index is inspired by the popular Dunns

    index for measuring the cluster validity. Single linkageclustering algorithm is used as the underlying partitioning

    technique. The superiority of the proposed validity measure

    in comparison with Dunns index is shown for four artificial

    and two real-life data sets.

    1. Introduction

    Clustering [1] is a core technique in data-mining with

    innumerable applications spanning many fields. Model se-

    lection in clustering consists of two steps. In the first step

    the proper clustering method for a particular data set has

    to be decided upon. Once this choice has been made, one

    has to determine the number of clusters and also assess

    the validity of the clusters formed [1]. For this purpose

    several cluster validity indices have been proposed in the

    literature. The measure of validity of the clusters should

    be such that it will be able to impose an ordering of the

    partitionings in terms of its goodness. In other words, if

    U1, U2, . . . , U m be the m partitions of X, and the corre-

    sponding values of a validity measure be V1, V2, . . . V m, then

    Vk1 Vk2 . . . V km, ki 1, 2, . . . , m , i = 1, 2, . . . , mwill indicate that Uk1 . . . Ukm. Here Ui Uj indicatesthat partition Ui is a better clustering than Uj . Note that

    a validity measure may also define a decreasing sequence

    instead of an increasing sequence of Vk1, . . . , V km .Several cluster validity indices have been proposed in

    the literature. These include Davies-Bouldin (DB) index [2],

    Dunns index [3], Xie-Beni (XB) index [2], I-index [2], CS-

    index [4], Sym-index [5], [6] etc., to name just a few. Some

    of these indices have been found to be able to detect the

    correct partitioning for a given number of clusters, while

    some can determine the appropriate number of clusters as

    well. Maulik and Bandyopadhyay [2] evaluated the perfor-

    mance of four validity indices, namely, the Davies-Bouldin

    index [2], Dunns index [3], Calinski-Harabasz index [2],

    and a new index I, in conjunction with three differentalgorithms, viz., the well-known K-means [1], single-linkage

    algorithm [1] and a SA-based clustering method [2]. But

    most of the existing cluster validity indices are able to

    detect the partitioning where clusters are either having the

    hyperspherical shape or symmetrical shape. Dunns index

    [3] is able to detect clusters having different shapes but

    sometimes it prefers the partitioning where some clusters

    are merged together to maximize the minimum separation

    between clusters. In this paper one cluster validity index isdeveloped which is able to detect the appropriate partitioning

    from data sets having clusters of any shape, size or convexity

    as long as they are well-separated.

    The concept of relative neighborhood graph (RNG) [7]

    has been successfully applied for solving several pattern

    recognition problems. One unsupervised clustering tech-

    nique based on the concepts of RNG is developed in Ref. [8].

    In this article the concepts of relative neighborhood graph [7]

    is used to develop a new cluster validity index. The proposed

    index, connect-index, quantifies the degree of connectivity

    of individual clusters, while they are well-separated as well.

    The index is inspired by the popular Dunns index [3].

    But it outperforms the Dunns index [3] for determining

    the number of clusters from data sets having well-separated

    clusters. Single-linkage clustering technique [1] is used as

    the underlying partitioning method. The effectiveness of the

    proposed index in comparison with the popular Dunns index

    [3] is shown for four artificially generated and two real-life

    data sets.

    2. Proposed Cluster Validity Index

    In this section at first the concept of relative neighborhood

    graph (RNG) [7][8] is first described. This is followed by

    a detailed description of the cluster validity index proposedhere. It is based on the concept of relative neighborhood

    graph in order to measure the amount of connectedness

    among the clusters.

    2.1. Relative Neighborhood Graph

    Suppose r is an integer and p, q are two points in r-

    dimensional Euclidean space. Then the lune of p and q

    2009 Seventh International Conference on Advances in Pattern Recognition

    978-0-7695-3520-3/09 $25.00 2009 IEEE

    DOI 10.1109/ICAPR.2009.53

    91

  • 8/8/2019 A 091

    2/4

    Figure 1. The lune of two points p and q is the region

    between the two arcs, not including the boundary.

    Figure 2. (a) A set of points in the plane (b) RNG of thepoints in (a)

    (denoted lun(p,q) or lun(pq)) is the set of points

    {z Rr : d(p, z) < d(p.q) and d(q, z) < d(p,q)},where d denotes the Euclidean distance. Alternatively,

    lun(p,q) denotes the interior of the region formed by theintersection of two r-dimensional hyperspheres of radius

    d(p,q), one of the hyperspheres being centered at p and

    the other atq

    . This is illustrated in Figure 1 which showsthe lune of two points p, q in the plane. If V is a set of

    n points in r-space, then define the relative neighborhood

    graph of V (denoted RN G(V) or simply RNG when Vis understood) to be the undirected graph with vertices

    V such that for each pair p,q V, pq is an edge ofRN G(V) iff lun(p,q)V = . Here the edge weight ofa particular edge (pq) is kept equal to d(p,q), the Euclideandistance between the points p and q.

    Figure 2(a) shows a set V of points in the plane; Figure

    2(b) shows the RNG of this set of points V. The RNG

    problem is: Given a set V, find RN G(V).

    2.2. Measuring the Connectivity Among a Set ofPoints

    In order to measure the connectivity among a set of points

    we have used the above discussed relative neighborhood

    graph concept. Here the distance between a pair of points is

    measured in the following way.

    1) Construct the relative neighborhood graph of the

    whole data set.

    2) The distance between any two points, x and y, denoted

    as dshort(x,y), is measured along the relative neigh-borhood graph. Find all possible paths among these

    two points along the RNG. Suppose there are total

    p paths between x and y, and the number of edges

    along the ith path is ni, for i = 1, . . . , p. If the edges

    along the ith path are denoted as edi1, . . . , e dini and thecorresponding edge weights are w(edi1), . . . , w(ed

    ini

    ),then the shortest distance between x and y is defined

    as follows:

    dshort(p,q) =p

    mini=1

    nimaxj=1

    w(edij).

    2.3. Proposed Cluster Validity Index

    The proposed cluster validity index is defined as fol-

    lows. Suppose the clusters formed are denoted by Ci, for

    i = 1, . . . , K , where K is the number of clusters. Then thediameter of a particular cluster is denoted as diam(Ci), fori = 1, . . . , K , which is defined below:

    diam(Ci) = maxx,yCi

    {dshort(x,y)}.

    Here dshort(x,y) is as defined in Section 2.2.The distance between any two clusters Ci and Cj where

    i, j = 1, . . . , K , i = j, is defined as follows:dist(Ci, Cj) = min

    xCi and yCj{dshort(x,y)}

    Now the proposed connectivity based cluster validity index,

    connect-index, is defined as follows:

    connect = min1iK{ min1jK,i=j{dist(Ci, Cj)

    max1kK{diam(Ck)}}}.Intuitively larger values of connect corresponds to good

    partitioning. Thus the appropriate number of clusters is

    determined by maximizing connect over different values

    of K. If connecti denotes the connect-index value for the

    number of clusters, K = i, then the appropriate number ofclusters, K, is determined as:

    K = argopt{ maxi=1,...,Kmax

    connecti}.

    Here Kmax is the maximum possible number of clusters. In

    general, Kmax is kept equal to

    n, where n is the number

    of points in the data set.connect-index has two components. Its denominator mea-

    sures the maximum shortest distance among any two points

    in a particular cluster. If the cluster is completely connected

    then the shortest distance between any two points would be

    very small and thus the diameter of that particular cluster

    would be small too. As connect-index tries to minimize the

    maximum diameter amongst all clusters, this in turn tries

    to minimize the diameter of every clusters. Thus when all

    clusters are well-connected, their diameters are small and the

    92

  • 8/8/2019 A 091

    3/4

    Table 1. Experimental Results on Several Data sets.Here AC denotes the actual number of clusters and OC

    denotes the obtained number of clusters.

    Name # points dimension AC OCconnect Dunn

    Pat1 557 2 3 3 2

    Pat2 417 2 2 2 2Spiral 1000 2 2 2 2

    Mixed 5 2 850 2 5 5 6

    Iris 150 4 3 2 2

    Cancer 683 9 2 2 6

    denominator of the connect-index gets a smaller value. The

    numerator of the connect-index is the minimum separation

    between any two clusters which is measured as the minimum

    shortest distance between any two points belonging to two

    different clusters along the RNG. In order to increase the

    value of connect-index, the numerator of this index has

    to be maximized, thus the minimum separation betweenany two clusters should be maximum. This only happens

    if the clusters are well-separated. Thus connect-index gets

    its maximum value when all the clusters are connected and

    well-separated as well.

    300 400 500 600 700 800 900800

    1000

    1200

    1400

    1600

    1800

    2000

    2200

    2400

    2600

    2800

    300 400 500 600 700 800 900800

    1000

    1200

    1400

    1600

    1800

    2000

    2200

    2400

    2600

    2800

    (a) (b)

    Figure 3. (a) Pat1 (b) Pat2

    8 6 4 2 0 2 4 6 86

    4

    2

    0

    2

    4

    6

    10 8 6 4 2 0 2 4 6 82

    0

    2

    4

    6

    8

    10

    12

    14

    16

    (a) (b)

    Figure 4. (a) Spiral(b) Mixed 5 2

    3. Experimental Results

    Here the popular single linkage clustering technique

    [1] is used to partition the data sets used for experi-

    ments. Four artificial and two real-life data sets are used

    300 400 500 600 700 800 900800

    1000

    1200

    1400

    1600

    1800

    2000

    2200

    2400

    2600

    2800

    300 400 500 600 700 800 900800

    1000

    1200

    1400

    1600

    1800

    2000

    2200

    2400

    2600

    2800

    (a) (b)

    Figure 5. Optimal Partitioning on Pat1 indicated by (a)proposed connect-index for K = 3 (b) Dunns index forK = 2

    300 400 500 600 700 800 900800

    1000

    1200

    1400

    1600

    1800

    2000

    2200

    2400

    2600

    2800

    8 6 4 2 0 2 4 6 86

    4

    2

    0

    2

    4

    6

    (a) (b)

    Figure 6. Optimal Partitioning indicated by bothconnect-index and Dunns index on (a) Pat2 for K = 2(b) Spiraldata set for K = 2

    to show the efficacy of the proposed cluster validity in-

    dex, connect-index. The description of the data sets used

    here for experiment is shown in Table 1. Pat1 and Pat2

    data sets are used in Ref.[9], Spiral data set is usedin Ref.[10] and Mixed 5 2 data set is used in Ref.[5].

    Figures 3(a), 3(b), 4(a), 4(b) show the four artificial data

    sets, respectively. The two real-life datasets are obtained

    from (http://www.ics.uci.edu/mlearn/MLRepository.html).Iris data set represents different categories of irises charac-

    terized by four feature values. It has three classes Setosa,

    Versicolor and Virginica. It is known that two classes (Ver-

    sicolor and Virginica) have a large amount of overlap while

    the class Setosa is linearly separable from the other two.

    The Wisconsin Breast Cancer data set has two categories in

    it: malignant and benign. The two classes are known to be

    linearly separable.

    Single Linkage clustering is used to partition all the

    above mentioned data sets for K = 2, . . . ,

    n and the

    corresponding connect-index values are computed for all

    the partitions. Then the partition which corresponds to the

    maximum value of connect-index is taken as the optimal

    partitioning and the corresponding number of clusters is

    regarded as the optimal number of clusters indicated by

    connect-index. For all the data sets used here for experiment,

    the optimal number of clusters indicated by connect-index

    93

  • 8/8/2019 A 091

    4/4

    10 8 6 4 2 0 2 4 6 82

    0

    2

    4

    6

    8

    10

    12

    14

    16

    10 8 6 4 2 0 2 4 6 82

    0

    2

    4

    6

    8

    10

    12

    14

    16

    (a) (b)

    Figure 7. Optimal Partitioning for Mixed 5 2 indicatedby (a) proposed connect-index for K = 5 (b) Dunnsindex for K = 6

    are reported in Table 1. For the purpose of comparison,

    the number of clusters identified by the popular Dunnsindex [3] for all data sets used here for experiment are

    also reported in Table 1. This table reveals that in most

    of the cases the proposed connect-index is able to identify

    the appropriate number of clusters from almost all the data

    sets used here for experiment while Dunns index is able to

    detect the appropriate number of clusters from two out of

    these six data sets. For Iris data set, both the validity indices

    provide K = 2, which is also often obtained for manyother methods for Iris. Figures 5(a), 6(a), 6(b), 7(a) show,

    respectively, the optimal partitionings indicated by connect-

    index for four artificial data sets used here for experiment.

    Similarly Figures 5(b), 6(a), 6(b) and 7(b) show, respectively,

    the optimal partitionings indicated by popular Dunns indexfor these four artificial data sets.

    For the two real-life data sets, Iris and Cancer, no

    visualization is possible as these are high-dimensional data

    sets. The Minkowski Score (MS) [11] is calculated after

    application of Single Linkage clustering technique for these

    two real-life data sets. This is a measure of the quality of a

    solution given the true clustering. Let T be the true solu-

    tion and S the solution we wish to measure. Denote by n11the number of pairs of elements that are in the same cluster

    in both S and T. Denote by n01 the number of pairs that

    are in the same cluster only in S, and by n10 the number of

    pairs that are in the same cluster in T. Minkowski Score (MS)is then defined as: M S(T , S) =

    n01+n10n11+n10

    .. For MS, the

    optimum score is 0, with lower scores being better. For Iris

    data set, MS value corresponding to the partitioning obtained

    by Single Linkage clustering for K = 2 is 0.88. Again forCancerdata set, Single Linkage clustering technique obtains

    a MS of 0.43 for K = 2 (number of partitions indicated bynewly proposed connect-index) while that ofK = 6 (numberof partitions indicated by Dunns index) is 1.45.

    4. Discussion and Conclusion

    Identifying the proper number of clusters and the proper

    partitioning from a data set are two crucial issues in un-

    supervised classification. In this paper one cluster validity

    index is developed for this purpose. The proposed index

    is able to detect the appropriate number of clusters andthe appropriate partitioning from data sets as long as the

    clusters are well separated either having any shape, size

    or convexity. The effectiveness of the proposed index in

    comparison with one existing cluster validity index, Dunns

    index, is shown for four artificial and two real-life data sets.

    Future work includes developing some mathematical proof

    of the proposed index. Comparing the proposed validity

    index with other existing indices more extensively is another

    important future research work.

    References

    [1] B. S. Everitt, S. Landau, and M. Leese, Cluster Analysis.London: Arnold, 2001.

    [2] U. Maulik and S. Bandyopadhyay, Performance evaluationof some clustering algorithms and validity indices, IEEETransactions on Pattern Analysis and Machine Intelligence,vol. 24, no. 12, pp. 16501654, 2002.

    [3] J. C. Dunn, A fuzzy relative of the ISODATA process andits use in detecting compact well-separated clusters, Journalof Cybernetics, vol. 3, pp. 3257, 1973.

    [4] C. H. Chou, M. C. Su, and E. Lai, A new cluster validitymeasure and its application to image compression, Pattern

    Analysis and Applications, vol. 7, pp. 205220, 2004.

    [5] S. Bandyopadhyay and S. Saha, A point symmetry basedclustering technique for automatic evolution of clusters,

    IEEE Transactions on Knowledge and Data Engineering,vol. 20, no. 11, pp. 117, November, 2008.

    [6] S. Saha and S. Bandyopadhyay, Application of a newsymmetry based cluster validity index for satellite imagesegmentation, IEEE Geoscience and Remote Sensing Letters,vol. 5, no. 2, pp. 166170, 2008.

    [7] G. T. Toussaint, The realtive neighborhood graph of a finiteplanar set, Pattern Recognition, vol. 12, pp. 261268, 1980.

    [8] S. Bandyopadhyay, An automatic shape independent clus-tering technique, Pattern Recognition, vol. 37, pp. 3345,2004.

    [9] S. K. Pal, S. Bandyopadhyay, and C. A. Murthy, Geneticalgorithms for generation of class boundaries, IEEE Trans.System Man Cybernet, vol. 28, no. 6, pp. 816828, 1998.

    [10] J. Handl and J. Knowles, An evolutionary approach to mul-tiobjective clustering, IEEE Transactions on EvolutionaryComputation, vol. 11, no. 1, pp. 5676, 2007.

    [11] A. Ben-Hur and I. Guyon, Detecting Stable Clusters usingPrincipal Component Analysis in Methods in Molecular Bi-ology. Humana press, 2003.

    94