A 091

8/8/2019 A 091

1/4

A Validity Index Based on Connectivity

Sriparna Saha and Sanghamitra Bandyopadhyay

Machine Intelligence Unit, Indian Statistical Institute, Kolkata, India

Email:{sriparna r, sanghami}@isical.ac.in

Abstract

In this paper we have developed a connectivity based

cluster validity index. This validity index is able to detect

the number of clusters automatically from data sets having

well separated clusters of any shape, size or convexity.

The proposed cluster validity index, connect-index, uses the

concept of relative neighborhood graph for measuring the

amount of connectedness of a particular cluster. The

proposed connect-index is inspired by the popular Dunns

index for measuring the cluster validity. Single linkageclustering algorithm is used as the underlying partitioning

technique. The superiority of the proposed validity measure

in comparison with Dunns index is shown for four artificial

and two real-life data sets.

1. Introduction

Clustering [1] is a core technique in data-mining with

innumerable applications spanning many fields. Model se-

lection in clustering consists of two steps. In the first step

the proper clustering method for a particular data set has

to be decided upon. Once this choice has been made, one

has to determine the number of clusters and also assess

the validity of the clusters formed [1]. For this purpose

several cluster validity indices have been proposed in the

literature. The measure of validity of the clusters should

be such that it will be able to impose an ordering of the

partitionings in terms of its goodness. In other words, if

U1, U2, . . . , U m be the m partitions of X, and the corre-

sponding values of a validity measure be V1, V2, . . . V m, then

Vk1 Vk2 . . . V km, ki 1, 2, . . . , m , i = 1, 2, . . . , mwill indicate that Uk1 . . . Ukm. Here Ui Uj indicatesthat partition Ui is a better clustering than Uj . Note that

a validity measure may also define a decreasing sequence

instead of an increasing sequence of Vk1, . . . , V km .Several cluster validity indices have been proposed in

the literature. These include Davies-Bouldin (DB) index [2],

Dunns index [3], Xie-Beni (XB) index [2], I-index [2], CS-

index [4], Sym-index [5], [6] etc., to name just a few. Some

of these indices have been found to be able to detect the

correct partitioning for a given number of clusters, while

some can determine the appropriate number of clusters as

well. Maulik and Bandyopadhyay [2] evaluated the perfor-

mance of four validity indices, namely, the Davies-Bouldin

index [2], Dunns index [3], Calinski-Harabasz index [2],

and a new index I, in conjunction with three differentalgorithms, viz., the well-known K-means [1], single-linkage

algorithm [1] and a SA-based clustering method [2]. But

most of the existing cluster validity indices are able to

detect the partitioning where clusters are either having the

hyperspherical shape or symmetrical shape. Dunns index

[3] is able to detect clusters having different shapes but

sometimes it prefers the partitioning where some clusters

are merged together to maximize the minimum separation

between clusters. In this paper one cluster validity index isdeveloped which is able to detect the appropriate partitioning

from data sets having clusters of any shape, size or convexity

as long as they are well-separated.

The concept of relative neighborhood graph (RNG) [7]

has been successfully applied for solving several pattern

recognition problems. One unsupervised clustering tech-

nique based on the concepts of RNG is developed in Ref. [8].

In this article the concepts of relative neighborhood graph [7]

is used to develop a new cluster validity index. The proposed

index, connect-index, quantifies the degree of connectivity

of individual clusters, while they are well-separated as well.

The index is inspired by the popular Dunns index [3].

But it outperforms the Dunns index [3] for determining

the number of clusters from data sets having well-separated

clusters. Single-linkage clustering technique [1] is used as

the underlying partitioning method. The effectiveness of the

proposed index in comparison with the popular Dunns index

[3] is shown for four artificially generated and two real-life

data sets.

2. Proposed Cluster Validity Index

In this section at first the concept of relative neighborhood

graph (RNG) [7][8] is first described. This is followed by

a detailed description of the cluster validity index proposedhere. It is based on the concept of relative neighborhood

graph in order to measure the amount of connectedness

among the clusters.

2.1. Relative Neighborhood Graph

Suppose r is an integer and p, q are two points in r-

dimensional Euclidean space. Then the lune of p and q

2009 Seventh International Conference on Advances in Pattern Recognition

978-0-7695-3520-3/09 $25.00 2009 IEEE

DOI 10.1109/ICAPR.2009.53

91

8/8/2019 A 091

2/4

Figure 1. The lune of two points p and q is the region

between the two arcs, not including the boundary.

Figure 2. (a) A set of points in the plane (b) RNG of thepoints in (a)

(denoted lun(p,q) or lun(pq)) is the set of points

{z Rr : d(p, z) < d(p.q) and d(q, z) < d(p,q)},where d denotes the Euclidean distance. Alternatively,

lun(p,q) denotes the interior of the region formed by theintersection of two r-dimensional hyperspheres of radius

d(p,q), one of the hyperspheres being centered at p and

the other atq

. This is illustrated in Figure 1 which showsthe lune of two points p, q in the plane. If V is a set of

n points in r-space, then define the relative neighborhood

graph of V (denoted RN G(V) or simply RNG when Vis understood) to be the undirected graph with vertices

V such that for each pair p,q V, pq is an edge ofRN G(V) iff lun(p,q)V = . Here the edge weight ofa particular edge (pq) is kept equal to d(p,q), the Euclideandistance between the points p and q.

Figure 2(a) shows a set V of points in the plane; Figure

2(b) shows the RNG of this set of points V. The RNG

problem is: Given a set V, find RN G(V).

2.2. Measuring the Connectivity Among a Set ofPoints

In order to measure the connectivity among a set of points

we have used the above discussed relative neighborhood

graph concept. Here the distance between a pair of points is

measured in the following way.

1) Construct the relative neighborhood graph of the

whole data set.

2) The distance between any two points, x and y, denoted

as dshort(x,y), is measured along the relative neigh-borhood graph. Find all possible paths among these

two points along the RNG. Suppose there are total

p paths between x and y, and the number of edges

along the ith path is ni, for i = 1, . . . , p. If the edges

along the ith path are denoted as edi1, . . . , e dini and thecorresponding edge weights are w(edi1), . . . , w(ed

ini

),then the shortest distance between x and y is defined

as follows:

dshort(p,q) =p

mini=1

nimaxj=1

w(edij).

2.3. Proposed Cluster Validity Index

The proposed cluster validity index is defined as fol-

lows. Suppose the clusters formed are denoted by Ci, for

i = 1, . . . , K , where K is the number of clusters. Then thediameter of a particular cluster is denoted as diam(Ci), fori = 1, . . . , K , which is defined below:

diam(Ci) = maxx,yCi

{dshort(x,y)}.

Here dshort(x,y) is as defined in Section 2.2.The distance between any two clusters Ci and Cj where

i, j = 1, . . . , K , i = j, is defined as follows:dist(Ci, Cj) = min

xCi and yCj{dshort(x,y)}

Now the proposed connectivity based cluster validity index,

connect-index, is defined as follows:

connect = min1iK{ min1jK,i=j{dist(Ci, Cj)

max1kK{diam(Ck)}}}.Intuitively larger values of connect corresponds to good

partitioning. Thus the appropriate number of clusters is

determined by maximizing connect over different values

of K. If connecti denotes the connect-index value for the

number of clusters, K = i, then the appropriate number ofclusters, K, is determined as:

K = argopt{ maxi=1,...,Kmax

connecti}.

Here Kmax is the maximum possible number of clusters. In

general, Kmax is kept equal to

n, where n is the number

of points in the data set.connect-index has two components. Its denominator mea-

sures the maximum shortest distance among any two points

in a particular cluster. If the cluster is completely connected

then the shortest distance between any two points would be

very small and thus the diameter of that particular cluster

would be small too. As connect-index tries to minimize the

maximum diameter amongst all clusters, this in turn tries

to minimize the diameter of every clusters. Thus when all

clusters are well-connected, their diameters are small and the

92

8/8/2019 A 091

3/4

Table 1. Experimental Results on Several Data sets.Here AC denotes the actual number of clusters and OC

denotes the obtained number of clusters.

Name # points dimension AC OCconnect Dunn

Pat1 557 2 3 3 2

Pat2 417 2 2 2 2Spiral 1000 2 2 2 2

Mixed 5 2 850 2 5 5 6

Iris 150 4 3 2 2

Cancer 683 9 2 2 6

denominator of the connect-index gets a smaller value. The

numerator of the connect-index is the minimum separation

between any two clusters which is measured as the minimum

shortest distance between any two points belonging to two

different clusters along the RNG. In order to increase the

value of connect-index, the numerator of this index has

to be maximized, thus the minimum separation betweenany two clusters should be maximum. This only happens

if the clusters are well-separated. Thus connect-index gets

its maximum value when all the clusters are connected and

well-separated as well.

300 400 500 600 700 800 900800

1000

1200

1400

1600

1800

2000

2200

2400

2600

2800

300 400 500 600 700 800 900800

1000

1200

1400

1600

1800

2000

2200

2400

2600

2800

(a) (b)

Figure 3. (a) Pat1 (b) Pat2

8 6 4 2 0 2 4 6 86

4

2

0

2

4

6

10 8 6 4 2 0 2 4 6 82

0

2

4

6

8

10

12

14

16

(a) (b)

Figure 4. (a) Spiral(b) Mixed 5 2

3. Experimental Results

Here the popular single linkage clustering technique

[1] is used to partition the data sets used for experi-

ments. Four artificial and two real-life data sets are used

300 400 500 600 700 800 900800

1000

1200

1400

1600

1800

2000

2200

2400

2600

2800

300 400 500 600 700 800 900800

1000

1200

1400

1600

1800

2000

2200

2400

2600

2800

(a) (b)

Figure 5. Optimal Partitioning on Pat1 indicated by (a)proposed connect-index for K = 3 (b) Dunns index forK = 2

300 400 500 600 700 800 900800

1000

1200

1400

1600

1800

2000

2200

2400

2600

2800

8 6 4 2 0 2 4 6 86

4

2

0

2

4

6

(a) (b)

Figure 6. Optimal Partitioning indicated by bothconnect-index and Dunns index on (a) Pat2 for K = 2(b) Spiraldata set for K = 2

to show the efficacy of the proposed cluster validity in-

dex, connect-index. The description of the data sets used

here for experiment is shown in Table 1. Pat1 and Pat2

data sets are used in Ref.[9], Spiral data set is usedin Ref.[10] and Mixed 5 2 data set is used in Ref.[5].

Figures 3(a), 3(b), 4(a), 4(b) show the four artificial data

sets, respectively. The two real-life datasets are obtained

from (http://www.ics.uci.edu/mlearn/MLRepository.html).Iris data set represents different categories of irises charac-

terized by four feature values. It has three classes Setosa,

Versicolor and Virginica. It is known that two classes (Ver-

sicolor and Virginica) have a large amount of overlap while

the class Setosa is linearly separable from the other two.

The Wisconsin Breast Cancer data set has two categories in

it: malignant and benign. The two classes are known to be

linearly separable.

Single Linkage clustering is used to partition all the

above mentioned data sets for K = 2, . . . ,

n and the

corresponding connect-index values are computed for all

the partitions. Then the partition which corresponds to the

maximum value of connect-index is taken as the optimal

partitioning and the corresponding number of clusters is

regarded as the optimal number of clusters indicated by

connect-index. For all the data sets used here for experiment,

the optimal number of clusters indicated by connect-index

93

8/8/2019 A 091

4/4

10 8 6 4 2 0 2 4 6 82

0

2

4

6

8

10

12

14

16

10 8 6 4 2 0 2 4 6 82

0

2

4

6

8

10

12

14

16

(a) (b)

Figure 7. Optimal Partitioning for Mixed 5 2 indicatedby (a) proposed connect-index for K = 5 (b) Dunnsindex for K = 6

are reported in Table 1. For the purpose of comparison,

the number of clusters identified by the popular Dunnsindex [3] for all data sets used here for experiment are

also reported in Table 1. This table reveals that in most

of the cases the proposed connect-index is able to identify

the appropriate number of clusters from almost all the data

sets used here for experiment while Dunns index is able to

detect the appropriate number of clusters from two out of

these six data sets. For Iris data set, both the validity indices

provide K = 2, which is also often obtained for manyother methods for Iris. Figures 5(a), 6(a), 6(b), 7(a) show,

respectively, the optimal partitionings indicated by connect-

index for four artificial data sets used here for experiment.

Similarly Figures 5(b), 6(a), 6(b) and 7(b) show, respectively,

the optimal partitionings indicated by popular Dunns indexfor these four artificial data sets.

For the two real-life data sets, Iris and Cancer, no

visualization is possible as these are high-dimensional data

sets. The Minkowski Score (MS) [11] is calculated after

application of Single Linkage clustering technique for these

two real-life data sets. This is a measure of the quality of a

solution given the true clustering. Let T be the true solu-

tion and S the solution we wish to measure. Denote by n11the number of pairs of elements that are in the same cluster

in both S and T. Denote by n01 the number of pairs that

are in the same cluster only in S, and by n10 the number of

pairs that are in the same cluster in T. Minkowski Score (MS)is then defined as: M S(T , S) =

n01+n10n11+n10

.. For MS, the

optimum score is 0, with lower scores being better. For Iris

data set, MS value corresponding to the partitioning obtained

by Single Linkage clustering for K = 2 is 0.88. Again forCancerdata set, Single Linkage clustering technique obtains

a MS of 0.43 for K = 2 (number of partitions indicated bynewly proposed connect-index) while that ofK = 6 (numberof partitions indicated by Dunns index) is 1.45.

4. Discussion and Conclusion

Identifying the proper number of clusters and the proper

partitioning from a data set are two crucial issues in un-

supervised classification. In this paper one cluster validity

index is developed for this purpose. The proposed index

is able to detect the appropriate number of clusters andthe appropriate partitioning from data sets as long as the

clusters are well separated either having any shape, size

or convexity. The effectiveness of the proposed index in

comparison with one existing cluster validity index, Dunns

index, is shown for four artificial and two real-life data sets.

Future work includes developing some mathematical proof

of the proposed index. Comparing the proposed validity

index with other existing indices more extensively is another

important future research work.

References

[1] B. S. Everitt, S. Landau, and M. Leese, Cluster Analysis.London: Arnold, 2001.

[2] U. Maulik and S. Bandyopadhyay, Performance evaluationof some clustering algorithms and validity indices, IEEETransactions on Pattern Analysis and Machine Intelligence,vol. 24, no. 12, pp. 16501654, 2002.

[3] J. C. Dunn, A fuzzy relative of the ISODATA process andits use in detecting compact well-separated clusters, Journalof Cybernetics, vol. 3, pp. 3257, 1973.

[4] C. H. Chou, M. C. Su, and E. Lai, A new cluster validitymeasure and its application to image compression, Pattern

Analysis and Applications, vol. 7, pp. 205220, 2004.

[5] S. Bandyopadhyay and S. Saha, A point symmetry basedclustering technique for automatic evolution of clusters,

IEEE Transactions on Knowledge and Data Engineering,vol. 20, no. 11, pp. 117, November, 2008.

[6] S. Saha and S. Bandyopadhyay, Application of a newsymmetry based cluster validity index for satellite imagesegmentation, IEEE Geoscience and Remote Sensing Letters,vol. 5, no. 2, pp. 166170, 2008.

[7] G. T. Toussaint, The realtive neighborhood graph of a finiteplanar set, Pattern Recognition, vol. 12, pp. 261268, 1980.

[8] S. Bandyopadhyay, An automatic shape independent clus-tering technique, Pattern Recognition, vol. 37, pp. 3345,2004.

[9] S. K. Pal, S. Bandyopadhyay, and C. A. Murthy, Geneticalgorithms for generation of class boundaries, IEEE Trans.System Man Cybernet, vol. 28, no. 6, pp. 816828, 1998.

[10] J. Handl and J. Knowles, An evolutionary approach to mul-tiobjective clustering, IEEE Transactions on EvolutionaryComputation, vol. 11, no. 1, pp. 5676, 2007.

[11] A. Ben-Hur and I. Guyon, Detecting Stable Clusters usingPrincipal Component Analysis in Methods in Molecular Bi-ology. Humana press, 2003.

94

A 091

Documents

Transcript of A 091