A 091
Transcript of A 091
-
8/8/2019 A 091
1/4
A Validity Index Based on Connectivity
Sriparna Saha and Sanghamitra Bandyopadhyay
Machine Intelligence Unit, Indian Statistical Institute, Kolkata, India
Email:{sriparna r, sanghami}@isical.ac.in
Abstract
In this paper we have developed a connectivity based
cluster validity index. This validity index is able to detect
the number of clusters automatically from data sets having
well separated clusters of any shape, size or convexity.
The proposed cluster validity index, connect-index, uses the
concept of relative neighborhood graph for measuring the
amount of connectedness of a particular cluster. The
proposed connect-index is inspired by the popular Dunns
index for measuring the cluster validity. Single linkageclustering algorithm is used as the underlying partitioning
technique. The superiority of the proposed validity measure
in comparison with Dunns index is shown for four artificial
and two real-life data sets.
1. Introduction
Clustering [1] is a core technique in data-mining with
innumerable applications spanning many fields. Model se-
lection in clustering consists of two steps. In the first step
the proper clustering method for a particular data set has
to be decided upon. Once this choice has been made, one
has to determine the number of clusters and also assess
the validity of the clusters formed [1]. For this purpose
several cluster validity indices have been proposed in the
literature. The measure of validity of the clusters should
be such that it will be able to impose an ordering of the
partitionings in terms of its goodness. In other words, if
U1, U2, . . . , U m be the m partitions of X, and the corre-
sponding values of a validity measure be V1, V2, . . . V m, then
Vk1 Vk2 . . . V km, ki 1, 2, . . . , m , i = 1, 2, . . . , mwill indicate that Uk1 . . . Ukm. Here Ui Uj indicatesthat partition Ui is a better clustering than Uj . Note that
a validity measure may also define a decreasing sequence
instead of an increasing sequence of Vk1, . . . , V km .Several cluster validity indices have been proposed in
the literature. These include Davies-Bouldin (DB) index [2],
Dunns index [3], Xie-Beni (XB) index [2], I-index [2], CS-
index [4], Sym-index [5], [6] etc., to name just a few. Some
of these indices have been found to be able to detect the
correct partitioning for a given number of clusters, while
some can determine the appropriate number of clusters as
well. Maulik and Bandyopadhyay [2] evaluated the perfor-
mance of four validity indices, namely, the Davies-Bouldin
index [2], Dunns index [3], Calinski-Harabasz index [2],
and a new index I, in conjunction with three differentalgorithms, viz., the well-known K-means [1], single-linkage
algorithm [1] and a SA-based clustering method [2]. But
most of the existing cluster validity indices are able to
detect the partitioning where clusters are either having the
hyperspherical shape or symmetrical shape. Dunns index
[3] is able to detect clusters having different shapes but
sometimes it prefers the partitioning where some clusters
are merged together to maximize the minimum separation
between clusters. In this paper one cluster validity index isdeveloped which is able to detect the appropriate partitioning
from data sets having clusters of any shape, size or convexity
as long as they are well-separated.
The concept of relative neighborhood graph (RNG) [7]
has been successfully applied for solving several pattern
recognition problems. One unsupervised clustering tech-
nique based on the concepts of RNG is developed in Ref. [8].
In this article the concepts of relative neighborhood graph [7]
is used to develop a new cluster validity index. The proposed
index, connect-index, quantifies the degree of connectivity
of individual clusters, while they are well-separated as well.
The index is inspired by the popular Dunns index [3].
But it outperforms the Dunns index [3] for determining
the number of clusters from data sets having well-separated
clusters. Single-linkage clustering technique [1] is used as
the underlying partitioning method. The effectiveness of the
proposed index in comparison with the popular Dunns index
[3] is shown for four artificially generated and two real-life
data sets.
2. Proposed Cluster Validity Index
In this section at first the concept of relative neighborhood
graph (RNG) [7][8] is first described. This is followed by
a detailed description of the cluster validity index proposedhere. It is based on the concept of relative neighborhood
graph in order to measure the amount of connectedness
among the clusters.
2.1. Relative Neighborhood Graph
Suppose r is an integer and p, q are two points in r-
dimensional Euclidean space. Then the lune of p and q
2009 Seventh International Conference on Advances in Pattern Recognition
978-0-7695-3520-3/09 $25.00 2009 IEEE
DOI 10.1109/ICAPR.2009.53
91
-
8/8/2019 A 091
2/4
Figure 1. The lune of two points p and q is the region
between the two arcs, not including the boundary.
Figure 2. (a) A set of points in the plane (b) RNG of thepoints in (a)
(denoted lun(p,q) or lun(pq)) is the set of points
{z Rr : d(p, z) < d(p.q) and d(q, z) < d(p,q)},where d denotes the Euclidean distance. Alternatively,
lun(p,q) denotes the interior of the region formed by theintersection of two r-dimensional hyperspheres of radius
d(p,q), one of the hyperspheres being centered at p and
the other atq
. This is illustrated in Figure 1 which showsthe lune of two points p, q in the plane. If V is a set of
n points in r-space, then define the relative neighborhood
graph of V (denoted RN G(V) or simply RNG when Vis understood) to be the undirected graph with vertices
V such that for each pair p,q V, pq is an edge ofRN G(V) iff lun(p,q)V = . Here the edge weight ofa particular edge (pq) is kept equal to d(p,q), the Euclideandistance between the points p and q.
Figure 2(a) shows a set V of points in the plane; Figure
2(b) shows the RNG of this set of points V. The RNG
problem is: Given a set V, find RN G(V).
2.2. Measuring the Connectivity Among a Set ofPoints
In order to measure the connectivity among a set of points
we have used the above discussed relative neighborhood
graph concept. Here the distance between a pair of points is
measured in the following way.
1) Construct the relative neighborhood graph of the
whole data set.
2) The distance between any two points, x and y, denoted
as dshort(x,y), is measured along the relative neigh-borhood graph. Find all possible paths among these
two points along the RNG. Suppose there are total
p paths between x and y, and the number of edges
along the ith path is ni, for i = 1, . . . , p. If the edges
along the ith path are denoted as edi1, . . . , e dini and thecorresponding edge weights are w(edi1), . . . , w(ed
ini
),then the shortest distance between x and y is defined
as follows:
dshort(p,q) =p
mini=1
nimaxj=1
w(edij).
2.3. Proposed Cluster Validity Index
The proposed cluster validity index is defined as fol-
lows. Suppose the clusters formed are denoted by Ci, for
i = 1, . . . , K , where K is the number of clusters. Then thediameter of a particular cluster is denoted as diam(Ci), fori = 1, . . . , K , which is defined below:
diam(Ci) = maxx,yCi
{dshort(x,y)}.
Here dshort(x,y) is as defined in Section 2.2.The distance between any two clusters Ci and Cj where
i, j = 1, . . . , K , i = j, is defined as follows:dist(Ci, Cj) = min
xCi and yCj{dshort(x,y)}
Now the proposed connectivity based cluster validity index,
connect-index, is defined as follows:
connect = min1iK{ min1jK,i=j{dist(Ci, Cj)
max1kK{diam(Ck)}}}.Intuitively larger values of connect corresponds to good
partitioning. Thus the appropriate number of clusters is
determined by maximizing connect over different values
of K. If connecti denotes the connect-index value for the
number of clusters, K = i, then the appropriate number ofclusters, K, is determined as:
K = argopt{ maxi=1,...,Kmax
connecti}.
Here Kmax is the maximum possible number of clusters. In
general, Kmax is kept equal to
n, where n is the number
of points in the data set.connect-index has two components. Its denominator mea-
sures the maximum shortest distance among any two points
in a particular cluster. If the cluster is completely connected
then the shortest distance between any two points would be
very small and thus the diameter of that particular cluster
would be small too. As connect-index tries to minimize the
maximum diameter amongst all clusters, this in turn tries
to minimize the diameter of every clusters. Thus when all
clusters are well-connected, their diameters are small and the
92
-
8/8/2019 A 091
3/4
Table 1. Experimental Results on Several Data sets.Here AC denotes the actual number of clusters and OC
denotes the obtained number of clusters.
Name # points dimension AC OCconnect Dunn
Pat1 557 2 3 3 2
Pat2 417 2 2 2 2Spiral 1000 2 2 2 2
Mixed 5 2 850 2 5 5 6
Iris 150 4 3 2 2
Cancer 683 9 2 2 6
denominator of the connect-index gets a smaller value. The
numerator of the connect-index is the minimum separation
between any two clusters which is measured as the minimum
shortest distance between any two points belonging to two
different clusters along the RNG. In order to increase the
value of connect-index, the numerator of this index has
to be maximized, thus the minimum separation betweenany two clusters should be maximum. This only happens
if the clusters are well-separated. Thus connect-index gets
its maximum value when all the clusters are connected and
well-separated as well.
300 400 500 600 700 800 900800
1000
1200
1400
1600
1800
2000
2200
2400
2600
2800
300 400 500 600 700 800 900800
1000
1200
1400
1600
1800
2000
2200
2400
2600
2800
(a) (b)
Figure 3. (a) Pat1 (b) Pat2
8 6 4 2 0 2 4 6 86
4
2
0
2
4
6
10 8 6 4 2 0 2 4 6 82
0
2
4
6
8
10
12
14
16
(a) (b)
Figure 4. (a) Spiral(b) Mixed 5 2
3. Experimental Results
Here the popular single linkage clustering technique
[1] is used to partition the data sets used for experi-
ments. Four artificial and two real-life data sets are used
300 400 500 600 700 800 900800
1000
1200
1400
1600
1800
2000
2200
2400
2600
2800
300 400 500 600 700 800 900800
1000
1200
1400
1600
1800
2000
2200
2400
2600
2800
(a) (b)
Figure 5. Optimal Partitioning on Pat1 indicated by (a)proposed connect-index for K = 3 (b) Dunns index forK = 2
300 400 500 600 700 800 900800
1000
1200
1400
1600
1800
2000
2200
2400
2600
2800
8 6 4 2 0 2 4 6 86
4
2
0
2
4
6
(a) (b)
Figure 6. Optimal Partitioning indicated by bothconnect-index and Dunns index on (a) Pat2 for K = 2(b) Spiraldata set for K = 2
to show the efficacy of the proposed cluster validity in-
dex, connect-index. The description of the data sets used
here for experiment is shown in Table 1. Pat1 and Pat2
data sets are used in Ref.[9], Spiral data set is usedin Ref.[10] and Mixed 5 2 data set is used in Ref.[5].
Figures 3(a), 3(b), 4(a), 4(b) show the four artificial data
sets, respectively. The two real-life datasets are obtained
from (http://www.ics.uci.edu/mlearn/MLRepository.html).Iris data set represents different categories of irises charac-
terized by four feature values. It has three classes Setosa,
Versicolor and Virginica. It is known that two classes (Ver-
sicolor and Virginica) have a large amount of overlap while
the class Setosa is linearly separable from the other two.
The Wisconsin Breast Cancer data set has two categories in
it: malignant and benign. The two classes are known to be
linearly separable.
Single Linkage clustering is used to partition all the
above mentioned data sets for K = 2, . . . ,
n and the
corresponding connect-index values are computed for all
the partitions. Then the partition which corresponds to the
maximum value of connect-index is taken as the optimal
partitioning and the corresponding number of clusters is
regarded as the optimal number of clusters indicated by
connect-index. For all the data sets used here for experiment,
the optimal number of clusters indicated by connect-index
93
-
8/8/2019 A 091
4/4
10 8 6 4 2 0 2 4 6 82
0
2
4
6
8
10
12
14
16
10 8 6 4 2 0 2 4 6 82
0
2
4
6
8
10
12
14
16
(a) (b)
Figure 7. Optimal Partitioning for Mixed 5 2 indicatedby (a) proposed connect-index for K = 5 (b) Dunnsindex for K = 6
are reported in Table 1. For the purpose of comparison,
the number of clusters identified by the popular Dunnsindex [3] for all data sets used here for experiment are
also reported in Table 1. This table reveals that in most
of the cases the proposed connect-index is able to identify
the appropriate number of clusters from almost all the data
sets used here for experiment while Dunns index is able to
detect the appropriate number of clusters from two out of
these six data sets. For Iris data set, both the validity indices
provide K = 2, which is also often obtained for manyother methods for Iris. Figures 5(a), 6(a), 6(b), 7(a) show,
respectively, the optimal partitionings indicated by connect-
index for four artificial data sets used here for experiment.
Similarly Figures 5(b), 6(a), 6(b) and 7(b) show, respectively,
the optimal partitionings indicated by popular Dunns indexfor these four artificial data sets.
For the two real-life data sets, Iris and Cancer, no
visualization is possible as these are high-dimensional data
sets. The Minkowski Score (MS) [11] is calculated after
application of Single Linkage clustering technique for these
two real-life data sets. This is a measure of the quality of a
solution given the true clustering. Let T be the true solu-
tion and S the solution we wish to measure. Denote by n11the number of pairs of elements that are in the same cluster
in both S and T. Denote by n01 the number of pairs that
are in the same cluster only in S, and by n10 the number of
pairs that are in the same cluster in T. Minkowski Score (MS)is then defined as: M S(T , S) =
n01+n10n11+n10
.. For MS, the
optimum score is 0, with lower scores being better. For Iris
data set, MS value corresponding to the partitioning obtained
by Single Linkage clustering for K = 2 is 0.88. Again forCancerdata set, Single Linkage clustering technique obtains
a MS of 0.43 for K = 2 (number of partitions indicated bynewly proposed connect-index) while that ofK = 6 (numberof partitions indicated by Dunns index) is 1.45.
4. Discussion and Conclusion
Identifying the proper number of clusters and the proper
partitioning from a data set are two crucial issues in un-
supervised classification. In this paper one cluster validity
index is developed for this purpose. The proposed index
is able to detect the appropriate number of clusters andthe appropriate partitioning from data sets as long as the
clusters are well separated either having any shape, size
or convexity. The effectiveness of the proposed index in
comparison with one existing cluster validity index, Dunns
index, is shown for four artificial and two real-life data sets.
Future work includes developing some mathematical proof
of the proposed index. Comparing the proposed validity
index with other existing indices more extensively is another
important future research work.
References
[1] B. S. Everitt, S. Landau, and M. Leese, Cluster Analysis.London: Arnold, 2001.
[2] U. Maulik and S. Bandyopadhyay, Performance evaluationof some clustering algorithms and validity indices, IEEETransactions on Pattern Analysis and Machine Intelligence,vol. 24, no. 12, pp. 16501654, 2002.
[3] J. C. Dunn, A fuzzy relative of the ISODATA process andits use in detecting compact well-separated clusters, Journalof Cybernetics, vol. 3, pp. 3257, 1973.
[4] C. H. Chou, M. C. Su, and E. Lai, A new cluster validitymeasure and its application to image compression, Pattern
Analysis and Applications, vol. 7, pp. 205220, 2004.
[5] S. Bandyopadhyay and S. Saha, A point symmetry basedclustering technique for automatic evolution of clusters,
IEEE Transactions on Knowledge and Data Engineering,vol. 20, no. 11, pp. 117, November, 2008.
[6] S. Saha and S. Bandyopadhyay, Application of a newsymmetry based cluster validity index for satellite imagesegmentation, IEEE Geoscience and Remote Sensing Letters,vol. 5, no. 2, pp. 166170, 2008.
[7] G. T. Toussaint, The realtive neighborhood graph of a finiteplanar set, Pattern Recognition, vol. 12, pp. 261268, 1980.
[8] S. Bandyopadhyay, An automatic shape independent clus-tering technique, Pattern Recognition, vol. 37, pp. 3345,2004.
[9] S. K. Pal, S. Bandyopadhyay, and C. A. Murthy, Geneticalgorithms for generation of class boundaries, IEEE Trans.System Man Cybernet, vol. 28, no. 6, pp. 816828, 1998.
[10] J. Handl and J. Knowles, An evolutionary approach to mul-tiobjective clustering, IEEE Transactions on EvolutionaryComputation, vol. 11, no. 1, pp. 5676, 2007.
[11] A. Ben-Hur and I. Guyon, Detecting Stable Clusters usingPrincipal Component Analysis in Methods in Molecular Bi-ology. Humana press, 2003.
94