A new multiobjective clustering technique based on the ...sriparna/papers/stab-kais.pdf · The...

Knowl Inf Syst (2010) 23:1–27DOI 10.1007/s10115-009-0204-4

REGULAR PAPER

A new multiobjective clustering technique basedon the concepts of stability and symmetry

Sriparna Saha · Sanghamitra Bandyopadhyay

Received: 13 January 2009 / Revised: 24 February 2009 / Accepted: 1 March 2009 /Published online: 4 April 2009© Springer-Verlag London Limited 2009

Abstract Most clustering algorithms operate by optimizing (either implicitly or explicitly)a single measure of cluster solution quality. Such methods may perform well on some datasets but lack robustness with respect to variations in cluster shape, proximity, evenness andso forth. In this paper, we have proposed a multiobjective clustering technique which opti-mizes simultaneously two objectives, one reflecting the total cluster symmetry and the otherreflecting the stability of the obtained partitions over different bootstrap samples of the dataset. The proposed algorithm uses a recently developed simulated annealing-based multiobjec-tive optimization technique, named AMOSA, as the underlying optimization strategy. Here,points are assigned to different clusters based on a newly defined point symmetry-based dis-tance rather than the Euclidean distance. Results on several artificial and real-life data sets incomparison with another multiobjective clustering technique, MOCK, three single objectivegenetic algorithm-based automatic clustering techniques, VGAPS clustering, GCUK clus-tering and HNGA clustering, and several hybrid methods of determining the appropriatenumber of clusters from data sets show that the proposed technique is well suited to detectautomatically the appropriate number of clusters as well as the appropriate partitioning fromdata sets having point symmetric clusters. The performance of AMOSA as the underlyingoptimization technique in the proposed clustering algorithm is also compared with PESA-II,another evolutionary multiobjective optimization technique.

Keywords Clustering · Multiobjective optimization (MOO) · Symmetry · Stability

1 Introduction

Clustering [2,17,31,33,35,46] is commonly defined as the task of finding natural partitioningwithin a data set such that data items within the same group are more similar than those within

S. Saha (B) · S. BandyopadhyayMachine Intelligence Unit, Indian Statistical Institute,203 B.T. Road, Kolkata 700108, Indiae-mail: [email protected]

S. Bandyopadhyaye-mail: [email protected]

123

2 S. Saha, S. Bandyopadhyay

different groups. This is the most general but rather ‘loose’ concept; sometimes it remainsquite difficult to realize in general practice. Evidently, one reason for this difficulty is thatfor many data sets, no unambiguous partitioning of the data exists, or can be established,even by humans. But even in some cases, where there is an unambiguous partitioning of thedata set, some clustering algorithms drastically fail. This is because most existing clusteringalgorithm is based only on one internal evaluation function, an objective function whichmeasures intrinsic properties of a partitioning, such as spatial separation between the clustersor the compactness of the clusters. Hence the internal evaluation function is assumed toreflect the quality of the partitioning reliably, an assumption that may be violated for certaindata sets. Thus the use of multiobjective clustering algorithms seems justified [24].

In order to mathematically identify clusters in a data set, some similarity or dissimilaritymeasure has to be defined first. This measure should establish a rule for assigning patternsto a particular cluster center. The measure of similarity is usually data dependent. Sincesymmetry is a basic feature of most shapes and objects, it can be considered as an importantfeature in their recognition and reconstruction [3]. Symmetry is a natural phenomena. Thus itcan be assumed that clusters too possess some property of symmetry. Based on this observa-tion, a new point symmetry-based distance, dps (PS-distance), was developed in [7]. Further,Kd-tree [1] was used to reduce the computational complexity of calculating the PS-distance.This proposed distance was then used to develop a genetic algorithm-based clustering tech-nique, GAPS [7].

Determining the appropriate number of clusters from a given data set is an importantconsideration in clustering. For this purpose, and also to validate the obtained partitioning,several cluster validity indices have been proposed in the literature. The measure of validityof the clusters should be such that it will be able to impose an ordering of the clusters in termsof its goodness. There exists a large number of cluster validation methods which employ thenotion of the stability of clustering solutions. The main idea behind such approach to clus-ter validation requires that solutions are similar for two different data sets that have beengenerated by the same (probabilistic) source. Breckenridge [12] has proposed a measureof stability by estimating the agreement of clustering solutions generated by a clusteringalgorithm and by a classifier trained using a second (clustered) data set. But that work didnot lead to a specific implementation procedure, in particular not for model order selection.Later on many methods (such as [18,29,42]) for cluster validation came out based on theidea of Breckenridge. However existing stability-based cluster validity indices are unableto determine the appropriate model order from the data sets for which the clustering algo-rithm provides stable solutions for several values of number of clusters, K . Moreover, mostof the validity measures usually assume a certain geometrical structure in the shapes of allthe clusters. But if different clusters possess different structural properties, these indices areoften found to fail. Here, a newly developed symmetry-based cluster validity index, namedSym-index, that uses the new distance dps, is combined with the stability of the clusteringsolutions produced by an algorithm for different bootstrap samples of a data set to developa new measure, SSym-index, for cluster validity.

Since the global optimum of the validity functions would correspond to the most “valid”solutions with respect to the functions, clustering algorithms based on Genetic Algorithms(GAs) have been developed to optimize the validity functions to determine the appropriatenumber of clusters and the appropriate partitioning of a data set simultaneously [4,5,20].Other than evaluating the static clusters generated by a specific clustering algorithm, thevalidity functions in these approaches are used as clustering objective functions for comput-ing the fitness, which guides the evolution to search for the “valid” solution. However, SimpleGA (SGA) [25] or its variants are used as the genetic clustering techniques in [4,5,20]. In

123

A new multiobjective clustering technique based on the concepts of stability and symmetry 3

[39], a function called Weighted Sum Validity Function (WSVF), which is a weighted sumof the several normalized validity functions, is used for optimization along with a HybridNiching Genetic Algorithm (HNGA) to evolve automatically the proper number of clustersfrom a given data set. Within this HNGA, a niching method is developed to prevent prematureconvergence by preserving both the diversity of the population with respect to the number ofclusters encoded in the individuals and the diversity of the subpopulation with the same num-ber of clusters during the search. In the above mentioned genetic clustering techniques forautomatic evolution of clusters, assignment of points to different clusters are done in the linesof K-means clustering technique. Consequently, all these approaches are only able to findcompact hyperspherical, equisized and convex clusters like those detected by the K-meansalgorithm [26]. In order to automatically detect all types of clusters having different geomet-ric shapes present in a data set, a point symmetry-based genetic clustering technique, VGAPSclustering, is proposed in [8]. Here the newly developed point symmetry-based distance [7]is utilized for assignment of points to different clusters.

VGAPS optimizes a single cluster validity measure, point symmetry-based cluster validityindex Sym-index, as the fitness function to reflect the goodness of an encoded partitioning.However, a single cluster validity measure like Sym-index is seldom equally applicable fordifferent kinds of data sets with different characteristics. Hence it is necessary to optimizesimultaneously several validity measures that can capture the different data characteristics.In order to achieve this, in this article the problem of clustering a data set is posed as oneof multiobjective optimization (MOO) [15], where search is performed over a number of,often conflicting, objective functions. A newly developed simulated annealing-based multi-objective optimization technique, AMOSA [9], is used to determine the appropriate clustercenters and the corresponding partitioning. In [24], a multiobjective clustering technique,named MOCK, is developed where two evaluation criteria, one based on the total compact-ness of the partitioning and another based on the connectedness of the clusters are optimizedsimultaneously. This algorithm is thus able to detect the clusters having either hyperspher-ical shapes or well-separated structures. But, it fails to detect overlapping clusters havingdifferent shapes other than hyperspheres. In the present paper, an attempt has been made tointegrate, in a novel approach, the property of point symmetry with stability so as to developa clustering technique that can detect both symmetric and stable partitioning of the data.

In this paper, a multiobjective clustering technique is developed which uses a recentlydeveloped simulated annealing-based MOO technique, AMOSA [9], as the underlying opti-mization strategy. Here, each string comprises the centers of a variable number of clusters.The number of clusters encoded in each string vary over a range. The algorithm is also able todetermine the appropriate number of clusters present in a data set. The assignment of pointsto different clusters are done based on the newly developed point symmetry-based distancerather than the Euclidean distance. The concept of stability of the clustering solutions pro-duced by this MOO algorithm for different bootstrap samples (produced by sampling withreplacement) of a data set is utilized for determining the appropriate model order of a dataset. The amount of stability is measured by calculating the mean and the difference of thenewly developed point symmetry-based cluster validity index, Sym-index, for different boot-strap samples of a data set. The proper number of clusters of a data set is determined bysimultaneously optimizing these average Sym-index and difference of the Sym-index values.The effectiveness of the proposed AMOSA with stability induced symmetry-based clus-tering technique (SSym-AMOSA) is shown for eight artificial and six real-life data sets ofvarying complexities. Comparisons are also made with another newly developed multiob-jective clustering technique, MOCK [24], three state-of-the-art single objective automaticclustering techniques, VGAPS clustering [8] optimizing a newly developed symmetry-based

123


cluster validity index, Sym-index, GCUK clustering [5] optimizing I -index [32] and HNGAclustering technique [39] optimizing a weighted sum validity function and several hybridmethods of determining the appropriate number of clusters from a data set. In a part of theexperiment, results are also shown using another evolutionary multiobjective optimizationtechnique, PESA-II [13] in place of AMOSA as the optimization strategy in the proposedclustering algorithm.

2 The SA-based MOO algorithm: AMOSA

Archived multiobjective simulated annealing (AMOSA) [9] is a generalized version of thesimulated annealing (SA) algorithm based on multiobjective optimization (MOO). MOO isapplied when dealing with the real-world problems where there are several objectives thatshould be optimized simultaneously. In general, a MOO algorithm usually admits a set ofsolutions that are not dominated by any solution it encountered, i.e., non-dominated solutions[15]. During recent years, many multiobjective evolution algorithms, such as multiobjectiveEA (MOEA), have been suggested to solve the MOO problems [45].

Simulated annealing (SA) is a search technique for solving difficult optimization prob-lems, which is based on the principles of statistical mechanics [44]. Recently, SA has becomevery popular because not only can SA replace the exhaustive search to save time and resource,but also converge to the global optimum if annealed sufficiently slowly [23].

Although the single objective version of SA is quite popular, its utility in the multiobjec-tive case was limited because of its search-from-a-point nature. Recently Bandyopadhyayet al. developed an efficient multiobjective version of SA called AMOSA [9] that overcomesthis limitation. AMOSA is utilized in this work for partitioning a data set.

The AMOSA algorithm incorporates the concept of an archive where the non-dominatedsolutions seen so far are stored. Two limits are kept on the size of the archive: a hard orstrict limit denoted by HL, and a soft limit denoted by SL. The algorithm begins with theinitialization of a number (γ × SL , γ > 1) of solutions each of which represents a state inthe search space. The multiple objective functions are computed. Each solution is refined byusing simple hill-climbing and domination relation for a number of iterations. Thereafter thenon-dominated solutions are stored in the archive until the size of the archive increases to SL.If the size of the archive exceeds HL, a single-linkage clustering scheme is used to reducethe size to HL. Then, one of the points is randomly selected from the archive. This is taken asthe current-pt, or the initial solution, at temperature T = Tmax. The current-pt is perturbed togenerate a new solution named new-pt, and its objective functions are computed. The domi-nation status of the new-pt is checked with respect to the current-pt and the solutions in thearchive. A new quantity called amount of domination, �dom(a, b) between two solutions aand b is defined as follows:

�dom(a, b) =M∏

i=1, fi (a)�= fi (b)

fi (a) − fi (b)

Ri,

where fi (a) and fi (b) are the i th objective values of the two solutions and Ri is the corre-sponding range of the objective function. Based on domination status different cases mayarise viz., accept the (i) new-pt, (ii) current-pt, or, (iii) a solution from the archive. Again,in case of overflow of the archive, clustering is used to reduce its size to HL. The processis repeated iter times for each temperature that is annealed with a cooling rate of α(< 1)

123


till the minimum temperature Tmin is attained. The process thereafter stops, and the archivecontains the final non-dominated solutions.

It has been demonstrated in Ref. [9] that the performance of AMOSA is better than thatof NSGA-II [16] and some other well-known MOO algorithms.

3 The point symmetry distance

In this section, at first the definition of the PS distance developed in [7] is described. Then,the use of Kd-tree for point symmetry distance computation is described.

3.1 Definition

A new definition of point symmetry-based distance (PS distance), dps(x, c), associated withpoint x with respect to a center c is developed in Ref. [7]. It is also shown in [7] that dps(x, c)is able to overcome some serious limitations of an earlier PS distance [41]. Let a point be x .The symmetrical (reflected) point of x with respect to a particular centre c is 2 × c − x . Letus denote this by x∗. Let knear unique nearest neighbors of x∗ be at Euclidean distances ofdi s, i = 1, 2, . . . knear . Then

dps(x, c) = dsym(x, c) × de(x, c), (1)

=∑knear

i=1 di

knear× de(x, c), (2)

where de(x, c) is the Euclidean distance between the point x and c, and dsym(x, c) is asymmetry measure of x with respect to c.

Note that knear in Eq. 2 cannot be chosen equal to 1, since if x∗ exists in the data setthen dps(x, c) = 0 and hence there will be no impact of the Euclidean distance. On the otherhand, large values of knear may not be suitable because it may underestimate the amount ofsymmetry of a point with respect to a particular cluster center. Here knear is chosen equalto 2. It may be noted that the proper value of knear largely depends on the distribution ofthe data set. A fixed value of knear may have many drawbacks. For instance, for very largeclusters (with too many points), two neighbors may not be enough as it is very likely thata few neighbors would have a distance close to zero. On the other hand, clusters with toofew points are more likely to be scattered, and the distance of the two neighbors may be toolarge. Thus a proper choice of knear is an important issue that needs to be addressed in thefuture.

Note that dps(x, c), which is a non-metric, is a way of measuring the amount of pointsymmetry between a point and a cluster center, rather than the distance like any Minkowskidistance.

3.2 Kd-tree-based nearest neighbor computation

A K -dimensional tree, or Kd-tree is a space-partitioning data structure for organizing pointsin a K -dimensional space. A Kd-tree uses only those splitting planes which are perpen-dicular to one of coordinate axes. In the nearest neighbor problem a set of data points ind-dimensional space is given. These points are preprocessed into a data structure, so thatgiven any query point q , the nearest or generally k nearest points of p to q can be reportedefficiently. Approximate Nearest Neighbor (ANN) is a library written in C++ [34], whichsupports data structures and algorithms for both exact and approximate nearest neighbor

123


searching in arbitrarily high dimensions. In this article ANN is used to find di s in Eq. 2efficiently. The ANN library implements a number of different data structures, based onKd-trees and box-decomposition trees, and employs a couple of different search strategies.The Kd-tree data structure has been used in this article. ANN allows the user to specify amaximum approximation error bound, thus allowing the user to control the tradeoff betweenaccuracy and running time.

For computing dps(x, c) in Eq. 2, di s need to be computed. This is a computation inten-sive task that can be speeded up by using the Kd-tree-based nearest neighbor search. Thefunction performing the k-nearest neighbor search in ANN is given a query point q (here itis x∗ = 2 × c − x), a nonnegative integer k (here it is set equal to knear ), an array of pointindices, nnidx , and an array of distances, dists. Both arrays are assumed to contain at leastk elements. This procedure computes the k nearest neighbors of q in the point set, and storesthe indices of the nearest neighbors in the array nnidx . Optionally a real value ε ≥ 0 maybe supplied. If so, then i th nearest neighbor is (1 + ε) approximation to the true i th nearestneighbor. That is, the true distance to this point may exceed the true distance to the real i thnearest neighbor of q by a factor of (1 + ε). If ε is omitted then the nearest neighbors willbe computed exactly. For the purpose of this article, the exact nearest neighbor is computed;so the ε is set equal to 0. After getting the knear nearest neighbors of x∗, the symmetricaldistance of x with respect to a centre c is calculated using Eq. 2. The Kd-tree structure canbe constructed in O(n log n) time and takes O(n) space.

4 Proposed method for multiobjective clustering

In this paper a new multiobjective clustering technique, SSym-AMOSA, is proposed whichuses the newly developed simulated annealing-based MOO technique, AMOSA [9] as theunderlying optimization strategy.

4.1 String representation and archive initialization

In AMOSA-based clustering, the strings are made up of real numbers which represent thecoordinates of the centers of the partitions. AMOSA attempts to evolve an appropriate setof cluster centers that represent the associated partitioning of the data. If a particular stringencodes the centers of K clusters in d dimensional space then its length l is taken to be d ∗ K .For example, in four dimensional space, the chromosome 〈2.3 1.4 7.6 12.9 2.1 3.4 0.01 12.20.06 2.3 6.7 15.3 3.2 11.72 9.5 3.4〉 encodes four cluster centers, (2.3, 1.4, 7.6, 12.9), (2.1,3.4, 0.01, 12.2), (0.06, 2.3, 6.7, 15.3) and (3.2, 11.72, 9.5, 3.4).

Each center is considered to be indivisible. Each string i in the archive initially encodesthe centers of a number, Ki , of clusters, such that Ki = (rand()mod(K max − 1)) + 2. Here,rand() is a function returning an integer, and K max is a soft estimate of the upper bound ofthe number of clusters. The number of clusters will therefore range from two to K max. TheKi centers encoded in a string are randomly selected distinct points from the data set.

4.2 Objective function computation

Fitness computation is composed of two steps. Firstly points are assigned to different clustersusing the point symmetry-based distance, dps [7]. Next, the mean and the difference of thecluster validity index, Sym-index for different bootstrap samples of a data set are computedand used as the objectives of the string.

123


4.2.1 Assignment of points

Here each point xi , 1 ≤ i ≤ n is assigned to cluster k iff dps(xi , ck) ≤ dps(xi , c j ), j =1, . . . , K , j �=k and dsym(xi , ck)≤θ . Here θ is a threshold described later. For dsym(xi , ck)>

θ , point xi is assigned to some cluster m if and only if de(xi , cm) ≤ de(xi , c j ), j =1, 2 . . . K , j �= m. The value of threshold θ is kept equal to dmax

N N as in [7], making itscomputation automatic and without user intervention. Here

dmaxN N = max

i=1,...NdN N (xi ),

where dN N (xi ) is the nearest neighbor distance of xi . After the assignments are done, thecluster centers encoded in the chromosome are replaced by the mean points of the respectiveclusters. This is referred to as the K-means like update center operation.

4.2.2 Used cluster validity index, Sym-index

The newly developed PS distance is used to define a cluster validity function, Sym-index[8,38] which measures the overall average symmetry with respect to the cluster centers.

Consider a partition of the data set X = {x j : j = 1, 2, . . . n} into K clusters where thecenter of cluster ci is computed by using

ci =∑ni

j=1 xij

ni

where ni (i = 1, 2, . . . , K ) is the number of points in cluster i and xij denotes the j th point

of the i th cluster. The new cluster validity function Sym is defined as:

Sym(K ) =(

1

K× 1

E K× DK

). (3)

Here,

EK =K∑

i=1

Ei , (4)

such that

Ei =ni∑

j=1

d∗ps(xi

j , ci ) (5)

and

DK = Kmaxi, j=1

‖ci − c j‖ (6)

DK is the maximum Euclidean distance between two cluster centres among all pairs of cen-tres. d∗

ps(xij , ci ) is computed by Eq. 2 with some constraint. Here, the first knear nearest

neighbors of x∗j = 2×ci − xi

j will be searched among only those points which are in cluster

i , i.e., the knear nearest neighbors of x∗j , the reflected point of xi

j with respect to ci , and

xij should belong to the i th cluster. The objective is to maximize this index in order to obtain

the actual number of clusters.

123


4.2.3 Explanation

As formulated in Eq. 3, Sym-index is a composition of three factors, 1/K , 1/EK and DK .The first factor increases as K decreases; as Sym-index needs to be maximized for optimalclustering, this factor prefers to decrease the value of K . The second factor is a measure of thetotal within cluster symmetry. For clusters which have good symmetrical structures, EK valueis less. Note that as K increases, in general, the clusters tend to become more symmetric.Moreover, as de(x, c) in Eq. 2 also decreases, EK decreases, resulting in an increase in thevalue of the Sym-index. Since Sym-index needs to be maximized, it will prefer to increasethe value of K . Finally the third factor, DK , measuring the maximum separation betweena pair of clusters, increases with the value of K . Note that the value of DK is bounded bythe maximum separation between a pair of points in the data set. As these three factors arecomplementary in nature, so they are expected to compete and balance each other criticallyfor determining the proper partitioning.

The use of DK as a measure of separation is further elucidated here. Instead of using themaximum separation between two clusters as DK , different other approaches could havebeen used. Three cases are examined here thoroughly.

– DK = Sum of pairwise inter cluster distances in a K -cluster structure. In this situation,DK would increase largely with increase in the value of K . Thus DK will take the max-imum value when the number of clusters will be equal to the number of elements in thedata set.

– DK = Average inter cluster distance. Then it would decrease at each step with K , insteadof being increased. Thus it will take the maximum value with the minimum possiblenumber of clusters.

– DK = The minimum distance between two clusters. Here also DK would decrease signif-icantly with increase in the number of clusters. So this would lead to a cluster structurewhere the loosely connected sub-structures remain as they were, where in fact a separationwas expected.

But if DK = Maximum inter cluster separation, increases significantly until the maximumseparation among compact clusters is achieved and then it becomes almost constant. Maxi-mum separation between any two clusters is attained when two extreme data elements formtwo single element clusters, which eventually is the upper boundary of DK . But the termi-nating condition is reached well before this situation. This influences the use of distancebetween two maximally separated cluster centers as the measure of separation, DK .

4.2.4 Calculation of objectives

The objective functional values are determined in the following way.

• Sample the data set, X , B times with replacement to obtain the bth bootstrap sample X (b)

• Now based on the centers encoded in a particular string, partition each bootstrap sampleusing point symmetry-based distance [7].

• Find Sym-index value for all the partitionings obtained for B samples. Here Sym(X (i))

represents the Sym-index value of the partitioning on the bootstrapped sample X (i).• Find the mean Sym-index value, denoted by SSymavg (average value of the stability

induced Sym-index)

SSymavg(K , Ai ) = E[Sym(·)] =∑B

b=1 Sym(X (b))

B(7)

123


The difference of the Sym-index value over these B bootstrap samples of a data set, denotedby SSymdff is also calculated as follows:

SSymdff (K , Ai ) =∑B

i=1∑B

j=i+1 d(Sym(X (i)), Sym(X ( j)))

B(B − 1)(8)

Here

d(Sym(X (i)), Sym(X (i))

) = ∣∣(Sym(X (i)) − Sym(X (i))∣∣.

The optimal set of values of number of clusters, K ∗ should maximize the tuple

{SSymavg(K , Ai ), 1/SSymdff (K , Ai )

}

simultaneously, i.e., for K ∈ K ∗, the SSymavg(K , Ai ) value corresponding to B partitionsshould be maximum and the variability of the Sym-index value over all the bootstrap samplesshould be the minimum. Note that here SSymdff provides the degree of stability of clusteringsolutions over different bootstrap samples of a data set. If the clustering results on differentbootstrap samples are same, then the vales of Sym-index over these partitions will also besame, resulting in a smaller value of SSymdff . Note that there can be other ways of definingthe stability of clustering solutions. The above tuple is used as the objective functions whichhave to be optimized by the MOO algorithm, AMOSA. Thus, the objective functions of theproposed SSym-AMOSA technique are as follows:

f1 = SSymavg(K , Ai )

and

f2 = 1

SSymdff (K , Ai )

SSym-AMOSA clustering technique simultaneously maximizes these two objective functions.

4.3 Mutation operation

A new string is generated from the current one by adopting one of the following three typesof mutations.

1. Each cluster center encoded in a string is replaced with a random variable drawn from

a Laplacian distribution, p(ε) ∝ e− |ε−µ|δ , where the scaling factor δ sets the magnitude

of perturbation. Here µ is the value at the position which is to be perturbed. The scal-ing factor δ is chosen equal to 1.0. The old value at the position is replaced with thenewly generated value. Here this type of mutation operator is applied for all dimensionsindependently.

2. One randomly generated cluster center is removed from the string, i.e., the total numberof clusters encoded in the string is decreased by 1.

3. The total number of clusters encoded in the string is increased by 1. One randomly chosenpoint from the data set is encoded as the new cluster center.

Any one of the above mentioned types of mutation is applied randomly on a particular stringif it is selected for mutation.

123


4.4 Selection of the best solution

In MOO, the algorithms produce a large number of non-dominated solutions [15] on the finalPareto optimal front. Each of these solutions provides a way of clustering the given data set.All the solutions are equally important from the algorithmic point of view. But sometimesthe user may want only a single solution. Consequently, in this paper a method of selecting asingle solution from the set of solutions, is now developed. This method is a semi-supervisedone.

Here we assume that the class level of 10% of the whole data set (denoted as test patterns)is known to us. The proposed SSym-AMOSA algorithm is executed on the rest 90% of the datasets for which no class information is known beforehand. A set of Pareto optimal solutionswill be generated. For each clustering associated with a solution from the final Pareto optimalset, the test patterns are also assigned cluster labels based on the nearest center criterion,and the amount of misclassification is calculated by computing the Minkowski Score values.Minkowski Score is a measure of the quality of a solution given the true clustering [10]. LetT be the “true” solution and S the solution we wish to measure. Denote by n11 the numberof pairs of elements that are in the same cluster in both S and T . Denote by n01 the numberof pairs that are in the same cluster only in S, and by n10 the number of pairs that are in thesame cluster in T. Minkowski Score is then defined as:

DM (T, S) =√

n01 + n10

n11 + n10(9)

In this case the optimum score is 0, with lower scores being “better”.The solution with the minimum Minkowski Score value calculated over the test patterns

is selected as the best solution.

5 Data sets used for experiment

Fourteen data sets are used for the experiment: eight of them are artificial data (AD_5_2,AD_10_2, Mixed_3_2, Sym_3_2, Ellip_2_2, Square1, Square4, Sizes5) and six are real-lifedata sets (Iris, BreastCancer, Newthyroid, LungCancer, Wine and LiverDisorder).

1. AD_5_2: This data set, used in [5], consists of 250 two-dimensional data points distrib-uted over 5 spherically shaped clusters. The clusters present in this data set are highlyoverlapping, each consisting of 50 data points. This data set is shown in Fig. 1a.

2. AD_10_2: This data set, used in Ref. [6], consists of 500 two-dimensional data points dis-tributed over 10 different clusters. Some clusters are overlapping in nature. Each clusterconsists of 50 data points. This data set is shown in Fig. 1b.

3. Mixed_3_2: This data set, contains 600 two-dimensional data points distributed on threeclusters. The clusters present here are either of ellipsoidal shaped or of hypersphericalshaped. This data set is shown in Fig. 2a.

4. Sym_3_2: This data set, used in [7], is a combination of ring-shaped, spherically compactand linear clusters. The total number of points in it is 350. The dimension of this data setis two. This data set is shown in Fig. 2b.

5. Ellip_2_2: This data set, used in [8], contains 400 two dimensional points distributed ontwo crossed ellipsoidal shells. This is shown in Fig. 3a.

6. Square1: This data set, used in Ref. [24], consists of 1,000 data points distributed overfour squared clusters. This is shown in Fig. 3b.

123


4 6 8 10 12 14 164

6

8

10

12

14

16

(a)

−20 −15 −10 −5 0 5 10 15 20−8

−6

−4

−2

0

2

4

6

8

10

12

(b)

Fig. 1 a AD_5_2, b AD_10_2

−10 −8 −6 −4 −2 0 2 4 6 8−2

0

2

4

6

8

10(a)

−1 −0.5 0 0.5 1 1.5−1

−0.5

0

0.5

1

1.5

2(b)

Fig. 2 a Mixed_3_2, b Sym_3_2

−4 −3 −2 −1 0 1 2 3 4 50

1

2

3

4

5

6

7

8

(a)

−10 −5 0 5 10 15 20−10

−5

0

5

10

15

(b)

Fig. 3 a Ellip_2_2, b Square1

7. Square4: This data set, used in Ref. [24], consists of 1,000 data points distributed overfour squared clusters. This is shown in Fig. 4a.

8. Sizes5: This data set, used in Ref. [24], consists of 1,000 data points distributed over fourclusters. This is shown in Fig. 4b.

123


−10 −5 0 5 10 15−6

−4

−2

0

2

4

6

8

10

12

(a)

−10 −5 0 5 10 15 20−10

−5

0

5

10

15

20

(b)

Fig. 4 a Square4, b Sizes5

9. Iris: Iris data set consists of 150 four-dimensional data points distributed over 3 clus-ters. Each cluster consists of 50 points. This data set represents different categories ofirises characterized by four feature values [21]. It has three classes Setosa, Versicolor andVirginica. It is known that two classes (Versicolor and Virginica) have a large amount ofoverlap while the class Setosa is linearly separable from the other two.

10. Breast Cancer: This Wisconsin Breast Cancer data set consists of 683 sample points.Each pattern has nine features corresponding to clump thickness, cell size uniformity,cell shape uniformity, marginal adhesion, single epithelial cell size, bare nuclei, blandchromatin, normal nucleoli and mitoses. There are two categories in the data: malignantand benign. The two classes are known to be linearly separable.

11. Newthyroid: The original database from where it has been collected is titled as Thyroidgland data (‘normal’, ‘hypo’ and ‘hyper’ functioning). Five laboratory tests are used topredict whether a patient’s thyroid belongs to the class euthyroidism, hypothyroidism orhyperthyroidism. There are a total of 215 instances and the number of attributes is five.

12. LungCancer: This data consists of 32 instances having 56 features each. The datadescribes three types of pathological lung cancers.

13. Wine: This is the Wine recognition data consisting of 178 instances having 13 featuresresulting from a chemical analysis of wines grown in the same region in Italy but derivedfrom three different cultivars. The analysis determined the quantities of 13 constituentsfound in each of the three types of wines.

14. LiverDisorder: This is the Liver Disorder data consisting of 345 instances having sixfeatures each. The data has two categories.

6 Experimental results

The above mentioned eight artificial and six real-life data sets are used to show the efficacyof the proposed automatic MOO clustering technique in determining the proper number ofpartitions and the appropriate partitioning from data sets. The parameters of the proposedSSym-AMOSA clustering technique are as follows: Tmax = 100, Tmin = 0.00001, α = 0.8,SL = 200 and H L = 100. Here K max is set equal to

√n, where n is the size of the data

set. The proposed SSym-AMOSA clustering technique produces a large number of non-dom-inated solutions on the final Pareto optimal front. The best solution is identified by following

123


Table 1 Number of clusters (OC) and the Minkowski Score (MS) values obtained by SSym-AMOSA, anotherautomatic MOO clustering technique, MOCK, three single objective clustering techniques, VGAPS clusteringoptimizing Sym-index, GCUK clustering optimizing I-index, HNGA clustering optimizing a weighted sumcluster validity function, PESA-II optimization technique-based proposed clustering algorithm, SSym-PESAII,for all the data sets used here for experiment

Data set AC SSym-AMOSA MOCK VGAPS GCUK HNGA SSym-PESAII

OC MS OC MS OC MS OC MS OC MS OC MS

AD_5_2 5 5 0.25 6 0.55 5 0.25 5 0.18 5 0.10 5 0.35

AD_10_2 10 10 0.13 6 1.01 7 0.84 10 0.15 10 0.15 13 0.57

Mixed_3_2 3 3 0.22 2 0.82 3 0.22 6 0.80 19 0.91 6 0.72

Sym_3_2 3 3 0.12 2 0.688 2 0.12 7 0.81 16 0.85 3 0.51

Ellip_2_2 2 2 0.00 4 0.74 2 0.00 3 0.86 8 0.89 7 0.72

Square1 4 4 0.19 4 0.19 4 0.19 5 0.31 4 0.199 9 0.81

Square4 4 4 0.49 4 0.60 5 0.52 4 0.51 4 0.498 7 0.73

Sizes5 4 4 0.25 2 0.64 5 0.22 2 0.48 6 0.80 6 0.66

Iris 3 3 0.54 2 0.82 3 0.62 3 0.58 2 0.85 3 0.55

Cancer 2 2 0.37 2 0.40 2 0.37 2 0.38 2 0.38 2 0.39

Newthyroid 3 2 0.75 2 0.82 3 0.58 7 0.81 5 0.84 3 0.74

Lungcancer 3 5 0.69 7 0.96 2 0.97 2 0.84 2 1.24 3 0.60

Wine 3 6 0.92 3 0.90 6 0.97 9 0.96 12 0.97 3 0.89

LiverDisorder 2 2 0.96 3 0.98 2 0.98 3 0.99 2 0.98 2 0.96

the method proposed in Sect. 4.4. The number of clusters identified by the best solution ofthe proposed SSym-AMOSA clustering technique and the Minkowski Score (MS) values ofthe corresponding partitionings for all the data sets used here for experiment are reported inTable 1. This MS value provides an quantitative measurement of the goodness of the obtainedpartitioning. For the purpose of comparison, another MOO clustering technique, MOCK [24]is also executed on the above mentioned data sets with default parameter setting. The sourcecode for MOCK is obtained from [30]. In MOCK, the best solution from the final Paretooptimal front is selected by GAPS-statistics [43]. The number of clusters automatically deter-mined by the best solution of MOCK clustering technique for all the data sets used here forexperiment and the corresponding MS values are also reported in Table 1.

In order to show the efficacy of the proposed MOO clustering technique over existing singleobjective clustering techniques, three recently developed genetic algorithm-based automaticclustering techniques, genetic clustering for unknown K (GCUK clustering) [5], variablestring length genetic point symmetry-based clustering technique (VGAPS clustering) [8],and hybrid niching genetic algorithm (HNGA) [39] are also executed on the above men-tioned fourteen data sets. These single objective automatic clustering techniques providea single solution after their execution. GCUK clustering technique optimizes an Euclid-ean distance-based cluster validity index, I -index [32], by using the search capability ofgenetic algorithms to automatically determine the appropriate number of partitions fromdata sets. The parameters of the GCUK clustering technique are as follows: populationsize = 100, number of generations = 40, probability of mutation = 0.2 and probability of cross-over = 0.8. VGAPS clustering technique optimizes a newly developed point symmetry-basedcluster validity index, Sym-index [38] by using the search capability of genetic algorithms.The parameters of the VGAPS clustering technique are as follows: population size = 100,

123


number of generations = 40, probability of mutation and probability of crossover are cal-culated adaptively as in Ref. [40]. In Hybrid Niching Genetic Algorithm (HNGA) [39], aweighted sum validity function (WSVF), which is a weighted sum of several normalizedcluster validity functions, is used for optimization to automatically evolve the proper num-ber of clusters and the appropriate partitioning of the data set. Within the HNGA, a nichingmethod is developed to prevent premature convergence during the search. Additionally, inorder to improve the computational efficiency, a hybridization between the niching methodwith the computationally attractive K-means is made. Here default parameter settings as usedin [39] are kept. The number of clusters automatically determined by these three clusteringtechniques for the fourteen data sets are also reported in Table 1. The MS values are alsocalculated for the partitionings obtained by these three single objective clustering techniquesfor all data sets. These are also reported in Table 1.

Table 1 reveals that in most of the data sets (for eight artificial data sets and threereal-life data sets) SSym-AMOSA is able to detect the appropriate number of clusters andthe corresponding MS values are also small. Figures 5a, 7a, 10a, 12a, 14a, 16a, 19a and 22ashow, respectively, the optimal partitionings identified by the best solution of SSym-AMOSAfor eight artificial data sets. Similarly, Figs. 5b, 7b, 10b, 12b, 14b, 16b, 19b and 22b show,respectively, the optimal partitionings identified by the best solution of MOCK [24] clus-tering technique for eight artificial data sets. MOCK [24] is able to detect the appropriatenumber of clusters from four out of these fourteen data sets. MOCK usually fails for data setshaving symmetrical overlapping clusters, e.g., AD_5_2, AD_10_2, Ellip_2_2 and Sym_3_2.

VGAPS clustering technique is able to detect the same partitioning as the best solutionof the proposed SSym-AMOSA clustering technique for AD_5_2, Mixed_3_2, Sym_3_2 andEllip_2_2 data sets. But it fails for three artificial data sets, AD_10_2, Square1 and Sizes5.The partitionings obtained by VGAPS clustering for eight artificial data sets are shown inFigs. 5a, 8a, 10a, 12a, 14a, 17a, 20a and 23a, respectively. For real-life data sets, VGAPSclustering is able to detect the appropriate number of clusters from four of these six data sets.But except the Newthyroid data, the MS scores obtained by the final solutions of the VGAPSclustering technique are poorer than those obtained by the best solutions of SSym-AMOSAclustering for all real-life data sets.

GCUK clustering technique optimizing the I -index provides the appropriate number ofclusters from only three out of eight artificial data sets used here for experiment. This againestablishes the fact that GCUK clustering is only capable of determining the appropriatenumber of clusters from data sets having hyperspherical clusters. The partitionings obtainedby GCUK clustering for eight artificial data sets are shown in Figures 5c, 8b, 10c, 12c, 14c,17b, 20b and 23b, respectively. For AD_5_2 data set it provides the second best partitioningcompared to other clustering techniques. Among the real-life data sets, for Iris and Cancer

4 6 8 10 12 14 164

6

8

10

12

14

16(a)

4 6 8 10 12 14 164

6

8

10

12

14

16

4 6 8 10 12 14 164

6

8

10

12

14

16(b) (c)

Fig. 5 Automatically clustered AD_5_2 after application of a SSym-AMOSA/VGAPS clustering techniquefor K = 5, b MOCK clustering technique for K = 6, c GCUK clustering technique for K = 5

123


4 6 8 10 12 14 164

6

8

10

12

14

16

(a)

4 6 8 10 12 14 164

6

8

10

12

14

16

4 6 8 10 12 14 164

6

8

10

12

14

16

(b) (c)

Fig. 6 Clustered AD_5_2 after application of a K-means clustering technique for K = 5, b HNGA clusteringtechnique for K = 5, c PESA-II-based multiobjective clustering technique for K = 5

−20 −15 −10 −5 0 5 10 15 20−8

−6

−4

−2

0

2

4

6

8

10

12

−20 −15 −10 −5 0 5 10 15 20−8

−6

−4

−2

0

2

4

6

8

10

12

(a) (b)

Fig. 7 Automatically clustered AD_10_2 after application of a SSym-AMOSA clustering technique for K =10, b MOCK clustering technique for K = 6

−20 −15 −10 −5 0 5 10 15 20−8

−6

−4

−2

0

2

4

6

8

10

12

−20 −15 −10 −5 0 5 10 15 20−8

−6

−4

−2

0

2

4

6

8

10

12

(a) (b)

Fig. 8 Automatically clustered AD_10_2 after application of a VGAPS clustering technique for K = 7,b GCUK clustering technique for K = 10

data sets it is capable of determining the appropriate number of clusters (refer to Table 1). Thecorresponding MS scores are also comparable with those obtained by SSym-AMOSA clus-tering technique. But GCUK clustering fails to provide the appropriate number of clustersfor other four real-life data sets.

HNGA clustering technique provides the appropriate number of clusters from only threeout of eight artificial data sets used here for experiment. It is because it can only determinethe appropriate number of partitions and the appropriate partitioning from data sets havinghyperspherical shaped clusters. Thus it is able to detect the appropriate partitionings from

123


−20 −15 −10 −5 0 5 10 15 20−8−6−4−2

02468

1012

(a)

−20 −15 −10 −5 0 5 10 15 20−8−6−4−2

02468

1012

−20 −15 −10 −5 0 5 10 15 20−8−6−4−2

02468

1012

(b) (c)

Fig. 9 Clustered AD_10_2 after application of a K-means clustering technique for K = 10, b HNGA clus-tering technique for K = 10, c PESA-II-based multiobjective clustering technique for K = 10

−10 −8 −6 −4 −2 0 2 4 6 8−2

0

2

4

6

8

10

(a)

−10 −8 −6 −4 −2 0 2 4 6 8−2

0

2

4

6

8

10

−10 −8 −6 −4 −2 0 2 4 6 8−2

0

2

4

6

8

10

(b) (c)

Fig. 10 Automatically clustered Mixed_3_2 after application of a SSym-AMOSA/VGAPS clustering tech-nique for K = 3, b MOCK clustering technique for K = 2, c GCUK clustering technique for K = 6

−10 −8 −6 −4 −2 0 2 4 6 8−2

0

2

4

6

8

10

−10 −8 −6 −4 −2 0 2 4 6 8−2

0

2

4

6

8

10

−10 −8 −6 −4 −2 0 2 4 6 8−2

0

2

4

6

8

10

(a) (b) (c)

Fig. 11 Clustered Mixed_3_2 after application of a K-means clustering technique for K = 3, b HNGAclustering technique for K = 19, c PESA-II-based multiobjective clustering technique for K = 6

−1 −0.5 0 0.5 1 1.5−1

−0.5

0

0.5

1

1.5

2

(a)

−1 −0.5 0 0.5 1 1.5−1

−0.5

0

0.5

1

1.5

2

−1 −0.5 0 0.5 1 1.5−1

−0.5

0

0.5

1

1.5

2

(b) (c)

Fig. 12 Automatically clustered Sym_3_2 after application of a SSym-AMOSA/VGAPS clustering techniquefor K = 3, b MOCK clustering technique for K = 2, c GCUK clustering technique for K = 7

AD_5_2, AD_10_2 and Square1 data sets. The partitionings obtained by HNGA clustering foreight artificial data sets are shown in Figs. 6b, 9b, 11b, 13b, 15b, 18b, 21b and 24b, respec-tively. For AD_5_2 data set it provides the best partitioning compared to other clustering

123


−1 −0.5 0 0.5 1 1.5−1

−0.5

0

0.5

1

1.5

2

(a)

−1 −0.5 0 0.5 1 1.5−1

−0.5

0

0.5

1

1.5

2

−1 −0.5 0 0.5 1 1.5−1

−0.5

0

0.5

1

1.5

2

(b) (c)

Fig. 13 Clustered Sym_3_2 after application of a K-means clustering technique for K = 3, b HNGAclustering technique for K = 16, c PESA-II-based multiobjective clustering technique for K = 3

−4 −3 −2 −1 0 1 2 3 4 50

1

2

3

4

5

6

7

8

−4 −3 −2 −1 0 1 2 3 4 50

1

2

3

4

5

6

7

8

−4 −3 −2 −1 0 1 2 3 4 50

1

2

3

4

5

6

7

8

(a) (b) (c)

Fig. 14 Automatically clustered Ellip_2_2 after application of a SSym-AMOSA/VGAPS clustering techniquefor K = 2, b MOCK clustering technique for K = 4, c GCUK clustering technique for K = 3

−4 −3 −2 −1 0 1 2 3 4 50

1

2

3

4

5

6

7

8

(a)

−4 −3 −2 −1 0 1 2 3 4 50

1

2

3

4

5

6

7

8

−4 −3 −2 −1 0 1 2 3 4 50

1

2

3

4

5

6

7

8

(b) (c)

Fig. 15 Clustered Ellip_2_2 after application of a K-means clustering technique for K = 2, b HNGAclustering technique for K = 8, c PESA-II-based multiobjective clustering technique for K = 7

−10 −5 0 5 10 15 20−10

−5

0

5

10

15

(a)

−10 −5 0 5 10 15 20−10

−5

0

5

10

15

(b)

Fig. 16 Automatically clustered Square1 after application of a SSym-AMOSA clustering technique for K = 4,b MOCK clustering technique for K = 4

123


−10 −5 0 5 10 15 20−10

−5

0

5

10

15

(a)

−10 −5 0 5 10 15 20−10

−5

0

5

10

15

(b)

Fig. 17 Automatically clustered Square1 after application of a VGAPS clustering technique for K = 4,b GCUK clustering technique for K = 5

−10 −5 0 5 10 15 20−10

−5

0

5

10

15

(a)

−10

−5

0

5

10

15

−10 −5 0 5 10 15 20 −10 −5 0 5 10 15 20−10

−5

0

5

10

15(b) (c)

Fig. 18 Clustered Square1 after application of a K-means clustering technique for K = 4, b HNGA clusteringtechnique for K = 4, c PESA-II-based multiobjective clustering technique for K = 9

−10 −5 0 5 10 15−6

−4

−2

0

2

4

6

8

10

12

(a)

−10 −5 0 5 10 15−6

−4

−2

0

2

4

6

8

10

12

(b)

Fig. 19 Automatically clustered Square4 after application of a SSym-AMOSA clustering technique for K = 4,b MOCK clustering technique for K = 4

techniques. Among the real-life data sets, only for Cancer and LiverDisorder data sets it isable to determine the appropriate number of clusters (refer to Table 1). But it fails to providethe appropriate number of clusters for other four real-life data sets.

We have also shown the effectiveness of the used optimization technique, AMOSA, in com-parison to the another existing evolutionary multiobjective optimization technique, PESA-II(Pareto Envelop-Based Selection Algorithm-II) [13] for the proposed clustering algorithm.Here, PESA-II is used instead of AMOSA to optimize the tuple described in Section IV.Bof the paper. The number of partitions and the corresponding Minkowski Scores obtained by

123


−10 −5 0 5 10 15−6

−4

−2

0

2

4

6

8

10

12

−10 −5 0 5 10 15−6

−4

−2

0

2

4

6

8

10

12

(a) (b)

Fig. 20 Automatically clustered Square4 after application of a VGAPS clustering technique for K = 5,b GCUK clustering technique for K = 4

−10 −5 0 5 10 15−6

−4

−2

0

2

4

6

8

10

12

(a)

−10 −5 0 5 10 15−6

−4

−2

0

2

4

6

8

10

12

−10 −5 0 5 10 15−6

−4

−2

0

2

4

6

8

10

12

(b) (c)

Fig. 21 Clustered Square4 after application of a K-means clustering technique for K = 4, b HNGA clusteringtechnique for K = 4, c PESA-II-based multiobjective clustering technique for K = 7

−10 −5 0 5 10 15 20−10

−5

0

5

10

15

20

(a)

−10 −5 0 5 10 15 20−10

−5

0

5

10

15

20

(b)

Fig. 22 Automatically clustered Sizes5 after application of a SSym-AMOSA clustering technique for K = 4,b MOCK clustering technique for K = 2

PESA-II-based multiobjective clustering technique, SSym-PESAII, for all the data sets areshown in Table 1. Results show that PESA-II-based clustering technique is able to detect theproper partitionings from only one artificial data set. The partitionings obtained by PESA-II-based clustering technique, SSym-PESAII, are shown in Figs. 6c, 9c, 11c, 13c, 15c, 18c, 21cand 24c, respectively for all artificial data sets. SSym-PESAII is able to detect the appropriatenumber of clusters from all six real-life data sets. Thus, it can be concluded that AMOSAperforms superior than PESA-II as the underlying optimization technique in the proposedclustering algorithm.

123


−10 −5 0 5 10 15 20−10

−5

0

5

10

15

20

(a)

−10 −5 0 5 10 15 20−10

−5

0

5

10

15

20

(b)

Fig. 23 Automatically clustered Sizes5 after application of a VGAPS clustering technique for K = 5,b GCUK clustering technique for K = 2

−10 −5 0 5 10 15 20−10

−5

0

5

10

15

20

(a)

−10 −5 0 5 10 15 20−10

−5

0

5

10

15

20

−10 −5 0 5 10 15 20−10

−5

0

5

10

15

20

(b) (c)

Fig. 24 Clustered Sizes5 after application of a K-means clustering technique for K = 4, b HNGA clusteringtechnique for K = 6, c PESA-II-based multiobjective clustering technique for K = 6

In order to show that the proposed SSym-AMOSA performs better than some hybrid meth-ods, the number of clusters and the Minkowski Score values obtained by K-means clusteringtechnique together with several existing cluster validity indices are also reported in Table 2.Here, K-means is executed for K = 2, . . .

√n for all data sets. Thereafter, eight cluster valid-

ity indices values are calculated for all the obtained partitionings for a particular data set. Theeight cluster validity indices are DB-index [14], Dunn-index [19], Generalized Dunn’s index[11], XB-index [47], PBM-index [37], K-index [28], FS-index [22], SV-index [27]. Supposeafter execution of K-means clustering technique on a particular data set total (Kmax−Kmin+1)

partitions will be generated, U∗Kmin

, U∗Kmin+1 . . . U∗

Kmax, with the corresponding validity index

(V ) values computed as VKmin , VKmin+1 . . . VKmax . Let K ∗ = argopti=Kmin ...Kmax[Vi ]. There-

fore, according to index V , K ∗ is the correct number of clusters present in the data. Thecorresponding U∗

K may be obtained by using a suitable clustering technique with the num-ber of clusters set to K ∗. The tuple 〈U∗

K ∗ , K ∗〉 is presented as the solution to the clusteringproblem. The number of clusters identified by these eight cluster validity indices along withK-means clustering technique and the corresponding Minkowski Score values of the obtainedpartitionings are reported in Table 2. K-means clustering technique is able to detect the properpartitionings from only AD_5_2, AD_10_2, Square1, Square4 and Sizes5 artificial data sets,i.e., data sets with hyperspherical shaped clsuters. The partitionings obtained by K-meansclustering technique for all artificial data sets for the actual number of clusters present inthe data set are shown in Figs. 6a, 9a, 11a, 13a, 15a, 18a, 21a and 24a, respectively. Resultsreported in Table 2 show that DB-index along with K-means clustering technique is able todetect the proper partitioning and the proper number of partitions from only four artificialdata sets and for two real-life data sets. Similarly Dunn, GDunn, XB, PBM, K, FS and SV

123


Table 2 Number of clusters (OC) and the Minkowski Score (MS) values obtained by eight different clustersvalidity indices, where K-means is used as the underlying partitioning technique and number of clusters arevaried from 2, to

√n. Here n and AC denote the number of data points and the actual number of clusters,

respectively

Data set AC DB Dunn GDunn XB PBM K FS SV

OC MS OC MS OC MS OC MS OC MS OC MS OC MS OC MS

AD_5_2 5 5 0.47 4 0.74 8 0.73 4 0.74 5 0.47 4 0.74 10 0.77 3 1.08

AD_10_2 10 8 0.66 3 1.64 12 0.40 10 0.13 10 0.13 9 0.47 15 0.57 5 1.11

Mixed_3_2 3 2 0.82 4 0.73 2 0.82 2 0.82 6 0.77 2 0.82 8 0.80 2 0.82

Sym_3_2 3 3 0.83 9 0.85 2 0.93 4 0.70 5 0.83 4 70 10 0.86 2 0.93

Ellip_2_2 2 3 0.85 2 0.81 2 0.81 3 0.85 5 0.86 3 0.85 10 0.90 2 0.81

Square1 4 4 0.199 6 0.54 5 0.40 4 0.199 4 0.199 4 0.199 10 0.77 4 0.199

Square4 4 4 0.50 4 0.50 6 0.69 4 0.50 4 0.50 4 0.50 10 0.83 4 0.50

Sizes5 4 4 0.25 7 0.85 4 0.25 4 0.25 5 0.70 4 0.25 4 0.25 4 0.25

Iris 3 2 0.85 5 0.69 2 0.85 2 0.85 3 0.60 2 0.85 5 0.69 2 0.85

Cancer 2 2 0.35 8 0.77 9 0.84 2 0.35 2 0.35 2 0.35 10 0.85 2 0.35

Newthyroid 3 5 0.83 10 0.90 10 0.90 3 0.81 4 0.83 3 0.81 9 0.90 7 0.89

Lungcancer 3 5 0.77 5 0.77 6 0.80 5 0.77 3 0.93 5 0.77 6 0.80 2 0.83

Wine 3 2 0.96 2 0.96 5 0.91 2 0.96 2 0.96 2 0.96 7 0.95 4 0.94

LiverDisorder 2 2 0.99 8 0.99 8 0.99 2 0.99 4 0.99 2 0.99 8 0.99 10 1.00

indices succeed in determining the appropriate partitioning and the appropriate number ofpartitions from one, one, seven, seven, six, one, four out of fourteen data sets, respectivelyalong with K-means clustering technique. Thus it can be concluded that none of these hybridmethods perform as good as the proposed SSym-AMOSA clustering technique.

Thus results show, in general, the proposed SSym-AMOSA clustering performs the bestcompared to other four state-of-the-art automatic clustering techniques and several hybridmethods of determining the appropriate number of clusters from different data sets. For thepurpose of illustration, the boxplots of the Minkowski Score values of the solutions on thefinal Pareto optimal front provided by both SSym-AMOSA and MOCK clustering techniquesare provided here for four real-life data sets, Iris, Newthyroid, Liverdisorder and Wine. Theseare shown in Figs. 25 and 26, respectively.

The final Pareto optimal fronts of the proposed SSym-AMOSA clustering technique forfour of these real-life data sets, Cancer, Newthyroid, Lungcancer, LiveDisorder, are alsoshown in Figs. 27 and 28, respectively, for the purpose of illustration.

6.1 Results using different space-partitioning methods

We have also executed our proposed algorithm with another space partitioning techniqueinstead of Kd-trees. Here, box-decomposition tree (bd-tree) [36] is used as the space parti-tioning technique or the required data structure to store the points. The used ANN library[34] supports this data structure also.

Here SSym-AMOSA is executed on all the data sets used here for experiment with bd-treeas the underlying data structure. Results show that performance of this clustering techniqueis quite similar to that of SSym-AMOSA using Kd-tree. Here, results are reported for six data

123


1 2

0.6

0.65

0.7

0.75

0.8

0.85

0.9(a)

Min

kow

ski S

core

s

1 2

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

Min

kow

ski S

core

Clustering Techniques Clustering Techniques

(b)

Fig. 25 Boxplots of the Minkowski Scores of the Pareto optimal solutions obtained by SSym-AMOSAclustering Technique and MOCK clustering Technique for a Iris data set, b Newthyroid data set. Here column‘1’ denotes the SSym-AMOSA clustering technique and column ‘2’ denotes the MOCK clustering technique

1 2

0.98

0.985

0.99

0.995

1 (a)

Min

kow

ski S

core

s

Clustering Techniques

1 2

0.9

0.95

1

1.05

1.1

1.15

1.2

1.25

1.3

1.35

1.4

Min

kow

ski S

core

s

Clustering Techniques

(b)

Fig. 26 Boxplots of the Minkowski Scores of the Pareto optimal solutions obtained by SSym-AMOSA clus-tering Technique and MOCK clustering Technique for a LiverDisorder data set, b Wine data set. Here column‘1’ denotes the SSym-AMOSA clustering technique and column ‘2’ denotes the MOCK clustering technique

1500 2000 2500 3000 3500 4000 4500 5000 55000

0.5

1

1.5

2

(a)2.5 x 10−5

1/SSymavg

SS

ymdf

f

SS

ymdf

f

400 500 600 700 800 900 1000 1100 1200 1300 14000

0.5

1

1.5

2

2.5

3 x 10−4

1/SSymavg

(b)

Fig. 27 Pareto optimal front obtained by the proposed SSym-AMOSA clustering technique for a Cancer dataset, b Newthyroid data set

sets. For AD_5_2, AD_10_2, Mixed_3_2, Square1, Square4, and Sizes5 data sets SSym-AMO-SA using bd-tree attains MS-scores of 0.44 (optimal number of clusters = 5), 0.15 (optimalnumber of clusters = 10), 0.23 (optimal number of clusters = 3), 0.199 (optimal number of

123


200 250 300 350 400 450 500 550 600 6500

0.5

1

1.5

2

2.5

3

(a)x 10−4

1/SSymavg

SS

ymdf

f

SS

ymdf

f

450 500 550 600 650 700 750 8000

1

2

3

4

5

6

7

8 x 10−5

1/SSymavg

(b)

Fig. 28 Pareto optimal front obtained by the proposed SSym-AMOSA clustering technique for a Lungcancerdata set, b LiverDisorder data set

Table 3 Number of clusters andthe Minkowski Score (MS)values obtained by SSym-AMOSAclustering technique for two datasets where knear is varied in therange 2–6

knear Mixed_3_2 Iris

OC MS OC MS

2 3 0.22 3 0.54

3 3 0.25 3 0.63

4 4 0.45 3 0.58

5 7 0.77 3 0.60

6 7 0.77 3 0.60

clusters = 4), 0.55 (optimal number of clusters = 4), and 0.55 (optimal number of clusters = 5),respectively.

6.2 Results using different values of knear

In order to show how the selection of the value of knear in Eq. 2 affects the performance ofthe proposed clustering technique, results on Mixed_3_2 and Iris data sets are shown for fivedifferent values of knear, knear = 2, 3, 4, 5, 6. Note that a short discussion on the choiceof knear is presented in Sect. 3.1 of the paper. The optimal number of clusters and the corre-sponding MS score values obtained by SSym-AMOSA clustering technique for these data setsfor five different values of knear are shown in Table 3. Results show that for both Mixed_3_2and Iris data sets, SSym-AMOSA provides the best partitionings with knear = 2. Thus, itcan be concluded that though the proper value of knear depends on the data set itself, butknear = 2 is suitable for most of the data sets.

6.3 Results using different parameter settings

The annealing schedule of an SA algorithm consists of (1) initial value of temperature (Tmax),(2) cooling schedule, (3) number of iterations to be performed at each temperature and (4)stopping criterion to terminate the algorithm. A very short discussion regarding the properchoice of these parameters has been described in Section IV.D of the paper on AMOSA [9].It indeed depends on the data sets used. Here these parameters are selected manually.

123


Table 4 Number of clusters andthe Minkowski Score (MS)values obtained by SSym-AMOSAclustering technique for two datasets for five different parametersettings

Parameter setting AD_10_2 Iris

OC MS OC MS

Setting 1 10 0.13 3 0.54

Setting 2 11 0.47 3 0.54

Setting 3 11 0.24 3 0.54

Setting 4 10 0.13 3 0.55

Setting 5 10 0.13 3 0.55

Setting 6 10 0.13 3 0.54

Setting 7 10 0.15 3 0.54

In this paper we have shown the sensitivity of the results for two data sets, AD_10_2 andIris, for different values of Tmax, Tmin, SL and H L . Here seven different parameter settingsare used:

1. Setting 1: Tmax = 100, Tmin = 0.00001, SL = 200, SL = 100.2. Setting 2: Tmax = 10, Tmin = 0.01, SL = 30, H L = 20.3. Setting 3: Tmax = 10, Tmin = 0.1, SL = 50, H L = 40.4. Setting 4: Tmax = 10, Tmin = 0.01, SL = 200, Hl = 100.5. Setting 5: Tmax = 10, Tmin = 0.1, SL = 200, H L = 100.6. Setting 6: Tmax = 100, Tmin = 0.00001, SL = 50, H L = 40.7. Setting 7: Tmax = 100, Tmin = 0.00001, SL = 30, H L = 20.

The number of clusters and the Minkowski Score values obtained by SSym-AMOSA withthese seven different parameter settings for these two data sets are shown in Table 4. Resultsshow that for reasonably good performance of SSym-AMOSA, either SL/HL values shouldbe high or there should be sufficient number of iterations (i.e., difference between T max andT min should be high enough). It can also be concluded that the parameter settings used inthis paper are applicable for any data set.

7 Conclusion

In this paper, a multiobjective clustering technique is proposed which optimizes simulta-neously two objectives, one reflecting the total symmetry present in the partitions of the dataset and the other reflecting the stability of the obtained partitions over different bootstrapsamples of the data set. The proposed algorithm assigns points to different clusters basedon the point symmetry-based distance rather than the Euclidean distance. Results on severalartificial and real-life data sets show that the proposed technique is well-suited to detectthe number of clusters from data sets having point symmetric clusters. Much further workis needed to investigate using different and more objectives, and to test the approach stillmore extensively. Selecting the best solution(s) from the Pareto optimal front is an importantproblem in multiobjective clustering. One method of selecting a single solution from the finalPareto optimal front is proposed here. But this method apriori assumes that labeling of 10%of the points is known beforehand. Thus some new unsupervised methods to choose the bestsolution from the final Pareto optimal front have to be developed. Future work includes thedetailed sensitivity studies of how the parameters of the proposed SSym-AMOSA clusteringtechnique can affect its performance.

123


References

1. Anderberg MR (2000) Computational geometry: algorithms and applications. Springer, Berlin2. Assent I, Krieger R, Glavic B, Seidli T (2008) Clustering multidimensional sequences in spatial and

temporal databases. Knowl Inf Syst 16(1):1–273. Attneave F (1995) Symmetry information and memory for pattern. Am J Psychol 68:209–2224. Bandyopadhyay S, Maulik U (2001) Nonparametric genetic clustering: comparison of validity indices.

IEEE Trans Syst Man Cybernet C 31(1):120–1255. Bandyopadhyay S, Maulik U (2002) Genetic clustering for automatic evolution of clusters and application

to image classification. Pattern Recognit 35(6):1197–12086. Bandyopadhyay S, Pal SK (2007) Classification and learning using genetic algorithms: applications in

bioinformatics and web intelligence. Springer, Heidelberg7. Bandyopadhyay S, Saha S (2007) GAPS: A clustering method using a new point symmetry based distance

measure. Pattern Recognit 40:3430–34518. Bandyopadhyay S, Saha S (2008) A point symmetry based clustering technique for automatic evolution

of clusters. IEEE Trans Knowl Data Eng 20(11):1–17 (accepted)9. Bandyopadhyay S, Saha S, Maulik U, Deb K (2008) A simulated annealing based multi-objective opti-

mization algorithm: AMOSA. IEEE Trans Evol Comput 12(3):269–28310. Ben-Hur A, Guyon I (2003) Detecting stable clusters using principal component analysis in methods in

molecular biology. Humana press, Totowa, NJ11. Bezdek JC, Pal NR (1998) Some new indexes of cluster validity. IEEE Trans Syst Man Cybernet 28:

301–31512. Breckenridge J (1989) Replicating cluster analysis: method, consistency and validity. Multivar Behav

Res 24:147–16113. Corne DW, Jerram NR, Knowles JD, Oates MJ (2001) PESA-II: region-based selection in evolutionary

multiobjective optimization, In: Spector L, Goodman ED, Wu A, Langdon WB, Voigt H-M, Gen M, SenS, Dorigo M, Pezeshk S, Garzon MH, Burke E (eds) Proceedings of the Genetic and Evolutionary Com-putation Conference (GECCO-2001). Morgan Kaufmann, San Francisco, California, USA, pp. 283–290.http://citeseer.ist.psu.edu/corne01pesaii.html

14. Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell1:224–227

15. Deb K (2001) Multi-objective optimization using evolutionary algorithms. John Wiley and Sons, Ltd,England

16. Deb K, Pratap A, Agarwal S, Meyarivan T (2002) A fast and elitist multiobjective genetic algorithm:NSGA-II. IEEE Trans Evol Comput 6(2):182–197

17. Denton AM, Besemann CA, Dorr DH (2009) Pattern-based time-series subsequence clustering usingradial distribution functions. Knowl Inf Syst 18(1):1–27

18. Dudoit S, Fridlyand J (2002) A prediction-based resampling method for estimating the number of clustersin a data set. Genome Biol 3(7):1299–1323

19. Dunn JC (1974) Well separated clusters and optimal fuzzy partitions. J Cyberns 4:95–10420. Eduardo RH, Nelson FFE (2003) A genetic algorithm for cluster analysis. Intell Data Anal 7:15–2521. Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugenics 3:179–18822. Fukuyama Y, Sugeno M (1989) A new method of choosing the number of clusters for the fuzzy c-means

method. In: Proceedings of the fifth fuzzy systems symposium, pp. 247–25023. Geman S, Geman D (1984) Stochastic relaxation, gibbs distributions and the Bayesian restoration of

images. IEEE Trans Pattern Anal Mach Intell 6(6):721–74124. Handl J, Knowles J (2007) An evolutionary approach to multiobjective clustering. IEEE Trans Evol Com-

put 11(1):56–7625. Holland JH (1975) Adaptation in natural and artificial systems. The University of Michigan Press,

AnnArbor26. Jain AK, Duin P, Jianchang M (2000) Statistical pattern recognition: a review. IEEE Trans Pattern Anal

Mach Intell 22(1):4–3727. Kim DJ, Park YW, Park DJ (2001) A novel validity index for determination of the optimal number of

clusters. IEICE Trans Inf Syst D-E84(2):281–28528. Kwon SH (1998) Cluster validity index for fuzzy clustering. Electron Lett 34(22):2176–217729. Lange T, Roth V, Braun ML, Buhmann JM (2004) Stability-based validation of clustering solutions.

Neural Comput 16:1299–132330. Le T (2007) Multiobjective clustering with automatic determination of the number of clusters. http://

dbkgroup.org/handl/mock/31. Li T (2008) Clustering based on matrix approximation: a unifying view. Knowl Inf Syst 17(1):1–15

123

http://citeseer.ist.psu.edu/corne01pesaii.html

http://dbkgroup.org/handl/mock/

http://dbkgroup.org/handl/mock/


32. Maulik U, Bandyopadhyay S (2002) Performance evaluation of some clustering algorithms and validityindices. IEEE Trans Pattern Anal Mach Intell 24(12):1650–1654

33. Moise G, Sander J, Ester M (2008) Robust projected clustering. Knowl Inf Syst 14(3):273–29834. Mount DM, Arya S (2005) ANN: A library for approximate nearest neighbor searching. http://www.cs.

umd.edu/~mount/ANN35. Nayak R (2008) Fast and effective clustering of xml data using structural information. Knowl Inf Syst

14(2):197–21536. Ohsawa Y, Sakauchi M (1983) BD-Tree: A new n-dimensional data structure with efficient dynamic

characteristics. In: Proceedings of the 9th world computer congress, IFIP83’, pp. 539–54437. Pakhira MK, Maulik U, Bandyopadhyay S (2004) Validity index for crisp and fuzzy clusters. Pattern

Recognit 37(3):487–50138. Saha S, Bandyopadhyay S (2008) Application of a new symmetry based cluster validity index for satellite

image segmentation. IEEE Geosci Remote Sens Lett 5(2):166–17039. Sheng W, Swift S, Zhang L, Liu X (2005) A weighted sum validity function for clustering with a hybrid

niching genetic algorithm. IEEE Trans Syst Man Cybernet B Cybernet 35(6):1156–116740. Srinivas M, Patnaik L (1994) Adaptive probabilities of crossover and mutation in genetic algorithms.

IEEE Trans Syst Man Cybernet 24(4):656–66741. Su M-C, Chou C-H (2001) A modified version of the k-means algorithm with a distance based on cluster

symmetry. IEEE Trans Pattern Anal Mach Intell 23(6):674–68042. Tibshirani R, Walther G, Botstein D, Brown P (2001) Cluster validation by prediction strength, Technical

report, Statistics Department, Stanford University, Stanford, CA43. Tibshirani R, Walther G, Hastie T (2000) Estimating the number of clusters in a dataset via the gap

statistic, Technical report44. van Laarhoven PJM, Aarts EHL (1987) Simulated annealing: theory and applications. Kluwer Academic

Publisher, Dordrecht45. Veldhuizen DV, Lamont G (2000) Multiobjective evolutionary algorithms: analyzing the state-of-the-art.

Evol Comput 2:125–14746. Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS,

Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst14(1):1–37

47. Xie XL, Beni G (1991) A validity measure for fuzzy clustering. IEEE Trans Pattern Anal Mach Intell13:841–847

Author Biographies

Sriparna Saha received her B.Tech degree in Computer Science andEngineering from Kalyani Govt. Engineering College, University ofKalyani, India in 2003. She did her M.Tech in Computer Science fromIndian Statistical Institute, Kolkata in 2005. She is the recipient of LtRashi Roy Memorial Gold Medal from Indian Statistical Institute foroutstanding performance in M.Tech (Computer Science). At present sheis pursuing her Ph.D. from Indian Statistical Institute, Kolkata, India.She has co-authored more than thirty articles in international journalsand conference/workshop proceedings. She has worked in University ofHouston, USA. She is the recipient of “Google India Women In Engi-neering Award, 2008”. Her name is also included in 2009 Edition ofWho’s Who in the World. She is also a reviewer of many internationaljournals and conferences. She was in the advisory committee of severalinternational conferences. Her research interests include MultiobjectiveOptimization, Pattern Recognition, Evolutionary Algorithms, and DataMining.

123

http://www.cs.umd.edu/~mount/ANN

http://www.cs.umd.edu/~mount/ANN


Dr. Sanghamitra Bandyopadhyay did her B.Tech, M.Tech and Ph.D.in Computer Science from Calcutta University, IIT Kharagpur and ISIrespectively. She is currently an associate professor at the Indian Statisti-cal Institute, Kolkata, India. She has worked at the Los Alamos NationalLaboratory, Los Alamos, New Mexico, University of New South Wales,Sydney, Australia, University of Texas at Arlington, University ofMaryland at Baltimore, Fraunhofer Institute, Germany, and TsinghuaUniversity, China. She is the first recipient of the Dr. Shanker DayalSharma Gold Medal and also the Institute Silver Medal for beingadjudged the best all-around postgraduate performer in IIT, Kharagpur,India, in 1994. She has also received the Young Scientist Awards ofthe Indian National Science Academy (INSA) and the Indian ScienceCongress Association (ISCA) in 2000. In 2002, she received the YoungEngineer Award of the Indian National Academy of Engineers (INAE)and the Swarnajayanti fellowship from the Department of Science andTechnology (DST) in 2007. She has authored/co-authored more than

150 technical articles in international journals, book chapters, and conference/workshop proceedings. She hasdelivered many invited talks and tutorials around the world, and has been the chair and member of severalconference committees. She has published an authored book titled “Classification and Learning Using GeneticAlgorithms: Applications in Bioinformatics and Web Intelligence” from Springer and two edited books titled“Advanced Methods for Knowledge Discovery from Complex Data,” published by Springer, United Kingdom,in 2005, and “Analysis of Biological Data: A Soft Computing Approach” published by World Scientific in2007. She is presently editing another book titled “Computational Intelligence and Pattern Analysis in Bio-logical Informatics” to be published by Wiley. She has also edited journals special issues in the area of softcomputing, data mining, and bioinformatics. Her research interests include computational biology and bioin-formatics, soft and evolutionary computation, pattern recognition and data mining. She is a senior member ofthe IEEE.

123

A new multiobjective clustering technique based on the ...sriparna/papers/stab-kais.pdf · The...

Documents

Transcript of A new multiobjective clustering technique based on the ...sriparna/papers/stab-kais.pdf · The...