Selection of k in K-means Clustering

download Selection of k in K-means Clustering

of 17

Transcript of Selection of k in K-means Clustering

  • 7/27/2019 Selection of k in K-means Clustering

    1/17

    Selection ofK in K-means clusteringD T Pham, S S Dimov, and C D Nguyen

    Manufacturing Engineering Centre, Cardiff University, Cardiff, UK

    The manuscript was received on 26 May 2004 and was accepted after revision for publication on 27 September 2004.

    DOI: 10.1243/095440605X8298

    Abstract: The K-means algorithm is a popular data-clustering algorithm. However, one of itsdrawbacks is the requirement for the number of clusters, K, to be specified before the algorithmis applied. This paper first reviews existing methods for selecting the number of clusters for thealgorithm. Factors that affect this selection are then discussed and a new measure to assist the

    selection is proposed. The paper concludes with an analysis of the results of using the proposedmeasure to determine the number of clusters for the K-means algorithm for different data sets.

    Keywords: clustering, K-means algorithm, cluster number selection

    1 INTRODUCTION

    Data clustering is a data exploration technique thatallows objects with similar characteristics to begrouped together in order to facilitate their furtherprocessing. Data clustering has many engineeringapplications including the identification of partfamilies for cellular manufacture.

    The K-means algorithm is a popular data-clustering algorithm. To use it requires the numberof clusters in the data to be pre-specified. Findingthe appropriate number of clusters for a given dataset is generally a trial-and-error process made moredifficult by the subjective nature of deciding whatconstitutes correct clustering [1].

    This paper proposes a method based on infor-mation obtained during the K-means clustering

    operation itself to select the number of clusters, K.The method employs an objective evaluationmeasure to suggest suitable values for K, thusavoiding the need for trial and error.

    The remainder of the paperconsists of five sections.Section 2 reviews the main known methods forselecting K. Section 3 analyses the factors influ-encing the selection of K. Section 4 describes theproposed evaluation measure. Section 5 presentsthe results of applying the proposed measure toselect K for different data sets. Section 6 concludes

    the paper.

    2 SELECTION OF THE NUMBER OF CLUSTERSAND CLUSTERING VALIDITY ASSESSMENT

    This section reviews existing methods for selectingK for the K-means algorithm and the correspondingclustering validation techniques.

    2.1 Values ofKspecified within a range or set

    The performance of a clustering algorithm may beaffected by the chosen value ofK. Therefore, insteadof using a single predefined K, a set of values mightbe adopted. It is important for the number ofvalues considered to be reasonably large, to reflectthe specific characteristics of the data sets. At thesame time, the selected values have to be signifi-cantly smaller than the number of objects in the

    data sets, which is the main motivation for perform-ing data clustering.

    Reported studies [218] on K-means clustering andits applications usually do not contain any expla-nation or justification for selecting particular valuesfor K. Table 1 lists the numbers of clusters and objectsand the corresponding data sets used in those studies.Two observations could be made when analysingthe data in the table. First, a number of researchers[57, 9] used only one or two values for K. Second,several other researchers [1, 3, 11, 13, 16] utilized

    relatively large K values compared with the numberof objects. These two actions contravene the above-mentioned guidelines for selecting K. Therefore, theclustering results do not always correctly representthe performance of the tested algorithms.

    Corresponding author: Manufacturing Engineering Centre,

    Cardiff University, Cardiff CF24 OYF, UK.

    103

    C09304 # IMechE 2005 Proc. IMechE Vol. 219 Part C: J. Mechanical Engineering Science

  • 7/27/2019 Selection of k in K-means Clustering

    2/17

    In general, the performance of any new versionof the K-means algorithm could be verified by com-paring it with its predecessors on the same criteria.In particular, the sum of cluster distortions isusually employed as such a performance indicator[3, 6, 13, 16, 18]. Thus, the comparison is consideredfair because the same model and criterion are usedfor the performance analysis.

    2.2 Values ofKspecified by the user

    The K-means algorithm implementation in manydata-mining or data analysis software packages[19 22] requires the number of clusters to be speci-fied by the user. To find a satisfactory clusteringresult, usually, a number of iterations are neededwhere the user executes the algorithm with differentvalues of K. The validity of the clustering result isassessed only visually without applying any formal

    performance measures. With this approach, it isdifficult for users to evaluate the clustering resultfor multi-dimensional data sets.

    2.3 Values ofKdetermined in a laterprocessing step

    When K-means clustering is used as a pre-processingtool, the number of clusters is determined by thespecific requirements of the main processing

    algorithm [13]. No attention is paid to the effect ofthe clustering results on the performance of thisalgorithm. In such applications, the K-meansalgorithm is employed just as a black box withoutvalidation of the clustering result.

    Table 1 The number of clusters used in different studies

    of the K-means algorithm

    ReferenceNumbers ofclusters K

    Number ofobjects N

    MaximumK/Nratio (%)

    [2] 32, 64, 128, 256,512, 1024

    8 192 12.50

    32, 64, 128, 256,512, 1024

    29 000

    256 2 048[3] 600, 700, 800,

    900, 100010 000 10.00

    600, 700, 800,900, 1000

    50 000

    [4] 4, 16, 64,100, 128

    100 000 0.13

    4, 16, 64,100, 128

    120 000

    4, 16, 64,100, 128

    256 000

    [5] 4 564 0.704 720

    4 1 0004 1 0084 1 0104 1 2024 2 0004 2 3244 3 0054 4 0004 6 2724 7 561

    [6] 6 150 4.00[7] 10 2 310 0.43

    25 12 902[8] 2, 4, 8 Not reported Not reported[9] 2, 4 500 3.33

    2, 4 50 0002, 4 100 00010 300

    [10] 1, 2, 3, 4 10 000 0.04[11] 10, 20, 30, 40, 50,

    60, 70, 80,90, 100

    500 20.00

    [12] 100 10 000 2.0050 2 500

    [13] 7 42 16.661, 2, 3, 4, 5, 6, 7 120

    [14] 2, 3, 4, 5, 6, 7, 8, 9,10, 11, 12, 13, 14

    250 5.60

    [15] 8, 20, 50, 64, 256 10 000 2.56[16] 5000 50 000 50.00

    5000 100 0005000 200 0005000 300 0005000 433 208100 100 000250 200 0001000 100 0001000 200 0001000 300 0001000 433 20840 20 00010, 20, 30, 40, 50,

    60, 70, 8030 000

    50, 500, 5000 10 00050, 500, 5000 50 000

    50, 500, 5000 100 00050, 500, 5000 200 00050, 500, 5000 300 00050, 500, 5000 433 208

    (continued)

    Table 1 Continued

    ReferenceNumbers ofclusters K

    Number ofobjects N

    MaximumK/Nratio (%)

    [17] 250 80 000 10.00250 90 000250 100 000

    250 110 000250 120 00050, 100, 400 4 00050, 100, 400 36 000250 80 000250 90 000250 100 000250 110 000250 120 00050, 100, 150 4 00050, 100, 150 36 00050 800 000500 800 000

    [18] 3, 4 150 6.674, 5 75

    2, 7, 10 214

    104 D T Pham, S S Dimov, and C D Nguyen

    Proc. IMechE Vol. 219 Part C: J. Mechanical Engineering Science C09304 # IMechE 2005

  • 7/27/2019 Selection of k in K-means Clustering

    3/17

    2.4 Values ofKequated to thenumber of generators

    Synthetic data sets, which are used for testingalgorithms, are often created by a set of normal oruniform distribution generators. Then, clusteringalgorithms are applied to those data sets with the

    number of clusters equated to the number of genera-tors. It is assumed that any resultant cluster willcover all objects created by a particular generator.Thus, the clustering performance is judged on thebasis of the difference between objects covered bya cluster and those created by the correspondinggenerator. Such a difference can be measured bysimply counting objects or calculating the infor-mation gain [7].

    There are drawbacks with this method. The firstdrawback concerns the stability of the clusteringresults when there are areas in the object spacethat contain objects created by different generators.Figure 1a illustrates such a case. The data setshown in this figure has two clusters, A and B,which cover objects generated by generators GAand GB respectively. Object X is in an overlapping

    area between clusters A and B. X has probabilitiesPGA and PGB of being created by GA and GB, respect-ively, and probabilities PCA and PCB of being includedinto clusters A and B, respectively. All four pro-babilities are larger than 0. Thus, there is a chancefor X to be created by generator GA but coveredby cluster B, and vice versa. In such cases, the

    clustering results will not be perfect. The stability ofthe clustering results depends on these four proba-bilities. With an increase in the overlapping areas inthe object space, the stability of the clustering resultsdecreases.

    The difference between the characteristics of thegenerators also has an effect on the clustering results.In Fig. 1b where the number of objects of cluster A isfive times larger than that of cluster B, the smallercluster B might be regarded as noise and all objectsmight be grouped into one cluster. Such a clustering

    outcome would differ from that obtained by visualinspection.

    Unfortunately, this method of selecting K cannotbe applied to practical problems. The data distri-bution in practical problems is unknown and alsothe number of generators cannot be specified.

    2.5 Values ofKdetermined bystatistical measures

    There are several statistical measures available forselectingK. These measures are often applied in com-bination with probabilistic clustering approaches.They are calculated with certain assumptionsabout the underlying distribution of the data. TheBayesian information criterion or Akeikes infor-mation criterion [14, 17] is calculated on data setswhich are constructed by a set of Gaussian distri-butions. The measures applied by Hardy [23] arebased on the assumption that the data set fits thePoisson distribution. Monte Carlo techniques,which are associated with the null hypothesis, are

    used for assessing the clustering results and also fordetermining the number of clusters [24, 25].

    There have been comparisons between probabilis-tic and partitioning clustering [7]. Expectationmaximization (EM) is often recognized as a typicalmethod for probabilistic clustering. Similarly,K-means clustering is considered a typical methodfor partitioning clustering. Although, EM andK-means clustering share some common ideas,they are based on different hypotheses, models andcriteria. Probabilistic clustering methods do not

    take into account the distortion inside a cluster, sothat a cluster created by applying such methodsmay not correspond to a cluster in partitioning clus-tering, and vice versa. Therefore, statistical measuresused in probabilistic methods are not applicable in

    Fig. 1 Effect of the relationship between clusters on

    the clustering for two object spaces in which(a) an area exists that contains objects created

    by two different generators and (b) there are

    no overlapping areas: A, objects generated by

    GA; D, objects generated by GB

    Selection ofKin K-means clustering 105

    C09304 # IMechE 2005 Proc. IMechE Vol. 219 Part C: J. Mechanical Engineering Science

  • 7/27/2019 Selection of k in K-means Clustering

    4/17

    the K-means algorithm. In addition, the assumptionsabout the underlying distribution cannot be verifiedon real data sets and therefore cannot be used toobtain statistical measures.

    2.6 Values ofKequated to the

    number of classesWith this method, the number of clusters is equatedto the number of classes in the data sets. A data-clustering algorithm can be used as a classifier byapplying it to data sets from which the class attributeis omitted and then assessing the clustering resultsusing the omitted class information [26, 27]. The out-come of the assessment is fed back to the clusteringalgorithm to improve its performance. In this way,the clustering can be considered to be supervised.

    With this method of determining the number

    of clusters, the assumption is made that the data-clustering method could form clusters, each ofwhich would consist of only objects belonging to

    one class. Unfortunately, most real problems donot satisfy this assumption.

    2.7 Values ofKdetermined throughvisualization

    Visual verification is applied widely because of itssimplicity and explanation possibilities. Visualexamples are often used to illustrate the drawbacksof an algorithm or to present the expected clusteringresults [5, 27].

    The assessment of a clustering result usingvisualization techniques depends heavily on theirimplicit nature. The clustering models utilized bysome clustering methods may not be appropriatefor particular data sets. The data sets in Fig. 2are illustrations of such cases. The application of

    visualization techniques implies a data distributioncontinuity in the expected clusters. If the K-meansapproach is applied to such data sets, there is not

    Fig. 2 Data sets inappropriate for the K-means approach: (a) data sets with four clusters [5];

    (b) data sets with three clusters [23]; (c) data sets with eight clusters [27]. Note that

    the number of clusters in each data set was specified by the respective researchers

    106 D T Pham, S S Dimov, and C D Nguyen

    Proc. IMechE Vol. 219 Part C: J. Mechanical Engineering Science C09304 # IMechE 2005

  • 7/27/2019 Selection of k in K-means Clustering

    5/17

    any cluster that satisfies the K-means cluster-ing model and at the same time corresponds to aparticular object grouping in the illustrated datasets. Therefore, the K-means algorithm cannot pro-duce the expected clustering results. This suggeststhat the K-means approach is unsuitable for suchdata sets.

    The characteristics of the data sets in Fig. 2(position, shape, size, and object distribution) areimplicitly defined. This makes the validation of theclustering results difficult. Any slight changes inthe data characteristics may lead to different out-comes. The data set in Fig. 2b is an illustration ofsuch a case. Another example is the series of datasets in Fig. 3. Although two clusters are easily identi-fiable in the data set in Fig. 3a, the numbers ofclusters in the data sets in Figs 3b and c depend onthe distance between the rings and the objectdensity of each ring. Usually such parametersare not explicitly defined when a visual check is

    carried out.In spite of the above-mentioned deficiencies, visu-

    alization of the results is still a useful method ofselecting K and validating the clustering resultswhen the data sets do not violate the assumptionsof the clustering model. In addition, this methodis recommended in cases where the expected resultscould be identified explicitly.

    2.8 Values ofKdetermined using

    a neighbourhood measureA neighbourhood measure could be added to thecost function of the K-means algorithm to determineK [26]. Although this technique has showed promis-ing results for a few data sets, it needs to prove itspotential in practical applications. Because the costfunction has to be modified, this technique cannotbe applied to the original K-means algorithm.

    3 FACTORS AFFECTING THE SELECTION OF K

    A function f(K) for evaluating the clustering resultcould be used to select the number of clusters. Fac-tors that such a function should take into accountare discussed in this section.

    3.1 Approach bias

    The evaluation function should be related closely tothe clustering criteria. As mentioned previously,such a relation could prevent adverse effects on thevalidation process. In particular, in the K-meansalgorithm, the criterion is the minimization of the

    distortion of clusters, so that the evaluation functionshould take this parameter into account.

    3.2 Level of detail

    In general, observers that could see relatively lowlevels of detail would obtain only an overview of anobject. By increasing the level of detail, they couldgain more information about the observed objectbut, at the same time, the amount of data that theyhave to process increases. Because of resource limit-ations, a high level of detail is normally used only toexamine parts of the object [28].

    Such an approach could be applied in clustering. Adata set with n objects could be grouped into anynumber of clusters between 1 and n, which wouldcorrespond to the lowest and the highest levels ofdetail respectively. By specifying different K values,it is possible to assess the results of groupingobjects into various numbers of clusters. Fromthis evaluation, more than one K value could berecommended to users, but the final selection is

    made by them.

    3.3 Internal distribution versus global impact

    Clustering is used to find irregularities in the datadistribution and to identify regions in which objectsare concentrated. However, not every region with ahigh concentration of objects is considered a cluster.For a region to be identified as a cluster, it is import-ant to analyse not only its internal distribution but

    also its interdependence with other object groupingsin the data set.In K-means clustering, the distortion of a cluster is

    a function of the data population and the distancebetween objects and the cluster centre according to

    Ij XNjt1

    d(xjt, wj)2 (1a)

    where Ij is the distortion of cluster j, wj is the centreof cluster j, Nj is the number of objects belonging to

    cluster j, xjt is the tth object belonging to cluster j,and d(xjt, wj) is the distance between object xjt andthe centre wj of cluster j.

    Each cluster is represented by its distortionand its impact on the entire data set is assessed by

    Fig. 3 Variations in the two-ring data set

    Selection ofKin K-means clustering 107

    C09304 # IMechE 2005 Proc. IMechE Vol. 219 Part C: J. Mechanical Engineering Science

  • 7/27/2019 Selection of k in K-means Clustering

    6/17

    its contribution to the sum of all distortions, SK,given by

    SK XKj1

    Ij (1b)

    where K is the specified number of clusters.Thus, such information is important in assessing

    whether a particular region in the object spacecould be considered a cluster.

    3.4 Constraints on f(K)

    The robustness of f(K) is very important. Becausethis function is based on the result of the clusteringalgorithm, it is important for this result to vary aslittle as possible when K remains unchanged. How-

    ever, one of the main deficiencies of the K-meansapproach is its dependence on randomness. Thus,the algorithm should yield consistent results sothat its performance can be used as a variablein the evaluation function. A new version of the K-means algorithm, namely the incremental K-meansalgorithm [29], satisfies this requirement and canbe adopted for this purpose.

    The role of f(K) is to reveal trends in the datadistribution and therefore it is important to keep itindependent of the number of objects. The number

    of clusters, K, is assumed to be much smaller thanthe number of objects, N. When K increases, f(K)should converge to some constant value. Then, if,for any intermediate K, f(K) exhibits a special beha-viour, such as a minimum or maximum point, thatvalue of K could be taken as the desired number ofclusters.

    4 NUMBER OF CLUSTERS FORK-MEANS CLUSTERING

    As mentioned in section 3.3, cluster analysis isused to find irregularities in the data distribution.When the data distribution is uniform, there is notany irregularity. Therefore, data sets with uniformdistribution could be used to calibrate and verifythe clustering result. This approach was applied byTibshirani et al. [30]. A data set of the same dimen-sion as the actual data set and with a uniform distri-bution was generated. The clustering performanceon this artificial data set was then compared withthe result obtained for the actual data set. A measure

    known as the gap statistic [30] was employed toassess performance. In this work, instead of generat-ing an artificial data set, the clustering performancefor the artificial data set was estimated. Also, insteadof the gap statistic, a new and more discriminatory

    measure was employed for evaluating the clusteringresult.

    When the K-means algorithm is applied to datawith a uniform distribution and K is increased by 1,the clusters are likely to change and, in the new pos-itions, the partitions will again be approximatelyequal in size and their distortions similar to one

    another. The evaluations carried out in reference[29] showed that, when a new cluster is insertedinto a cluster (K 1) with a hypercuboid shape anda uniform distribution, the decrease in the sum ofdistortions is proportional to the original sum of dis-tortions. This conclusion was found to be correct forclustering results obtained with relatively smallvalues of K. In such cases, the sum of distortionsafter the increase in the number of clusters couldbe estimated from the current value.

    The evaluation function f(K) is defined using the

    equations

    f(K)

    1 if K 1SK

    aKSK1if SK1 = 0, 8K. 1

    1 if SK1 0, 8K. 1

    8>:

    (2)

    aK

    1 3

    4Ndif K 2 and Nd . 1

    (3a)

    aK1 1 aK1

    6if K. 2 and Nd . 1

    (3b)

    8>>>>>>>>>:

    where SK is the sum of the cluster distortions whenthe number of clusters is K, Nd is the number ofdata set attributes (i.e. the number of dimensions)and aK is a weight factor. The term aKSK21 inequation (2) is an estimate of SK based on SK21made with the assumption that the data have a uni-form distribution. The value off(K) is the ratio of thereal distortion to the estimated distortion and isclose to 1 when the data distribution is uniform.

    When there are areas of concentration in the datadistribution, SK will be less than the estimatedvalue, so that f(K) decreases. The smaller that f(K)is, the more concentrated is the data distribution.Thus, values of K that yield small f(K) can beregarded as giving well-defined clusters.

    The weight factor aK, defined in equation (3), isa positive number less than or equal to 1 and isapplied to reduce the effect of dimensions. WithK 2,aKis computed using equation (3a). This equa-tion is derived from equation (7) in reference [29],

    which shows that the decrease in distortion is inver-sely proportional to the number of dimensions, Nd.

    As Kincreases above 2, the decrease in the sum ofdistortions reduces (the ratio SK/SK21 approaches 1),as can be seen in Fig. 4. This figure shows the values

    108 D T Pham, S S Dimov, and C D Nguyen

    Proc. IMechE Vol. 219 Part C: J. Mechanical Engineering Science C09304 # IMechE 2005

  • 7/27/2019 Selection of k in K-means Clustering

    7/17

    of SK/SK21 computed for different K when the clus-tering algorithm is applied to data sets of differentdimensions and with uniform distributions. Withsuch data sets, f(K) is expected to be equal to 1 andaK should be chosen to equate f(K) to 1. Fromequation (2), aK should therefore be SK/SK21 andthus obtainable from Fig. 4. However, for compu-tational simplicity, the recursion equation (3b) hasbeen derived from the data represented in Fig. 4 to

    calculate aK. Figure 5 shows that the values of aKobtained from equation (3b) fit the plots in Fig. 4closely.

    The proposed function f(K) satisfies the con-straints mentioned in the previous section. Therobustness of f(K) will be verified experimentally inthe next section. When the number of objects isdoubled or tripled but their distributions areunchanged, the resultant clusters remain in thesame position. SK and SK21 are doubled or tripledcorrespondingly, so that f(K) stays constant. There-

    fore, generally, f(K) is independent of the numberof objects in the data set.

    To reduce the effect of differences in the rangesof the attributes, data are normalized before theclustering starts. However, it should be noted that,

    when the data have well-separated groups of objects,the shape of such regions in the problem space hasan effect on the evaluation function. In these cases,the normalization does not influence the localobject distribution, because it is a scaling techniquethat applies to the whole data set.

    5 PERFORMANCE

    The evaluation function f(K) is tested in a series ofexperiments on the artificially generated data setsshown in Fig. 6. All data are normalized before theincremental K-means algorithm is applied with Kranging from 1 to 19. f(K) is calculated on the basisof the total distortion of the clusters.

    In Figs 6ac, all objects belong to a single regionwith a uniform distribution. The graph in Fig. 6ashows that f(K) reflects well the clustering result onthis data set with a uniform distribution becausef(K) is approximately constant and equal to 1 for

    all K. When K 4 and K 3 in Figs 6a and b, respect-ively, f(K) reaches minimum values. This could beattributed to the shape of the areas defined by theobjects belonging to these data sets. However, theminimum values of f(K) do not differ significantly

    Fig. 4 The ratio SK/SK21 for data sets having uniform

    distributions: (a) two-dimensional square and

    circle; (b) four-dimensional cube and sphere

    Fig. 5 Comparison of the values ofaKcalculated using

    equation (3b) and the ratio SK/SK21

    Selection ofKin K-means clustering 109

    C09304 # IMechE 2005 Proc. IMechE Vol. 219 Part C: J. Mechanical Engineering Science

  • 7/27/2019 Selection of k in K-means Clustering

    8/17

    Fig. 6 Data sets and their corresponding f(K)

    110 D T Pham, S S Dimov, and C D Nguyen

    Proc. IMechE Vol. 219 Part C: J. Mechanical Engineering Science C09304 # IMechE 2005

  • 7/27/2019 Selection of k in K-means Clustering

    9/17

    Fig. 6 Continued

    Selection ofKin K-means clustering 111

    C09304 # IMechE 2005 Proc. IMechE Vol. 219 Part C: J. Mechanical Engineering Science

  • 7/27/2019 Selection of k in K-means Clustering

    10/17

    Fig. 6 Continued

    112 D T Pham, S S Dimov, and C D Nguyen

    Proc. IMechE Vol. 219 Part C: J. Mechanical Engineering Science C09304 # IMechE 2005

  • 7/27/2019 Selection of k in K-means Clustering

    11/17

    Fig. 6 Continued

    Selection ofKin K-means clustering 113

    C09304 # IMechE 2005 Proc. IMechE Vol. 219 Part C: J. Mechanical Engineering Science

  • 7/27/2019 Selection of k in K-means Clustering

    12/17

    Fig. 6 Continued

    114 D T Pham, S S Dimov, and C D Nguyen

    Proc. IMechE Vol. 219 Part C: J. Mechanical Engineering Science C09304 # IMechE 2005

  • 7/27/2019 Selection of k in K-means Clustering

    13/17

    from the average value for any strong recommen-dations to be made to the user. By comparing thevalues of f(K) in Figs 6a and c, it can be seen thataK reduces the effect of the data set dimensions onthe evaluation function.

    For the data set in Fig. 6d, again, all objects areconcentrated in a single region with a normal

    distribution. The f(K) plot for this data set suggestscorrectly that, when K 1, the clustering result isthe most suitable for this data set.

    The data sets in Figs 6e and f are created bytwo generators that have normal distributions. InFig. 6e, the two generators have an overlappingregion but, in Fig. 6f, they are well separated. Note

    Fig. 7 f(K) for the 12 benchmark data sets

    Selection ofKin K-means clustering 115

    C09304 # IMechE 2005 Proc. IMechE Vol. 219 Part C: J. Mechanical Engineering Science

  • 7/27/2019 Selection of k in K-means Clustering

    14/17

    that the value for f(2) in the latter figure is muchsmaller than in the former.

    The data sets in Figs 6g and h have three recog-nizable regions. From the corresponding graphs,f(K) suggests correct values of K for clusteringthese data sets.

    Three different generators that create objectgroupings with a normal distribution are used toform the data set in Fig. 6i. In this case, f(K) suggeststhe value 2 or 3 for K. Because two of these threegenerators create object groupings that overlap,f(2) is smaller than f(3). This means that the data

    Fig. 7 Continued

    116 D T Pham, S S Dimov, and C D Nguyen

    Proc. IMechE Vol. 219 Part C: J. Mechanical Engineering Science C09304 # IMechE 2005

  • 7/27/2019 Selection of k in K-means Clustering

    15/17

    have only two clearly defined regions, but K 3could also be used to cluster the objects.

    Figures 6j and k illustrate how the level of detailcould affect the selection of K. f(K) reaches mini-mum values at K 2 and 4 respectively. In suchcases, users could select the most appropriate valueof K based on their specific requirements. A more

    complex case is shown in Fig. 6l where there is apossible K value of 4 or 8. The selection of a parti-cular K will depend on the requirements of thespecific application for which the clustering iscarried out.

    The data sets in Figs 6mo have well-definedregions in the object space, each of which has adifferent distribution, location, and number ofobjects. If the minimum value of f(K) is used tocluster the objects, K will be different from thenumber of generators utilized to create them (as in

    the case of the clusters in Fig. 6o or the number ofobject groupings that could be identified visually(as in the case of the clusters in Figs 6m and n).The reason for the difference varies with differentcases. For example, it could be considered thatthere are five clusters in Fig. 6m because the clusterdistances are smaller for the two leftmost pairs ofclusters than for others and the clusters in thosepairs could be merged together. However, nosimple explanation could be given for the casesshown in Figs 6n and o. This highlights the fact

    that f(K) should only be used to suggest a guidevalue for the number of clusters and the finaldecision as to which value to adopt has to be left atthe discretion of the user.

    From the graphs in Fig. 6, a conclusion could bemade that any K with corresponding f(K) , 0.85could be recommended for clustering. If there isnot a value with corresponding f(K) , 0.85, K 1 isselected.

    The proposed function f(K) is also applied to 12benchmarking data sets from the UCI RepositoryMachine Learning Databases [31]. Figure 7 shows

    how the value of f(K) varies with K. If a threshold of0.85 is selected for f(K) (from the study on theartificial data sets), the numbers of clusters recom-mended for each of these data sets are given as inTable 2. K 1 means that the data distribution isvery close to the standard uniform distribution. Thevalues recommended using f(K) are very smallbecause of the high correlation between the attributesof these data sets, very similar to that shown in Fig. 6e.This can be verified by examining two attributes ata time and plotting the data sets in two dimensions.

    The above experimental study on 15 artificialand 12 benchmark data sets has demonstratedthe robustness of f(K). The evaluation functionconverges in most cases to 1 when K increasesabove 9.

    6 CONCLUSION

    Existing methods of selecting the number of clustersfor K-means clustering have a number of drawbacks.Also, current methods for assessing the clusteringresults do not provide much information on theperformance of the clustering algorithm.

    A new method to select the number of clustersfor the K-means algorithm has been proposed inthe paper. The new method is closely related to theapproach of K-means clustering because it takes

    into account information reflecting the performanceof the algorithm. The proposed method can suggestmultiple values ofKto users for cases when differentclustering results could be obtained with variousrequired levels of detail. The method could be com-putationally expensive if used with large data setsbecause it requires several applications of theK-means algorithm before it can suggest a guidevalue for K. The method has been validated on15 artificial and 12 benchmark data sets. Furtherresearch is required to verify the capability of thismethod when applied to data sets with morecomplex object distributions.

    ACKNOWLEDGEMENTS

    This work was carried out as part of the CardiffInnovative Manufacturing Research Centre Projectsupported by the Engineering and Physical SciencesResearch Council and the SUPERMAN Project sup-

    ported by the European Commission and the WelshAssembly Government under the European RegionalDevelopment Fund programme. The authors aremembers of the IPROMS Network of Excellencefunded by the European Commission.

    Table 2 The recommended number of

    clusters based on f(K)

    Data setsProposed numberof clusters

    Australian 1Balance-scale 1Car evaluation 2, 3, 4Cmc 1Ionosphere 2Iris 2, 3Page blocks 2Pima 1Wdbc 2Wine 3Yeast 1Zoo 2

    Selection ofKin K-means clustering 117

    C09304 # IMechE 2005 Proc. IMechE Vol. 219 Part C: J. Mechanical Engineering Science

  • 7/27/2019 Selection of k in K-means Clustering

    16/17

    REFERENCES

    1 Han, J. and Kamber, M. Data Mining: Concepts andTechniques, 2000 (Morgan Kaufmann, San Francisco,California).

    2 Al-Daoud, M. B., Venkateswarlu, N. B., andRoberts, S. A. Fast K-means clustering algorithms.

    Report 95.18, School of Computer Studies, Universityof Leeds, June 1995.

    3 Al-Daoud, M. B., Venkateswarlu, N. B., andRoberts, S. A. New methods for the initialisation ofclusters. Pattern Recognition Lett., 1996, 17, 451455.

    4 Alsabti, K., Ranka, S., and Singh, V. An efficientK-means clustering algorithm. In Proceedings of theFirst Workshop on High-Performance Data Mining,Orlando, Florida, 1998; ftp://ftp.cise.ufl.edu/pub/faculty/ranka/Proceedings.

    5 Bilmes, J., Vahdat, A., Hsu, W., and Im, E. J. Empiricalobservations of probabilistic heuristics for theclustering problem. Technical Report TR-97-018,International Computer Science Institute, Berkeley,California.

    6 Bottou, L. and Bengio, Y. Convergence properties of theK-means algorithm. Adv. Neural Infn Processing Systems,1995, 7, 585592.

    7 Bradley, S. and Fayyad, U. M. Refining initialpoints for K-means clustering. In Proceedings ofthe Fifteenth International Conference on MachineLearning (ICML 98) (Ed. J. Shavlik), Madison,

    Wisconsin, 1998, pp. 9199 (Morgan Kaufmann, SanFrancisco, California).

    8 Du, Q. and Wong, T-W. Numerical studies of

    MacQueens K-means algorithm for computing the cen-troidal Voronoi tessellations. Int. J. Computers Math.

    Applics, 2002, 44, 511523.9 Castro, V. E. and Yang, J. A fast and robust general

    purpose clustering algorithm. In Proceedings of theFourth European Workshop on Principles of KnowledgeDiscovery in Databases and Data Mining (PKDD 00),Lyon, France, 2000, pp. 208218.

    10 Castro, V. E. Why so many clustering algorithms?SIGKDD Explorations, Newsletter of the ACM SpecialInterest Group on Knowledge Discovery and Data

    Mining, 2002, 4(1), 6575.

    11 Fritzke, B. The LBG-U method for vector quantiza-tion an improvement over LBG inspired fromneural networks. Neural Processing Lett., 1997, 5(1),3545.

    12 Hamerly, G. and Elkan, C. Alternatives to the K-meansalgorithm that find better clusterings. In Proceedings ofthe 11th International Conference on Information andKnowledge Management (CIKM 02), McLean, Virginia,2002, pp. 600607.

    13 Hansen, L. K. and Larsen, J. Unsupervised learningand generalisation. In Proceedings of the IEEEInternational Conference on Neural Networks,

    Washington, DC, June 1996, pp. 25 30 (IEEE,New York).

    14 Ishioka, T. Extended K-means with an efficientestimation of the number of clusters. In Proceedingsof the Second International Conference on Intelligent

    Data Engineering and Automated Learning (IDEAL2000), Hong Kong, PR China, December 2000,pp. 1722.

    15 Kanungo, T., Mount, D. M., Netanyahu, N., Piatko, C.,Silverman, R., and Wu, A. The efficient K-means clus-tering algorithm: analysis and implementation. IEEETrans. Pattern Analysis Mach. Intell. 2002, 24(7),

    881892.16 Pelleg, D. and Moore, A. Accelerating exact K-meansalgorithms with geometric reasoning. In Proceedingsof the Conference on Knowledge Discovery inDatabases (KDD 99), San Diego, California, 1999,pp. 277 281.

    17 Pelleg, D. and Moore, A. X-means: extendingK-meanswith efficient estimation of the number of clusters. InProceedings of the 17th International Conference on

    Machine Learning (ICML 2000), Stanford, California,2000, 727734.

    18 Pena, J. M., Lazano, J. A., and Larranaga, P.An empiri-cal comparison of four initialisation methods for the

    K-means algorithm. Pattern Recognition Lett., 1999,20, 1027 1040.

    19 SPSS Clementine Data Mining System. User Guide Ver-sion 5, 1998 (Integral Solutions Limited, Basingstoke,Hampshire).

    20 DataEngine 3.0 Intelligent Data Analysis an EasyJob, Management Intelligenter Technologien GmbH,Germany, 1998; http://www.mitgmbh.de.

    21 Kerr, A., Hall, H. K., and Kozub, S. Doing Statistics withSPSS, 2002 (Sage, London).

    22 S-PLUS 6 for Windows Guide to Statistics, Vol. 2,Insightful Corporation, Seattle, Washington, 2001;

    http://www.insightful.com/DocumentsLive/23/44/statman2.pdf.

    23 Hardy, A. On the number of clusters. Comput. Statist.Data Analysis, 1996, 23, 8396.

    24 Theodoridis, S. and Koutroubas, K. Pattern Recog-nition, 1998 (Academic Press, London).

    25 Halkidi, M., Batistakis, Y., and Vazirgiannis, M.Cluster validity methods. Part I. SIGMOD Record,2002, 31(2); available online http://www.acm.org/sigmod/record/.

    26 Kothari, R. and Pitts, D. On finding the number ofclusters. Pattern Recognition Lett., 1999, 20, 405416.

    27 Cai, Z. Technical aspects of data mining. PhD thesis,Cardiff University, Cardiff, 2001.

    28 Lindeberg, T. Scale-space Theory in Computer Vision,1994 (Kluwer Academic, Boston, Massachusetts).

    29 Pham, D. T., Dimov, S. S., and Nguyen, C. D.Incremental K-means algorithm. Proc. Instn Mech.Engrs, Part C: J. Mechanical Engineering Science, 2003,218, 783795.

    30 Tibshirani, R., Walther, G., and Hastie, T. Estimatingthe number of clusters in a dataset via the gap statistic.Technical Report 208, Department of Statistics,Stanford University, California, 2000.

    31 Blake, C., Keogh, E., and Merz, C. J. UCI Re-pository of Machine Learning Databases, Irvine,California. Department of Information and Com-puter Science, University of California, Irvine,California, 1998.

    118 D T Pham, S S Dimov, and C D Nguyen

    Proc. IMechE Vol. 219 Part C: J. Mechanical Engineering Science C09304 # IMechE 2005

  • 7/27/2019 Selection of k in K-means Clustering

    17/17

    APPENDIX

    Notation

    A, B clustersd(xjt, wj) distance between object xjt and the

    centre wj of cluster j

    f(K) evaluation functionGA, GB generatorsIj distortion of cluster jK number of clustersN number of objects in the data setNd number of data set attributes

    (the dimension of the data set)

    Nj number of objects belonging tocluster j

    PGA, PGB probabilities that X is created by GA orGB respectively

    PCA, PCB probabilities that X is clustered into A orB respectively

    SK sum of all distortions with K being thespecified number of clustersX objectxjt object belonging to cluster jwj centre of cluster j

    aK weight factor

    Selection ofKin K-means clustering 119