A Hierarchical Clustering Algorithm Based on K

download A Hierarchical Clustering Algorithm Based on K

of 4

Transcript of A Hierarchical Clustering Algorithm Based on K

  • 8/9/2019 A Hierarchical Clustering Algorithm Based on K

    1/4

    A Hierarchical Clustering Algorithm based on K-means with Constraints*

    GuoYan Hang, DongMei Zhang, JiaDong Ren

    College of Information Science and Engineering,

    Yanshan University,

    Qinhuangdao, China

    [email protected]

    JiaDong Ren, ChangZhen Hu

    School of Computer Science and Technology,

    Beijing Institute of Technology,Beijing, China

    [email protected]

    AbstractHierarchical clustering is one of the most importanttasks in data mining. However, the existing hierarchical

    clustering algorithms are time-consuming, and have low

    clustering quality because of ignoring the constraints. In this

    paper, a Hierarchical Clustering Algorithm based on K-means

    with Constraints (HCAKC) is proposed. In HCAKC, in order

    to improve the clustering efficiency, Improved Silhouette is

    defined to determine the optimal number of clusters. In

    addition, to improve the hierarchical clustering quality, the

    existing pairwise must-link and cannot-link constraints are

    adopted to update the cohesion matrix between clusters.

    Penalty factor is introduced to modify the similarity metric to

    address the constraint violation. The experimental results

    show that HCAKC has lower computational complexity and

    better clustering quality compared with the existing algorithm

    CSM.

    Keywords-Hierarchical clustering; Improved Silhouette; K-

    means; constraints

    I. INTRODUCTIONClustering is an important analysis tool in many fields,

    such as pattern recognition, image classification, biologicalsciences, marketing, city-planning, document retrievals, etc.Hierarchical clustering is one of the most widely usedclustering methods.

    At present, several existing clustering algorithms focuson combination of the advantages of hierarchical andpartitioning clustering algorithms [1-2]. K-means, which isone of the representative partitioning methods, obtains thenumber of clusters through minimizing the objectivefunction. K-means has higher efficiency compared with thehierarchical methods. However, the number of clusters Kneeds to be fixed iteratively. Thus, K-means is oftenrequired to be run many times and is computationallyexpensive. How to determine the number of clusters

    becomes an increasingly important problem. The commontrail-and-error method [3] generally depends on certainclustering algorithms and is inefficient when the dataset islarge.

    *This work is supported by the National High Technology Research and

    Development Program ("863"Program) of China (No. 2009AA01Z433)and the Natural Science Foundation of Hebei Province P.R. China

    (No.F2008000888).

    Besides, the existing algorithms by combininghierarchical clustering and K-means [4] ignore the existingconstraints [5]. Clustering is traditionally considered as anunsupervised method for data analysis. However, in somecases the background knowledge is known in addition to thedata instances. Typically, the background knowledge is inthe form of pairwise constraints (must-link and cannot-link).Clustering quality can be improved through utilizing the

    existing constraints. Carlos Ruiz enhanced the density-basedalgorithm DBSCAN with constraints upon data points to getthe new algorithm C-DBSCAN [6]. C-DBSCAN hassuperior performance to DBSCAN even with a smallnumber of constraints. However, the efficiency of C-DBSCAN is not good. How the K-means clusteringalgorithm can be profitably modified to make use of theconstraints is demonstrated in cop-kmeans [7]. Although theclustering accuracy of cop-kmeans is improved, theconstraint violation problem has not been well addressed. I.Davidson [8] incorporated the pairwise constraints into theagglomerative hierarchical clustering to improve theclustering quality. However, the problem of constraintviolation is still not solved.

    In this paper, we propose HCAKC, a new method forhierarchical clustering based on K-means with existing pairwise constraints. In HCAKC, Improved Silhouette,CUCMC (Constraints-based Update of Cohesion Matrix between Clusters) are defined. The optimal number ofclusters is determined through computing the averageImproved Silhouette of the dataset such that the timecomplexity can be reduced. The initial clusters of HCAKCare obtained through running K-means. In order to improvethe clustering quality of the hierarchical clustering, theexisting pairwise must-link and cannot-link constraints areincorporated into the agglomerative hierarchical clustering.

    CUCMC is done based on the existing constraints. The

    penalty factor [9] is introduced into cohesion [2] similarity

    metric to address the constraint violation.This paper is organized as follows: In sectionII, we

    give the basic concepts and definitions. In sectionIII, we present our hierarchical clustering algorithm calledHCAKC. SectionIV shows the experimental results of theclustering algorithm. Finally we conclude the paper insectionV.

    2009 Fourth International Conference on Innovative Computing, Information and Control

    978-0-7695-3873-0/09 $29.00 2009 IEEE 1479

    Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on July 08,2010 at 20:10:42 UTC from IEEE Xplore. Restrictions apply.

  • 8/9/2019 A Hierarchical Clustering Algorithm Based on K

    2/4

    II. BASIC CONCEPTSAND DEFINITIONSA silhouette [4] is a function that measures the

    similarity of an object with objects of its own clustercompared with the objects of other clusters.

    For a cluster C consisting of data points p1, p2 pn, theradius r of C is defined as formula(1), where c is centroid ofC and d(p

    i, c) is the Euclidean distance between p

    iand c.

    ( )

    1

    22

    1

    1 n

    i

    i

    r d p cn =

    =

    (1)

    ( , ) exp( ( , ) ( , ) / )i j i i i j i join p c d p c d p c r = (2)

    Where join(pi, cj) is the intention of p i to be joined intoCj.The cohesion [2] of Ci and Cj is calculated as formula(3).

    ( )

    ( , ) ( , )

    ,

    j i

    p Ci p Cji j

    i j

    join p C join p C

    Chs C CC C

    +

    =+

    (3)

    Definition 1 IS (Improved Silhouette)Let S be a dataset consisting of clusters C1, C2Ct. The

    distance between each object oi(oiCj, j [1,t]) and thecentroid of its own cluster is denoted as ai. bi is theminimum distance between oi and each centroid of theother t-1 clusters. The IS(oi) is defined as formula(4).

    ( ) ( ) / max( , )i i i i i IS o b a a b= (4)

    In formula (4), the meanings of ai and bi have changedcompared with the traditional silhouette computation. Bothai and bi denote the distance to the cluster centroid.

    The average IS of dataset corresponding to different partition is calculated. The maximal IS of the datasetcorresponds to the optimal partition of the dataset.

    We take point A in Fig. 1 as an example to show the IScomputation of a data point.

    (1)Obtain the centroids of clusters C1, C2, C3respectively: Centroid1=(1.4, 1.2), Centroid2=(4.4, 4.8),Centroid3=(6.5, 0.8333);

    (2)Calculate aA, the distance between A and the centroid

    of its own cluster: 2 2(1 1.4) (0 1.2)Aa = + =1.2649. The

    distances between A and each centroid of C2 and C3 can be

    obtained similarly, and they are 5.8822 and 5.5628respectively. Since bA denotes the minimum distanceaccording to the definition of IS, thus let bA=5.8822.

    (3)The IS of A can be obtained based on formula (4).IS(A)=(bA-aA)/max(aA, bA)=0.7850.Definition 2 CUCMC (Constraint-based Update ofCohesion Matrix between Clusters)

    Suppose that { }1

    n

    k kC

    =is the set of given clusters and

    X=[ ]( , )s t n nChs C C is the existing cohesion matrix between

    any two clusters. Let M= {(Ci, Cj)} be the set of must-linkconstraints, indicating that cluster Ci and Cj should be in thesame class, C= {(Ci, Cj)} be the set of cannot-linkconstraints, indicating that Ci and Cj should be in thedifferent classes. For clusters Cp, Cq, Cr(p, q, r [1,n]), inorder to satisfy M, Chs(Cp, Cq) in X is updated to 1, andChs(Cp(Cq), Cr)is updated to max(Chs(Cp, Cr), Chs(Cq, Cr)).

    And Chs(Cp, Cq) in X is updated to 0 in order to satisfy C.In Fig. 2, we give an example to show the process of

    CUCMC, where must-link constraint M (C1, C2) is known.From Fig. 2, we can obtain that Chs(C1, C2)=0.4,

    Chs(C1, C3)=0.2, Chs(C2, C3)=0.1 respectively withoutconsidering the existing constraint. Since C1, C2 need tosatisfy M (C1, C2), Chs(C1, C2) in X, the cohesion matrix, isupdated to 1, and the Chs(C1, C3) is updated to 0.2.

    The CUCMC of the example is as follows:

    X=

    1

    1.01

    2.04.01

    X=

    1

    2.01

    2.011

    In general, the existing constraints are in the form ofm={(xi, xj)}, c={(xi, xj)}, where m indicates that point xiand xj should be in the same cluster, and c indicates that

    point xi and xj should be in the different clusters. The M={(Ci, Cj)} and C= {(Ci, Cj)} can be obtained throughutilizing the propagation of constraints.

    Penalty factor w, w are introduced in order to addressthe constraints violation.

    ( , )( , )( , ) ( , ) i ji j C C Ci j i j M C C Sim C C Chs C C w w= (5)

    ( , )i jM C Cw

    works as must-link (Ci, Cj) is violated,

    and ( , )i jC C Cw works as Cannot-link (Ci, Cj) is violated.

    8

    x

    C1

    C2

    C3

    A(1,0) 2

    y

    4 80

    6

    4

    2

    6

    Figure 1. The IS computation of data point

    0.2 0.1

    C3

    C2C2

    C3

    C1 C1

    0.4

    0.2

    Figure 2. The update of cohesion matrix of cluster with constraints

    1480

    Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on July 08,2010 at 20:10:42 UTC from IEEE Xplore. Restrictions apply.

  • 8/9/2019 A Hierarchical Clustering Algorithm Based on K

    3/4

    III. HIERARCHICAL CLUSTERING ALGORITHMBASED ONK-MEANS WITH CONSTRAINTS

    The CSM [2] needs to specify K. Different K leads todifferent clustering results. Thus, how to determine theappropriate K becomes especially important. Besides, theexisting constraints are not considered in CSM, so the

    accuracy of the clustering results will not be high.In HCAKC, we plot the curve about the average IS ofdataset to be clustered and the number of partitions. Theoptimal number of clusters is determined by the maximumof the curve, since the average IS of a dataset not onlyreflects the density of clusters, but also the dissimilarity

    between clusters. The cohesion matrix X is constructedaccording to the cohesion between any two clusters. Theexisting pairwise constraints are incorporated into thehierarchical clustering. CUCMC is implemented based onthe existing constraints. Thus, the clustering results aregreatly optimized. In our algorithms, S is the dataset to beclustered; K is the optimal number of clusters; n is the sizeof S; m is the number of sub-clusters; M={(Ci, Cj)}, (i,

    j [1,t]) is the set of existing must-link constraints; C={(Ci,Cj)}, (i, j [1,t]) is the set of existing cannot-linkconstraints.Algorithm Find-KInput: SOutput: K

    begin1: partition S into t clusters: C1, C2Ct, according to thegeometry distribution of S2: repeat{3: for (i=1; i>K);

    3: repeat4: { for (each point x in S)5: assign x to the closest sub-cluster based on the

    distance to the centroid;6: update the centroid of each sub-cluster;7: } until (no points change between the t clusters) //utilize

    the K-means on the S, where K equals to t

    8: Compute the cohesion matrix X between the t clusters;9: If ( (Ci Cj ) M or (Ci Cj )C)10: implement CUCMC;11: If (Ci Cj ) violates M (C)

    12: w, w are enforced on the cohesion matrix;13: do{ Extract the maximal chs (Ci, Cj);14: If (Ci and Cj do not belong to the same sub-cluster)15: merge the two sub-clusters which they belong to a

    new subcluster;16: t:=t-1; } while (t>K).end

    In HCAKC, Find-K is firstly run in order to determinethe optimal number K of the dataset to be clustered. Then,K-means is adopted to form t clusters initially, in which t ismore than K. The cohesion matrix named X between tclusters is obtained base on formula (3). Afterwards, theexisting constraints sets M={(Ci, Cj)} and C={(Ci, Cj)} areconsidered to implement the CUCMC. The penalty factor isintroduced to address the constraint violation. When must-

    link (Ci, Cj) is violated, ( , )i jM C Cw is forced on the similarity

    metric according to formula (5). The row i column j in X isset at Sim(Ci, Cj).

    IV. EXPERIMENTAL RESULTSAll of our experiments have been conducted on a

    computer with 2.4Ghz Intel CPU and 512M main memory.The operating system of the computer is Microsoft

    Windows XP. HCAKC is compared with CSM to evaluatethe clustering quality and time performance of HCAKC.The algorithms are all implemented in Microsoft VisualC++6.0.

    We performed our experiments on the UCI datasets:Ionosphere, iris, breast-cancer, credit-g, page-blocks. Themust-link and cannot-link constraints are generatedartificially by utilizing the same method with [7]. Thedetails of datasets are shown in Tab.1. For instance, D1 isthe Ionosphere dataset consisting of 355 instances from twoclusters. Accuracy [7], one of the clustering qualitymeasures, is computed to compare the clustering resultsbetween HCAKC and CSM. We averaged the measures for100 trails on each dataset. Fig. 3 and Fig. 4 show the

    experimental results comparing HCAKC with CSM.HCAKC and CSM are run on D1 (i.e. Ionosphere

    dataset) with constraints respectively. Fig. 3 shows theaccuracy results comparing HCAKC with CSM on theIonosphere dataset.From Fig. 3, we can get the conclusionthat CSM is lower in accuracy compared with HCAKCwith varying size of constraints.

    In CSM, the constraints are not considered. In HCAKC,the constraints are incorporated into the hierarchicalclustering to update the cohesion matrix, and the constraints

    1481

    Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on July 08,2010 at 20:10:42 UTC from IEEE Xplore. Restrictions apply.

  • 8/9/2019 A Hierarchical Clustering Algorithm Based on K

    4/4

    violation is addressed as well. Thus, HCAKC is better interms of clustering accuracy.

    The experiments have been conducted on the datasets:iris, breast-cancer, credit-g and page-blocks respectively tocompare the time efficiency of HCAKC with that of CSM.From Fig. 4, we can conclude that HCAKC is better thanCSM in CPU running time on different datasets.

    The cluster number K needs to be specified as aparameter before the CSM algorithm. The time cost of theparameter setting is expensive, since the K-means needs to be run iteratively. HCAKC finds out the optimal K viacomputing the average IS of the points in datasets, and thetime cost of this process is insignificant. The time efficiencyof HCAKC is obvious even when the scale of the dataset islarge.

    TABLE.1 PARAMETERS IN TESTING DATASET

    00. 10. 2

    0. 30. 40. 50. 60. 70. 80. 9

    1

    5 10 15 20 25 30 35 40

    Constraints Ratio/Size(%)

    A

    ccuracy(%)

    CSM HCAKC

    Figure 3. HCAKC and CSM comparison in terms of accuracy

    01020304050607080

    D2 D3 D4 D5

    Datasets

    Run

    ningtime(%)

    HCAKC CSM

    Figure 4. HCAKC and CSM comparison in terms of running time

    V. CONCLUSIONIn order to improve the time efficiency and clustering

    quality of CSM, a new method named HCAKC is proposedin this paper. In our proposed algorithm, the curve graphabout average IS of the dataset and different partitionnumber has been plotted. The optimal number of clusters isdetermined by locating the maximum of the curve graph. Asa result, the complexity of the process when determining thenumber of clusters has been improved. Thereafter, theexisting constraints have been incorporated to complete theCUCMC during the hierarchical clustering process. Thepenalty factor is introduced into our algorithm to address theconstraints violation. Hence, the clustering quality has beenimproved. The results of the experiments have demonstratedthat the HCAKC algorithm is efficient in reducing the timecomplexity and increasing the clustering quality.

    REFERENCES

    [1] L. Sun, T. C. Lin, H. C. Huang, B. Y. Liao, and J. S. Pan, Anoptimized approach on applying genetic algorithm to adaptive clustervalidity index, 3rd International Conference on International

    Information Hiding and Multimedia Signal Processing, Kaohsiung,Taiwan, Nov. 2007, vol. 2, pp. 582-585.

    [2] C.R. Lin, M.S. Chen, Combining Partitional and HierarchicalAlgorithms for Robust and Efficient Data Clustering with CohesionSelf-Merging, IEEE Transaction On Knowledge and Data

    Engineering, 2005, 17(2): 145-159.

    [3] H. J. Sun, S. R. Wang and Q. S. Jiang, FCM-Based model selectionalgorithms for determining the number of cluster, Pattern

    Recognition, 2004, vol. 37(10), pp. 20272037.

    [4] S. Lamrous, M. Taileb, Divisive Hierarchical K-Means, CIMCA2006: International Conference on Computational Intelligence forModeling, Control and Automation, Jointly with IAWTIC 2006:

    International Conference on Intelligent Agents Web Technologiesand Internet Commerce, Sydney, NSW, Australia, 2006, pp. 18-23.

    [5] S. C. Chu, J. F. Roddick, C. J. Su and J. S. Pan, Constrained AntColony Optimization for data clustering, 8th Pacific RimInternational Conference on Artificial Intelligence, PRICAI 2004:

    Trends in Artificial Intelligence, Auckland, New Zealand, 2004, vol.

    3157, pp. 534-543.

    [6] C. Ruiz, M. Spiliopoulou and E. Menasalvas, C-DBSCAN:Density-based clustering with constraints, In 11th International

    Conference on Rough Sets, Fuzzy Sets, Data Mining and GranularComputing, Toronto, Canada, 2007, pp. 216-223.

    [7] K. Wagstaff, C. Cardie, S. Rogers and S. Schroedl, Constrained K-means clustering with background knowledge, Proceeding of the17th International Conference on Machine Learning, 2001, pp. 577-

    584.

    [8] I. Davidson and S. S. Ravi, Agglomerative hierarchical clusteringwith constraints: Theoretical and empirical results, 9th EuropeanConference on Principles and Practice of Knowledge Discovery in

    Databases, Porto, Portugal, 2005, pp. 59-70.

    [9] M. Bilenko, S. Basu and R. J. Mooney, Integrating constraints andmetric learning in semi-supervised clustering, Proc. of the 21st Intl

    Conf. on Machine Learning, New York, ACM Press, 2004,pp.8188.

    Dataset Name Size Clusters

    D1 Ionosphere 355 2

    D2 Iris 150 3

    D3 breast-cancer 277 2

    D4 credit-g 1000 2

    D5 page-blocks 5473 5

    1482

    Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on July 08,2010 at 20:10:42 UTC from IEEE Xplore. Restrictions apply.