Base Paper -Clustering with Multi-Viewpoint based Similarity Measure.pdf

download Base Paper -Clustering with Multi-Viewpoint based Similarity Measure.pdf

of 15

Transcript of Base Paper -Clustering with Multi-Viewpoint based Similarity Measure.pdf

  • 8/14/2019 Base Paper -Clustering with Multi-Viewpoint based Similarity Measure.pdf

    1/15

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. YY, 2011 1

    Clustering with Multi-Viewpoint basedSimilarity Measure

    Duc Thang Nguyen, Lihui Chen,Senior Member, IEEE,and Chee Keong Chan

    AbstractAll clustering methods have to assume some cluster relationship among the data objects that they are applied on. Similarity

    between a pair of objects can be defined either explicitly or implicitly. In this paper, we introduce a novel multi-viewpoint based similarity

    measure and two related clustering methods. The major difference between a traditional dissimilarity/similarity measure and ours is

    that the former uses only a single viewpoint, which is the origin, while the latter utilizes many different viewpoints, which are objects

    assumed to not be in the same cluster with the two objects being measured. Using multiple viewpoints, more informative assessment

    of similarity could be achieved. Theoretical analysis and empirical study are conducted to support this claim. Two criterion functions for

    document clustering are proposed based on this new measure. We compare them with several well-known clustering algorithms that

    use other popular similarity measures on various document collections to verify the advantages of our proposal.

    Index TermsDocument clustering, text mining, similarity measure.

    1 INTRODUCTION

    Clustering is one of the most interesting and impor-tant topics in data mining. The aim of clustering is tofind intrinsic structures in data, and organize them intomeaningful subgroups for further study and analysis.There have been many clustering algorithms publishedevery year. They can be proposed for very distinctresearch fields, and developed using totally differenttechniques and approaches. Nevertheless, according toa recent study [1], more than half a century after it wasintroduced, the simple algorithm k-means still remainsas one of the top 10 data mining algorithms nowadays.

    It is the most frequently used partitional clustering al-gorithm in practice. Another recent scientific discussion[2] states that k-means is the favourite algorithm thatpractitioners in the related fields choose to use. Need-less to mention, k-means has more than a few basicdrawbacks, such as sensitiveness to initialization and tocluster size, and its performance can be worse than otherstate-of-the-art algorithms in many domains. In spite ofthat, its simplicity, understandability and scalability arethe reasons for its tremendous popularity. An algorithmwith adequate performance and usability in most ofapplication scenarios could be preferable to one with

    better performance in some cases but limited usage dueto high complexity. While offering reasonable results, k-means is fast and easy to combine with other methodsin larger systems.

    A common approach to the clustering problem is totreat it as an optimization process. An optimal partitionis found by optimizing a particular function of similarity(or distance) among data. Basically, there is an implicit

    The authors are with the Division of Information Engineering, Schoolof Electrical and Electronic Engineering, Nanyang Technological Univer-sity, Block S1, Nanyang Ave., Republic of Singapore, 639798. Email: [email protected],{elhchen, eckchan}@ntu.edu.sg.

    assumption that the true intrinsic structure of data couldbe correctly described by the similarity formula de-fined and embedded in the clustering criterion function.Hence, effectiveness of clustering algorithms under thisapproach depends on the appropriateness of the similar-ity measure to the data at hand. For instance, the origi-nalk-means has sum-of-squared-error objective functionthat uses Euclidean distance. In a very sparse and high-dimensional domain like text documents, spherical k-means, which uses cosine similarity instead of Euclideandistance as the measure, is deemed to be more suitable[3], [4].

    In [5], Banerjee et al. showed that Euclidean distancewas indeed one particular form of a class of distancemeasures called Bregman divergences. They proposedBregman hard-clustering algorithm, in which any kindof the Bregman divergences could be applied. Kullback-Leibler divergence was a special case of Bregman di-vergences that was said to give good clustering resultson document datasets. Kullback-Leibler divergence is agood example of non-symmetric measure. Also on thetopic of capturing dissimilarity in data, Pakalska et al.[6] found that the discriminative power of some distancemeasures could increase when their non-Euclidean andnon-metric attributes were increased. They concluded

    that non-Euclidean and non-metric measures could beinformative for statistical learning of data. In [7], Pelilloeven argued that the symmetry and non-negativity as-sumption of similarity measures was actually a limi-tation of current state-of-the-art clustering approaches.Simultaneously, clustering still requires more robust dis-similarity or similarity measures; recent works such as[8] illustrate this need.

    The work in this paper is motivated by investigationsfrom the above and similar research findings. It appearsto us that the nature of similarity measure plays a veryimportant role in the success or failure of a cluster-

    Digital Object Indentifier 10.1109/TKDE.2011.86 1041-4347/11/$26.00 2011 IEEE

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 8/14/2019 Base Paper -Clustering with Multi-Viewpoint based Similarity Measure.pdf

    2/15

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. YY, 2011 2

    TABLE 1

    Notations

    Notation Description

    n number of documentsm number of termsc number of classesk number of clustersd document vector,d = 1

    S =

    {d1

    , . . . , dn

    } set of all the documentsSr set of documents in cluster rD =

    diS

    di composite vector of all the documents

    Dr =

    diSrdi composite vector of cluster r

    C =D/n centroid vector of all the documentsCr =Dr/nr centroid vector of cluster r , nr =|Sr|

    ing method. Our first objective is to derive a novelmethod for measuring similarity between data objects insparse and high-dimensional domain, particularly textdocuments. From the proposed similarity measure, wethen formulate new clustering criterion functions andintroduce their respective clustering algorithms, which

    are fast and scalable like k-means, but are also capableof providing high-quality and consistent performance.

    The remaining of this paper is organized as follows.In Section 2, we review related literature on similar-ity and clustering of documents. We then present ourproposal for document similarity measure in Section 3.It is followed by two criterion functions for documentclustering and their optimization algorithms in Sec-tion 4. Extensive experiments on real-world benchmarkdatasets are presented and discussed in Sections 5 and 6.Finally, conclusions and potential future work are givenin Section 7.

    2 RELATEDWOR KFirst of all, Table 1 summarizes the basic notations thatwill be used extensively throughout this paper to rep-resent documents and related concepts. Each documentin a corpus corresponds to an m-dimensional vector d,wherem is the total number of terms that the documentcorpus has. Document vectors are often subjected tosome weighting schemes, such as the standard TermFrequency-Inverse Document Frequency (TF-IDF), andnormalized to have unit length.

    The principle definition of clustering is to arrange dataobjects into separate clusters such that the intra-cluster

    similarity as well as the inter-cluster dissimilarity is max-imized. The problem formulation itself implies that someforms of measurement are needed to determine suchsimilarity or dissimilarity. There are many state-of-the-art clustering approaches that do not employ any specificform of measurement, for instance, probabilistic model-

    based method [9], non-negative matrix factorization [10],information theoretic co-clustering [11] and so on. Inthis paper, though, we primarily focus on methods thatindeed do utilize a specific measure. In the literature,Euclidean distance is one of the most popular measures:

    Dist (di, dj) = didj (1)

    It is used in the traditional k-means algorithm. The ob-jective ofk-means is to minimize the Euclidean distancebetween objects of a cluster and that clusters centroid:

    mink

    r=1

    diSr

    diCr2

    (2)

    However, for data in a sparse and high-dimensional

    space, such as that in document clustering, cosine simi-larity is more widely used. It is also a popular similarityscore in text mining and information retrieval [12]. Par-ticularly, similarity of two document vectors di and dj ,Sim(di, dj), is defined as the cosine of the angle betweenthem. For unit vectors, this equals to their inner product:

    Sim (di, dj) = cos(di, dj) = dtidj (3)

    Cosine measure is used in a variant ofk-means calledspherical k-means [3]. While k-means aims to mini-mize Euclidean distance, spherical k-means intends tomaximize the cosine similarity between documents in a

    cluster and that clusters centroid:

    maxk

    r=1

    diSr

    dtiCrCr

    (4)

    The major difference between Euclidean distance andcosine similarity, and therefore between k-means andsphericalk-means, is that the former focuses on vectormagnitudes, while the latter emphasizes on vector di-rections. Besides direct application in spherical k-means,cosine of document vectors is also widely used in manyother document clustering methods as a core similaritymeasurement. The min-max cut graph-based spectralmethod is an example [13]. In graph partitioning ap-proach, document corpus is consider as a graph G =(V, E), where each document is a vertex in V and eachedge inEhas a weight equal to the similarity between apair of vertices. Min-max cut algorithm tries to minimizethe criterion function:

    mink

    r=1

    Sim (Sr, S\Sr)

    Sim (Sr, Sr) (5)

    where Sim (Sq, Sr)1q,rk

    =

    diSq,djSr

    Sim(di, dj)

    and when the cosine as in Eq. (3) is used, minimizingthe criterion in Eq. (5) is equivalent to:

    min

    kr=1

    DtrD

    Dr2 (6)

    There are many other graph partitioning methods withdifferent cutting strategies and criterion functions, suchas Average Weight [14] and Normalized Cut [15], allof which have been successfully applied for documentclustering using cosine as the pairwise similarity score[16], [17]. In [18], an empirical study was conducted tocompare a variety of criterion functions for documentclustering.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 8/14/2019 Base Paper -Clustering with Multi-Viewpoint based Similarity Measure.pdf

    3/15

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. YY, 2011 3

    Another popular graph-based clustering technique isimplemented in a software package called CLUTO [19].This method first models the documents with a nearest-neighbor graph, and then splits the graph into clustersusing a min-cut algorithm. Besides cosine measure, theextended Jaccard coefficient can also be used in thismethod to represent similarity between nearest docu-ments. Given non-unit document vectors ui, uj (di =ui/ui,dj =uj/uj), their extended Jaccard coefficientis:

    SimeJacc(ui, uj) = utiuj

    ui2 +uj2 utiuj(7)

    Compared with Euclidean distance and cosine similarity,the extended Jaccard coefficient takes into account boththe magnitude and the direction of the document vec-tors. If the documents are instead represented by theircorresponding unit vectors, this measure has the sameeffect as cosine similarity. In [20], Strehl et al. comparedfour measures: Euclidean, cosine, Pearson correlationand extended Jaccard, and concluded that cosine andextended Jaccard are the best ones on web documents.

    In nearest-neighbor graph clustering methods, suchas the CLUTOs graph method above, the concept ofsimilarity is somewhat different from the previouslydiscussed methods. Two documents may have a certainvalue of cosine similarity, but if neither of them is inthe other ones neighborhood, they have no connec-tion between them. In such a case, some context-basedknowledge or relativeness property is already taken intoaccount when considering similarity. Recently, Ahmadand Dey [21] proposed a method to compute distance

    between two categorical values of an attribute based on

    their relationship with all other attributes. Subsequently,Ienco et al. [22] introduced a similar context-based dis-tance learning method for categorical data. However, fora given attribute, they only selected a relevant subsetof attributes from the whole attribute set to use as thecontext for calculating distance between its two values.

    More related to text data, there are phrase-based andconcept-based document similarities. Lakkaraju et al.[23] employed a conceptual tree-similarity measure toidentify similar documents. This method requires rep-resenting documents as concept trees with the help ofa classifier. For clustering, Chim and Deng [24] pro-posed a phrase-based document similarity by combining

    suffix tree model and vector space model. They thenused Hierarchical Agglomerative Clustering algorithmto perform the clustering task. However, a drawback ofthis approach is the high computational complexity dueto the needs of building the suffix tree and calculatingpairwise similarities explicitly before clustering. Thereare also measures designed specifically for capturingstructural similarity among XML documents [25]. Theyare essentially different from the document-content mea-sures that are discussed in this paper.

    In general, cosine similarity still remains as the mostpopular measure because of its simple interpretation

    and easy computation, though its effectiveness is yetfairly limited. In the following sections, we propose anovel way to evaluate similarity between documents,and consequently formulate new criterion functions fordocument clustering.

    3 MULTI-V IEWPOINT BASEDSIMILARITY

    3.1 Our novel similarity measureThe cosine similarity in Eq. (3) can be expressed in thefollowing form without changing its meaning:

    Sim (di, dj) = cos(di0, dj0) = (di0)t

    (dj0) (8)

    where 0 is vector 0 that represents the origin point.According to this formula, the measure takes 0 as oneand only reference point. The similarity between twodocuments di and dj is determined w.r.t. the angle

    between the two points when looking from the origin.To construct a new concept of similarity, it is possible

    to use more than just one point of reference. We may

    have a more accurate assessment of how close or distanta pair of points are, if we look at them from many differ-ent viewpoints. From a third point dh, the directions anddistances to di and dj are indicated respectively by thedifference vectors(didh)and(dj dh). By standing atvarious reference points dh to view di, dj and workingon their difference vectors, we define similarity betweenthe two documents as:

    Sim(di, dj)di,djSr

    = 1

    nnr

    dhS\Sr

    Sim(didh, djdh) (9)

    As described by the above equation, similarity of twodocumentsdi and dj - given that they are in the same

    cluster - is defined as the average of similarities mea-sured relatively from the views of all other documentsoutside that cluster. What is interesting is that the simi-larity here is defined in a close relation to the clusteringproblem. A presumption of cluster memberships has

    been made prior to the measure. The two objects tobe measured must be in the same cluster, while thepoints from where to establish this measurement must

    be outside of the cluster. We call this proposal theMulti-Viewpoint based Similarity, or MVS. From thispoint onwards, we will denote the proposed similar-ity measure between two document vectors di and dj

    by MVS(di, dj |di, djSr), or occasionally MVS(di, dj)for

    short.The final form of MVS in Eq. (9) depends on particular

    formulation of the individual similarities within the sum.If the relative similarity is defined by dot-product of thedifference vectors, we have:

    MVS(di, dj |di, dj Sr)

    = 1

    nnr

    dhS\Sr

    (didh)t(djdh)

    = 1

    nnr

    dh

    cos(didh, djdh)didhdjdh (10)

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 8/14/2019 Base Paper -Clustering with Multi-Viewpoint based Similarity Measure.pdf

    4/15

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. YY, 2011 4

    The similarity between two points di and dj insideclusterSr, viewed from a point dh outside this cluster, isequal to the product of the cosine of the angle between dianddjlooking fromdhand the Euclidean distances fromdh to these two points. This definition is based on theassumption thatdhis not in the same cluster with di anddj . The smaller the distancesdi dhand dj dhare,the higher the chance thatdhis in fact in the same clusterwith di and dj , and the similarity based on dh shouldalso be small to reflect this potential. Therefore, throughthese distances, Eq. (10) also provides a measure of inter-cluster dissimilarity, given that points di and dj belongto clusterSr, whereasdh belongs to another cluster. Theoverall similarity between di and dj is determined bytaking average over all the viewpoints not belonging toclusterSr. It is possible to argue that while most of theseviewpoints are useful, there may be some of them givingmisleading information just like it may happen with theorigin point. However, given a large enough number ofviewpoints and their variety, it is reasonable to assume

    that the majority of them will be useful. Hence, the effectof misleading viewpoints is constrained and reduced bythe averaging step. It can be seen that this method offersmore informative assessment of similarity than the singleorigin point based similarity measure.

    3.2 Analysis and practical examples of MVS

    In this section, we present analytical study to show thatthe proposed MVS could be a very effective similaritymeasure for data clustering. In order to demonstrateits advantages, MVS is compared with cosine similarity(CS) on how well they reflect the true group structure

    in document collections. Firstly, exploring Eq. (10), wehave:

    MVS(di, dj |di, dj Sr)

    = 1

    n nr

    dhS\Sr

    dtidj d

    tidhd

    tjdh+d

    thdh

    =dtidj 1

    nnrdtidh

    dh 1

    nnrdtjdh

    dh+1, dh=1

    =dtidj 1

    n nrdtiDS\Sr

    1

    n nrdtjDS\Sr + 1

    =dtidj dtiCS\Sr d

    tjCS\Sr + 1 (11)

    where DS\Sr =

    dhS\Sr dh is the composite vectorof all the documents outside cluster r, called the outercomposite w.r.t. cluster r, and CS\Sr = DS\Sr/(nnr)the outer centroid w.r.t. cluster r, r = 1, . . . , k. FromEq. (11), when comparing two pairwise similaritiesMVS(di, dj)and MVS(di, dl), document dj is more simi-lar to documentdi than the other document dl is, if andonly if:

    dtidjdtjCS\Sr > d

    tidld

    tlCS\Sr

    cos(di, dj)cos(dj , CS\Sr )CS\Sr>

    cos(di, dl)cos(dl, CS\Sr)CS\Sr (12)

    1: procedure B UILDMVSMATRIX(A)2: forr 1 :c do3: DS\Sr

    di/Sr

    di4: nS\Sr |S\Sr|5: end for6: for i1 :n do7: r class ofdi8: for j 1 :n do

    9: if dj Sr then

    10: aij dtidj dti

    DS\SrnS\Sr

    dtjDS\SrnS\Sr

    + 1

    11: else

    12: aijdtidjdti

    DS\Srdj

    nS\Sr1 dtj

    DS\Srdj

    nS\Sr1 +1

    13: end if14: end for15: end for16: return A = {aij}nn17: end procedure

    Fig. 1. Procedure: Build MVS similarity matrix.

    From this condition, it is seen that even when dlis considered closer to di in terms of CS, i.e.cos(di, dj) cos(di, dl), dl can still possibly be regardedas less similar to di based on MVS if, on the contrary,it is closer enough to the outer centroid CS\Sr thandj is. This is intuitively reasonable, since the closer dlis to CS\Sr , the greater the chance it actually belongsto another cluster rather than Sr and is, therefore, lesssimilar to di. For this reason, MVS brings to the table anadditional useful measure compared with CS.

    To further justify the above proposal and analysis, we

    carried out a validity test for MVS and CS. The purposeof this test is to check how much a similarity measurecoincides with the true class labels. It is based on oneprinciple: if a similarity measure is appropriate for theclustering problem, for any of a document in the corpus,the documents that are closest to it based on this measureshould be in the same cluster with it.

    The validity test is designed as following. For eachtype of similarity measure, a similarity matrix A ={aij}nn is created. For CS, this is simple, as aij =dtidj .The procedure for building MVS matrix is described inFig. 1. Firstly, the outer composite w.r.t. each class isdetermined. Then, for each row ai of A, i = 1, . . . , n,

    if the pair of documents di and dj, j = 1, . . . , n are inthe same class, aij is calculated as in line 10, Fig. 1.Otherwise, dj is assumed to be in dis class, and aijis calculated as in line 12, Fig. 1. After matrix A isformed, the procedure in Fig. 2 is used to get its validityscore. For each document di corresponding to row ai ofA, we select qr documents closest to di. The value ofqr is chosen relatively as percentage of the size of theclassr that contains di, where percentage (0, 1]. Then,validity w.r.t. di is calculated by the fraction of these qrdocuments having the same class label withdi, as in line12, Fig. 2. The final validity is determined by averaging

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 8/14/2019 Base Paper -Clustering with Multi-Viewpoint based Similarity Measure.pdf

    5/15

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. YY, 2011 5

    Require: 0< percentage 11: procedure GETVALIDITY(validity, A, percentage)2: for r 1 :c do3: qr percentage nr4: if qr = 0 then percentage too small5: qr 16: end if7: end for

    8: for i 1 :n do9: {aiv[1], . . . , aiv[n]} Sort {ai1, . . . , ain}

    10:s.t. aiv[1] aiv[2] . . . aiv[n]

    {v[1], . . . , v[n]} permute {1, . . . , n}11: r class ofdi

    12: validity(di)|{dv[1], . . . , dv[qr ]} Sr|

    qr13: end for

    14: validity

    ni1validity(di)

    n15: returnvalidity16: end procedure

    Fig. 2. Procedure: Get validity score.

    over all the rows of A, as in line 14, Fig. 2. It is clearthat validity score is bounded within 0 and 1. The highervalidity score a similarity measure has, the more suitableit should be for the clustering task.

    Two real-world document datasets are used as exam-ples in this validity test. The first is reuters7, a subsetof the famous collection, Reuters-21578 Distribution 1.0,of Reuters newswire articles1. Reuters-21578 is one ofthe most widely used test collection for text categoriza-tion. In our validity test, we selected 2,500 documentsfrom the largest 7 categories: acq, crude, interest,

    earn, money-fx, ship and trade to formreuters7.Some of the documents may appear in more than onecategory. The second dataset is k1b, a collection of 2,340web pages from the Yahoo! subject hierarchy, including6 topics: health, entertainment, sport, politics,tech and business. It was created from a past studyin information retrieval called WebAce [26], and is nowavailable with the CLUTO toolkit [19].

    The two datasets were preprocessed by stop-wordremoval and stemming. Moreover, we removed wordsthat appear in less than two documents or more than99.5% of the total number of documents. Finally, the

    documents were weighted by TF-IDF and normalized tounit vectors. The full characteristics of reuters7 and k1bare presented in Fig. 3.

    Fig. 4 shows the validity scores of CS and MVS onthe two datasets relative to the parameter percentage.The value ofpercentage is set at 0.001, 0.01, 0.05, 0.1,0.2,. . . ,1.0. According to Fig. 4, MVS is clearly better thanCS for both datasets in this validity test. For example,with k1b dataset at percentage = 1.0, MVS validityscore is 0.80, while that of CS is only 0.67. This indicates

    1. http://www.daviddlewis.com/resources/testcollections/reuters21578/

    acq

    29%

    interest5%

    crude

    8%

    trade

    4%

    ship

    4%

    earn

    43%

    money-fx

    7%

    Reuters7

    Classes: 7

    Documents: 2,500

    Words: 4,977

    entertainment

    59%

    health

    21%

    business

    6%

    sports

    6%

    tech

    3%

    politics

    5%

    k1b

    Classes: 6

    Documents: 2,340

    Words: 13,859

    Fig. 3. Characteristics ofreuters7 andk1b datasets.

    0.50

    0.55

    0.60

    0.65

    0.70

    0.75

    0.80

    0.85

    0.90

    0.95

    1.00

    0 0.1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1

    percentage

    v

    alidity

    k1b-CS k1b-MVS

    re uters7-CS reuters7-MVS

    Fig. 4. CS and MVS validity test.

    that, on average, when we pick up any document andconsider its neighborhood of size equal to its true class

    size, only 67% of that documents neighbors based onCS actually belong to its class. If based on MVS, thenumber of valid neighbors increases to 80%. The validitytest has illustrated the potential advantage of the newmulti-viewpoint based similarity measure compared tothe cosine measure.

    4 MULTI-V IEWPOINT BASEDCLUSTERING

    4.1 Two clustering criterion functions IRand IV

    Having defined our similarity measure, we now formu-

    late our clustering criterion functions. The first function,called IR, is the cluster size-weighted sum of averagepairwise similarities of documents in the same cluster.Firstly, let us express this sum in a general form byfunction F:

    F=

    kr=1

    nr

    1

    n2r

    di,djSr

    Sim(di, dj)

    (13)

    We would like to transform this objective function intosome suitable form such that it could facilitate the op-timization procedure to be performed in a simple, fast

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 8/14/2019 Base Paper -Clustering with Multi-Viewpoint based Similarity Measure.pdf

    6/15

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. YY, 2011 6

    and effective way. According to Eq. (10):di,djSr

    Sim(di, dj)

    =

    di,djSr

    1

    nnr

    dhS\Sr

    (didh)t

    (dj dh)

    = 1

    nnrdi,dj

    dhdtidj dtidhdtjdh+dthdh

    Since diSr

    di =

    djSr

    dj =Dr,

    dhS\Sr

    dh= DDr anddh= 1,

    we have di,djSr

    Sim(di, dj)

    = di,djSr

    dtidj 2nr

    n nrdiSr

    dti dhS\Sr

    dh+n2r

    =DtrDr 2nrn nr

    Dtr(DDr) +n2r

    = n+nrn nr

    Dr2

    2nrn nr

    DtrD+n2r

    Substituting into Eq. (13) to get:

    F =k

    r=1

    1

    nr

    n+nrn nr

    Dr2

    n+nrnnr

    1

    DtrD

    +n

    Because n is constant, maximizing F is equivalent tomaximizing F:

    F =kr=1

    1nr

    n+nrnnr

    Dr2

    n+nrnnr

    1

    DtrD

    (14)

    If comparing F with the min-max cut in Eq. (5), bothfunctions contain the two terms ||Dr||

    2 (an intra-clustersimilarity measure) and D trD (an inter-cluster similaritymeasure). Nonetheless, while the objective of min-maxcut is to minimize the inverse ratio between these twoterms, our aim here is to maximize their weighted differ-ence. In F, this difference term is determined for eachcluster. They are weighted by the inverse of the clus-ters size, before summed up over all the clusters. Oneproblem is that this formulation is expected to be quite

    sensitive to cluster size. From the formulation of COSA[27] - a widely known subspace clustering algorithm - wehave learned that it is desirable to have a set of weight

    factors = {r}k1 to regulate the distribution of these

    cluster sizes in clustering solutions. Hence, we integrate into the expression ofFto have it become:

    F=k

    r=1

    rnr

    n+nrnnr

    Dr2

    n+nrnnr

    1

    DtrD

    (15)

    In common practice, {r}k1 are often taken to be simple

    functions of the respective cluster sizes {nr}k1 [28]. Let

    us use a parameter called the regulating factor, whichhas some constant value ( [0, 1]), and let r =nr inEq. (15), the final form of our criterion function IR is:

    IR=k

    r=1

    1

    n1r

    n+nrnnr

    Dr2

    n+nrnnr

    1

    DtrD

    (16)

    In the empirical study of Section 5.4, it appears that

    IRs performance dependency on the value of is notvery critical. The criterion function yields relatively goodclustering results for (0, 1).

    In the formulation ofIR, a cluster quality is measuredby the average pairwise similarity between documentswithin that cluster. However, such an approach can leadto sensitiveness to the size and tightness of the clusters.With CS, for example, pairwise similarity of documentsin a sparse cluster is usually smaller than those in adense cluster. Though not as clear as with CS, it is stillpossible that the same effect may hinder MVS-basedclustering if using pairwise similarity. To prevent this,an alternative approach is to consider similarity between

    each document vector and its clusters centroid instead.This is expressed in objective function G:

    G=k

    r=1

    diSr

    1

    nnr

    dhS\Sr

    Sim

    didh,

    CrCr

    dh

    G=

    kr=1

    1

    nnr

    diSr

    dhS\Sr

    (didh)t

    CrCr

    dh

    (17)

    Similar to the formulation of IR, we would like toexpress this objective in a simple form that we couldoptimize more easily. Exploring the vector dot product,we get:

    diSr

    dhS\Sr

    (didh)t

    CrCr

    dh

    =di

    dh

    dti

    CrCr

    dtidhdth

    CrCr

    + 1

    = (nnr)Dtr

    DrDr

    Dtr(DDr)nr(DDr)t Dr

    Dr

    +nr(nnr) , since CrCr

    = DrDr

    = (n+Dr) Dr (nr+Dr)DtrD

    Dr+ nr(n nr)

    Substituting the above into Eq. (17) to have:

    G=

    kr=1

    n+Dr

    nnrDr

    n+Dr

    nnr1

    DtrD

    Dr

    +n

    Again, we could eliminate n because it is a constant.MaximizingG is equivalent to maximizing IV below:

    IV=

    kr=1

    n+Dr

    nnrDr

    n+Dr

    nnr1

    D trD

    Dr

    (18)

    IV calculates the weighted difference between the twoterms: Dr and DtrD/Dr, which again represent an

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 8/14/2019 Base Paper -Clustering with Multi-Viewpoint based Similarity Measure.pdf

    7/15

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. YY, 2011 7

    1: procedure INITIALIZATION2: Select k seeds s1, . . . , sk randomly3: cluster[di] p = argmaxr{strdi},i= 1, . . . , n4: Dr

    diSr

    di, nr |Sr|, r= 1, . . . , k5: end procedure6: procedure REFINEMENT7: repeat8: {v[1 :n]} random permutation of{1, . . . , n}

    9: forj 1 :n do10: i v[j]11: p cluster[di]12: Ip I(np1, Dpdi)I(np, Dp)13: q arg max

    r,r=p{I(nr+1, Dr+di)I(nr, Dr)}

    14: Iq I(nq+ 1, Dq+di)I(nq, Dq)15: if Ip+ Iq > 0 then16: Move di to cluster q: cluster[di] q17: Update Dp, np, Dq, nq18: end if19: end for20: until No move for all n documents

    21: end procedureFig. 5. Algorithm: Incremental clustering.

    intra-cluster similarity measure and an inter-cluster sim-ilarity measure, respectively. The first term is actuallyequivalent to an element of the sum in sphericalk-meansobjective function in Eq. (4); the second one is similar toan element of the sum in min-max cut criterion in Eq.(6), but with Dras scaling factor instead ofDr

    2. Wehave presented our clustering criterion functions IR andIVin the simple forms. Next, we show how to performclustering by using a greedy algorithm to optimize thesefunctions.

    4.2 Optimization algorithm and complexity

    We denote our clustering framework by MVSC, meaningClustering with Multi-Viewpoint based Similarity. Sub-sequently, we have MVSC-IR and MVSC-IV, which areMVSC with criterion function IR and IV respectively.The main goal is to perform document clustering by opti-mizingIRin Eq. (16) andIVin Eq. (18). For this purpose,the incrementalk-way algorithm [18], [29] - a sequentialversion ofk-means - is employed. Considering that theexpression ofIV in Eq. (18) depends only on nr andDr,r= 1, . . . , k, IVcan be written in a general form:

    IV =k

    r=1

    Ir(nr, Dr) (19)

    where Ir(nr, Dr) corresponds to the objective value ofcluster r. The same is applied to IR. With this generalform, the incremental optimization algorithm, whichhas two major steps Initialization and Refinement, isdescribed in Fig. 5. At Initialization, k arbitrary doc-uments are selected to be the seeds from which intialpartitions are formed. Refinement is a procedure that

    consists of a number of iterations. During each iteration,the n documents are visited one by one in a totallyrandom order. Each document is checked if its move toanother cluster results in improvement of the objectivefunction. If yes, the document is moved to the clusterthat leads to the highest improvement. If no clustersare better than the current cluster, the document isnot moved. The clustering process terminates when aniteration completes without any documents being movedto new clusters. Unlike the traditional k-means, thisalgorithm is a stepwise optimal procedure. While k-means only updates after all n documents have been re-assigned, the incremental clustering algorithm updatesimmediately whenever each document is moved to newcluster. Since every move when happens increases theobjective function value, convergence to a local optimumis guaranteed.

    During the optimization procedure, in each iteration,the main sources of computational cost are:

    Searching for optimum clusters to move individual

    documents to: O(nzk). Updating composite vectors as a result of such

    moves:O(mk).

    where nz is the total number of non-zero entries in alldocument vectors. Our clustering approach is partitionaland incremental; therefore, computing similarity matrixis absolutely not needed. If denotes the number ofiterations the algorithm takes, since nz is often severaltens times larger than m for document domain, thecomputational complexity required for clustering withIR andIV is O(nzk).

    5 PERFORMANCE EVALUATION OFMVSC

    To verify the advantages of our proposed methods, weevaluate their performance in experiments on documentdata. The objective of this section is to compare MVSC-IR and MVSC-IVwith the existing algorithms that alsouse specific similarity measures and criterion functionsfor document clustering. The similarity measures to becompared includes Euclidean distance, cosine similarityand extended Jaccard coefficient.

    5.1 Document collections

    The data corpora that we used for experiments consist oftwenty benchmark document datasets. Besides reuters7

    and k1b, which have been described in details earlier,we included another eighteen text collections so thatthe examination of the clustering methods is more thor-ough and exhaustive. Similar to k1b, these datasets areprovided together with CLUTO by the toolkits authors[19]. They had been used for experimental testing inprevious papers, and their source and origin had also

    been described in details [30], [31]. Table 2 summarizestheir characteristics. The corpora present a diversity ofsize, number of classes and class balance. They wereall preprocessed by standard procedures, including stop-word removal, stemming, removal of too rare as well as

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 8/14/2019 Base Paper -Clustering with Multi-Viewpoint based Similarity Measure.pdf

    8/15

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. YY, 2011 8

    TABLE 2

    Document datasets

    Data Source c n m Balance

    fbis TREC 17 2,463 2,000 0.075hitech TREC 6 2,301 13,170 0.192k1a WebACE 20 2,340 13,859 0.018k1b WebACE 6 2,340 13,859 0.043la1 TREC 6 3,204 17,273 0.290

    la2 TREC 6 3,075 15,211 0.274re0 Reuters 13 1,504 2,886 0.018re1 Reuters 25 1,657 3,758 0.027tr31 TREC 7 927 10,127 0.006reviews TREC 5 4,069 23,220 0.099wap WebACE 20 1,560 8,440 0.015

    classic CACM/CISI/

    4 7,089 12,009 0.323CRAN/MED

    la12 TREC 6 6,279 21,604 0.282new3 TREC 44 9,558 36,306 0.149sports TREC 7 8,580 18,324 0.036tr11 TREC 9 414 6,424 0.045tr12 TREC 8 313 5,799 0.097tr23 TREC 6 204 5,831 0.066tr45 TREC 10 690 8,260 0.088reuters7 Reuters 7 2,500 4,977 0.082

    c: # of classes, n : # of documents, m: # of words

    Balance= (smallest class size)/(largest class size)

    too frequent words, TF-IDF weighting and normaliza-tion.

    5.2 Experimental setup and evaluation

    To demonstrate how well MVSCs can perform, wecompare them with five other clustering methods onthe twenty datasets in Table 2. In summary, the sevenclustering algorithms are:

    MVSC-IR: MVSC using criterion function IR MVSC-IV : MVSC using criterion function IV k-means: standardk-means with Euclidean distance Spkmeans: spherical k -means with CS graphCS: CLUTOs graph method with CS graphEJ: CLUTOs graph with extended Jaccard MMC: Spectral Min-Max Cut algorithm [13]

    Our MVSC-IRand MVSC-IV programs are implementedin Java. The regulating factor in IR is always set at 0.3during the experiments. We observed that this is oneof the most appropriate values. A study on MVSC-IRsperformance relative to different values is presentedin a later section. The other algorithms are provided by

    the C library interface which is available freely with theCLUTO toolkit [19]. For each dataset, cluster number ispredefined equal to the number of true class, i.e. k = c.

    None of the above algorithms are guaranteed tofind global optimum, and all of them are initialization-dependent. Hence, for each method, we performed clus-tering a few times with randomly initialized values,and chose the best trial in terms of the correspondingobjective function value. In all the experiments, each testrun consisted of 10 trials. Moreover, the result reportedhere on each dataset by a particular clustering methodis the average of 10 test runs.

    After a test run, clustering solution is evaluated bycomparing the documents assigned labels with theirtrue labels provided by the corpus. Three types of ex-ternal evaluation metric are used to assess clusteringperformance. They are the FScore, Normalized MutualInformation (NMI) and Accuracy. FScore is an equallyweighted combination of the precision (P) and re-call (R) values used in information retrieval. Given aclustering solution, FScore is determined as:

    FScore =k

    i=1

    nin

    maxj

    (Fi,j)

    where Fi,j =2Pi,j Ri,j

    Pi,j+ Ri,j; Pi,j =

    ni,jnj

    , Ri,j =ni,j

    ni

    wherenidenotes the number of documents in classi,njthe number of documents assigned to cluster j , andni,jthe number of documents shared by class i and cluster

    j. From another aspect, NMI measures the informationthe true class partition and the cluster assignment share.

    It measures how much knowing about the clusters helpsus know about the classes:

    NMI=

    ki=1

    kj=1ni,jlog

    nni,jninj

    k

    i=1nilognin

    kj=1njlog

    njn

    Finally,Accuracymeasures the fraction of documents thatare correctly labels, assuming a one-to-one correspon-dence between true classes and assigned clusters. Let qdenote any possible permutation of index set {1, . . . , k},

    Accuracy is calculated by:

    Accuracy=

    1

    nmaxq

    ki=1

    ni,q(i)

    The best mapping q to determine Accuracy could befound by the Hungarian algorithm2. For all three metrics,their range is from 0 to 1, and a greater value indicatesa better clustering solution.

    5.3 Results

    Fig. 6 shows the Accuracy of the seven clustering al-gorithms on the twenty text collections. Presented in adifferent way, clustering results based onFScoreandNMIare reported in Table 3 and Table 4 respectively. For each

    dataset in a row, the value in bold and underlined is thebest result, while the value in bold only is the second tobest.

    It can be observed that MVSC-IR and MVSC-IV per-form consistently well. In Fig. 6, 19 out of 20 datasets,except reviews, either both or one of MVSC approachesare in the top two algorithms. The next consistent per-former is Spkmeans. The other algorithms might workwell on certain dataset. For example, graphEJ yieldsoutstanding result on classic; graphCS and MMC are

    2. http://en.wikipedia.org/wiki/Hungarian algorithm

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 8/14/2019 Base Paper -Clustering with Multi-Viewpoint based Similarity Measure.pdf

    9/15

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. YY, 2011 9

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    fbis hit. k1a k1b la1 la2 re0 re1 tr31 rev.

    Accuracy

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    wap clas. la12 new3 spo. tr11 tr12 tr23 tr45 reu.

    Accuracy

    MVSC-IR MVSC-IV kmeans Spkmeans graphCS graphEJ MMC

    Fig. 6. Clustering results inAccuracy. Left-to-right in legend corresponds to left-to-right in the plot.

    good on reviews. But they do not fare very well on therest of the collections.

    To have a statistical justification of the clusteringperformance comparisons, we also carried out statisticalsignificance tests. Each of MVSC-IR and MVSC-IV waspaired up with one of the remaining algorithms for a

    paired t-test [32]. Given two paired sets X and Y ofN measured values, the null hypothesis of the test isthat the differences between X and Y come from apopulation with mean 0. The alternative hypothesis isthat the paired sets differ from each other in a significantway. In our experiment, these tests were done based onthe evaluation values obtained on the twenty datasets.The typical 5% significance level was used. For example,considering the pair (MVSC-IR, k-means), from Table 3,it is seen that MVSC-IR dominatesk-means w.r.t.FScore.If the paired t-test returns a p-value smaller than 0.05,we reject the null hypothesis and say that the dominance

    is significant. Otherwise, the null hypothesis is true andthe comparison is considered insignificant.

    The outcomes of the pairedt-tests are presented in Ta-ble 5. As the paired t-tests show, the advantage of MVSC-IR and MVSC-IV over the other methods is statisticallysignificant. A special case is the graphEJ algorithm. On

    the one hand, MVSC-IR is not significantly better thangraphEJ if based on FScore or NMI. On the other hand,when MVSC-IR and MVSC-IVare tested obviously bet-ter than graphEJ, the p-values can still be consideredrelatively large, although they are smaller than 0.05. Thereason is that, as observed before, graphEJs results onclassicdataset are very different from those of the otheralgorithms. While interesting, these values can be con-sidered as outliers, and including them in the statisticaltests would affect the outcomes greatly. Hence, we alsoreport in Table 5 the tests where classic was excludedand only results on the other 19 datasets were used.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 8/14/2019 Base Paper -Clustering with Multi-Viewpoint based Similarity Measure.pdf

    10/15

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. YY, 2011 10

    TABLE 3

    Clustering results in FScore

    Data MVSC-IR MVSC-IV k-means Spkmeans graphCS graphEJ MMC

    fbis .645 .613 .578 .584 .482 .503 .506

    hitech .512 .528 .467 .494 .492 .497 .468

    k1a .620 .592 .502 .545 .492 .517 .524

    k1b .873 .775 .825 .729 .740 .743 .707

    la1 .719 .723 .565 .719 .689 .679 .693la2 .721 .749 .538 .703 .689 .633 .698

    re0 .460 .458 .421 .421 .468 .454 .390

    re1 .514 .492 .456 .499 .487 .457 .443

    tr31 .728 .780 .585 .679 .689 .698 .607

    reviews .734 .748 .644 .730 .759 .690 .749

    wap .610 .571 .516 .545 .513 .497 .513

    classic .658 .734 .713 .687 .708 .983 .657

    la12 .719 .735 .559 .722 .706 .671 .693

    new3 .548 .547 .500 .558 .510 .496 .482

    sports .803 .804 .499 .702 .689 .696 .650

    tr11 .749 .728 .705 .719 .665 .658 .695

    tr12 .743 .758 .699 .715 .642 .722 .700

    tr23 .560 .553 .486 .523 .522 .531 .485

    tr45 .787 .788 .692 .799 .778 .798 .720

    reuters7 .774 .775 .658 .718 .651 .670 .687

    TABLE 4

    Clustering results inNMI

    Data MVSC-IR MVSC-IV k-means Spkmeans graphCS graphEJ MMC

    fbis .606 .595 .584 .593 .527 .524 .556

    hitech .323 .329 .270 .298 .279 .292 .283

    k1a .612 .594 .563 .596 .537 .571 .588

    k1b .739 .652 .629 .649 .635 .650 .645

    la1 .569 .571 .397 .565 .490 .485 .553

    la2 .568 .590 .381 .563 .496 .478 .566

    re0 .399 .402 .388 .399 .367 .342 .414

    re1 .591 .583 .532 .593 .581 .566 .515tr31 .613 .658 .488 .594 .577 .580 .548

    reviews .584 .603 .460 .607 .570 .528 .639

    wap .611 .585 .568 .596 .557 .555 .575

    classic .574 .644 .579 .577 .558 .928 .543

    la12 .574 .584 .378 .568 .496 .482 .558

    new3 .621 .622 .578 .626 .580 .580 .577

    sports .669 .701 .445 .633 .578 .581 .591

    tr11 .712 .674 .660 .671 .634 .594 .666

    tr12 .686 .686 .647 .654 .578 .626 .640

    tr23 .432 .434 .363 .413 .344 .380 .369

    tr45 .734 .733 .640 .748 .726 .713 .667

    reuters7 .633 .632 .512 .612 .503 .520 .591

    Under this circumstance, both MVSC-IR and MVSC-IVoutperform graphEJ significantly with good p-values.

    5.4 Effect of on MVSC-IRs performance

    It has been known that criterion function based par-titional clustering methods can be sensitive to clustersize and balance. In the formulation of IR in Eq. (16),there exists parameter which is called the regulatingfactor, [0, 1]. To examine how the determination of could affect MVSC-IRs performance, we evaluated

    MVSC-IRwith different values of from 0 to 1, with 0.1incremental interval. The assessment was done based onthe clustering results in NMI, FScore and Accuracy, eachaveraged over all the twenty given datasets. Since theevaluation metrics for different datasets could be verydifferent from each other, simply taking the average overall the datasets would not be very meaningful. Hence,we employed the method used in [18] to transformthe metrics into relative metrics before averaging. Ona particular document collection S, the relative FScore

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 8/14/2019 Base Paper -Clustering with Multi-Viewpoint based Similarity Measure.pdf

    11/15

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. YY, 2011 11

    TABLE 5

    Statistical significance of comparisons based on pairedt-tests with 5% significance level

    k-means Spkmeans graphCS graphEJ* MMC

    FScore MVSC-IR > () 1.77E-5 1.60E-3 4.61E-4 .056 (7.68E-6) 3.27E-6

    MVSC-IV () 7.52E-5 1.42E-4 3.27E-5 .022 (1.50E-6) 2.16E-7

    NMI MVSC-IR > ()

    7.42E-6 .013 2.39E-7 .060 (1.65E-8) 8.72E-5MVSC-IV () 4.27E-5 .013 4.07E-7 .029 (4.36E-7) 2.52E-4

    Accuracy MVSC-IR () 1.45E-6 1.50E-4 1.33E-4 .028 (3.29E-5) 8.33E-7

    MVSC-IV () 1.74E-5 1.82E-4 4.19E-5 .014 (8.61E-6) 9.80E-7

    (or ) indicates the algorithm in the row performs significantly better (or worse) than the one in the column; > (or

  • 8/14/2019 Base Paper -Clustering with Multi-Viewpoint based Similarity Measure.pdf

    12/15

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. YY, 2011 12

    TABLE 6

    TDT2and Reuters-21578document corpora

    TDT2 Reuters-21578

    Total number of documents 10,021 8,213Total number of classes 56 41Largest class size 1,844 3,713Smallest class size 10 10

    sub-collection having 8,213 documents from the largest41 topics. The same two document collections had beenused in the paper of the NMF methods [10]. Documentsthat appear in two or more topics were removed, andthe remaining documents were preprocessed in the sameway as in Section 5.1.

    6.2 Experiments and results

    The following clustering methods:

    Spkmeans: spherical k -means rMVSC-IR: refinement of Spkmeans by MVSC-IR rMVSC-IV: refinement of Spkmeans by MVSC-IV MVSC-IR: normal MVSC using criterion IR MVSC-IV : normal MVSC using criterion IV

    and two new document clustering approaches that donot use any particular form of similarity measure:

    NMF: Non-negative Matrix Factorization method NMF-NCW: Normalized Cut Weighted NMF

    were involved in the performance comparison. Whenused as a refinement for Spkmeans, the algorithms

    rMVSC-IRand rMVSC-IV worked directly on the outputsolution of Spkmeans. The cluster assignment produced

    by Spkmeans was used as initialization for both rMVSC-IRand rMVSC-IV. We also investigated the performanceof the original MVSC-IR and MVSC-IV further on thenew datasets. Besides, it would be interesting to see howthey and their Spkmeans-initialized versions fare againsteach other. What is more, two well-known documentclustering approaches based on non-negative matrix fac-torization, NMF and NMF-NCW [10], are also includedfor a comparison with our algorithms, which use explicitMVS measure.

    During the experiments, each of the two corpora inTable 6 were used to create 6 different test cases, eachof which corresponded to a distinct number of topicsused (c = 5, . . . , 10). For each test case, c topics wererandomly selected from the corpus and their documentswere mixed together to form a test set. This selection wasrepeated 50 times so that each test case had 50 differenttest sets. The average performance of the clustering

    algorithms with k = c were calculated over these 50test sets. This experimental set-up is inspired by thesimilar experiments conducted in the NMF paper [10].Furthermore, similar to previous experimental setup inSection 5.2, each algorithm (including NMF and NMF-NCW) actually considered 10 trials on any test set beforeusing the solution of the best obtainable objective func-tion value as its final output.

    The clustering results on TDT2 and Reuters-21578 areshown in Table 7 and 8 respectively. For each test case in

    TABLE 7

    Clustering results on TDT2

    NMI Accuracy

    Algorithms k=5 k=6 k=7 k=8 k=9 k=10 k=5 k=6 k=7 k=8 k=9 k=10

    Spkmeans .690 .704 .700 .677 .681 .656 .708 .689 .668 .620 .605 .578

    rMVSC-IR .753 .777 .766 .749 .738 .699 .855 .846 .822 .802 .760 .722

    rMVSC-IV .740 .764 .742 .729 .718 .676 .839 .837 .801 .785 .736 .701

    MVSC-IR .749 .790 .797 .760 .764 .722 .884 .867 .875 .840 .832 .780

    MVSC-IV .775 .785 .779 .745 .755 .714 .886 .871 .870 .825 .818 .777

    NMF .621 .630 .607 .581 .593 .555 .697 .686 .642 .604 .578 .555

    NMF-NCW .713 .746 .723 .707 .702 .659 .788 .821 .764 .749 .725 .675

    TABLE 8

    Clustering results onReuters-21578

    NMI Accuracy

    Algorithms k=5 k=6 k=7 k=8 k=9 k=10 k=5 k=6 k=7 k=8 k=9 k=10

    Spkmeans .370 .435 .389 .336 .348 .428 .512 .508 .454 .390 .380 .429

    rMVSC-IR . 386 .448 .406 .347 .359 .433 .591 .592 .522 .445 .437 .485

    rMVSC-IV .395 .438 .408 .351 .361 .434 .591 .573 .529 .453 .448 .477

    MVSC-IR .377 .442 .418 .354 .356 .441 .582 .588 .538 .473 .477 .505

    MVSC-IV .375 .444 .416 .357 .369 .438 .589 .588 .552 .475 .482 .512

    NMF .321 .369 .341 .289 .278 .359 .553 .534 .479 .423 .388 .430

    NMF-NCW .355 .413 .387 .341 .344 .413 .608 .580 .535 .466 .432 .493

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 8/14/2019 Base Paper -Clustering with Multi-Viewpoint based Similarity Measure.pdf

    13/15

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. YY, 2011 13

    0.2

    0.30.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Spkmeans

    rMVSC-IRrMVSC-IV

    rMVSC-IR

    (a) TDT2

    0.2

    0.30.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Spkmeans

    rMVSC-IRrMVSC-IV

    (b) Reuters-21578

    Fig. 8. Accuracieson the 50 test sets (in sorted order of Spkmeans) in the test casek = 5.

    a column, the value in bold and underlined is the bestamong the results returned by the algorithms, while thevalue in bold only is the second to best. From the tables,several observations can be made. Firstly, MVSC-IRandMVSC-IV continue to show they are good clusteringalgorithms by outperforming other methods frequently.They are always the best in every test case of TDT2.Compared with NMF-NCW, they are better in almost allthe cases, except only the case ofReuters-21578, k = 5,where NMF-NCW is the best based on Accuracy.

    The second observation, which is also the main objec-tive of this empirical study, is that by applying MVSCto refine the output of spherical k-means, clusteringsolutions are improved significantly. Both rMVSC-IRand rMVSC-IV lead to higher NMIsand AccuraciesthanSpkmeans in all the cases. Interestingly, there are manycircumstances where Spkmeans result is worse than thatof NMF clustering methods, but after refined by MVSCs,it becomes better. To have a more descriptive picture ofthe improvements, we could refer to the radar charts in

    Fig. 8. The figure shows details of a particular test casewherek = 5. Remember that a test case consists of 50 dif-ferent test sets. The charts display result on each test set,including the accuracy result obtained by Spkmeans, andthe results after refinement by MVSC, namely rMVSC-IR and rMVSC-IV. For effective visualization, they aresorted in ascending order of the accuracies by Spkmeans(clockwise). As the patterns in both Fig. 8(a) and Fig. 8(b)reveal, improvement in accuracy is most likely attainable

    by rMVSC-IR and rMVSC-IV. Many of the improve-ments are with a considerably large margin, especiallywhen the original accuracy obtained by Spkmeans is low.

    There are only few exceptions where after refinement,accuracy becomes worst. Nevertheless, the decreases insuch cases are small.

    Finally, it is also interesting to notice from Table 7and Table 8 that MVSC preceded by spherical k-meansdoes not necessarily yields better clustering results thanMVSC with random initialization. There are only a smallnumber of cases in the two tables that rMVSC can befound better than MVSC. This phenomenon, however, isunderstandable. Given a local optimal solution returned

    by sphericalk-means, rMVSC algorithms as a refinementmethod would be constrained by this local optimumitself and, hence, their search space might be restricted.The original MVSC algorithms, on the other hand, arenot subjected to this constraint, and are able to followthe search trajectory of their objective function fromthe beginning. Hence, while performance improvementafter refining spherical k-means result by MVSC provesthe appropriateness of MVS and its criterion functionsfor document clustering, this observation in fact only

    reaffirms its potential.

    7 CONCLUSIONS ANDF UTURE WORK

    In this paper, we propose a Multi-Viewpoint basedSimilarity measuring method, named MVS. Theoreticalanalysis and empirical examples show that MVS ispotentially more suitable for text documents than thepopular cosine similarity. Based on MVS, two criterionfunctions, IR and IV, and their respective clusteringalgorithms, MVSC-IR and MVSC-IV , have been intro-duced. Compared with other state-of-the-art clusteringmethods that use different types of similarity measure,

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 8/14/2019 Base Paper -Clustering with Multi-Viewpoint based Similarity Measure.pdf

    14/15

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. YY, 2011 14

    on a large number of document datasets and underdifferent evaluation metrics, the proposed algorithmsshow that they could provide significantly improvedclustering performance.

    The key contribution of this paper is the fundamentalconcept of similarity measure from multiple viewpoints.Future methods could make use of the same principle,

    but define alternative forms for the relative similarity inEq. (10), or do not use average but have other methods tocombine the relative similarities according to the differ-ent viewpoints. Besides, this paper focuses on partitionalclustering of documents. In the future, it would also bepossible to apply the proposed criterion functions for hi-erarchical clustering algorithms. Finally, we have shownthe application of MVS and its clustering algorithms fortext data. It would be interesting to explore how theywork on other types of sparse and high-dimensionaldata.

    REFERENCES

    [1] X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda,G. J. McLachlan, A. Ng, B. Liu, P. S. Yu, Z.-H. Zhou, M. Steinbach,D. J. Hand, and D. Steinberg, Top 10 algorithms in data mining,Knowl. Inf. Syst., vol. 14, no. 1, pp. 137, 2007.

    [2] I. Guyon, U. von Luxburg, and R. C. Williamson, Clustering:Science or Art? NIPS09 Workshop on Clustering Theory, 2009.

    [3] I. Dhillon and D. Modha, Concept decompositions for largesparse text data using clustering, Mach. Learn., vol. 42, no. 1-2,pp. 143175, Jan 2001.

    [4] S. Zhong, Efficient online spherical K-means clustering, inIEEEIJCNN, 2005, pp. 31803185.

    [5] A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh, Clustering withBregman divergences,J. Mach. Learn. Res., vol. 6, pp. 17051749,Oct 2005.

    [6] E. Pekalska, A. Harol, R. P. W. Duin, B. Spillmann, and H. Bunke,Non-Euclidean or non-metric measures can be informative, in

    Structural, Syntactic, and Statistical Pattern Recognition, ser. LNCS,vol. 4109, 2006, pp. 871880.[7] M. Pelillo, What is a cluster? Perspectives from game theory, in

    Proc. of the NIPS Workshop on Clustering Theory, 2009.[8] D. Lee and J. Lee, Dynamic dissimilarity measure for support

    based clustering, IEEE Trans. on Knowl. and Data Eng., vol. 22,no. 6, pp. 900905, 2010.

    [9] A. Banerjee, I. Dhillon, J. Ghosh, and S. Sra, Clustering on theunit hypersphere using von Mises-Fisher distributions, J. Mach.Learn. Res., vol. 6, pp. 13451382, Sep 2005.

    [10] W. Xu, X. Liu, and Y. Gong, Document clustering based on non-negative matrix factorization, in SIGIR, 2003, pp. 267273.

    [11] I. S. Dhillon, S. Mallela, and D. S. Modha, Information-theoreticco-clustering, inKDD, 2003, pp. 8998.

    [12] C. D. Manning, P. Raghavan, and H. Schutze, An Introduction toInformation Retrieval. Press, Cambridge U., 2009.

    [13] C. Ding, X. He, H. Zha, M. Gu, and H. Simon, A min-max cut

    algorithm for graph partitioning and data clustering, in IEEEICDM, 2001, pp. 107114.[14] H. Zha, X. He, C. H. Q. Ding, M. Gu, and H. D. Simon, Spectral

    relaxation for k-means clustering, in NIPS, 2001, pp. 10571064.[15] J. Shi and J. Malik, Normalized cuts and image segmentation,

    IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, pp. 888905, 2000.[16] I. S. Dhillon, Co-clustering documents and words using bipartite

    spectral graph partitioning, in KDD, 2001, pp. 269274.[17] Y. Gong and W. Xu, Machine Learning for Multimedia Content

    Analysis. Springer-Verlag New York, Inc., 2007.[18] Y. Zhao and G. Karypis, Empirical and theoretical comparisons

    of selected criterion functions for document clustering, Mach.Learn., vol. 55, no. 3, pp. 311331, Jun 2004.

    [19] G. Karypis, CLUTO a clustering toolkit, Dept. of ComputerScience, Uni. of Minnesota, Tech. Rep., 2003, http://glaros.dtc.umn.edu/gkhome/views/cluto.

    [20] A. Strehl, J. Ghosh, and R. Mooney, Impact of similarity measureson web-page clustering, inProc. of the 17th National Conf. on Artif.Intell.: Workshop of Artif. Intell. for Web Search. AAAI, Jul. 2000,pp. 5864.

    [21] A. Ahmad and L. Dey, A method to compute distance betweentwo categorical values of same attribute in unsupervised learningfor categorical data set,Pattern Recognit. Lett., vol. 28, no. 1, pp.110 118, 2007.

    [22] D. Ienco, R. G. Pensa, and R. Meo, Context-based distancelearning for categorical data clustering, in Proc. of the 8th Int.

    Symp. IDA, 2009, pp. 8394.[23] P. Lakkaraju, S. Gauch, and M. Speretta, Document similarity

    based on concept tree distance, in Proc. of the 19th ACM conf. onHypertext and hypermedia, 2008, pp. 127132.

    [24] H. Chim and X. Deng, Efficient phrase-based document similar-ity for clustering, IEEE Trans. on Knowl. and Data Eng., vol. 20,no. 9, pp. 12171229, 2008.

    [25] S. Flesca, G. Manco, E. Masciari, L. Pontieri, and A. Pugliese, Fastdetection of xml structural similarity, IEEE Trans. on Knowl. andData Eng., vol. 17, no. 2, pp. 160175, 2005.

    [26] E.-H. Han, D. Boley, M. Gini, R. Gross, K. Hastings, G. Karypis,V. Kumar, B. Mobasher, and J. Moore, Webace: a web agent fordocument categorization and exploration, in AGENTS 98: Proc.of the 2nd ICAA, 1998, pp. 408415.

    [27] J. Friedman and J. Meulman, Clustering objects on subsets ofattributes, J. R. Stat. Soc. Series B Stat. Methodol. , vol. 66, no. 4,pp. 815839, 2004.

    [28] L. Hubert, P. Arabie, and J. Meulman,Combinatorial data analysis:optimization by dynamic programming. Philadelphia, PA, USA:Society for Industrial and Applied Mathematics, 2001.

    [29] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification,2nd ed. New York: John Wiley & Sons, 2001.

    [30] S. Zhong and J. Ghosh, A comparative study of generativemodels for document clustering, inSIAM Int. Conf. Data MiningWorkshop on Clustering High Dimensional Data and Its Applications ,2003.

    [31] Y. Zhao and G. Karypis, Criterion functions for document clus-tering: Experiments and analysis, Dept. of Computer Science,Uni. of Minnesota, Tech. Rep., 2002.

    [32] T. M. Mitchell, Machine Learning. McGraw-Hill, 1997.

    Duc Thang Nguyen received the B.Eng. de-gree in Electrical & Electronic Engineering atNanyang Technological University, Singapore,where he is also pursuing the Ph.D. degreein the Division of Information Engineering. Cur-rently, he is an Operations Planning Analystat PSA Corporation Ltd, Singapore. His re-search interests include algorithms, informationretrieval, data mining, optimizations and opera-tions research.

    Lihui Chen received the B.Eng. in ComputerScience & Engineering at Zhejiang University,China and the PhD in Computational Science atUniversity of St. Andrews, UK. Currently she isan Associate Professor in the Division of Infor-mation Engineering at Nanyang TechnologicalUniversity in Singapore. Her research interestsinclude machine learning algorithms and appli-cations, data mining and web intelligence. Shehas published more than seventy referred pa-pers in international journals and conferences in

    these areas. She is a senior member of the IEEE, and a member of theIEEE Computational Intelligence Society.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 8/14/2019 Base Paper -Clustering with Multi-Viewpoint based Similarity Measure.pdf

    15/15

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. YY, 2011 15

    Chee KeongChan received his B.Eng. from Na-tional University of Singapore in Electrical andElectronic Engineering, MSC and DIC in Com-puting from Imperial College, University of Lon-don and PhD from Nanyang Technological Uni-versity. Upon graduation from NUS, he workedas an R&D Engineer in Philips Singapore forseveral years. Currently, he is an Associate Pro-fessor in the Information Engineering Division,lecturing in subjects related to computer sys-

    tems, artificial intelligence, software engineeringand cyber security. Through his years in NTU, he has published aboutnumerous research papers in conferences, journals, books and bookchapter. He has also provided numerous consultations to the industries.His current research interest areas include data mining (text and solarradiation data mining), evolutionary algorithms (scheduling and games)and renewable energy.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.