Active Clustering: Robust and Efficient Hierarchical Clustering using Adaptively Selected...

download Active Clustering: Robust and Efficient Hierarchical Clustering using Adaptively Selected Similarities

of 9

Transcript of Active Clustering: Robust and Efficient Hierarchical Clustering using Adaptively Selected...

  • 7/26/2019 Active Clustering: Robust and Efficient Hierarchical Clustering using Adaptively Selected Similarities

    1/9

    260

    Active Clustering: Robust and Efficient Hierarchical Clusteringusing Adaptively Selected Similarities

    Brian Eriksson Gautam Dasarathy Aarti Singh Robert Nowak

    Boston University University of Wisconsin Carnegie Mellon University University of Wisconsin

    Abstract

    Hierarchical clustering based on pairwise sim-ilarities is a common tool used in a broadrange of scientific applications. However,in many problems it may be expensive toobtain or compute similarities between theitems to be clustered. This paper investi-gates the hierarchical clustering ofN itemsbased on a small subset of pairwise similar-ities, significantly less than the complete setofN(N 1)/2 similarities. First, we showthat if the intracluster similarities exceed in-tercluster similarities, then it is possible tocorrectly determine the hierarchical cluster-ing from as few as 3NlogN similarities. Wedemonstrate this order of magnitude savingsin the number of pairwise similarities necessi-

    tates sequentially selecting which similaritiesto obtain in an adaptive fashion, rather thanpicking them at random. We then proposean active clusteringmethod that is robust toa limited fraction of anomalous similarities,and show how even in the presence of thesenoisy similarity values we can resolve the hi-erarchical clustering using only O

    Nlog2N

    pairwise similarities.

    1 Introduction

    Hierarchical clustering based on pairwise similaritiesarises routinely in a wide variety of engineering andscientific problems. These problems include inferringgene behavior from microarray data [1], Internet topol-ogy discovery [2], detecting community structure insocial networks [3], advertising [4], and database man-agement [5, 6]. It is often the case that there is a sig-

    Appearing in Proceedings of the 14th International Con-ference on Artificial Intelligence and Statistics (AISTATS)2011, Fort Lauderdale, FL, USA. Volume 15 of JMLR:W&CP 15. Copyright 2011 by the authors.

    nificant cost associated with obtaining each similarityvalue. For example, in the case of Internet topology in-ference, the determination of similarity values requiresmany probe packets to be sent through the network,which can place a significant burden on the networkresources. In other situations, the similarities may be

    the result of expensive experiments or require an ex-pert human to perform the comparisons, again placinga significant cost on their collection.

    The potential cost of obtaining similarities motivates anatural question: Is it possible to reliably cluster itemsusing less than the complete, exhaustive set of all pair-wise similarities? We will show that the answer is yes,particularly under the condition that intracluster sim-ilarity values are greater than intercluster similarityvalues, which we will define as the Tight Clustering(TC) condition. We also consider extensions of theproposed approach to more challenging situations in

    which a significant fraction of intracluster similarityvalues may be smaller than intercluster similarity val-ues. This allows for robust, provably-correct clusteringeven when the TC condition does not hold uniformly.

    The TC condition is satisfied in many situations. Forexample, the TC condition holds if the similarities aregenerated by a branching process (or tree structure)in which the similarity between items is a monotonicincreasing function of the distance from the root totheir nearest common branch point (ancestor). Thissort of process arises naturally in clustering nodes inthe Internet [7]. Also note that, for suitably chosen

    similarity metrics, the data can satisfy the TC condi-tion even when the clusters have complex structures.For example, if similarities between two points are de-fined as the length of the longest edge on the shortestpath between them on a nearest-neighbor graph, thenthey satisfy the TC condition given the clusters do notoverlap. Additionally, density based similarity metrics[8] also allow for arbitrary cluster shapes while satis-fying the TC condition.

    One natural approach is to attempt clustering usinga small subset of randomly chosen pairwise similari-ties. However, we show that this is quite ineffective in

  • 7/26/2019 Active Clustering: Robust and Efficient Hierarchical Clustering using Adaptively Selected Similarities

    2/9

    261

    Active Clustering: Robust and Efficient Hierarchical Clustering using Adaptively Selected Similarities

    general. We instead propose an active approach thatsequentially selects similarities in an adaptive fashion,and thus we call the procedure active clustering. Weshow that under the TC condition, it is possible toreliably determine the unambiguous hierarchical clus-tering ofN items using at most 3Nlog Nof the total

    ofN(N 1)/2 possible pairwise similarities. Since itis clear that we must obtain at least one similarityfor each of the N items, this is about as good as onecould hope to do. Then, to broaden the applicabil-ity of the proposed theory and method, we proposea robust active clustering methodology for situationswhere a random subset of the pairwise similarities areunreliable and therefore fail to meet the TC condition.In this case, we show how using only O

    Nlog2 N

    ac-

    tively chosen pairwise similarities, we can still recoverthe underlying hierarchical clustering with high prob-ability.

    While there have been prior attempts at developing ro-bust procedures for hierarchical clustering [9, 10, 11],these works do not try to optimize the number ofsimilarity values needed to robustly identify the trueclustering, and mostly require all O

    N2

    similarities.

    Other prior work has attempted to develop efficient ac-tive clustering methods [12, 13, 14], but the proposedtechniques are ad-hoc and do not provide any theoret-ical guarantees. Outside of clustering literature, someinteresting connections emerge between between thisproblem and prior work on graphical model inference[15], which we exploit here.

    2 The Hierarchical Clustering

    Problem

    Let X = {x1, x2, . . . , xN} be a collection ofN items.Our goal will be to resolve a hierarchical clustering ofthese items.

    Definition 1. AclusterCis defined as any subset ofX. A collection of clustersT is called ahierarchicalclustering if CiTCi = X and for any Ci, Cj T,only one of the following is true(i) Ci Cj, (ii)Cj Ci, (iii) Ci Cj =.

    The hierarchical clustering T has the form of a tree,where each node corresponds to a particular cluster.The tree is binaryif for everyCk Tthat is not a leafof the tree, there exists proper subsetsCi and Cj ofCk,such thatCiCj =, and CiCj =Ck. The binary treeis said to becompleteif it hasNleaf nodes, each corre-sponding to one of the individual items. Without lossof generality, we will assume that Tis a complete (pos-sibly unbalanced) binary tree, since any non-binarytree can be represented by an equivalent binary tree.

    Let S = {si,j} denote the collection of all pairwise

    similarities between the items inX, withsi,j denotingthe similarity between xi and xj and assuming si,j =sj,i. The traditional hierarchical clustering problemuses the complete set of pairwise similarities to infer T.In order to guarantee that Tcan be correctly identifiedfromS, the similarities must conform to the hierarchy

    ofT. We consider the following sufficient condition.Definition 2. The triple(X, T, S) satisfies theTightClustering (TC) Conditionif for every set of threeitems{xi, xj , xk} such thatxi, xj C andxk C, forsomeC T, the pairwise similarities satisfies, si,j >max(si,k, sj,k).

    In words, the TC condition implies that the similaritybetween all pairs within a cluster is greater than thesimilarity with respect to any item outside the cluster.We can consider using off-the-shelf hierarchical clus-tering methodologies, such as bottom-up agglomera-

    tive clustering [16], on the set of pairwise similaritiesthat satisfies the TC condition. Bottom-up agglomer-ative clustering is a recursive process that begins withsingleton clusters (i.e., the N individual items to beclustered). At each step of the algorithm, the pair ofmost similar clusters are merged. The process is re-peated until all items are merged into a single cluster.It is easy to see that if the TC condition is satisfied,then the standard bottom-up agglomerative clusteringalgorithms such as single linkage, average linkage andcomplete linkage will all produce T given the com-plete similarity matrix S. Various agglomerative clus-tering algorithms differ in how the similarity between

    two clusters is defined, but every technique requires allN(N1)/2 pairwise similarity values since all simi-larities must be compared at the very first step.

    To properly cluster the items using fewer similaritiesrequires a more sophisticated adaptive approach wheresimilarities are carefully selected in a sequential man-ner. Before contemplating such approaches, we firstdemonstrate that adaptivity is necessary, and thatsimply picking similarities at random will not suffice.

    Proposition 1. LetT be a hierarchical clustering ofNitems and consider a cluster of sizem inT for some

    m N. If

    n pairwise similarities, with

    n max(si,k, sj,k).

    Thereforexk is the outlierof the same triple.

    In Theorem 3.1, we find that by combining ouroutliertest approach with the tree reconstruction al-gorithm of [15], we discover an adaptive methodology(which we will refer to as OUTLIERcluster) that onlyrequires on the order of Nlog N pairwise similaritiesto exactly reconstruct the hierarchical clustering T.

    Theorem 3.1. Assume that the triple(X, T, S)satis-fies the Tight Clustering (TC) condition whereT is acomplete (possible unbalanced) binary tree that is un-known. Then, OUTLIERclusterrecoversT exactly us-

    ing at most 3Nlog3/2 N adaptively selected pairwisesimilarity values.

    Proof. From Appendix II of [15], we find a methodol-ogy that requires at most Nlog3/2 N leadership teststo exactly reconstruct the unique binary tree structureofNitems. Lemma 1 shows that under the TC condi-tion, each leadership test can be performed using only3 adaptively selected pairwise similarities. Therefore,we can reconstruct the hierarchical clustering T froma set of items X using at most 3Nlog3/2 Nadaptivelyselected pairwise similarity values.

    3.1 Tight Clustering Experiments

    In Table 1 we see the results of both clustering tech-niques (OUTLIERcluster and bottom-up agglomera-tive clustering) on various synthetic tree topologiesgiven the Tight Clustering (TC) condition. The per-formance is in terms of the number of pairwise similar-ities required by the agglomerative clustering method-ology, denoted by nagg , and the number of similari-ties required by our OUTLIERclustermethod,noutlier.The methodologies are performed on both a balancedbinary tree of varying size (N = 128, 256, 512) and a

  • 7/26/2019 Active Clustering: Robust and Efficient Hierarchical Clustering using Adaptively Selected Similarities

    4/9

    263

    Active Clustering: Robust and Efficient Hierarchical Clustering using Adaptively Selected Similarities

    synthetic Internet tree topology generated using thetechnique from [17]. As seen in the table, our tech-nique resolves the underlying tree structure using atmost 11% of the pairwise similarities required by thebottom-up agglomerative clustering approach. As thenumber of items in the topology increases, further im-

    provements are seen using OUTLIERcluster. Due tothe pairwise similarities satisfying the TC condition,both methodologies resolve a binary representation ofthe underlying tree structure exactly.

    Table 1: Comparison ofOUTLIERcluster and Agglomer-ative Clustering on various topologies satisfying the TightClustering condition.

    Topology Size nagg noutliernoutliernagg

    Balanced N= 128 8,128 876 10.78%Binary N= 256 32,640 2,206 6.21%

    N= 512 130,816 4,561 3.49%

    Internet N= 768 294,528 8,490 2.88%

    While OUTLIERcluster determines the correct clus-tering hierarchy when all the pairwise similarities areconsistent with the hierarchyT , it can fail if one ormore of the pairwise similarities are inconsistent. Wefind that with only two outliertests erroneous at ran-dom, this corrupts the clustering reconstruction usingOUTLIERcluster significantly. This can be attributedto the greedy construction of the clustering hierarchy

    using this methodology, where if one of the initial itemsis incorrectly placed in the hierarchy, this will resultin a cascading effect that will drastically reduce theaccuracy of the clustering.

    4 Robust Active Clustering

    Suppose that most, but not all, of the outlier testsagree withT . This may occur if a subset of the sim-ilarities are in some sense inconsistent, erroneous oranomalous. We will assume that a certain subset ofthe similarities produce correct outliertests and the

    rest may not. These similarities that produce correcttests are said to be consistentwith the hierarchyT.Our goal is to recover the clusters ofT despite the factthat the similarities are not always consistent with it.

    Definition 3. The subset of consistent similaritiesis denoted SC S. These similarities satisfythe following property: if si,j, sj,k, si,k SC thenoutlier (xi, xj, xk) returns the leader of the triple(xi, xj, xk)inT(i.e., the outlier test is consistent withrespect toT).We adopt the following probabilistic model for SC.Each similarity in S fails to be consistent indepen-

    dently with probability at most q < 1/2 (i.e., mem-bership in SCis termed by repeatedly tossing a biasedcoin). The expected cardinality of SC is E[|SC|](1q)|S|. Under this model, there is a large prob-ability that one or more of outlier tests will yield anincorrect leader with respect to

    T . Thus, our tree re-

    construction algorithm in Section 3 will fail to recoverthe tree with large probability. We therefore pursuea different approach based on a top-down recursiveclustering procedure that uses voting to overcome theeffects of incorrect tests.

    The key element of the top-down procedure is a ro-bust algorithm for correctly splitting a given cluster inT into its two subclusters, presented in Algorithm 1.Roughly speaking, the procedure quantifies how fre-quently two items tend to agree on outlier testsdrawn from a small random subsample of other items.If they tend to agree frequently, then they are clus-

    tered together; otherwise they are not. We show thatthis algorithm can determine the correct split of theinput clusterC with high probability. The degree towhich the split is balanced affects performance, andwe need the following definition.

    Definition 4. LetC be any non-leaf cluster inT anddenote its subclusters byCL andCR; i.e.,CL

    CR =andCL

    CR =C . Thebalance factor ofC is C :=min{|CL|, |CR|}\|C|.Theorem 4.1. Let 0 < < 1 and threshold (0, 1/2). Consider a clusterC T containingn items(|C| = n) with balance factorC and disjoint sub-clustersCR andCL, and assume the following condi-tions hold:

    A1- The pairwise similarities are consistent withprobability at least1q, for someq1 1

    2(1).

    A2- q, satisfy(1(1q)2)< 0, and n >2m, then with probability at least1 the output ofsplit(C,m, ) is the correct subclusters,CR andCL.

    The proof of the theorem is given in the Appendix.The theorem above shows that the algorithm is guar-anteed (with high probability) to correctly split clus-ters that are sufficiently large for a certain range ofqand, as specified by A2. A bound on the constant c0is given in Equation 3 in the proof, but the importantfact is that it does not depend on n, the number ofitems inC. Thus all but the very smallest clusters canbe reliably split. Note that total number of similari-ties required by split is at most 3mn. So if we takem= c0log(4n/

    ), the total is at most 3c0n log(4n/).

    The key point of the lemma is this: instead of usingallO(n2) similarities, split only requiresO(n log n).

  • 7/26/2019 Active Clustering: Robust and Efficient Hierarchical Clustering using Adaptively Selected Similarities

    5/9

    264

    Brian Eriksson, Gautam Dasarathy, Aarti Singh, Robert Nowak

    Algorithm 1: split(C,m, )

    Input :

    1. A single clusterC consisting ofn items.

    2. Parametersm < n/2 and (0, 1/2)

    Initialize :

    1. Select two subsetsSV, SA Cuniformly at ran-dom (with replacement) containingm items each.

    2. Select a seed itemxj Cuniformly at randomand let Cj {CR, CL} denote the subcluster itbelongs to.

    Split :

    For each xi C and xk SA\ xi, compute theoutlier fractionofSV:

    ci,k := 1

    |SV \ {xi, xk}|

    xSV\{xi,xk}

    1{outlier(xi,xk,x)=x}

    where 1 denotes the indicator function.

    Compute the outlier agreementon SA:

    ai,j :=

    xkSA\{xi,xj}

    1{ci,k> and cj,k>}

    +

    1{ci,k2m.

    similarities for each call ofsplit, where n is the sizeof the cluster in each call. Now if the splits are bal-anced, the depth of the complete cluster tree will beO(log N), withO(2) calls tosplitat levelinvolvingclusters of sizen= O(N/2). An easy calculation thenshows that the total number of similarities required byRAclusteris thenO(Nlog2 N), compared to the totalnumber which is O(N2). The performance guaranteefor the robust active clustering algorithm are summa-

    rized in the following main theorem.

    Theorem 4.2. LetX be a collection ofN items withunderlying hierarchical clustering structureT. Let0 < < 1 and =

    2N1/ log( 1

    1)

    . If m k0log8

    N

    ,

    for a constant k0 > 0 and conditionsA1 andA2 ofTheorem 4.1 hold, then with probability at least 1RAcluster(X, m , )usesO(Nlog2 N)similarities andrecovers all clusters C T that satisfy the followingconditions:

    The cluster size, |C| > 2m

    All clusters in T that contain C have a balancefactor

    The proof of the theorem is given in the Appendix.The constant k0 is specified in Equation 4. Roughlyspeaking, the theorem implies that under the con-ditions of the Theorem 4.1 we can robustly recoverall clusters of size O(log N) or larger using onlyO(Nlog2 N) similarities. Comparing this result toTheorem 3.1, we note three costs associated with be-ing robust to inconsistent similarities: 1) we requireO(Nlog2 N) rather thanO(Nlog N) similarity values;2) the degree to which the clusters are balanced now

  • 7/26/2019 Active Clustering: Robust and Efficient Hierarchical Clustering using Adaptively Selected Similarities

    6/9

    265

    Active Clustering: Robust and Efficient Hierarchical Clustering using Adaptively Selected Similarities

    plays a role (in the constant ); 3) we cannot guaran-tee the recovery of clusters smaller than O(logN) dueto voting.

    5 Robust Clustering Experiments

    To test our robust clustering methodology we focus onexperimental results from a balanced binary tree usingsynthesized similarities and real-world data sets usinggenetic microarray data ([18], with 7 expressions pergene and using Pearson correlation for pairwise simi-larity), breast tumor characteristics (via [19], with 10features per tumor and using Pearson correlation forpairwise similarity), and phylogenetic sequences ([20],using the Needleman-Wunsch algorithm [21] for pair-wise similarity between amino acid sequences). Thesynthetic binary tree experiments allows us to observethe characteristics of our algorithm while controlling

    the amount of inconsistency with respect to the TightClustering (TC) condition, while the real world datagives us perspective on problems where the tree struc-ture and TC condition is assumed, but not known.

    In order to quantify the performance of the tree re-construction algorithms, consider the non-unique par-tial ordering, : {1, 2,...,N} {1, 2,...,N}, result-ing from the ordering of items in the reconstructedtree. For a set of observed similarities, given the orig-inal ordering of the items from the true tree structurewe would expect to find the largest similarity valuesclustered around the diagonal of the similarity matrix.

    Meanwhile, a random ordering of the items would havethe large similarity values potentially scattered awayfrom the diagonal. To assess performance of our re-constructed tree structures, we will consider the rateof decay for similarity values off the diagonal of thereordered items,sd = 1NdNdi=1 s(i),(i+d). Usingsd, we define a distribution over the the average off-diagonal similarity values, and compute the entropy ofthis distribution as follows:

    E() =N1i=1

    pilogpi (2)

    Wherepi = N1d=1sd1si.This entropy value provides a measure of the quality ofa partial ordering induced by the tree reconstructionalgorithm. For a balanced binary tree with N=512,we find that for the original ordering,E(original) =2.2323, and for the random ordering,E(random) =2.702. This motivates examining the estimated -entropy of our clustering reconstruction-based order-ings asE() =E(random)E(), where we nor-malize the reconstructed clustering entropy value withrespect to a random permutation of the items. The

    quality of our clustering methodologies will be exam-ined, where the larger the estimated -entropy, thehigher the quality of our estimated clustering.

    For the synthetic binary tree experiments, we createda balanced binary tree with 512 items. We gener-

    ated similarity between each pair of items such that100 (1 q)% of the pairwise similarities chosen atrandom are consistent with the TC condition ( SC).The remaining 100 q% of the pairwise similaritieswere inconsistent with the TC condition. We ex-amined the performance of both standard bottom-upagglomerative clustering and our Robust Clusteringalgorithm, RAcluster, for pairwise similarities withq= 0.05, 0.15, 0.25. The results presented here are av-eraged over 10 random realization of noisy syntheticdata and setting the threshold = 0.30. We usedthe similarity voting budget m = 80, which requires65% of the complete set of similarities. Performance

    gains are shown using our robust clustering approachin Table 2 in terms of both the estimated -entropyand rmin, the size of the smallest correctly resolvedcluster (where all clusters of size rmin or larger arereconstructed correctly for the clustering). Compar-isons between -entropy and rmin show a clear cor-relation between high -entropy and high clusteringreconstruction resolution.

    Table 2: Clustering -entropy results for synthetic bi-nary tree with N = 512 for Agglomerative Clusteringand RAcluster.

    Agglo. RAclusterClustering (m=80)

    q -Entropy rmin -Entropy rmin

    0.05 0.37 460.8 1.02 7.20.15 0.09 512 1.02 15.20.25 0.01 512 1.01 57.6

    Our robust clustering methodology was then per-formed on the real world data sets using the thresh-old = 0.30 and similarity voting budgets m = 20and m = 40. In addition to the quality of our clus-

    ter reconstruction (in terms of estimated -entropy),the performance is also stated in terms of the num-ber of pairwise similarities required by the agglomera-tive clustering methodology, denoted by nagg , againstthe number of similarities required by our RAclustermethod, nrobust. The results in Table 3 (averagedover 20 random permutations of the datasets) againshow significant performance gains in terms of boththe estimated -entropy and the number of pairwisesimilarities required. Finally, in Figure 1 we see thereordered similarity matrices given both agglomera-tive clustering and our robust clustering methodology,RAcluster.

  • 7/26/2019 Active Clustering: Robust and Efficient Hierarchical Clustering using Adaptively Selected Similarities

    7/9

    266

    Brian Eriksson, Gautam Dasarathy, Aarti Singh, Robert Nowak

    Table 3: -entropy results for real world datasets (gene microarray, breast tumor comparison, and Phylogenetics) usingboth Agglomerative Clustering and RAcluster algorithm.

    Agglo. RAcluster(m= 20) RAcluster (m= 40)Dataset -Entropy -Entropy nrobust

    nagg-Entropy nrobust

    nagg

    Gene (N=500) 0.1561 0.1768 27% 0.1796 51%

    Gene (N=1000) 0.1484 0.1674 18% 0.1788 37%

    Tumor (N=400) 0.0574 0.0611 30% 0.0618 57%Tumor (N=600) 0.0578 0.0587 24% 0.0594 47%

    Phylo. (N=750) 0.0126 0.0141 21% 0.0143 41%Phylo. (N=1000) 0.0149 0.0103 16% 0.0151 35%

    (A) (B)

    Figure 1: Reordered pairwise similarity matrices, Genemicroarray data with N= 1000 using (A) - AgglomerativeClustering and (B) - Robust Clustering using m = 20 (re-quiring only 18.1% of the similarities). An ideal clusteringwould organize items so that the similarity matrix is darkblue (high similarity) clusters/blocks on the diagonal andlight blue (low similarity) values off the diagonal blocks.The robust clustering is clearly closer to this ideal (i.e., Bcompared to A).

    6 Appendix

    6.1 Proof of Theorem 4.1

    Since the outlier tests can be erroneous, we insteaduse a two-round voting procedure to correctly deter-mine whether two items xi and xj are in the samesub-cluster or not. Please refer to Algorithm 1 fordefinitions of the relevant quantities. The followinglemma establishes that the outlier fraction values ci,kcan reveal whether two items xi, xk are in the samesubcluster or not, provided that the number of votingitemsm = |SV| is large enough and the similarity si,kis consistent.

    Lemma 2. Consider two itemsxi andxk. Under as-sumptionsA1 andA2 and assumingsi,k SC, com-paring the outlier count values ci,k to a threshold will correctly indicate whether xi, xk are in the samesubcluster with probability at least1 C

    2 for

    m log(4/C)

    2min

    ( 1 + (1 q)2)2 , ((1 q)2 )2

    .

    Proof. Let i,k := 1{si,kSC} be the event that thesimilarity between itemsxi, xk is in the consistent sub-set (see Definition 3). Under A1, the expected outlierfraction (ci,k) conditioned on xi, xk and i,k can be

    bounded in two cases; when they belong to the samesubcluster and when they do not:

    E [ci,k |xi, xk CLor xi, xk CR, i,k] (1 q)2

    E [ci,k |xi CR, xk CL orxi CL, xk CR, i,k]

    1 (1 q)2

    A2 stipulates a gap between the two bounds. Ho-effdings Inequality ensures that, with high probabil-ity, ci,k will not significantly deviate below/above thelower/upper bound. Thresholding ci,k at a level be-

    tween the bounds will probably correctly determinewhether xi and xk are in the same subcluster or not.More precisely, if

    m log(4/C)

    2min

    ( 1 + (1 q)2)2 , ((1 q)2 )2 ,

    then with probability at least (1 C/2) the thresholdtest correctly determines if the items are in the samesubcluster.

    Next, note that we cannot use the cluster count ci,jdirectly to decide the placement ofxi since the con-

    dition si,j SCmay not hold. In order to be robustto errors in si,j , we employ a second round of votingbased on an independent set ofm randomly selectedagreementitems, SA. The agreement fraction, ai,j , isthe average of the number of times the item xi agreeswith the clustering decision ofxj onSA.

    Lemma 3. Consider the following procedure:

    xi

    Cj : if ai,j

    1

    2

    Ccj : if ai,j < 1

    2

    Under assumptions A1 and A2, with probability atleast 1 C

    2 , the above procedure based onm = |SA|

  • 7/26/2019 Active Clustering: Robust and Efficient Hierarchical Clustering using Adaptively Selected Similarities

    8/9

    267

    Active Clustering: Robust and Efficient Hierarchical Clustering using Adaptively Selected Similarities

    agreement items will correctly determine if the items

    xi, xj are in the same subcluster, provided

    m log(4/C)

    2

    (1 C) (1 q)2 12

    2 .

    Proof. Define i,j as the event that similaritiessi,k, sj,k are both consistent (i.e., si,k, sj,k SC) andthresholding the cluster counts ci,k, cj,k at level cor-rectly indicates if the underlying items belong to thesame subcluster or not. Using Lemma 2 and the unionbound, the conditional expectations of the agreementcounts,ai,j , can be bounded as,

    E [ai,j |xi / Cj] P

    Ci,j

    1 (1 q)2 (1 C)

    E [ai,j |xi Cj] P(i,j) (1 q)2

    (1 C)

    Since q 1 1/

    2(1 ) and C =

    /n (as de-fined below), there is a gap between these two boundsthat includes the value 1/2. Hoeffdings Inequality en-sures that with high probability ai,j will not signif-icantly deviate above/below the upper/lower bound.Thus, thresholding ai,j at 1/2 will resolve whetherthe two items xi, xj are in the same or different sub-clusters with probability at least (1 C/2) provided

    m log(4/C)/

    2

    (1 C) (1 q)2 1

    2

    2.

    By combining Lemmas 2 and 3, we can state the fol-lowing. The split methodology of Algorithm 1 willsuccessfully determine if two items xi, xj are in the

    same subcluster with probability at least 1 C underassumption A1and A2, provided

    m max

    log(4/C)

    2

    (1 C) (1 q)2 1

    2

    2 ,

    log(4/C)

    2min

    ( 1 + (1 q)2)2 , ((1 q)2 )2

    and the cluster under consideration has at least 2mitems.

    In order to successfully determine the subcluster as-signments for all n items of the cluster C with proba-bility at least 1 , requires settingC=

    n (i.e.,tak-ing the union bound over all n items). Thus we havethe requirement m c0(

    , ,q , )log(4n/) wherethe constant obeys

    c0(, , q , ) max

    1

    2

    (1 ) (1 q)2 12

    2 ,

    1

    2min( 1 + (1 q)2)2 , ((1 q)2 )2

    (3)

    Finally, this result and assumptions A1-A2 imply thatthe algorithm split(C,m, ) correctly determine thetwo subclusters ofCwith probability at least 1 .

    6.2 Proof of Theorem 4.2

    Lemma 4. A binary tree with N leaves and bal-ance factor C has depth of at most L logN/ log( 11 ).

    Proof. Consider a binary tree structure withN leaves(items) with balance factor 1/2. After depth of,the number of items in the largest cluster are boundedby (1 )

    N. IfL denotes the maximum depth level,

    then there can only be 1 item in the largest clusterafter depth ofL, we have 1 (1 )

    LN.

    The entire hierarchical clustering can be resolved if

    all the clusters are resolved correctly. With a max-imum depth of L, the total number of clusters Min the hierarchy is bounded by

    L=02

    2(L+1)

    2N1/ log( 11 ), using the result of Lemma 4. There-

    fore, the probability that some cluster in the hierar-

    chy is not resolved M 2N1/ log( 11 ) (where

    split succeeds with probability > 1 ). There-fore, for all clusters (which satisfy the conditions A1and A2 of Theorem 4.1 and have size > 2m) canbe resolved with probability 1 (by setting =

    2N1/ log( 1

    1)

    ), from the proof of Theorem 4.1 we define

    m= k0(,,q,)log 8N where,

    k0(,,q,) c0(,,q,)/

    1 +

    1/ log

    1

    1

    (4)

    Given this choice ofm, we find that the RAclustermethodology in Algorithm 2 for a set ofN items willresolve all clusters that satisfy A1 and A2 of Theo-rem 4.1 and have size > 2m, with probability at least1 .

    Furthermore, the algorithm only requiresONlog2N

    total pairwise similarities. By run-

    ning the RAcluster methodology, each item willhave the split methodology performed at most

    logN/ log( 11 ) times (i.e., once for each depth levelof the hierarchy). If m = k0(,,q,)log

    8N

    for RAcluster, each call to split will requireonly 3k0(,,q,)log

    8N

    pairwise similaritiesper item. Given N total items, we find thatthe RAcluster methodology requires at most3k0(,,q,)N

    logNlog( 11 )

    log8N

    pairwise similarities.

    References

    [1] H. Yu and M. Gerstein, Genomic Analysis of theHierarchical Structure of Regulatory Networks,

  • 7/26/2019 Active Clustering: Robust and Efficient Hierarchical Clustering using Adaptively Selected Similarities

    9/9

    268

    Brian Eriksson, Gautam Dasarathy, Aarti Singh, Robert Nowak

    in Proceedings of the National Academy of Sci-ences, vol. 103, 2006, pp. 14,72414,731.

    [2] J. Ni, H. Xie, S. Tatikonda, and Y. R. Yang, Ef-ficient and Dynamic Routing Topology Inferencefrom End-to-End Measurements, in IEEE/ACM

    Transactions on Networking, vol. 18, February2010, pp. 123135.

    [3] M. Girvan and M. Newman, Community Struc-ture in Social and Biological Networks, in Pro-ceedings of the National Academy of Sciences,vol. 99, pp. 78217826.

    [4] R. K. Srivastava, R. P. Leone, and A. D. Shocker,Market Structure Analysis: Hierarchical Clus-tering of Products Based on Substitution-in-Use,in The Journal of Marketing, vol. 45, pp. 3848.

    [5] S. Chaudhuri, A. Sarma, V. Ganti, and

    R. Kaushik, Leveraging Aggregate Constraintsfor Deduplication, in Proceedings of SIGMODConference 2007, pp. 437448.

    [6] A. Arasu, C. Re, and D. Suciu, Large-ScaleDeduplication with Constraints Using Dedupa-log, in Proceedings of ICDE 2009, pp. 952963.

    [7] R. Ramasubramanian, D. Malkhi, F. Kuhn,M. Balakrishnan, and A. Akella, On The Tree-ness of Internet Latency and Bandwidth, inProceedings of ACM SIGMETRICS Conference,Seattle, WA, 2009.

    [8] Sajama and A. Orlitsky, Estimating and Com-puting Density-Based Distance Metrics, in Pro-ceedings of the 22nd International Conference onMachine Learning (ICML), 2005, pp. 760767.

    [9] M. Balcan and P. Gupta, Robust HierarchicalClustering, in Proceedings of the Conference onLearning Theory (COLT), July 2010.

    [10] G. Karypis, E. Han, and V. Kumar,CHAMELEON: A Hierarchical ClusteringAlgorithm Using Dynamic Modeling, in IEEEComputer, vol. 32, 1999, pp. 6875.

    [11] S. Guha, R. Rastogi, and K. Shim, ROCK: ARobust Clustering Algorithm for Categorical At-tributes, in Information Systems, vol. 25, July2000, pp. 345366.

    [12] T. Hofmann and J. M. Buhmann, Active DataClustering, in Advances in Neural InformationProcessing Systems (NIPS), 1998, pp. 528534.

    [13] T. Zoller and J. Buhmann, Active Learning forHierarchical Pairwise Data Clustering, in Pro-ceedings of the 15th International Conference onPattern Recognition, vol. 2, 2000, pp. 186 189.

    [14] N. Grira, M. Crucianu, and N. Boujemaa, Ac-tive Semi-Supervised Fuzzy Clustering, in Pat-tern Recognition, vol. 41, May 2008, pp. 18511861.

    [15] J. Pearl and M. Tarsi, Structuring Causal

    Trees, in Journal of Complexity, vol. 2, 1986,pp. 6077.

    [16] T. Hastie, R. Tibshirani, and J. Friedman, TheElements of Statistical Learning. Springer, 2001.

    [17] L. Li, D. Alderson, W. Willinger, and J. Doyle, AFirst-Principles Approach to Understanding theInternets Router-Level Topology, inProceedingsof ACM SIGCOMM Conference, 2004, pp. 314.

    [18] J. DeRisi, V. Iyer, and P. Brown, Exploring themetabolic and genetic control of gene expressionon a genomic scale, in Science, vol. 278, October1997, pp. 680686.

    [19] A. Frank and A. Asuncion, UCI MachineLearning Repository, 2010. [Online]. Available:http://archive.ics.uci.edu/ml

    [20] R. Finn, J. Mistry, and et. al., The Pfam ProteinFamilies Database, in Nucleic Acids Research,vol. 38, 2010, pp. 211222.

    [21] S. Needleman and C. Wunsch, A GeneralMethod Applicable to the Search for Similaritiesin the Amino Acid Sequence of Two Proteins,

    in Journal of Molecular Biology, vol. 48, 1970, p.443453.