SDM'18 - Graph sketching-based Massive Data Clustering · SDM’18 - Graph sketching-based Massive...

1
SDM’18 - Graph sketching-based Massive Data Clustering Anne Morvan 12 , Krzysztof Choromanski 3 , Cédric Gouy-Pailler 1 and Jamal Atif 2 1 CEA, LIST, 2 Université Paris-Dauphine, PSL Research University, CNRS, UMR 7243, LAMSADE, 3 Google Brain Robotics Objectives We present a new clustering algorithm DBMSTClu providing a solution to the following issues: 1) detecting arbitrary-shaped data clusters, 2) with no parameter, 3) in a space-efficient manner by working on a limited number of linear measurements, a sketched version of the dissimilarity data graph G . Steps of the method 1 The dissimilarity data graph G with N nodes is handled as a stream of edge weight updates and sketched in one pass into a compact structure with space cost O (N polylog(N )), cf. method from [1] relying on 0 -sampling principle [2]. 2 From the graph sketch, an Approximate Minimum Spanning Tree (AMST) T is recovered containing N - 1 weighted edges s.t. for all i [ N - 1], weights w i (0, 1]. An MST is good for expressing the underlying structure of a graph. 3 Without any parameter, DBMSTClu performs successive edge cuts in T which create new connected components that can be seen as clusters. At each iteration, a cut is chosen as the one maximizing a criterion named Density-Based Validity Index of a Clustering partition (DBCVI) based on Dispersion and Separation defined on each connected component. Cluster Dispersion & Separation The Dispersion of cluster C i (DISP) is defined as the maximum edge weight of C i : i [ K ], DISP(C i )= max e j E (C i ) w (e j ) if |E (C i )| =0 0 otherwise. The Separation of cluster C i (SEP) is defined as the minimum distance between nodes of C i and nodes of other clusters C j ,i = j, i, j [ K ]. Cuts(C i ) is the set of edges incident to C i : i [ K ], SEP(C i )= min e j Cuts(C i ) w (e j ) if K =1 1 otherwise. SEP(C 1 ) DISP(C 1 ) C 1 Figure 1: SEP and DISP for cluster C 1 in T (N = 12, K =3). Validity Index of Cluster & Clustering Partition The Validity Index of cluster C i , i [ K ] is defined as: V C (C i )= SEP(C i ) - DISP(C i ) max(SEP(C i ), DISP(C i )) The Density-Based Validity Index of a Clustering partition Π= {C 1 ,...,C K }, DBCVI(Π) is defined as the weighted average of the Validity Indices of all clusters in the partition. DBCVI(Π) = Σ K i=1 |C i | N V C (C i ) Input: T Compute the DBCV I for each cut Apply the best cut on T Does the best cut improve the DBCV I ? Return the clustering partition yes no Figure 2: DBMSTClu algorithm To whom correspondence should be adressed: [email protected]. Anne is partly supported by the Direction Générale de l’Armement (French Ministry of Defense). Tricks for a linear time implementation 1 For a performed cut in cluster C i , V C (C j ) for any j = i remain unchanged. 2 SEP and DISP exhibit some directional recurrence relationship in T : knowing these values for a given cut, we can deduce them for a neighboring cut left and right (cf. Fig. 3) by a Double Depth-First Search. e S 1 S 3 S 2 Figure 3: Recursive relationship for left and right Dispersions resulting from the cut of edge e: DISP left (e) = max(w (S 1 )), DISP right (e) = max(w (S 2 ),w (S 3 )) where w (.) returns the edge weights of the subtree in parameter. Separation works analogically. Experimental results 1 Safety of the sketching: Figure 4: Noisy moons: SEMST, DBSCAN ( =0.15, minP ts =5), DBMSTClu with AMST 2 Scalability: experiments within the Stochastic Block Model K \N 1000 10000 50000 100000 250000 500000 750000 1000000 5 0.34 2.96 14.37 28.91 73.04 148.85 218.11 292.25 20 0.95 8.73 43.71 88.51 223.18 449.37 669.29 889.88 100 4.36 40.25 201.76 398.41 995.42 2011.79 3015.61 4016.13 100/5" 12.82 13.60 14.04 13.78 13.63 13.52 13.83 13.74 Table 1: DBMSTClu’s execution time (in s) varying N and K (avg. on 5 runs). Figure 5: Visualization of Table 1 exhibiting linear property Figure 6: DBMSTClu applied on real dataset mushroom (N = 8124) detects 23 clusters in 3.36s while DBSCAN requires 9s. Conclusion and perspectives We introduced a novel space-efficient density-based clustering algorithm working solely on an MST without any parameter. Its robustness has been assessed by using as input an approximate MST, retrieved from the dissimilarity graph sketch, rather than an exact one. Further work would be to 1) use DBMSTClu in privacy issues, 2) adapt both the MST recovery and DBMSTClu to the fully online setting by updating the current MST and clustering partition as new edge weight updates are seen. [1] Kook Jin Ahn, Sudipto Guha, and Andrew McGregor. Analyzing graph structure via linear measurements. In Proceedings of the Twenty-third Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’12, pages 459–467, Philadelphia, PA, USA, 2012. Society for Industrial and Applied Mathematics. [2] Graham Cormode and Donatella Firmani. A unifying framework for l 0 -sampling algorithms. Distributed and Parallel Databases, 32(3):315–335, 2014. Special issue on Data Summarization on Big Data.

Transcript of SDM'18 - Graph sketching-based Massive Data Clustering · SDM’18 - Graph sketching-based Massive...

Page 1: SDM'18 - Graph sketching-based Massive Data Clustering · SDM’18 - Graph sketching-based Massive Data Clustering AnneMorvan12,KrzysztofChoromanski3,CédricGouy-Pailler1 andJamalAtif2

SDM’18 - Graph sketching-based Massive DataClustering

Anne Morvan12, Krzysztof Choromanski3, Cédric Gouy-Pailler1 and Jamal Atif21CEA, LIST, 2Université Paris-Dauphine, PSL Research University, CNRS, UMR 7243, LAMSADE, 3Google Brain Robotics

Objectives

We present a new clustering algorithm DBMSTClu providing a solution tothe following issues: 1) detecting arbitrary-shaped data clusters, 2) with noparameter, 3) in a space-efficient manner by working on a limited number oflinear measurements, a sketched version of the dissimilarity data graph G.

Steps of the method

1 The dissimilarity data graph G with N nodes is handled as a stream of edgeweight updates and sketched in one pass into a compact structure with spacecost O(N polylog(N)), cf. method from [1] relying on `0-sampling principle [2].

2 From the graph sketch, an Approximate Minimum Spanning Tree (AMST) T isrecovered containing N − 1 weighted edges s.t. for all i ∈ [N − 1], weightswi ∈ (0, 1]. An MST is good for expressing the underlying structure of a graph.

3 Without any parameter, DBMSTClu performs successive edge cuts in T whichcreate new connected components that can be seen as clusters. At eachiteration, a cut is chosen as the one maximizing a criterion namedDensity-Based Validity Index of a Clustering partition (DBCVI) based onDispersion and Separation defined on each connected component.

Cluster Dispersion & Separation

•The Dispersion of cluster Ci (DISP) is defined as the maximum edge weightof Ci:

∀i ∈ [K], DISP(Ci) =

maxej∈E(Ci)

w(ej) if |E(Ci)| 6= 00 otherwise.

•The Separation of cluster Ci (SEP) is defined as the minimum distancebetween nodes of Ci and nodes of other clusters Cj, i 6= j, i, j ∈ [K]. Cuts(Ci)is the set of edges incident to Ci:

∀i ∈ [K], SEP(Ci) =

minej∈Cuts(Ci)

w(ej) if K 6= 11 otherwise.

bb

b

b

bb

b

b bb

b

b

SEP(C1)

DISP(C1)

C1

Figure 1: SEP and DISP for cluster C1 in T (N = 12, K = 3).

Validity Index of Cluster & Clustering Partition

•The Validity Index of cluster Ci, i ∈ [K] is defined as:

VC(Ci) = SEP(Ci)− DISP(Ci)max(SEP(Ci),DISP(Ci))

•The Density-Based Validity Index of a Clustering partitionΠ = {C1, . . . , CK}, DBCVI(Π) is defined as the weighted average of theValidity Indices of all clusters in the partition.

DBCVI(Π) = ΣKi=1|Ci|NVC(Ci)

Input: T

Computethe

DBCV Ifor each

cut

Applythe bestcut on T

Does thebest cutimprove

theDBCV I?

Return theclusteringpartition

yes

no

Figure 2: DBMSTClu algorithmTo whom correspondence should be adressed: [email protected]. Anne is partly supported by the Direction Générale de l’Armement (French Ministry of Defense).

Tricks for a linear time implementation

1 For a performed cut in cluster Ci, VC(Cj) for any j 6= i remain unchanged.2 SEP and DISP exhibit some directional recurrence relationship in T : knowingthese values for a given cut, we can deduce them for a neighboring cut left andright (cf. Fig. 3) by a Double Depth-First Search.

eb b

b

b

b

b

b

b

bS1

S3

S2

Figure 3: Recursive relationship for left and right Dispersions resulting from the cut of edge e:DISPleft(e) = max(w(S1)), DISPright(e) = max(w(S2), w(S3)) where w(.) returns the edge weightsof the subtree in parameter. Separation works analogically.

Experimental results

1 Safety of the sketching:

1.0 0.5 0.0 0.5 1.0 1.5 2.01.0

0.5

0.0

0.5

1.0

1.5

1.0 0.5 0.0 0.5 1.0 1.5 2.01.0

0.5

0.0

0.5

1.0

1.5

1.0 0.5 0.0 0.5 1.0 1.5 2.01.0

0.5

0.0

0.5

1.0

1.5

Figure 4: Noisy moons: SEMST, DBSCAN (ε = 0.15, minPts = 5), DBMSTClu with AMST

2 Scalability: experiments within the Stochastic Block Model

K\N 1000 10000 50000 100000 250000 500000 750000 10000005 0.34 2.96 14.37 28.91 73.04 148.85 218.11 292.2520 0.95 8.73 43.71 88.51 223.18 449.37 669.29 889.88100 4.36 40.25 201.76 398.41 995.42 2011.79 3015.61 4016.13“100/5" 12.82 13.60 14.04 13.78 13.63 13.52 13.83 13.74

Table 1: DBMSTClu’s execution time (in s) varying N and K (avg. on 5 runs).

0 200000 400000 600000 800000 1000000

Number of points

0

500

1000

1500

2000

2500

3000

3500

4000

Tim

e (

s)

K=5

K=20

K=100

Figure 5: Visualization of Table 1 exhibiting linearproperty

Figure 6: DBMSTClu applied on real datasetmushroom (N = 8124) detects 23 clusters in 3.36swhile DBSCAN requires 9s.

Conclusion and perspectives•We introduced a novel space-efficient density-based clustering algorithmworking solely on an MST without any parameter. Its robustness has beenassessed by using as input an approximate MST, retrieved from thedissimilarity graph sketch, rather than an exact one.

•Further work would be to 1) use DBMSTClu in privacy issues, 2) adapt boththe MST recovery and DBMSTClu to the fully online setting by updating thecurrent MST and clustering partition as new edge weight updates are seen.

[1] Kook Jin Ahn, Sudipto Guha, and Andrew McGregor.Analyzing graph structure via linear measurements.In Proceedings of the Twenty-third Annual ACM-SIAM Symposium on DiscreteAlgorithms, SODA ’12, pages 459–467, Philadelphia, PA, USA, 2012. Society forIndustrial and Applied Mathematics.

[2] Graham Cormode and Donatella Firmani.A unifying framework for l0-sampling algorithms.Distributed and Parallel Databases, 32(3):315–335, 2014.Special issue on Data Summarization on Big Data.