Constraint-Driven Clustering

Post on 14-Jan-2016

49 views 0 download

Tags:

description

Constraint-Driven Clustering. Rong Ge 1 , Martin Ester 1 , Wen Jin 1 , Ian Davidson 2 Presenter: Rong Ge 1 Simon Fraser University 2 University of California - Davis. Introduction. Clustering methods aim at grouping data objects into clusters based on some criteria - PowerPoint PPT Presentation

Transcript of Constraint-Driven Clustering

1

Constraint-Driven Clustering

Rong Ge1, Martin Ester1, Wen Jin1, Ian Davidson2

Presenter: Rong Ge1 Simon Fraser University2 University of California - Davis

2

Introduction

Clustering methods aim at grouping data objects into clusters based on

some criteria can be either data-driven or need-driven

[Banerjee’06] Data-Driven methods

discover the true structure of the underlying data by grouping similar data objects together

Need-Driven methods group data objects based on not only similarity but

also application needs discover more actionable clusters

3

Capturing Application Needs

Two methodologies: Design sophisticated objective functions based on

business needs E.g., in catalog segmentation, clustering results are

evaluated by their utility in decision making [Kleinberg et al.’99]

Capture application needs by constraints E.g., discovering balanced customer groups in market

segmentation [Ghosh et al.’02] Yet, existing models often require users to provide the

number of clusters Often unknown Or not suit for application needs

4

Constraint-Driven Clustering

Constraint-Driven Clustering Utilizes constraints to control cluster formation Discovers an arbitrary number of clusters Goals:

Discover compact clusters Satisfy all constraints

Two constraint types (Cluster-level constraints) Minimum significance constraint

Specifies the minimum number of objects in a cluster Minimum variance constraint

Specifies the minimum variance of a cluster

5

Motivation - Energy Aware Sensor Networks

Constraint-Driven Clustering: Minimum Significance Constraint

Balances the work load of master nodes Minimum Variance Constraint

Allows sensor clusters to be balanced in terms of energy consumption

Goal: minimize energy consumption Solution:

Group sensors into clusters A master node is selected from

sensors in a cluster or deployed Other sensors communicate with

outside through the master nodesCommandNode

Sensor

Master Node

Communication Channel

6

Motivation - Privacy Preservation Goal: publish personal records without a privacy breach Solution:

Group records into clusters Release the summary of each cluster to the public

Constraint-Driven Clustering: Minimum Significance Constraint

Similar to k-Anonymity in preserving individual privacy Minimum Variance Constraint

Variance translates into the width of the confidence interval of the adversary estimate

Prevent similar, even identical, records to be released

7

Related Work Clustering with Cluster-level Constraints

Constrained k-means algorithm [Bradley et al.’00] The existential constraint [Tung et al.’01]

Specifies the minimum # of objects in a subset of the input data Is a general form of minimum significance constraint

Different to our model: K is specified K-Anonymity [Samarati et al.’98][Sweeney et al.’02]

Each record is indistinguishable from k-1 other records On categorical data

PPMicroCluster [Jin et al.’06] Minimum significance and minimum radius constraints Constraint is posed on the radius of a cluster Did not analyze the complexity of the clustering model

8

Constraint-Driven Clustering (CDC) Given a set of points , a set of constraints C

Partition P into disjoint clusters {P1,, Pm} s.t.: Each cluster satisfies all constraints The sum of squared distances of data points to their

corresponding cluster representatives is minimized Constraints

For each cluster Pi, 1 · i · m

Our model searches for clusters which are balanced in terms of cardinality or/and variance

9

Theoretical Results

Note that the CDC problem has feasible solutions as long as the whole data set satisfies given constraints

Sig-CDC -Sig-CDC Var-CDC -Var-CDC

Constraints Sig > 1,Var = 0

Sig > 1,Var = 0

Sig = 1,Var>0

Sig = 1, Var>0

Cluster representative

Medoid Mean vector Medoid Mean vector

Complexity NP-hard (by a reduction from PLANAR X3C)

10

Heuristic Algorithm

Intuition The generated clusters must be balanced Membership assignment of each point depends on its

close neighbors Data structure: CD-Tree

Helps to retrieve close neighbors easily Obtain a solution to the CDC problem by post

processing leaf nodes Two parameters

Significance parameter S (S = Sig) Variance parameter V (V = Var)

11

CD-Tree

Leaf nodes Each entry contains an individual data point Upper-bound capacity and variance

Max capacity: 2S – 1 (In an optimal solution, no cluster consists of > 2S-1 data objects)

Max variance: 2V (To keep leaf nodes compact s.t. the SSE is minimized)

Non-leaf nodes Each entry

contains pointers to child nodes and summaries of points in the child nodes

corresponds to the subtree rooted at the child node Max capacity Z ( a constant, can be set arbitrarily)

12

CD-Tree vs. CF-Tree and R*-Tree

CF-Tree Does not save individual data points No max capacity specified for leaf nodes

R*-Tree No max variance specified for leaf nodes

Both CF-Tree and R*-tree are not designed for generating clusters satisfying constraints

CD-Tree One CD-Tree is built for a set of constraints When constraint value is changed slightly, we can

obtain a solution by post-processing leaf nodes

13

Algorithm

Two steps: Build the CD-Tree (Insertion and Split) Post-process leaf nodes to solve the CDC problem

Root

l1

nlr

S = 5

nlrnll

l2 l3l1

l2 l3

l1 l2

l1 l2

nll

14

Experimental Results

Comparison partner PPMicroCluster algorithm

Similar problem definition Can be adapted to handle the minimum variance

constraint Static algorithm

Data sets Synthetic data set (DS1)

5000 2-d data points to simulate sensors deployed uniformly

Two real UCI data sets (Abalone and Letter)

15

Results on Synthetic data set

Results for the DS1 dataset (Only Significance Constraints are Specified)

16

Results on Letter data set

Results for the Letter dataset (Both Significance and Variance Constraints are Specified)

17

Conclusion & Future work A new Constraint-Driven Clustering (CDC) model

Need-driven Focused on two cluster-level constraints

Proved NP-Hardness of the CDC problem Proposed a new data structure (CD-Tree) Developed a heuristic algorithm based on CD-Tree Future Work

Allow constraints to be ranges instead of exact values Design other types of constraints to capture different

application needs Generalize the heuristic algorithm to handle other

constraints, such as minimum separation constraint [Davidson et al.’05]

18

Reference [Ghosh’02] J. Ghosh and A. Strehl. Clustering and visualization of

retail market baskets. In N. R. Pal and L. Jain, editors, Knowledge Discovery in Advanced Information Systems. Springer, 2002.

[Kleinberg’99] J. Kleinberg, C. Papadimitriou, and P. Raghavan. A microeconomic view of data mining. J. Data Mining and Knowledge Discovery, 1999.

[Bradley’00] P. Bradley, K. P. Bennett, and A. Demiriz. Constrained k-means clustering. Technical report, MSR-TR-2000-65, Microsoft Research, 2000.

[Wagstaff’00] K. Wagstaff and C. Cardie. Clustering with instance-level constraints. In ICML, 2000.

[Davidson’05] I. Davidson and S. S. Ravi. Clustering with constraints: Feasibility issues and the k-means algorithm. In SDM, 2005.

[Samarati’98] P. Samarati and L. Sweeney. Generalizing data to provide anonymity when disclosing information (abstract). In PODS, 1998.

19

Reference [Sweeney’02] L. Sweeney. k-anonymity: A model for protecting

privacy. In IJUFKS, 2002. [Jin’06] W. Jin, R. Ge, and W. Qian. On robust and effective k-

anonymity in large databases. In PAKDD, 2006. [Aggarwal’04] C. C. Aggarwal and P. S. Yu. A condensation

approach to privacy preserving data mining. In EDBT, 2004. [Tung’01] A. K. H. Tung, J. Han, R. T. Ng, and L. V. S. Lakshmanan.

Constraint-based clustering in large databases. In ICDT, 2001. [Banerjee’06] A. Banerjee and J. Ghosh. Scalable clustering

algorithms with balancing constraints. Data Mining Knowledge Discovery, 13(3), 2006.

20

Thanks!

Poster: this evening (Tuesday), board #1

21

Split

Split

Create a new leaf node

Move the furthest point to the mean of the old leaf node to it

Calculate the new objective value

Does the objectivevalue drop?

Yes

NoLink the new node appropriately

22

Runtime

O(n2 + n * Sig2) The runtime of inserting one point is O(n) The height of a CD-Tree can be O(n) Total time for split is O(Sig2) Total time for building a tree is O(n2 + nSig2)

23

Outline

Introduction Two classes of clustering methods Motivation for constraint-driven clustering

Related Work Constraint-Driven Clustering model Theoretical Results Heuristic Algorithm Experimental Results Conclusion Future Work

24

Related Work Actionable Clustering [Kleinberg‘99]

Objective function measures the utility of a clustering in decision making

Cluster-level Constraints Constrianed k-means algorithm [Bradley’00] Different to our model: K is specified

Instance-level Constraints Must-link and cannot-link constraints [Wagstaff’00] Feasibility issue with the instance-level constraints

[Davidson’05] Model a cluster-level constraint with instance-level

constraints Require a large number of instance-level constraints Specifying too many constraints is problematic

25

Related Work (Contd.)

K-Anonymity [Samarati’98][Sweeney’02] Each record is indistinguishable from k-1 other

records On categorical data Condensation approach is an extension of K-

Anonymity on numerical data [Aggarwal’04] PPMicroCluster[Jin’06]

Minimum significance constraint and minimum radius constraint

Different to our model: Minimum variance constraint Not analyze the complexity of the cluster model Propose a static algorithm