Constraint-Driven Clustering

Rong Ge1, Martin Ester1, Wen Jin1, Ian Davidson2

Presenter: Rong Ge1 Simon Fraser University2 University of California - Davis

Introduction

Clustering methods aim at grouping data objects into clusters based on

some criteria can be either data-driven or need-driven

[Banerjee’06] Data-Driven methods

discover the true structure of the underlying data by grouping similar data objects together

Need-Driven methods group data objects based on not only similarity but

also application needs discover more actionable clusters

Capturing Application Needs

Two methodologies: Design sophisticated objective functions based on

business needs E.g., in catalog segmentation, clustering results are

evaluated by their utility in decision making [Kleinberg et al.’99]

Capture application needs by constraints E.g., discovering balanced customer groups in market

segmentation [Ghosh et al.’02] Yet, existing models often require users to provide the

number of clusters Often unknown Or not suit for application needs

Constraint-Driven Clustering

Constraint-Driven Clustering Utilizes constraints to control cluster formation Discovers an arbitrary number of clusters Goals:

Discover compact clusters Satisfy all constraints

Two constraint types (Cluster-level constraints) Minimum significance constraint

Specifies the minimum number of objects in a cluster Minimum variance constraint

Specifies the minimum variance of a cluster

Motivation - Energy Aware Sensor Networks

Constraint-Driven Clustering: Minimum Significance Constraint

Balances the work load of master nodes Minimum Variance Constraint

Allows sensor clusters to be balanced in terms of energy consumption

Goal: minimize energy consumption Solution:

Group sensors into clusters A master node is selected from

sensors in a cluster or deployed Other sensors communicate with

outside through the master nodesCommandNode

Sensor

Master Node

Communication Channel

Motivation - Privacy Preservation Goal: publish personal records without a privacy breach Solution:

Group records into clusters Release the summary of each cluster to the public

Constraint-Driven Clustering: Minimum Significance Constraint

Similar to k-Anonymity in preserving individual privacy Minimum Variance Constraint

Variance translates into the width of the confidence interval of the adversary estimate

Prevent similar, even identical, records to be released

Related Work Clustering with Cluster-level Constraints

Constrained k-means algorithm [Bradley et al.’00] The existential constraint [Tung et al.’01]

Specifies the minimum # of objects in a subset of the input data Is a general form of minimum significance constraint

Different to our model: K is specified K-Anonymity [Samarati et al.’98][Sweeney et al.’02]

Each record is indistinguishable from k-1 other records On categorical data

PPMicroCluster [Jin et al.’06] Minimum significance and minimum radius constraints Constraint is posed on the radius of a cluster Did not analyze the complexity of the clustering model

Constraint-Driven Clustering (CDC) Given a set of points , a set of constraints C

Partition P into disjoint clusters {P1,, Pm} s.t.: Each cluster satisfies all constraints The sum of squared distances of data points to their

corresponding cluster representatives is minimized Constraints

For each cluster Pi, 1 · i · m

Our model searches for clusters which are balanced in terms of cardinality or/and variance

Theoretical Results

Note that the CDC problem has feasible solutions as long as the whole data set satisfies given constraints

Sig-CDC -Sig-CDC Var-CDC -Var-CDC

Constraints Sig > 1,Var = 0

Sig > 1,Var = 0

Sig = 1,Var>0

Sig = 1, Var>0

Cluster representative

Medoid Mean vector Medoid Mean vector

Complexity NP-hard (by a reduction from PLANAR X3C)

Heuristic Algorithm

Intuition The generated clusters must be balanced Membership assignment of each point depends on its

close neighbors Data structure: CD-Tree

Helps to retrieve close neighbors easily Obtain a solution to the CDC problem by post

processing leaf nodes Two parameters

Significance parameter S (S = Sig) Variance parameter V (V = Var)

CD-Tree

Leaf nodes Each entry contains an individual data point Upper-bound capacity and variance

Max capacity: 2S – 1 (In an optimal solution, no cluster consists of > 2S-1 data objects)

Max variance: 2V (To keep leaf nodes compact s.t. the SSE is minimized)

Non-leaf nodes Each entry

contains pointers to child nodes and summaries of points in the child nodes

corresponds to the subtree rooted at the child node Max capacity Z ( a constant, can be set arbitrarily)

CD-Tree vs. CF-Tree and R*-Tree

CF-Tree Does not save individual data points No max capacity specified for leaf nodes

R*-Tree No max variance specified for leaf nodes

Both CF-Tree and R*-tree are not designed for generating clusters satisfying constraints

CD-Tree One CD-Tree is built for a set of constraints When constraint value is changed slightly, we can

obtain a solution by post-processing leaf nodes

Algorithm

Two steps: Build the CD-Tree (Insertion and Split) Post-process leaf nodes to solve the CDC problem

nlrnll

l2 l3l1

Experimental Results

Comparison partner PPMicroCluster algorithm

Similar problem definition Can be adapted to handle the minimum variance

constraint Static algorithm

Data sets Synthetic data set (DS1)

5000 2-d data points to simulate sensors deployed uniformly

Two real UCI data sets (Abalone and Letter)

Results on Synthetic data set

Results for the DS1 dataset (Only Significance Constraints are Specified)

Results on Letter data set

Results for the Letter dataset (Both Significance and Variance Constraints are Specified)

Conclusion & Future work A new Constraint-Driven Clustering (CDC) model

Need-driven Focused on two cluster-level constraints

Proved NP-Hardness of the CDC problem Proposed a new data structure (CD-Tree) Developed a heuristic algorithm based on CD-Tree Future Work

Allow constraints to be ranges instead of exact values Design other types of constraints to capture different

application needs Generalize the heuristic algorithm to handle other

constraints, such as minimum separation constraint [Davidson et al.’05]

Reference [Ghosh’02] J. Ghosh and A. Strehl. Clustering and visualization of

retail market baskets. In N. R. Pal and L. Jain, editors, Knowledge Discovery in Advanced Information Systems. Springer, 2002.

[Kleinberg’99] J. Kleinberg, C. Papadimitriou, and P. Raghavan. A microeconomic view of data mining. J. Data Mining and Knowledge Discovery, 1999.

[Bradley’00] P. Bradley, K. P. Bennett, and A. Demiriz. Constrained k-means clustering. Technical report, MSR-TR-2000-65, Microsoft Research, 2000.

[Wagstaff’00] K. Wagstaff and C. Cardie. Clustering with instance-level constraints. In ICML, 2000.

[Davidson’05] I. Davidson and S. S. Ravi. Clustering with constraints: Feasibility issues and the k-means algorithm. In SDM, 2005.

[Samarati’98] P. Samarati and L. Sweeney. Generalizing data to provide anonymity when disclosing information (abstract). In PODS, 1998.

Reference [Sweeney’02] L. Sweeney. k-anonymity: A model for protecting

privacy. In IJUFKS, 2002. [Jin’06] W. Jin, R. Ge, and W. Qian. On robust and effective k-

anonymity in large databases. In PAKDD, 2006. [Aggarwal’04] C. C. Aggarwal and P. S. Yu. A condensation

approach to privacy preserving data mining. In EDBT, 2004. [Tung’01] A. K. H. Tung, J. Han, R. T. Ng, and L. V. S. Lakshmanan.

Constraint-based clustering in large databases. In ICDT, 2001. [Banerjee’06] A. Banerjee and J. Ghosh. Scalable clustering

algorithms with balancing constraints. Data Mining Knowledge Discovery, 13(3), 2006.

Thanks!

Poster: this evening (Tuesday), board #1

Create a new leaf node

Move the furthest point to the mean of the old leaf node to it

Calculate the new objective value

Does the objectivevalue drop?

NoLink the new node appropriately

Runtime

O(n2 + n * Sig2) The runtime of inserting one point is O(n) The height of a CD-Tree can be O(n) Total time for split is O(Sig2) Total time for building a tree is O(n2 + nSig2)

Outline

Introduction Two classes of clustering methods Motivation for constraint-driven clustering

Related Work Constraint-Driven Clustering model Theoretical Results Heuristic Algorithm Experimental Results Conclusion Future Work

Related Work Actionable Clustering [Kleinberg‘99]

Objective function measures the utility of a clustering in decision making

Cluster-level Constraints Constrianed k-means algorithm [Bradley’00] Different to our model: K is specified

Instance-level Constraints Must-link and cannot-link constraints [Wagstaff’00] Feasibility issue with the instance-level constraints

[Davidson’05] Model a cluster-level constraint with instance-level

constraints Require a large number of instance-level constraints Specifying too many constraints is problematic

Related Work (Contd.)

K-Anonymity [Samarati’98][Sweeney’02] Each record is indistinguishable from k-1 other

records On categorical data Condensation approach is an extension of K-

Anonymity on numerical data [Aggarwal’04] PPMicroCluster[Jin’06]

Minimum significance constraint and minimum radius constraint

Different to our model: Minimum variance constraint Not analyze the complexity of the cluster model Propose a static algorithm

Constraint-Driven Clustering

Documents

Transcript of Constraint-Driven Clustering

Constraint-Driven Coordinated Control of Multi …magnus.ece.gatech.edu/Papers/GennaroACC2019.pdfConstraint-Driven Coordinated Control of Multi-Robot Systems Gennaro Notomista1 and

Strategic management of component obsolescence … · using constraint-driven design refresh planning ... of component obsolescence using constraint-driven design ... Strategic management

Constraint-Driven Floorplan Repairweb.eecs.umich.edu/~imarkov/pubs/jour/todaes08-floorist.pdf · Constraint-Driven Floorplan Repair MICHAEL D. MOFFITTy, JARROD A. ROYz, IGOR L. MARKOVz,

Reconstruction and Clustering in Random Constraint ...

Directed clustering in driven compartmentalized granular gas

Visuallyâ€“driven analysis of movement data by progressive clustering

Soft-Clustering Driven Flip-flop Placement Targeting Clock … · Soft-Clustering Driven Flip-flop Placement Targeting Clock-induced OCV Dimitrios Mangiras Giorgos Dimitrakopoulos

Clustering Driven Deep Autoencoder for Video Anomaly Detectionjsyuan/papers/2020/ECCV... · Clustering Driven Deep Autoencoder for Video Anomaly Detection Yunpeng Chang 1, Zhigang

Tree Clustering for Constraint Networks

Electric Propeller Driven RC Aircraft Constraint Analysis/Weight Estimation/Flight Simulation/Optimization

1. Introduction - University of Calgary in Alberta · 2014-10-17 · Keywords: Spatial Clustering; Constraint-based Clustering; Obstacle. 1. Introduction Clustering large amounts

Search-driven String Constraint Solving for Vulnerability Detection

Constraint Satisfaction Problems. Contents Representations Representations Solving with Tree Search and Heuristics Constraint Propagation Tree Clustering.

Constraint Reasoning and Kernel Clustering for …damoulas/Site/papers_files/pattern...Constraint Reasoning and Kernel Clustering for Pattern Decomposition With Scaling Ronan LeBras

Energy-Driven Adaptive Clustering Hierarchy (EDACH) for Wireless Sensor Networks

Locally Constraint Support Vector Clustering

Biology-Driven Clustering of Microarray Data

Dynamic Data Driven-based Automatic Clustering and ... · Dynamic Data Driven-based Automatic Clustering and Semantic Annotation for Internet of Things Sensor Data Szu-Yin Lin,1*

Constraint-Driven Dynamic Adaptation of Mobile ...people.cs.vt.edu/~tilevich/papers/mobicase2014Constraints.pdf · Constraint-Driven Dynamic Adaptation of Mobile Applications for

Data Driven Clustering of P300 EEG Data Using Coupled ...