Hierarchical Stability Based Model Selection for Data Clustering
Embed Size (px)
Transcript of Hierarchical Stability Based Model Selection for Data Clustering
**Hierarchical Stability Based Model Selection for Data ClusteringBing Yin
Advisor: Greg Hamerly
What is clustering?
What is model selection for clustering algorithms?
Stability Based Model Selection: Proposals and Problems
Hierarchical Stability Based Model Selection Algorithm Unimodality Test Experiments
Main Contribution Extended the concept of stability to hierarchical stability. Solved the symmetric data sets problem. Make stability a competing tool for model selection.
**What is clustering?
Given: data set of objects some relations between those objects: similarities, distances, neighborhoods, connections,
Goal: Find meaningful groups of objects s. t. objects in the same group are similar objects in different groups are dissimilar
Clustering is: a form of unsupervised learning a method of data exploration
**What is clustering? An Example
Image Segmentation: Micro array Analysis:
Serum Stimulation of Human Fibroblasts(Eisen,Spellman,PNAS,1998)
9800 spots representing 8600 genes
12 samples taken over 24 hour period
Clusters can be roughly categorized as gene involved inA: cholesterol biosynthesisB: the cell cycleC: the immediate-early responseD: signaling and angiogenesisE: wound healing and tissue remodelingDocument ClusteringPost-search GroupingData MiningSocial Network AnalysisGene Family Grouping
**What is clustering? An Algorithm
K-Means algorithm (Lloyd, 1957)Given: data points X1,,Xn d, number K clusters to find.1. Randomly initialize the centers m10,,mK0.2. Iterate until convergence: 2.1 Assign each point to the closest center according to Euclidean distance, i.e., define clusters C1i+1,,CKi+1 byXs Cki+1 where ||Xs-mki||2 < ||Xs-mli||2, l=1 to K 2.2 Compute the new cluster centers bymki+1 = Xs / |Cki+1|
What is optimized?Minimizing within-cluster distances:
**What is model selection?
Clustering algorithms need to know the K before running.
The correct answer of K for a given data is unknown
So we need a better way to find this K and also the positions of the K centers
This can be intuitively called model selection for clustering algorithms.
Existing model selection method: Bayesian Information Criterion Gap statistics Projection Test Stability based approach
**Stability Based Model Selection
The basic idea: scientific truth should be reproducible in experiments.
Repeatedly run a clustering algorithm on the same data with parameter K and get a collection of clustering: If K is the correct model, clustering should be similar to each other If K is a wrong model, clustering may be quite different from each other
This fact is referred as the stability of K (Ulrike von Luxburg,2007)
**Stability Based Model Selection(2)
Example on the toy data:
If we can mathematically define this stability score for K, then stability can be used to find the correct model for the given data.
**Define the Stability
Variation of Information (VI) Clustering C1: X1,,Xk and Clustering C2: X1,,Xk on date X The prob. point p belongs in Xi is :
The entropy of C1: The joint prob. p in Xi and Xj is P(i,j) with entropy: The VI is defined as:
VI indicates a distance between two clustering.
**Define the stability (2)
Calculate the VI score for a single K Clustering the data using K-Means for K clusters, run M times Calculate pair wise VI of these M clustering. Average the VI and use it as the VI score for K
The calculated VI score for K indicates instability of K
Try this over different K
The K with lowest VI score/instability is chosen as the correct model
**Define the Stability(3)An good example of Stability
An bad example of Stability: symmetric data
Why?Because Clustering data into 9 clusters apparently has more grouping choices than clustering them into 3.
Problems with the concept of stability introduced above: Symmetric Data Set Only local optimization the smaller K
Proposed solution Analyze the stability in an hierarchical manner Do Unimodality Test to detect the termination of the recursion
Given: Data set XHS-means: 1. Test if X is not a unimodal cluster
2. If yes, find the optimal K for X by analyzing stability; otherwise, X is a single cluster, return.
3. Partition X into K subsets
4. For each subset, recursively perform this algorithm from step 1
5. Merge answers from each subset as answer for current data
**Unimodality Test - 2 Unimodality testFact: sum of squared Gaussians follows 2 distribution. If x1,,xd are d independent Gaussian variables, then S = x12++xd2 follows 2 distribution of degree d.
For given data set X, calculate Si=Xi12++Xid2 If X is a single Gaussian, then S follows 2 of degree d Otherwise, S is not a 2 distribution.
**Unimodality Test - Gap Test
Fact: the within cluster dispersion drops most apparently with the correct K (Tibshirani, 2000)
Given: Data set X, candidate k cluster X to k clusters and get within cluster dispersion Wk generate uniform data sets, cluster to k clusters, calculate W*k (averaged) gap(k) = W*k Wk select smallest k s. t. gap(k)>gap(k+1)
we use it in another way: just ask k=1?
**ExperimentsSynthetic data Both Gaussian Distribution and Uniform Distribution In dimensions from 2 up to 20 c-separation between each cluster center and its nearest neighbor is 4 200 points in each cluster, 10 clusters in total Handwritten Digits U.S. Postal Service handwritten digits 9298 instances in 256 dimensions 10 true clusters (maybe!) KDDD Control Curves 600 instances in 60 dimensions 6 true clusters, each has 100 instances
Synthetic Gaussian(10 true clusters)Synthetic Uniform(10 true clusters)Handwritten Digits(10 true clusters)KDDD Control Curves(6 true clusters)HS-means101101606.50.5Lange Stability6.51.5712030PG-means10119.51.5201171
**Experiments symmetric dataHS-meansLange Stability
Better Unimodality Testing approach. More detailed comparison on the performance with existing method like within cluster distance, VI metric and so on. Improve the speed of the algorithm.
**Questions and Comments