RIC: Parameter-Free Noise-Robust Clustering
description
Transcript of RIC: Parameter-Free Noise-Robust Clustering
Intelligent Database Systems Lab
國立雲林科技大學National Yunlin University of Science and Technology
RIC: Parameter-Free Noise-Robust Clustering
Presenter : Shu-Ya Li
Authors : CHRISTIAN BO¨ HM, CHRISTOS FALOUTSOS,
JIA-YU PAN, CLAUDIA PLANT
TKDD, 2007
2Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Outline
Motivation
Objective
Methodology
Experiments and Results
Conclusion
Personal Comments
3Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Motivation
How to find a natural clustering of a real-world point set which contains
an unknown number of clusters with different shapes
the clusters may be contaminated by noise?
4Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Objectives
Find natural clustering in a dataset Goodness of a clustering
We use Volume after Compression (VAC) to quantify the ‘goodness’ of a grouping by.
Efficient algorithm for good clustering
Robust Fitting Cluster Merging
MDL for classificationVAC for clustering
5Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.VAC (Volume after Compression )
VAC Tells which grouping is better
Lower VAC => better grouping
Formula using decorrelation matrix
Computing VAC Compute covariance matrix of cluster C
Compute PCA and obtain decorrelation matrix
Compute VAC from the matrix
6Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Computing VAC
VAC (volume after compression) Record bytes to record their type (guassian, uniform,..)
Record bytes for number of clusters k
The bytes to describe the parameters of each distribution (e.g., mean, variance, covariance, slope, intercept) and then the location of each point
Cluster Model
stat = (μi, σi, lbi, ubi, ...)
2.3+4.3=6.6bits
7Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Methodology – RIC framework
Robust Fitting Mahalanobis distance defined by Λ and V
Conventional estimation: covariance matrix uses Mean
Robust estimation: covariance matrix uses Median
Median is less affected by outliers than Mean
PCA (Σ = V ΛV T)
median
μ
μR
8Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Methodology – RIC framework
Cluster Merging Merge Ci and Cj only if the combined VAC decreases
If savedCost > 0, then merge Ci and Cj
Greedy search to maximize savedCost, hence minimize VAC
9Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experiments
Results on Synthetic Data
10Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experiments
Performance on Real Data
11Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experiments
Compares the result of filterOpt to the result of filterDist.
12Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Conclusion
The contributions of this work are the answers to the two questions, organized in our RIC framework. (Q1) Goodness Measure.
We propose the VAC criterion using information-theory concepts, and specifically the volume after compression.
(Q2) Efficiency. Robust fitting (RF) algorithm, which carefully avoids outliers.
Cluster merging (CM) algorithm, which stitches clusters together if the stitching gives a better VAC score.
13Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Personal Comments
Advantage Description detail
Many pictures and examples
Drawback It is difficult to identify black and white picture.
Application Clustering