Post on 18-Dec-2014
description
CLUSTERING
TUTORIAL
GABOR VERESS
2013.10.16
2
CONTENTS
What is clustering?
Distance: Similarity and dissimilarity
Data types in cluster analysis
Clustering methods
Evaluation of clustering
Summary
2
3
WHAT IS CLUSTERING?
Grouping of objects
3
4
CLUSTERING I. (BY TYPE)
Fruit Veggie
5
CLUSTERING II. (BY COLOR)
Yellow Green
6
CLUSTERING III.
(BY SHAPE)
Ball
Chili shape
Longish Bushy
7
ANOTHER CLUSTERING EXAMPLE
7
8
IMAGE PROCESSING EXAMPLE
8 Figure from “Image and video segmentation: the normalised cut framework”by Shi and Malik, copyright IEEE, 1998
9
YET ANOTHER EXAMPLE
Original Clustering 1 Clustering 2
9
10
Item Cian Magenta Yellow Black
Chili 72 0 51 57
Cucamber 11 0 45 19
Broccoli 15 0 23 31
Apple 25 0 74 20
Paprika 0 52 100 11
Lemon 0 20 93 5
Orange 0 18 65 3
Banana 0 1 100 1
CLUSTERING BY COLOR EXAMPLE
11
Item Cian Magenta Yellow Black Cluster
Chili 72 0 51 57 Cluster 1
Cucamber 11 0 45 19 Cluster 1
Broccoli 15 0 23 31 Cluster 1
Apple 25 0 74 20 Cluster 1
Paprika 0 52 100 11 Cluster 2
Lemon 0 20 93 5 Cluster 2
Orange 0 18 65 3 Cluster 2
Banana 0 1 100 1 Cluster 2
CLUSTERING BY COLOR EXAMPLE
12
WHAT IS CLUSTERING?
Grouping of objects into classes such a way that
• Objects in same cluster are similar
• Objects in different clusters are dissimilar
Segmentation vs. Clustering
• Clustering is finding borders between groups,
• Segmenting is using borders to form groups
Clustering is the method of creating segments
12
13
SUPERVISED VS. UNSUPERVISED
CLASSIFICATION VS. CLUSTERING
Classification – Supervised
Classes are predetermined
we know in advance the stamping
For example if we already diagnosed some disease
Or we know who has churned
Clustering – Unsupervised
Classes are not known in advance
we don’t know in advance the stamping
Market behaviour segmentation
Or Gene analysis
13
14
APPLICATIONS OF CLUSTERING
Marketing: segmentation of the customer based on behavior
Banking: ATM Fraud detection (outlier detection)
ATM classification: segmentation based on time series
Gene analysis: Identifying gene responsible for a disease
Chemistry: Periodic table of the elements
Image processing: identifying objects on an image (face detection)
Insurance: identifying groups of car insurance policy holders with a
high average claim cost
Houses: identifying groups of houses according to their house type,
value, and geographical location
15
TYPICAL DATABASE
15
id age sex region income married children car save_act current_act mortgage pep
ID12101 48 FEMALE INNER_CITY 17,546 NO 1 NO NO NO NO YES
ID12102 40 MALE TOWN 30,085 YES 3 YES NO YES YES NO
ID12103 51 FEMALE INNER_CITY 16,575 YES 0 YES YES YES NO NO
ID12104 23 FEMALE TOWN 20,375 YES 3 NO NO YES NO NO
ID12105 57 FEMALE RURAL 50,576 YES 0 NO YES NO NO NO
ID12106 57 FEMALE TOWN 37,870 YES 2 NO YES YES NO YES
ID12107 22 MALE RURAL 8,877 NO 0 NO NO YES NO YES
ID12108 58 MALE TOWN 24,947 YES 0 YES YES YES NO NO
ID12109 37 FEMALE SUBURBAN 25,304 YES 2 YES NO NO NO NO
ID12110 54 MALE TOWN 24,212 YES 2 YES YES YES NO NO
ID12111 66 FEMALE TOWN 59,804 YES 0 NO YES YES NO NO
How we define similarity or dissimilarity?
Especially for categorical variables?
16
WHAT TO DERIVE
FORM THE DATABASE?
16
id age sex region income married children car save_act current_act mortgage pep
ID12101 48 FEMALE INNER_CITY 17,546 NO 1 NO NO NO NO YES
ID12102 40 MALE TOWN 30,085 YES 3 YES NO YES YES NO
ID12103 51 FEMALE INNER_CITY 16,575 YES 0 YES YES YES NO NO
ID12104 23 FEMALE TOWN 20,375 YES 3 NO NO YES NO NO
ID12105 57 FEMALE RURAL 50,576 YES 0 NO YES NO NO NO
ID12106 57 FEMALE TOWN 37,870 YES 2 NO YES YES NO YES
ID12107 22 MALE RURAL 8,877 NO 0 NO NO YES NO YES
ID12108 58 MALE TOWN 24,947 YES 0 YES YES YES NO NO
ID12109 37 FEMALE SUBURBAN 25,304 YES 2 YES NO NO NO NO
ID12110 54 MALE TOWN 24,212 YES 2 YES YES YES NO NO
ID12111 66 FEMALE TOWN 59,804 YES 0 NO YES YES NO NO
Upper: Original database of the objects (customers)
Right: Similarity or dissimilarity measure of the objects (similarity of customers)
id ID12101 ID12102 ID12103 ID12104 ID12105
ID12101 0 12 23 19 13
ID12102 12 0 25 13 17
ID12103 23 25 0 9 21
ID12104 19 13 9 0 12
ID12105 13 17 21 12 0
17
REQUIREMENTS OF CLUSTERING
• Ability to deal with different types of attributes
• Discovery of clusters with arbitrary shape
• Able to deal with noise and outliers
• Insensitive to order of input records
• High dimensionality
• Scalability
• Minimal requirements for domain knowledge to
determine input parameters
• Incorporation of user-specified constraints
• Interpretability and usability
DISTANCE:
SIMILARITY AND
DISSIMILARITY
19
SIMILARITY AND DISSIMILARITY
There is no single definition of similarity or
dissimilarity between data objects
The definition of similarity or dissimilarity between
objects depends on
• the type of the data considered
• what kind of similarity we are looking for
19
20
DISTANCE MEASURE
Similarity/dissimilarity between objects is often
expressed in terms of a distance measure d(x,y)
Ideally, every distance measure should be a metric, i.e.,
it should satisfy the following conditions:
1. d(x,y) ≥ 0
2. d(x,y) = 0 if x = y
3. d(x,y) = d(y,x)
4. d(x,z) ≤ d(x,y) + d(y,z)
20
21
TYPE OF VARIABLES
Interval-scaled variables: Age
Binary variables: Car, Mortgage
Nominal, Ordinal, and Ratio variables
Variables of mixed types
Complex data types: Documents, GPS coordinates
21
id age sex region income married children car save_act current_act mortgage pep
ID12101 48 FEMALE INNER_CITY 17,546 NO 1 NO NO NO NO YES
ID12102 40 MALE TOWN 30,085 YES 3 YES NO YES YES NO
ID12103 51 FEMALE INNER_CITY 16,575 YES 0 YES YES YES NO NO
ID12104 23 FEMALE TOWN 20,375 YES 3 NO NO YES NO NO
ID12105 57 FEMALE RURAL 50,576 YES 0 NO YES NO NO NO
ID12106 57 FEMALE TOWN 37,870 YES 2 NO YES YES NO YES
ID12110 54 MALE TOWN 24,212 YES 2 YES YES YES NO NO
22
INTERVAL-SCALED VARIABLES
Continuous measurements of a roughly linear scale
for example, age, weight and height
The measurement unit can affect the cluster analysis
To avoid dependence on the measurement unit, we should
standardize the data
22
23
STANDARDIZATION
To standardize the measurements:
• calculate the mean absolute deviation
where and
• calculate the standardized measurement (z-score)
23
|)|...|||(|121 fnffffff
mxmxmxns
,)...21
1nffff
xx(xn m
f
fif
if s
mx z
24
DISTANCE MEASURE I.
One group of popular distance measures for interval-
scaled variables are Minkowski distances
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are
two p-dimensional data objects, and q is a positive integer
24
pp
jx
ix
jx
ix
jx
ixjid )||...|||(|),(
2211
25
DISTANCE MEASURES II.
If q = 1, the distance measure is Manhattan (or city
block) distance
If q = 2, the distance measure is Euclidean distance
25
||...||||),(2211 pp j
xi
xj
xi
xj
xi
xjid
)||...|||(|),( 22
22
2
11 pp jx
ix
jx
ix
jx
ixjid
26
EXAMPLE: DISTANCE MEASURES
Distance Matrix
point x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1
Manhattan
Distance
p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
Euclidean
Distance
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
0, 2
2, 0
3, 1 5, 1
0
1
2
3
0 1 2 3 4 5 6
x
y
27
WHY STANDARDIZATION?
Age and Income
No standardization
Income >> Age
No separation on age
With standardization
Separation based on both
age and income
27
28
RATIO-SCALED VARIABLES
A positive measurement on a nonlinear scale, approximately at exponential scale AeBt or Ae-Bt
Methods:
1. treat them like interval-scaled variables is not a good choice!
2. apply logarithmic transformation yif = log(xif)
3. treat them as continuous ordinal data and treat their rank as interval-scaled
4. create a better variable
28
Object On-net Off-net Ratio Log-Ratio On-net/Total
1 95 6 0.06 -1.20 94%
2 56 15 0.27 -0.57 79%
Dist 1-2 0.04 0.39 0.02
3 12 23 1.92 0.28 34%
4 12 29 2.42 0.38 29%
Dist 3-4 0.25 0.01 0.00
29
ORDINAL VARIABLES
An ordinal variable can be discrete or continuous
Order of values is important, e.g., rank
Can be treated like interval-scaled
• replacing xif by their rank
• map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by
• compute the dissimilarity using methods for interval-scaled variables
29
1
1
f
if
if M
rz
} ..., ,1 { fif
Mr
30
BINARY VARIABLES I.
Binary variables
has 2 outcomes 0/1, Y/N, …
Symmetric binary variable:
No preference on which outcome
should be coded 0 or 1
like gender
Asymmetric binary variable:
Outcomes are not equally important,
or based on one outcome the objects
are similar but based on the other
outcome we can’t tell
Like Has Mortgage or HIV positive
30
FEMALE MALE
FEMALE 0 1
MALE 1 0
Mortgage No Mortgage
Mortgage 0 1
No Mortgage 1 undef
Distances
31
BINARY VARIABLES II.
If we have more binary variables in the database we can
calculate the distance based on the contingency table
A contingency table for binary data
31
Object j
1 0 SUM
Object i
1 a b a+b
0 c d c+d
SUM a+c b+d t
32
BINARY VARIABLES III.
Simple matching coefficient (invariant similarity, if the
binary variable is symmetric):
Jaccard coefficient (non-invariant similarity, if the binary
variable is asymmetric):
32
dcbacb jid
),(
cbacb jid
),(
dcbada jisim
),(
cbaa jisim
),(
Object j
1 0 SUM
Object i
1 a b a+b
0 c d c+d
SUM a+c b+d t
33
NOMINAL VARIABLES
Generalization of the binary variable in that it can take
more than 2 states, e.g., red, yellow, blue
Distance matrix
More variables
Method 1: simple matching
• m: # of matches, p: total # of variables
Method 2: use a large number of binary variables
• create new binary variable for each of the k nominal states
33
pmp
jid
),(pmjisim ),(
Distance Red Yellow Blue
Red 0 1 1
Yellow 1 0 1
Blue 1 1 0
34
VARIABLES OF MIXED TYPES
Database usually contains different types of variables
• symmetric binary, asymmetric binary, nominal, ordinal, interval
Approaches
1. Group each type of variable together, performing a separate cluster analysis for each type.
2. Bring different variables onto a common scale of the interval [0.0, 1.0], performing a single cluster analysis
35
WEIGHTED FORMULA
Weight δij (f) = 0
• if xif or xjf is missing
• or xif = xjf =0 and variable f is asymmetric binary,
Otherwise Weight δij (f) = 1
Another option is to choose the weights based on business aspects
)(1
)()(1),(
fij
pf
fij
fij
pf
djid
36
VECTOR OBJECTS:
COSINE SIMILARITY
Vector objects: keywords in documents, gene features in micro-arrays, …
Applications: information retrieval, biologic taxonomy, ...
Cosine measure: If d1 and d2 are two vectors, then
cos(d1, d2) = (d1 d2) /||d1|| ||d2|| ,
where indicates vector dot product, ||d||: the length of vector d
Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
d1d2 = 3*1+2*0+0*0+5*0+0*0+0*0+0*0+2*1+0*0+0*2 = 5
||d1||= (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2)0.5=(6) 0.5 = 2.245
cos( d1, d2 ) = .3150
37
COMPLEX DATA TYPES
All not relational objects => complex types of data
• examples: spatial data, location data, multimedia data,
genetic data, time-series data, text data and data
collected from Web
We can define our own similarity or dissimilarity
measures than the previous
• can, for example, mean using of string and/or
sequence matching, or methods of information retrieval
37
CLUSTERING
METHODS
39
MAJOR CLUSTERING APPROACHES
Partitioning algorithms: Construct various partitions and then
evaluate them by some criterion
Hierarchy algorithms: Create a hierarchical decomposition of
the set of data (or objects) using some criterion
Density-based: based on connectivity and density functions
Grid-based: based on a multiple-level granularity structure
Model-based: A model is hypothesized for each of the clusters
and the idea is to find the best fit of that model to each other
40
PARTITIONING
BASIC CONCEPT
Partitioning method: Construct a partition of a database D of n
objects into k clusters
• each cluster contains at least one object
• each object belongs to exactly one cluster
Given a k, find a partition of k clusters that optimizes the chosen
partitioning criterion (min distance from cluster centers)
• Global optimal: exhaustively enumerate all partitions Stirling(n,k)
(S(10,3) = 9.330, S(20,3) = 580.606.446,…)
• Heuristic methods: k-means and k-medoids algorithms
• k-means: Each cluster is represented by the center of the cluster
• k-medoids or PAM (Partition around medoids): Each cluster is
represented by one of the objects in the cluster
40
41
PARTITIONING
K-MEANS ALGORITHM
Input: k clusters, n objects of database D.
Output: A set of k clusters which minimizes the squared-error function
Algorithm:
1. Choose k objects as the initial cluster centers
2. Assign each object to the cluster which has the closest mean
point (centroid) under squared Euclidean distance metric
3. When all objects have been assigned, recalculate the positions of
k mean point (centroid)
4. Repeat Steps 2. and 3. until the centroids do not change any
more
41
42
PARTITIONING
K-MEANS ALGORITHM
Source: Clustering: A survey 2008, R. Capaldo F. Collovà
42
43
PARTITIONING
K-MEANS
+ Easy to implement
+ The K-means method is is relatively efficient: O(tkn), where n is objects number, k is clusters number, and t is iterations number. Normally, k, t << n.
- Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms
- Not applicable in categorical data
- Need to specify k, the number of clusters, in advance
- Unable to handle noisy data and outliers
- Not suitable to discover clusters with non-convex shapes
To overcome some of these problems is introduced the K-medoids or PAM
43
44
PARTITIONING
K-MEDOID ALGORITHM
The method K-medoid or PAM ( Partitioning Around Medoids ) is the
same as k-means but instead of mean it uses medoid
mq (q = 1,2,…,k) as object more representative of cluster
medoid is the most centrally located object in a cluster
44 Source: Clustering: A survey 2008, R. Capaldo F. Collovà
45
PARTITIONING
K-MEDOID OR PAM
+ PAM is more robust than K-means in the presence of noise and
outliers because a medoid is less influenced by outliers or other
extreme values than a mean
- PAM works efficiently for small data sets but does not scale well
for large data sets. Infact: O( k(n-k)2 ) for each iteration where n
is data numbers, k is the clusters numbers
To overcome these problems is introduced:
CLARA (Clustering LARge Applications) - > Sampling based method
CLARANS - > A Clustering Algorithm based on Randomized Search.
45
46
PARTITIONING
CLARA
CLARA (Clustering LARge Applications) (Kaufmann and Rousseeuw
in 1990) draws multiple sample of the dataset and applies PAM on the
sample in order to find the medoids.
+ Deals with larger data sets than PAM
+ Experiments show that 5 samples of size 40+2k give satisfactory
results
- Efficiency depends on the sample size, should also determine
that parameter
- A good clustering based on samples will not necessarily
represent a good clustering of the whole data set if the sample is
biased, but to avoid this we use multiple sampling
46
47
PARTITIONING
CLARANS
CLARANS (CLustering Algorithm based on RANdomized Search) (Ng and Han’94)
A clustering method that draws sample of neighbors dynamically
There are 2 parameters: maxneighbour the maximum number of neighbours examined, numlocal the number of local minimum obtained
The algorithm is searching for new neighbours and replaces the current setup with a lower cost setup until the number of examined neighbours reaches the maxneighbour or the number of new local minimum obtained is reaches numlocal
+ It is more efficient and scalable than both PAM and CLARA
+ returns higher quality clusters
+ has the benefit of not confining the search to a restricted area
- Depending on parameters can be very time consuming (close to PAM)
47
48
HIERARCHICAL
BASIC CONCEPT
Hierarchical clustering
Construct a hierarchy of clusters not just a single partition
of objects
• Use distance matrix as clustering criteria
• Does not require the number of clusters as an input, but
needs a termination condition, e. g., number of clusters
or a distance threshold for merging
48
49
HIERARCHICAL
CLUSTERING TREE, DENDOGRAM
The hierarchy of clustering is given as a clustering tree or dendrogram
• leaves of the tree represent the individual objects
• internal nodes of the tree represent the clusters
Two main types of hierarchical clustering
• agglomerative (bottom-up)
• place each object in its own cluster (a singleton)
• merge in each step the two most similar clusters until there is only one cluster left or the termination condition is satisfied
• divisive (top-down)
• start with one big cluster containing all the objects
• divide the most distinctive cluster into smaller clusters and proceed until there are n clusters or the termination condition is satisfied
49
Div
isiv
e
Ag
glo
me
rativ
e
50
HIERARCHICAL
CLUSTER DISTANCE MEASURES
Single link (nearest neighbor). The distance between two clusters is determined by the distance of the two closest objects (nearest neighbors) in the different clusters.
Complete link (furthest neighbor). The distances between clusters are determined by the greatest distance between any two objects in the different clusters (i.e., by the "furthest neighbors").
Pair-group average link. The distance between two clusters is calculated as the average distance between all pairs of objects in the two different clusters.
Pair-group centroid. The distance between two clusters is determined as the distance between centroids.
50
Centroid link
51
HIERARCHICAL
EXAMPLE WITH DENDOGRAM
51
52
HIERARCHICAL
+ Conceptually simple
+ Theoretical properties are well understood
+ When clusters are merged/split, the decision is permanent => the
number of different alternatives that need to be examined is
reduced
- Merging/splitting of clusters is permanent => erroneous decisions
are impossible to correct later
- Divisive methods can be computational hard
- Methods are not (necessarily) scalable for large data sets
52
EVALUATION
54
EVALUATION BASICS
Business
• Segment sizes
• Meaningful segments
Technical
• Compactness
• Separation
54
55
COMPACTNESS AND SEPARATION
Compactness
intra-cluster variance
Separation
inter-cluster distance
Sometimes the two measures leads to different results 0
0,05
0,1
0,15
0,2
0,25
0,3
0,35
0,4
0,45
0,5
2 3 4 5 6
Dens_bw Scatt_orig
Separation
Compactness
56
INDEX FUNCTIONS
Number of clusters
• Finding the
minimum/maximum of a
function we can determine the
optimal number of clusters
Comparing clustering methods
• Using the index functions we
can compare the results of
different clustering methods of
the same database
0
0,2
0,4
0,6
0,8
1
1,2
1,4
1,6
2 3 4 5 6 7 8 9 10
DB
0
20
40
60
80
100
2 3 4 5 6 7 8 9 10
SD KM SD TS
57
SAMPLE DATABASE
We generated a sample with 4 clusters
• 2dimensions
• Real values between (–10;15)
With outliers
58
TWO-STEP AND K-MEANS
CLUSTERING
Two-step
3 4
5 6
7 8
3 4
5 6
7 8
K-means
59
0
0,2
0,4
0,6
0,8
1
2 3 4 5 6 7 8 9 10
DB KM DB TS
DB (DAVIES-BOULDIN) INDEX
DB index summarizes the similarity of a given cluster and the most
dissimilar cluster and then take the average of them
0
0,2
0,4
0,6
0,8
1
2 3 4 5 6 7 8 9 10
DB TS
0
0,2
0,4
0,6
0,8
1
2 3 4 5 6 7 8 9 10
DB KM
60
S_DBW INDEX
2 components
• Dens_bw: cluster separation
• Scatt: the average variance of the
clusters divided by the variance of
all objects
0
0,1
0,2
0,3
0,4
0,5
0,6
2 3 4 5 6 7 8 9 10
Dens_bw TS Scatt TS
0
0,1
0,2
0,3
0,4
0,5
0,6
2 3 4 5 6 7 8 9 10
Dens_bw KM Scatt KM
0
0,1
0,2
0,3
0,4
0,5
0,6
2 3 4 5 6 7 8 9 10
S_Dbw KM S_Dbw TS
61
SD INDEX
2 components :
• Scatt: compactness of
the clusters
• Dis: Function of the
centroids of the clusters
We should know the
maximum number of
clusters
0
20
40
60
80
100
2 3 4 5 6 7 8 9 10
SD TS
0
10
20
30
40
50
60
70
80
2 3 4 5 6 7 8 9 10
SD KM
0
20
40
60
80
100
2 3 4 5 6 7 8 9 10
SD KM SD TS
62
RS, RMSSTD INDEXEK
RD (R-squared) = variance between clusters / total variance
RMSSTD (Root-mean-square standard deviation)
= within cluster variance
0
0,2
0,4
0,6
0,8
1
1,2
2 3 4 5 6 7 8 9 10
-0,1
-0,05
0
0,05
0,1
0,15
RS_diff KM RS KM
0
2
4
6
8
10
12
2 3 4 5 6 7 8 9 10
-3
-2
-1
0
1
2
3
4
5
RMSSTD_diff KM RMSSTD KM
0
0,2
0,4
0,6
0,8
1
1,2
2 3 4 5 6 7 8 9 10
0
0,01
0,02
0,03
0,04
0,05
0,06
0,07
0,08
RS_diff TS RS TS
0
0,5
1
1,5
2
2,5
3
3,5
2 3 4 5 6 7 8 9 10
-0,05
0
0,05
0,1
0,15
0,2
0,25
0,3
0,35
0,4
0,45
RMSSTD_diff TS RMSSTD TS
0,6
0,65
0,7
0,75
0,8
0,85
0,9
0,95
1
2 3 4 5 6 7 8 9 10
RS KM RS TS
0
2
4
6
8
10
12
2 3 4 5 6 7 8 9 10
RMSSTD KM RMSSTD TS
63
SEGMENTATION IN BANK
Needs based segmentation for new tariff plans
When the number of cluster is 4 or 5 then we have a too big segment (cca. 60 000 customer)
Above 6 segments we can not to identify more significant segment
Balance decrease is the cutting variable Szeparáltság
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
2 3 4 5 6 7 8 9 10
0
0,02
0,04
0,06
0,08
0,1
0,12
Átmérő
0
0,05
0,1
0,15
0,2
0,25
0,3
0,35
2 3 4 5 6 7 8 9 10
0
0,005
0,01
0,015
0,02
0,025
0,03
0,035
0,04
0,045
0,05Separation Diameter
64
BANK SEGMENTATION – INDEXES
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
2 3 4 5 6 7 8 9 10
Dens_bw TS Scatt_orig TS
0
0,2
0,4
0,6
0,8
1
1,2
1,4
2 3 4 5 6 7 8 9 10
DB TS
0
0,5
1
1,5
2
2,5
3
3,5
4
2 3 4 5 6 7 8 9 10
SD TS
0,2
0,25
0,3
0,35
0,4
0,45
0,5
0,55
0,6
2 3 4 5 6 7 8 9 10
RMSSTD KM RMSSTD TS
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
2 3 4 5 6 7 8 9 10
0
0,02
0,04
0,06
0,08
0,1
0,12
RS_diff TS RS TS
0
0,1
0,2
0,3
0,4
0,5
0,6
2 3 4 5 6 7 8 9 10
-0,01
0
0,01
0,02
0,03
0,04
0,05
RMSSTD_diff TS RMSSTD TS
Based on the indexes there are 4-6 really different segments
65
LITERATURE I.
J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann
Publishers, August 2000.
J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann
Publishers, August 2000 (k-means, k-medoids or PAM )
L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990 (CLARA, AGNES, DIANA).
R. Ng and J. Han. Efficient and effective clustering method for spatial data mining.
VLDB'94 (CLARANS).
J. Han and M. Kamber. Data Mining: Concepts andTechniques. Morgan Kaufmann
Publishers, August 2000 (deterministic annealing, genetic algorithms).
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method
for very large databases. SIGMOD'96 (BIRCH).
S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large
databases. SIGMOD'98 (CURE).
65
66
LITERATURE II.
Karypis G., Eui-Hong Han, Kumar V. Chameleon: hierarchical clustering using dynamic modeling (CHAMELEON).
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases. KDD'96 (DBSCAN).
M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the clustering structure, SIGMOD’99 (OPTICS).
A. Hinneburg D., A. Keim: An Efficient Approach to Clustering in Large Multimedia Database with Noise. Proceedings of the 4-th ICKDDM, New York ’98 (DENCLUE).
Abramowitz, M. and Stegun, I. A. (Eds.). "Stirling Numbers of the Second Kind." §24.1.4 in Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, 9th printing. New York: Dover, pp. 824-825, 1972.
Introduction to Data Mining Pang-Ning Tan, Michigan State University Michael Steinbach,Vipin Kumar, University of Minnesota Publisher: Addison-Wesley Copyright: 2006.
66
THANK YOU!
GABOR VERESS
LYNX ANALYTICS