Clustering training

67
CLUSTERING TUTORIAL GABOR VERESS 2013.10.16

description

What is clustering? Distance: Similarity and dissimilarity Data types in cluster analysis Clustering methods Evaluation of clustering Summary

Transcript of Clustering training

Page 1: Clustering training

CLUSTERING

TUTORIAL

GABOR VERESS

2013.10.16

Page 2: Clustering training

2

CONTENTS

What is clustering?

Distance: Similarity and dissimilarity

Data types in cluster analysis

Clustering methods

Evaluation of clustering

Summary

2

Page 3: Clustering training

3

WHAT IS CLUSTERING?

Grouping of objects

3

Page 4: Clustering training

4

CLUSTERING I. (BY TYPE)

Fruit Veggie

Page 5: Clustering training

5

CLUSTERING II. (BY COLOR)

Yellow Green

Page 6: Clustering training

6

CLUSTERING III.

(BY SHAPE)

Ball

Chili shape

Longish Bushy

Page 7: Clustering training

7

ANOTHER CLUSTERING EXAMPLE

7

Page 8: Clustering training

8

IMAGE PROCESSING EXAMPLE

8 Figure from “Image and video segmentation: the normalised cut framework”by Shi and Malik, copyright IEEE, 1998

Page 9: Clustering training

9

YET ANOTHER EXAMPLE

Original Clustering 1 Clustering 2

9

Page 10: Clustering training

10

Item Cian Magenta Yellow Black

Chili 72 0 51 57

Cucamber 11 0 45 19

Broccoli 15 0 23 31

Apple 25 0 74 20

Paprika 0 52 100 11

Lemon 0 20 93 5

Orange 0 18 65 3

Banana 0 1 100 1

CLUSTERING BY COLOR EXAMPLE

Page 11: Clustering training

11

Item Cian Magenta Yellow Black Cluster

Chili 72 0 51 57 Cluster 1

Cucamber 11 0 45 19 Cluster 1

Broccoli 15 0 23 31 Cluster 1

Apple 25 0 74 20 Cluster 1

Paprika 0 52 100 11 Cluster 2

Lemon 0 20 93 5 Cluster 2

Orange 0 18 65 3 Cluster 2

Banana 0 1 100 1 Cluster 2

CLUSTERING BY COLOR EXAMPLE

Page 12: Clustering training

12

WHAT IS CLUSTERING?

Grouping of objects into classes such a way that

• Objects in same cluster are similar

• Objects in different clusters are dissimilar

Segmentation vs. Clustering

• Clustering is finding borders between groups,

• Segmenting is using borders to form groups

Clustering is the method of creating segments

12

Page 13: Clustering training

13

SUPERVISED VS. UNSUPERVISED

CLASSIFICATION VS. CLUSTERING

Classification – Supervised

Classes are predetermined

we know in advance the stamping

For example if we already diagnosed some disease

Or we know who has churned

Clustering – Unsupervised

Classes are not known in advance

we don’t know in advance the stamping

Market behaviour segmentation

Or Gene analysis

13

Page 14: Clustering training

14

APPLICATIONS OF CLUSTERING

Marketing: segmentation of the customer based on behavior

Banking: ATM Fraud detection (outlier detection)

ATM classification: segmentation based on time series

Gene analysis: Identifying gene responsible for a disease

Chemistry: Periodic table of the elements

Image processing: identifying objects on an image (face detection)

Insurance: identifying groups of car insurance policy holders with a

high average claim cost

Houses: identifying groups of houses according to their house type,

value, and geographical location

Page 15: Clustering training

15

TYPICAL DATABASE

15

id age sex region income married children car save_act current_act mortgage pep

ID12101 48 FEMALE INNER_CITY 17,546 NO 1 NO NO NO NO YES

ID12102 40 MALE TOWN 30,085 YES 3 YES NO YES YES NO

ID12103 51 FEMALE INNER_CITY 16,575 YES 0 YES YES YES NO NO

ID12104 23 FEMALE TOWN 20,375 YES 3 NO NO YES NO NO

ID12105 57 FEMALE RURAL 50,576 YES 0 NO YES NO NO NO

ID12106 57 FEMALE TOWN 37,870 YES 2 NO YES YES NO YES

ID12107 22 MALE RURAL 8,877 NO 0 NO NO YES NO YES

ID12108 58 MALE TOWN 24,947 YES 0 YES YES YES NO NO

ID12109 37 FEMALE SUBURBAN 25,304 YES 2 YES NO NO NO NO

ID12110 54 MALE TOWN 24,212 YES 2 YES YES YES NO NO

ID12111 66 FEMALE TOWN 59,804 YES 0 NO YES YES NO NO

How we define similarity or dissimilarity?

Especially for categorical variables?

Page 16: Clustering training

16

WHAT TO DERIVE

FORM THE DATABASE?

16

id age sex region income married children car save_act current_act mortgage pep

ID12101 48 FEMALE INNER_CITY 17,546 NO 1 NO NO NO NO YES

ID12102 40 MALE TOWN 30,085 YES 3 YES NO YES YES NO

ID12103 51 FEMALE INNER_CITY 16,575 YES 0 YES YES YES NO NO

ID12104 23 FEMALE TOWN 20,375 YES 3 NO NO YES NO NO

ID12105 57 FEMALE RURAL 50,576 YES 0 NO YES NO NO NO

ID12106 57 FEMALE TOWN 37,870 YES 2 NO YES YES NO YES

ID12107 22 MALE RURAL 8,877 NO 0 NO NO YES NO YES

ID12108 58 MALE TOWN 24,947 YES 0 YES YES YES NO NO

ID12109 37 FEMALE SUBURBAN 25,304 YES 2 YES NO NO NO NO

ID12110 54 MALE TOWN 24,212 YES 2 YES YES YES NO NO

ID12111 66 FEMALE TOWN 59,804 YES 0 NO YES YES NO NO

Upper: Original database of the objects (customers)

Right: Similarity or dissimilarity measure of the objects (similarity of customers)

id ID12101 ID12102 ID12103 ID12104 ID12105

ID12101 0 12 23 19 13

ID12102 12 0 25 13 17

ID12103 23 25 0 9 21

ID12104 19 13 9 0 12

ID12105 13 17 21 12 0

Page 17: Clustering training

17

REQUIREMENTS OF CLUSTERING

• Ability to deal with different types of attributes

• Discovery of clusters with arbitrary shape

• Able to deal with noise and outliers

• Insensitive to order of input records

• High dimensionality

• Scalability

• Minimal requirements for domain knowledge to

determine input parameters

• Incorporation of user-specified constraints

• Interpretability and usability

Page 18: Clustering training

DISTANCE:

SIMILARITY AND

DISSIMILARITY

Page 19: Clustering training

19

SIMILARITY AND DISSIMILARITY

There is no single definition of similarity or

dissimilarity between data objects

The definition of similarity or dissimilarity between

objects depends on

• the type of the data considered

• what kind of similarity we are looking for

19

Page 20: Clustering training

20

DISTANCE MEASURE

Similarity/dissimilarity between objects is often

expressed in terms of a distance measure d(x,y)

Ideally, every distance measure should be a metric, i.e.,

it should satisfy the following conditions:

1. d(x,y) ≥ 0

2. d(x,y) = 0 if x = y

3. d(x,y) = d(y,x)

4. d(x,z) ≤ d(x,y) + d(y,z)

20

Page 21: Clustering training

21

TYPE OF VARIABLES

Interval-scaled variables: Age

Binary variables: Car, Mortgage

Nominal, Ordinal, and Ratio variables

Variables of mixed types

Complex data types: Documents, GPS coordinates

21

id age sex region income married children car save_act current_act mortgage pep

ID12101 48 FEMALE INNER_CITY 17,546 NO 1 NO NO NO NO YES

ID12102 40 MALE TOWN 30,085 YES 3 YES NO YES YES NO

ID12103 51 FEMALE INNER_CITY 16,575 YES 0 YES YES YES NO NO

ID12104 23 FEMALE TOWN 20,375 YES 3 NO NO YES NO NO

ID12105 57 FEMALE RURAL 50,576 YES 0 NO YES NO NO NO

ID12106 57 FEMALE TOWN 37,870 YES 2 NO YES YES NO YES

ID12110 54 MALE TOWN 24,212 YES 2 YES YES YES NO NO

Page 22: Clustering training

22

INTERVAL-SCALED VARIABLES

Continuous measurements of a roughly linear scale

for example, age, weight and height

The measurement unit can affect the cluster analysis

To avoid dependence on the measurement unit, we should

standardize the data

22

Page 23: Clustering training

23

STANDARDIZATION

To standardize the measurements:

• calculate the mean absolute deviation

where and

• calculate the standardized measurement (z-score)

23

|)|...|||(|121 fnffffff

mxmxmxns

,)...21

1nffff

xx(xn m

f

fif

if s

mx z

Page 24: Clustering training

24

DISTANCE MEASURE I.

One group of popular distance measures for interval-

scaled variables are Minkowski distances

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are

two p-dimensional data objects, and q is a positive integer

24

qq

pp

qq

jx

ix

jx

ix

jx

ixjid )||...|||(|),(

2211

Page 25: Clustering training

25

DISTANCE MEASURES II.

If q = 1, the distance measure is Manhattan (or city

block) distance

If q = 2, the distance measure is Euclidean distance

25

||...||||),(2211 pp j

xi

xj

xi

xj

xi

xjid

)||...|||(|),( 22

22

2

11 pp jx

ix

jx

ix

jx

ixjid

Page 26: Clustering training

26

EXAMPLE: DISTANCE MEASURES

Distance Matrix

point x y

p1 0 2

p2 2 0

p3 3 1

p4 5 1

Manhattan

Distance

p1 p2 p3 p4

p1 0 4 4 6

p2 4 0 2 4

p3 4 2 0 2

p4 6 4 2 0

Euclidean

Distance

p1 p2 p3 p4

p1 0 2.828 3.162 5.099

p2 2.828 0 1.414 3.162

p3 3.162 1.414 0 2

p4 5.099 3.162 2 0

0, 2

2, 0

3, 1 5, 1

0

1

2

3

0 1 2 3 4 5 6

x

y

Page 27: Clustering training

27

WHY STANDARDIZATION?

Age and Income

No standardization

Income >> Age

No separation on age

With standardization

Separation based on both

age and income

27

Page 28: Clustering training

28

RATIO-SCALED VARIABLES

A positive measurement on a nonlinear scale, approximately at exponential scale AeBt or Ae-Bt

Methods:

1. treat them like interval-scaled variables is not a good choice!

2. apply logarithmic transformation yif = log(xif)

3. treat them as continuous ordinal data and treat their rank as interval-scaled

4. create a better variable

28

Object On-net Off-net Ratio Log-Ratio On-net/Total

1 95 6 0.06 -1.20 94%

2 56 15 0.27 -0.57 79%

Dist 1-2 0.04 0.39 0.02

3 12 23 1.92 0.28 34%

4 12 29 2.42 0.38 29%

Dist 3-4 0.25 0.01 0.00

Page 29: Clustering training

29

ORDINAL VARIABLES

An ordinal variable can be discrete or continuous

Order of values is important, e.g., rank

Can be treated like interval-scaled

• replacing xif by their rank

• map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by

• compute the dissimilarity using methods for interval-scaled variables

29

1

1

f

if

if M

rz

} ..., ,1 { fif

Mr

Page 30: Clustering training

30

BINARY VARIABLES I.

Binary variables

has 2 outcomes 0/1, Y/N, …

Symmetric binary variable:

No preference on which outcome

should be coded 0 or 1

like gender

Asymmetric binary variable:

Outcomes are not equally important,

or based on one outcome the objects

are similar but based on the other

outcome we can’t tell

Like Has Mortgage or HIV positive

30

FEMALE MALE

FEMALE 0 1

MALE 1 0

Mortgage No Mortgage

Mortgage 0 1

No Mortgage 1 undef

Distances

Page 31: Clustering training

31

BINARY VARIABLES II.

If we have more binary variables in the database we can

calculate the distance based on the contingency table

A contingency table for binary data

31

Object j

1 0 SUM

Object i

1 a b a+b

0 c d c+d

SUM a+c b+d t

Page 32: Clustering training

32

BINARY VARIABLES III.

Simple matching coefficient (invariant similarity, if the

binary variable is symmetric):

Jaccard coefficient (non-invariant similarity, if the binary

variable is asymmetric):

32

dcbacb jid

),(

cbacb jid

),(

dcbada jisim

),(

cbaa jisim

),(

Object j

1 0 SUM

Object i

1 a b a+b

0 c d c+d

SUM a+c b+d t

Page 33: Clustering training

33

NOMINAL VARIABLES

Generalization of the binary variable in that it can take

more than 2 states, e.g., red, yellow, blue

Distance matrix

More variables

Method 1: simple matching

• m: # of matches, p: total # of variables

Method 2: use a large number of binary variables

• create new binary variable for each of the k nominal states

33

pmp

jid

),(pmjisim ),(

Distance Red Yellow Blue

Red 0 1 1

Yellow 1 0 1

Blue 1 1 0

Page 34: Clustering training

34

VARIABLES OF MIXED TYPES

Database usually contains different types of variables

• symmetric binary, asymmetric binary, nominal, ordinal, interval

Approaches

1. Group each type of variable together, performing a separate cluster analysis for each type.

2. Bring different variables onto a common scale of the interval [0.0, 1.0], performing a single cluster analysis

Page 35: Clustering training

35

WEIGHTED FORMULA

Weight δij (f) = 0

• if xif or xjf is missing

• or xif = xjf =0 and variable f is asymmetric binary,

Otherwise Weight δij (f) = 1

Another option is to choose the weights based on business aspects

)(1

)()(1),(

fij

pf

fij

fij

pf

djid

Page 36: Clustering training

36

VECTOR OBJECTS:

COSINE SIMILARITY

Vector objects: keywords in documents, gene features in micro-arrays, …

Applications: information retrieval, biologic taxonomy, ...

Cosine measure: If d1 and d2 are two vectors, then

cos(d1, d2) = (d1 d2) /||d1|| ||d2|| ,

where indicates vector dot product, ||d||: the length of vector d

Example:

d1 = 3 2 0 5 0 0 0 2 0 0

d2 = 1 0 0 0 0 0 0 1 0 2

d1d2 = 3*1+2*0+0*0+5*0+0*0+0*0+0*0+2*1+0*0+0*2 = 5

||d1||= (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481

||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2)0.5=(6) 0.5 = 2.245

cos( d1, d2 ) = .3150

Page 37: Clustering training

37

COMPLEX DATA TYPES

All not relational objects => complex types of data

• examples: spatial data, location data, multimedia data,

genetic data, time-series data, text data and data

collected from Web

We can define our own similarity or dissimilarity

measures than the previous

• can, for example, mean using of string and/or

sequence matching, or methods of information retrieval

37

Page 38: Clustering training

CLUSTERING

METHODS

Page 39: Clustering training

39

MAJOR CLUSTERING APPROACHES

Partitioning algorithms: Construct various partitions and then

evaluate them by some criterion

Hierarchy algorithms: Create a hierarchical decomposition of

the set of data (or objects) using some criterion

Density-based: based on connectivity and density functions

Grid-based: based on a multiple-level granularity structure

Model-based: A model is hypothesized for each of the clusters

and the idea is to find the best fit of that model to each other

Page 40: Clustering training

40

PARTITIONING

BASIC CONCEPT

Partitioning method: Construct a partition of a database D of n

objects into k clusters

• each cluster contains at least one object

• each object belongs to exactly one cluster

Given a k, find a partition of k clusters that optimizes the chosen

partitioning criterion (min distance from cluster centers)

• Global optimal: exhaustively enumerate all partitions Stirling(n,k)

(S(10,3) = 9.330, S(20,3) = 580.606.446,…)

• Heuristic methods: k-means and k-medoids algorithms

• k-means: Each cluster is represented by the center of the cluster

• k-medoids or PAM (Partition around medoids): Each cluster is

represented by one of the objects in the cluster

40

Page 41: Clustering training

41

PARTITIONING

K-MEANS ALGORITHM

Input: k clusters, n objects of database D.

Output: A set of k clusters which minimizes the squared-error function

Algorithm:

1. Choose k objects as the initial cluster centers

2. Assign each object to the cluster which has the closest mean

point (centroid) under squared Euclidean distance metric

3. When all objects have been assigned, recalculate the positions of

k mean point (centroid)

4. Repeat Steps 2. and 3. until the centroids do not change any

more

41

Page 42: Clustering training

42

PARTITIONING

K-MEANS ALGORITHM

Source: Clustering: A survey 2008, R. Capaldo F. Collovà

42

Page 43: Clustering training

43

PARTITIONING

K-MEANS

+ Easy to implement

+ The K-means method is is relatively efficient: O(tkn), where n is objects number, k is clusters number, and t is iterations number. Normally, k, t << n.

- Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms

- Not applicable in categorical data

- Need to specify k, the number of clusters, in advance

- Unable to handle noisy data and outliers

- Not suitable to discover clusters with non-convex shapes

To overcome some of these problems is introduced the K-medoids or PAM

43

Page 44: Clustering training

44

PARTITIONING

K-MEDOID ALGORITHM

The method K-medoid or PAM ( Partitioning Around Medoids ) is the

same as k-means but instead of mean it uses medoid

mq (q = 1,2,…,k) as object more representative of cluster

medoid is the most centrally located object in a cluster

44 Source: Clustering: A survey 2008, R. Capaldo F. Collovà

Page 45: Clustering training

45

PARTITIONING

K-MEDOID OR PAM

+ PAM is more robust than K-means in the presence of noise and

outliers because a medoid is less influenced by outliers or other

extreme values than a mean

- PAM works efficiently for small data sets but does not scale well

for large data sets. Infact: O( k(n-k)2 ) for each iteration where n

is data numbers, k is the clusters numbers

To overcome these problems is introduced:

CLARA (Clustering LARge Applications) - > Sampling based method

CLARANS - > A Clustering Algorithm based on Randomized Search.

45

Page 46: Clustering training

46

PARTITIONING

CLARA

CLARA (Clustering LARge Applications) (Kaufmann and Rousseeuw

in 1990) draws multiple sample of the dataset and applies PAM on the

sample in order to find the medoids.

+ Deals with larger data sets than PAM

+ Experiments show that 5 samples of size 40+2k give satisfactory

results

- Efficiency depends on the sample size, should also determine

that parameter

- A good clustering based on samples will not necessarily

represent a good clustering of the whole data set if the sample is

biased, but to avoid this we use multiple sampling

46

Page 47: Clustering training

47

PARTITIONING

CLARANS

CLARANS (CLustering Algorithm based on RANdomized Search) (Ng and Han’94)

A clustering method that draws sample of neighbors dynamically

There are 2 parameters: maxneighbour the maximum number of neighbours examined, numlocal the number of local minimum obtained

The algorithm is searching for new neighbours and replaces the current setup with a lower cost setup until the number of examined neighbours reaches the maxneighbour or the number of new local minimum obtained is reaches numlocal

+ It is more efficient and scalable than both PAM and CLARA

+ returns higher quality clusters

+ has the benefit of not confining the search to a restricted area

- Depending on parameters can be very time consuming (close to PAM)

47

Page 48: Clustering training

48

HIERARCHICAL

BASIC CONCEPT

Hierarchical clustering

Construct a hierarchy of clusters not just a single partition

of objects

• Use distance matrix as clustering criteria

• Does not require the number of clusters as an input, but

needs a termination condition, e. g., number of clusters

or a distance threshold for merging

48

Page 49: Clustering training

49

HIERARCHICAL

CLUSTERING TREE, DENDOGRAM

The hierarchy of clustering is given as a clustering tree or dendrogram

• leaves of the tree represent the individual objects

• internal nodes of the tree represent the clusters

Two main types of hierarchical clustering

• agglomerative (bottom-up)

• place each object in its own cluster (a singleton)

• merge in each step the two most similar clusters until there is only one cluster left or the termination condition is satisfied

• divisive (top-down)

• start with one big cluster containing all the objects

• divide the most distinctive cluster into smaller clusters and proceed until there are n clusters or the termination condition is satisfied

49

Div

isiv

e

Ag

glo

me

rativ

e

Page 50: Clustering training

50

HIERARCHICAL

CLUSTER DISTANCE MEASURES

Single link (nearest neighbor). The distance between two clusters is determined by the distance of the two closest objects (nearest neighbors) in the different clusters.

Complete link (furthest neighbor). The distances between clusters are determined by the greatest distance between any two objects in the different clusters (i.e., by the "furthest neighbors").

Pair-group average link. The distance between two clusters is calculated as the average distance between all pairs of objects in the two different clusters.

Pair-group centroid. The distance between two clusters is determined as the distance between centroids.

50

Centroid link

Page 51: Clustering training

51

HIERARCHICAL

EXAMPLE WITH DENDOGRAM

51

Page 52: Clustering training

52

HIERARCHICAL

+ Conceptually simple

+ Theoretical properties are well understood

+ When clusters are merged/split, the decision is permanent => the

number of different alternatives that need to be examined is

reduced

- Merging/splitting of clusters is permanent => erroneous decisions

are impossible to correct later

- Divisive methods can be computational hard

- Methods are not (necessarily) scalable for large data sets

52

Page 53: Clustering training

EVALUATION

Page 54: Clustering training

54

EVALUATION BASICS

Business

• Segment sizes

• Meaningful segments

Technical

• Compactness

• Separation

54

Page 55: Clustering training

55

COMPACTNESS AND SEPARATION

Compactness

intra-cluster variance

Separation

inter-cluster distance

Sometimes the two measures leads to different results 0

0,05

0,1

0,15

0,2

0,25

0,3

0,35

0,4

0,45

0,5

2 3 4 5 6

Dens_bw Scatt_orig

Separation

Compactness

Page 56: Clustering training

56

INDEX FUNCTIONS

Number of clusters

• Finding the

minimum/maximum of a

function we can determine the

optimal number of clusters

Comparing clustering methods

• Using the index functions we

can compare the results of

different clustering methods of

the same database

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1,6

2 3 4 5 6 7 8 9 10

DB

0

20

40

60

80

100

2 3 4 5 6 7 8 9 10

SD KM SD TS

Page 57: Clustering training

57

SAMPLE DATABASE

We generated a sample with 4 clusters

• 2dimensions

• Real values between (–10;15)

With outliers

Page 58: Clustering training

58

TWO-STEP AND K-MEANS

CLUSTERING

Two-step

3 4

5 6

7 8

3 4

5 6

7 8

K-means

Page 59: Clustering training

59

0

0,2

0,4

0,6

0,8

1

2 3 4 5 6 7 8 9 10

DB KM DB TS

DB (DAVIES-BOULDIN) INDEX

DB index summarizes the similarity of a given cluster and the most

dissimilar cluster and then take the average of them

0

0,2

0,4

0,6

0,8

1

2 3 4 5 6 7 8 9 10

DB TS

0

0,2

0,4

0,6

0,8

1

2 3 4 5 6 7 8 9 10

DB KM

Page 60: Clustering training

60

S_DBW INDEX

2 components

• Dens_bw: cluster separation

• Scatt: the average variance of the

clusters divided by the variance of

all objects

0

0,1

0,2

0,3

0,4

0,5

0,6

2 3 4 5 6 7 8 9 10

Dens_bw TS Scatt TS

0

0,1

0,2

0,3

0,4

0,5

0,6

2 3 4 5 6 7 8 9 10

Dens_bw KM Scatt KM

0

0,1

0,2

0,3

0,4

0,5

0,6

2 3 4 5 6 7 8 9 10

S_Dbw KM S_Dbw TS

Page 61: Clustering training

61

SD INDEX

2 components :

• Scatt: compactness of

the clusters

• Dis: Function of the

centroids of the clusters

We should know the

maximum number of

clusters

0

20

40

60

80

100

2 3 4 5 6 7 8 9 10

SD TS

0

10

20

30

40

50

60

70

80

2 3 4 5 6 7 8 9 10

SD KM

0

20

40

60

80

100

2 3 4 5 6 7 8 9 10

SD KM SD TS

Page 62: Clustering training

62

RS, RMSSTD INDEXEK

RD (R-squared) = variance between clusters / total variance

RMSSTD (Root-mean-square standard deviation)

= within cluster variance

0

0,2

0,4

0,6

0,8

1

1,2

2 3 4 5 6 7 8 9 10

-0,1

-0,05

0

0,05

0,1

0,15

RS_diff KM RS KM

0

2

4

6

8

10

12

2 3 4 5 6 7 8 9 10

-3

-2

-1

0

1

2

3

4

5

RMSSTD_diff KM RMSSTD KM

0

0,2

0,4

0,6

0,8

1

1,2

2 3 4 5 6 7 8 9 10

0

0,01

0,02

0,03

0,04

0,05

0,06

0,07

0,08

RS_diff TS RS TS

0

0,5

1

1,5

2

2,5

3

3,5

2 3 4 5 6 7 8 9 10

-0,05

0

0,05

0,1

0,15

0,2

0,25

0,3

0,35

0,4

0,45

RMSSTD_diff TS RMSSTD TS

0,6

0,65

0,7

0,75

0,8

0,85

0,9

0,95

1

2 3 4 5 6 7 8 9 10

RS KM RS TS

0

2

4

6

8

10

12

2 3 4 5 6 7 8 9 10

RMSSTD KM RMSSTD TS

Page 63: Clustering training

63

SEGMENTATION IN BANK

Needs based segmentation for new tariff plans

When the number of cluster is 4 or 5 then we have a too big segment (cca. 60 000 customer)

Above 6 segments we can not to identify more significant segment

Balance decrease is the cutting variable Szeparáltság

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

2 3 4 5 6 7 8 9 10

0

0,02

0,04

0,06

0,08

0,1

0,12

Átmérő

0

0,05

0,1

0,15

0,2

0,25

0,3

0,35

2 3 4 5 6 7 8 9 10

0

0,005

0,01

0,015

0,02

0,025

0,03

0,035

0,04

0,045

0,05Separation Diameter

Page 64: Clustering training

64

BANK SEGMENTATION – INDEXES

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

2 3 4 5 6 7 8 9 10

Dens_bw TS Scatt_orig TS

0

0,2

0,4

0,6

0,8

1

1,2

1,4

2 3 4 5 6 7 8 9 10

DB TS

0

0,5

1

1,5

2

2,5

3

3,5

4

2 3 4 5 6 7 8 9 10

SD TS

0,2

0,25

0,3

0,35

0,4

0,45

0,5

0,55

0,6

2 3 4 5 6 7 8 9 10

RMSSTD KM RMSSTD TS

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

2 3 4 5 6 7 8 9 10

0

0,02

0,04

0,06

0,08

0,1

0,12

RS_diff TS RS TS

0

0,1

0,2

0,3

0,4

0,5

0,6

2 3 4 5 6 7 8 9 10

-0,01

0

0,01

0,02

0,03

0,04

0,05

RMSSTD_diff TS RMSSTD TS

Based on the indexes there are 4-6 really different segments

Page 65: Clustering training

65

LITERATURE I.

J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann

Publishers, August 2000.

J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann

Publishers, August 2000 (k-means, k-medoids or PAM )

L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster

Analysis. John Wiley & Sons, 1990 (CLARA, AGNES, DIANA).

R. Ng and J. Han. Efficient and effective clustering method for spatial data mining.

VLDB'94 (CLARANS).

J. Han and M. Kamber. Data Mining: Concepts andTechniques. Morgan Kaufmann

Publishers, August 2000 (deterministic annealing, genetic algorithms).

T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method

for very large databases. SIGMOD'96 (BIRCH).

S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large

databases. SIGMOD'98 (CURE).

65

Page 66: Clustering training

66

LITERATURE II.

Karypis G., Eui-Hong Han, Kumar V. Chameleon: hierarchical clustering using dynamic modeling (CHAMELEON).

M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases. KDD'96 (DBSCAN).

M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the clustering structure, SIGMOD’99 (OPTICS).

A. Hinneburg D., A. Keim: An Efficient Approach to Clustering in Large Multimedia Database with Noise. Proceedings of the 4-th ICKDDM, New York ’98 (DENCLUE).

Abramowitz, M. and Stegun, I. A. (Eds.). "Stirling Numbers of the Second Kind." §24.1.4 in Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, 9th printing. New York: Dover, pp. 824-825, 1972.

Introduction to Data Mining Pang-Ning Tan, Michigan State University Michael Steinbach,Vipin Kumar, University of Minnesota Publisher: Addison-Wesley Copyright: 2006.

66

Page 67: Clustering training

THANK YOU!

GABOR VERESS

LYNX ANALYTICS