Cluster Analysis. Different types of data e.g. Continuous data:height Categorical data ordered...

7
Cluster Analysis

Transcript of Cluster Analysis. Different types of data e.g. Continuous data:height Categorical data ordered...

Page 1: Cluster Analysis. Different types of data e.g. Continuous data:height Categorical data ordered (nominal):growth rate very slow, slow, medium, fast, very.

Cluster Analysis

Page 2: Cluster Analysis. Different types of data e.g. Continuous data:height Categorical data ordered (nominal):growth rate very slow, slow, medium, fast, very.

Different types of datae.g.

Continuous data : height

Categorical data

ordered (nominal) : growth rate very slow, slow, medium, fast, very

fast

not ordered : fruit colour yellow, green, purple, red, orange

Binary data : fruit / no fruit

Page 3: Cluster Analysis. Different types of data e.g. Continuous data:height Categorical data ordered (nominal):growth rate very slow, slow, medium, fast, very.

Similarity matrix

We define a similarity between units – like the correlation between continuous variables.

(also can be a dissimilarity or distance matrix)

A similarity can be constructed as an average of the similarities between the units on each variable.

(can use weighted average)

This provides a way of combining different types of variables.

Page 4: Cluster Analysis. Different types of data e.g. Continuous data:height Categorical data ordered (nominal):growth rate very slow, slow, medium, fast, very.

relevant for continuous variables:

Euclidean

city block or Manhattan

Distance metrics

A

B

A

B

(also many other variations)

Page 5: Cluster Analysis. Different types of data e.g. Continuous data:height Categorical data ordered (nominal):growth rate very slow, slow, medium, fast, very.

Similarity coefficients for binary data

simple matching

count if both units 0 or both units 1

Jaccard

count only if both units 1

(and many other variants)

simple matching can be extended to categorical data

Page 6: Cluster Analysis. Different types of data e.g. Continuous data:height Categorical data ordered (nominal):growth rate very slow, slow, medium, fast, very.

Clustering methods

hierarchical

divisive

put everything together and split

monothetic / polythetic

agglomerative

keep everything separate and join the most similar points (classical cluster analysis)

non-hierarchical

k-means clustering

Page 7: Cluster Analysis. Different types of data e.g. Continuous data:height Categorical data ordered (nominal):growth rate very slow, slow, medium, fast, very.

Agglomerative hierarchical

Single linkage or nearest neighbourfinds the minimum spanning tree: shortest tree that

connects all pointschaining

Complete linkage or furthest neighbourCompact clusters of approximately equal size.(makes compact groups even when none exist)

Average linkage methodsbetween single and average linkage