Cluster Analysis. Different types of data e.g. Continuous data:height Categorical data ordered...
-
Upload
elijah-wallace -
Category
Documents
-
view
212 -
download
0
Transcript of Cluster Analysis. Different types of data e.g. Continuous data:height Categorical data ordered...
Cluster Analysis
Different types of datae.g.
Continuous data : height
Categorical data
ordered (nominal) : growth rate very slow, slow, medium, fast, very
fast
not ordered : fruit colour yellow, green, purple, red, orange
Binary data : fruit / no fruit
Similarity matrix
We define a similarity between units – like the correlation between continuous variables.
(also can be a dissimilarity or distance matrix)
A similarity can be constructed as an average of the similarities between the units on each variable.
(can use weighted average)
This provides a way of combining different types of variables.
relevant for continuous variables:
Euclidean
city block or Manhattan
Distance metrics
A
B
A
B
(also many other variations)
Similarity coefficients for binary data
simple matching
count if both units 0 or both units 1
Jaccard
count only if both units 1
(and many other variants)
simple matching can be extended to categorical data
Clustering methods
hierarchical
divisive
put everything together and split
monothetic / polythetic
agglomerative
keep everything separate and join the most similar points (classical cluster analysis)
non-hierarchical
k-means clustering
Agglomerative hierarchical
Single linkage or nearest neighbourfinds the minimum spanning tree: shortest tree that
connects all pointschaining
Complete linkage or furthest neighbourCompact clusters of approximately equal size.(makes compact groups even when none exist)
Average linkage methodsbetween single and average linkage