Multivariate statistical methods Cluster analysis.

13
Multivariate statistical methods Cluster analysis

Transcript of Multivariate statistical methods Cluster analysis.

Page 1: Multivariate statistical methods Cluster analysis.

Multivariate statistical methods

Cluster analysis

Page 2: Multivariate statistical methods Cluster analysis.

Multivariate methods

multivariate dataset – group of n objects, m variables (as a rule n>m, if possible).

confirmation vs. eploration analysisconfirmation – impact on parameter estimate

and hypothesis testingexploration – impact on data exploration,

finding out of patterns and structure

Page 3: Multivariate statistical methods Cluster analysis.

Multivariate statistical methods

Unit classification Cluster analysis Discrimination analysis

Analysis of relations among variables Cannonical correlation analysis Factor analysis Principal component analysis

Page 4: Multivariate statistical methods Cluster analysis.

Unit classification methods

Page 5: Multivariate statistical methods Cluster analysis.

Cluster analysis (CA)

aim is find out groups of objects, which are similar and are different from other groups

methods of cluster analysis:hierarchicalnonhierarchical

Page 6: Multivariate statistical methods Cluster analysis.

1. Hierarchical methods

creation of clusters of different level (clusters of the highest level include clusters of lower level)

results of hierarchical methods are formed in tree structure, results are presented by dendrogram

is specified:similarity ratealgorithms of clustering

Page 7: Multivariate statistical methods Cluster analysis.

Hierarchical methods – similarity expression

qualitative values number of indentical values/number of all values

quantitative values: Euclidean distance vzdálenost Manhattan distance (Hemming distance) Tschebyshev distance

Page 8: Multivariate statistical methods Cluster analysis.

Similarity rates Euclidean distance

Manhattan (Hemming distance)

Tschebyshev distance

where xik, xjk are objects, which distance is explored in n-dimension, n is number of observed characteristics

n

kjkikjiE xxxxD

1

2,

n

kjkikjiH xxxxD

1

,

jkikk

jiC xxxxD max,

Page 9: Multivariate statistical methods Cluster analysis.

Distance of objects in 2D

Distances: Circle – Euclidean

Internal square – Hemming

External square – Tshebyshev

Page 10: Multivariate statistical methods Cluster analysis.

Other types of similarity rates Power

definied by user, the higher p is, the higher weight of larger distances is and it means lower signification of smaller distances. Parameter r causes conversely.

1-Pearson r

unsuitable for smal number of dimension Percentual discrepancy

suitable for categorical variables

rn

k

p

jkikji xxxxD

1

1

,

jkikji xxrxxD ,1,

nxxnumberxxD jkikji /.,

Page 11: Multivariate statistical methods Cluster analysis.

Algoritms of clustering Nearest neighbor linkage: distance between two

clusters is definied as distance of two nearest objects

Furthest neighbor linkage: distance between two clusters is definied as distance of two furthest objects

Nonweighted group average linkage: distance between two clusters is definied as average distance among all of pairs, where 1st member is from 1st cluster and 2nd member is from 2nd cluster

Weighted group average linkage: as previous, extra takes note of cluster size (number of objects) as weights

Page 12: Multivariate statistical methods Cluster analysis.

Algorithms of clustering

Nonweighted centroid: distance between two clusters is definied as distance of centroids of these clusters. Centroid is vector of averages (each coordinate is average of appropriate coordinates of objects in the each cluster)

Weighted centroid: as previous,extra takes note of cluster size (number of objects) as weights

Ward´s method: different from previous, for computation of distance among clusters is used analysis of variance. For clustering is important this rule, that the internal cluster sum of squares is minimal

Page 13: Multivariate statistical methods Cluster analysis.

2. Nonhierarchical method mostly used is method

K – means algorithm is based on moving of objects among clusters number of clusters is beforehand defined; randomly or

according to experiences of analyst centroids are defined for all clusters in the same step all objects are explored. If the object is nearest to the own

centroid, we leave it in this cluster. If not, move it in cluster, which centroid is the nearest. Intercluster sum of square should be minimal. This procedure repeat until at no objects shall be moved. Than we have final solution.

we are not working with distance matrix → K – means method is suitable for clustering of larger size of objects