Multivariate statistical methods Cluster analysis.
-
Upload
joel-lucas -
Category
Documents
-
view
245 -
download
6
Transcript of Multivariate statistical methods Cluster analysis.
Multivariate statistical methods
Cluster analysis
Multivariate methods
multivariate dataset – group of n objects, m variables (as a rule n>m, if possible).
confirmation vs. eploration analysisconfirmation – impact on parameter estimate
and hypothesis testingexploration – impact on data exploration,
finding out of patterns and structure
Multivariate statistical methods
Unit classification Cluster analysis Discrimination analysis
Analysis of relations among variables Cannonical correlation analysis Factor analysis Principal component analysis
Unit classification methods
Cluster analysis (CA)
aim is find out groups of objects, which are similar and are different from other groups
methods of cluster analysis:hierarchicalnonhierarchical
1. Hierarchical methods
creation of clusters of different level (clusters of the highest level include clusters of lower level)
results of hierarchical methods are formed in tree structure, results are presented by dendrogram
is specified:similarity ratealgorithms of clustering
Hierarchical methods – similarity expression
qualitative values number of indentical values/number of all values
quantitative values: Euclidean distance vzdálenost Manhattan distance (Hemming distance) Tschebyshev distance
Similarity rates Euclidean distance
Manhattan (Hemming distance)
Tschebyshev distance
where xik, xjk are objects, which distance is explored in n-dimension, n is number of observed characteristics
n
kjkikjiE xxxxD
1
2,
n
kjkikjiH xxxxD
1
,
jkikk
jiC xxxxD max,
Distance of objects in 2D
Distances: Circle – Euclidean
Internal square – Hemming
External square – Tshebyshev
Other types of similarity rates Power
definied by user, the higher p is, the higher weight of larger distances is and it means lower signification of smaller distances. Parameter r causes conversely.
1-Pearson r
unsuitable for smal number of dimension Percentual discrepancy
suitable for categorical variables
rn
k
p
jkikji xxxxD
1
1
,
jkikji xxrxxD ,1,
nxxnumberxxD jkikji /.,
Algoritms of clustering Nearest neighbor linkage: distance between two
clusters is definied as distance of two nearest objects
Furthest neighbor linkage: distance between two clusters is definied as distance of two furthest objects
Nonweighted group average linkage: distance between two clusters is definied as average distance among all of pairs, where 1st member is from 1st cluster and 2nd member is from 2nd cluster
Weighted group average linkage: as previous, extra takes note of cluster size (number of objects) as weights
Algorithms of clustering
Nonweighted centroid: distance between two clusters is definied as distance of centroids of these clusters. Centroid is vector of averages (each coordinate is average of appropriate coordinates of objects in the each cluster)
Weighted centroid: as previous,extra takes note of cluster size (number of objects) as weights
Ward´s method: different from previous, for computation of distance among clusters is used analysis of variance. For clustering is important this rule, that the internal cluster sum of squares is minimal
2. Nonhierarchical method mostly used is method
K – means algorithm is based on moving of objects among clusters number of clusters is beforehand defined; randomly or
according to experiences of analyst centroids are defined for all clusters in the same step all objects are explored. If the object is nearest to the own
centroid, we leave it in this cluster. If not, move it in cluster, which centroid is the nearest. Intercluster sum of square should be minimal. This procedure repeat until at no objects shall be moved. Than we have final solution.
we are not working with distance matrix → K – means method is suitable for clustering of larger size of objects