Lecture 6 Statistical Lecture ─ Cluster Analysis.

Lecture 6

Statistical Lecture

─ Cluster Analysis

Cluster Analysis

• Grouping similar objects to produce a classification

• Useful when the priori the structure of the data is unknown

• Involving the assessment of the relative distances between points

Clustering Algorithms

• Partitioning :

Divide the data set into k clusters where k needs to be specified beforehand, e.g.

k-means.

Clustering Algorithms• Hierarchical :

– Agglomerative methods :

Start with the situation where each object forms its own little cluster, and then successively merges clusters until only one large cluster left

– Divisive methods :

Start by considering the whole data set as one cluster, and then splits up clusters until each object is separate

Caution• Most users are interested in the main structure

of their data, consisting of a few large clusters

• When forming larger clusters, agglomerative methods might makes wrong decisions in the first step. (Once one step is wrong, the whole thing is wrong)

• For divisive methods, the larger clusters are determined first, so they are less likely to suffer from earlier steps

Agglomerative Hierarchical Clustering Procedure

(1) Each observation begins in a cluster by itself

(2) The two closest clusters are merged to from a new cluster that replaces the two old clusters

(3) Repeat (2) until only one cluster is left

The various clustering methods differ in how

the distance between two clusters is computed.

Remarks• For coordinate data, variables with large

variances tend to have more effect on the resulting clusters than those with small variance

• Scaling or transforming the variables might be needed

• Standardization (standardize the variables to mean 0 and standard deviation 1) or principle components is useful but not always appropriate

• Outliers should be removed before analysis

Remarks(cont.)

• Nonlinear transformations of the variables may change the number of population clusters and should therefore be approached with caution

• For most applications, the variables should be transformed so that equal differences are of equal practical importance

• An interval scale of measurement is required if raw data are used as input. Ordinal or ranked coordinate data are generally not appropriate

Notation

n number of observation

v number of variables if data are coordinates

G number of clusters at any given level of the hierarchy

xi ith observation

Ck kth cluster, subset of {1, 2, …, n}

Nk number of observations in Ck

Notation(cont.)

sample mean vector

mean vector for cluster Ck

||x|| Euclidean length of the vector x, that is the square root of the sum of the squares of the elements of x

T

Wk

x

kx

n

ii xx

1

2||||

kCi

ki xx 2||||

Notation(cont.)

PG Wj, where summation is over the G cl

usters at the Gth level of the hierarchy

Bkl Wm – Wk – Wl if Cm=CkCl

d(x, y) any distance or dissimilarity measure between observations or vectors x and y

Dkl any distance or dissimilarity measure b

etween clusters Ck and Cl

Clustering Method ─ Average Linkage

The distance between two clusters is defined by

If d(x, y)=||x – y||2, then

The combinatorial formula is

if Cm=CkCl

k lCi Cj

lkjikl NNxxdD /,

llkklkkl NWNWxxD //|||| 2

mjlljkkjm NDNDND /

Average Linkage

• The distance between clusters is the average distance between pairs of observations, one in each cluster

• It tends to join clusters with small variance and is slightly biased toward producing clusters with the same variance

Centroid Method

The distance between two clusters is defined

by

If d(x, y)=||x – y||2, then the combinatorial

formula is

2|||| lkkl xxD

2// mkllkmjlljkkjm NDNNNDNDND

Centroid Method

• The distance between two clusters is defined as the squared Euclidean distance between their centroids or means

• It is more robust to outliers than most other hierarchical methods but in other respects may not perform as well as Ward’s method or average linkage

Complete Linkage


by


jiCjCi

kl xxdDlk

,maxmax

jljkjm DDD ,max

Complete Linkage

• The distance between two cluster is the maximum distance between an observation in one cluster and an observation in the other cluster

• It is strongly biased toward producing clusters with roughly equal diameters and can be severely distorted by moderate outliers

Single Linkage


by


jljkjm DDD ,min

ji

CjCikl xxdD

lk

,minmin

Single Linkage

• The distance between two clusters is the minimum distance between an observation in one cluster and an observation in the other cluster

• It sacrifices performance in the recovery of compact clusters in return for the ability to detect elongated and irregular clusters

Ward’s Minimum-Variance Method


by

If d(x, y)=||x – y||2, then the combinatorial

formula is

lk

lkklkl

NN

xxBD 11

|||| 2

mj

kljjlljjkkjjm NN

DNDNNDNND

Ward’s Minimum-Variance Method

• The distance between two clusters is the ANOVA sum of squares between the two clusters added up over all the variables

• It tends to join clusters with a small number of observation

• It is strongly biased toward producing clusters with roughly the same number of observations

• It is also very sensitive to outliers

Assumptions for WMVM

• Multivariate normal mixture

• Equal spherical covariance matrices

• Equal sampling probabilities

Remarks

• Single linkage tends to lead to the formation of long straggly clusters

• Average, complete linkage and Ward’s method often find spherical clusters even when the data appear to contain clusters of other shapes

McQuitty’s Similarity Analysis


Median MethodIf d(x, y)=||x – y||2, then the combinatorial

formula is

2/jljkjm DDD

4/2/ kljljkjm DDDD

Kth-nearest Neighbor Method

• Prespecify k

• Let rk(x) be the distance from point x to the kth nearest observation

• Consider a closed sphere centered at x with radius rk(x), say Sk(x)

Kth-nearest Neighbor Method

• The estimated density at x is defined by

• For any two observations xi and xj

xS

xSxf

k

k

of Volume within nsobservatio of #

otherwise

,max, if11

21

,* jkikjijiji

xrxrxxdxfxfxxd

K-Means Algorithm• It is intended for use with large data sets,

from approximately 100 to 100000 observations

• With small data sets, the results may be highly sensitive to the order of the observations in the data set

• It combines an effective method for finding initial clusters with a standard iterative algorithm for minimizing the sum of squared distance from the cluster means

K-Means Algorithm

• Specify the number of clusters, say k• A set of k points called cluster seeds is

selected as a first guess of the means of the k clusters

• Each observation is assigned to the nearest seed to form temporary clusters

• The seeds are then replaced by the means of the temporary clusters

• The process is repeated until no further changes occur in the clusters

Cluster Seeds• Select the first complete (no missing values)

observation as the first seed

• The next complete observation that is separated from the first seed by at least the prespecified distance becomes the second seed

• Later observations are selected as new seeds if they are separated from all previous seeds by at least the radius, as long as the maximum number of seeds is not exceeded

Cluster Seeds

If an observation is complete but fails to

qualify as a new seed, two tests can be made

to see if the observation can replace one of

the old seeds

Cluster Seeds(cont.)

• An old seed is replaced if the distance between the observation and the closest seed is greater than the minimum distance between seeds. The seed that is replaced is selected from the two seeds that are closest to each other. The seed that is replaced is the one of these two with the shortest distance to the closest of the remaining seed when the other seed is replaced by the current observation

Cluster Seeds(cont.)

• If the observation fails the first test for seed replacement, a second test is made. The observation replaces the nearest seed if the smallest distance from the observation to all seeds other than the nearest one is greater than the shortest distance from the nearest seed to all other seeds. If this test is failed, go on to the next observation.

Dissimilarity Matrices

n n dissimilarity matrix

where d(i, j)=d(j, i) measures the “difference”

or dissimilarity between the objects i and j.

01,2,1,

02,31,3

01,2

0

nndndnd

dd

d

Dissimilarity Matrices

d usually satisfies

• d(i, i) = 0

• d(i, j) 0

• d(i, j) = d(j, i)

Dissimilarity

Interval-scaled variables-continuous

measurements on a (roughly) linear scale

(temperature, height, weight, etc.)

•

•

distance) (Euclidean ,1

2

v

fjfif xxjid

distance) (Manhattan ||,1

v

fjfif xxjid

Dissimilarity(cont.)

• The choice of measurement units strongly affects the resulting clustering

• The variable with the large dispersion will have the largest impact on clustering

• If all variables are considered equally important, the data need to be standardized first

Standardization

• Mean absolute deviation (Robust)

• Median absolute deviation (Robust)

• Usual standard deviation

n

ififf

n

iiff

f

fifij mxnsxnms

mxZ

11||

1 and

1 where

n

ififfiff

f

fifij mxnsxms

mxZ

1||

1 and}{median where

n

ififf

n

iiff

f

fifij mxnsxnms

mxZ

1

2

1 11

and1

where

Continuous Ordinal Variables

These are continuous measurements on an unknown scale, or where only the ordering is known but not the actual magnitude.• Replace the xif by their rank rif {1, …, Mf}• Transform the scale to [0,1] as follows :

• Compute the dissimilarities as for interval-scaled variables

11

f

ifif M

rZ

Ratio-Scaled Variables

These are positive continuous measurements on a

nonlinear scale, such as an exponential scale. One

example would be the growth of a bacterial population

(say, with a growth function AeBt).• Simple as interval-scaled variables, though this is not

recommended as it can distort the measurement scale• As continuous ordinal data• By first transforming the data (perhaps by taking loga

rithms), and then treating the results as interval-scaled variables

Discrete Ordinal Variables

A variable of this type has M possible values

(scores) which are ordered.

The dissimilarities are computed in the same

way as for continuous ordinal variables.

Nominal Variables

• Such a variable has M possible values, which are not ordered.

• The dissimilarity between objects i and j is usually defined as

variablesofnumber total and for valuesdifferent taking variables#

,ji

jid

Symmetric Binary Variables

Two possible values, coded 0 and 1, whichare equally important (s.t. a male and female).Consider the contingency table of the objects iand j :

dc

ba

ji

0

1

01\

dcbacb

jid ,

Asymmetric Binary Variables

Two possible values, one of which carries

more importance than the other.

The most meaningful outcome is coded as 1,

and the less meaningful outcome as 0.

Typically, 1 stands for the presence of a

certain attribute (e.g., a particular distance),

and 0 for its absence.

Asymmetric Binary Variables

dc

ba

ji

0

1

01\

cbacb

jid ,

Cluster Analysis of Flying Mileages Between 10 American Cities

0

ATLANTA 587 0

CHICAGO1212 920 0

DENVER 701 940 879 0

HOUSTON1936 1745 831 1374 0

LOS ANGELES 604 1188 1726 968 2339 0

MIAMI 748 713 1631 1420 2451 1092 0

NEW YORK2139 1858 949 1645 347 2594 2571 0

SAN FRANCISCO2182 1737 1021 1891 959 2734 2408 678 0

SEATTLE 543 597 1494 1220 2300 923 205 2442 2329 0

WASHINGTON D.C.

The CLUSTER ProcedureAverage Linkage Cluster Analysis

Cluster History

NCL Clusters Joined FREQ PSF PST2

NormRMSDist

Tie

9 NEW YORK WASHINGTON D.C. 2 66.7 . 0.1297

8 LOS ANGELES SAN FRANCISCO 2 39.2 . 0.2196

7 ATLANTA CHICAGO 2 21.7 . 0.3715

6 CL7 CL9 4 14.5 3.4 0.4149

5 CL8 SEATTLE 3 12.4 7.3 0.5255

4 DENVER HOUSTON 2 13.9 . 0.5562

3 CL6 MIAMI 5 15.5 3.8 0.6185

2 CL3 CL4 7 16.0 5.3 0.8005

1 CL2 CL5 10 . 16.0 1.2967

Root-Mean-Square Distance Between Observations = 1580.242

Average Linkage Cluster Analysis

The CLUSTER ProcedureCentroid Hierarchical Cluster Analysis

Cluster History

NCL Clusters Joined FREQ PSF PST2

NormCentDist

Tie

9 NEW YORK WASHINGTON D.C. 2 66.7 . 0.1297

8 LOS ANGELES SAN FRANCISCO 2 39.2 . 0.2196

7 ATLANTA CHICAGO 2 21.7 . 0.3715

6 CL7 CL9 4 14.5 3.4 0.3652

5 CL8 SEATTLE 3 12.4 7.3 0.5139

4 DENVER CL5 4 12.4 2.1 0.5337

3 CL6 MIAMI 5 14.2 3.8 0.5743

2 CL3 HOUSTON 6 22.1 2.6 0.6091

1 CL2 CL4 10 . 22.1 1.173


Centroid Hierarchical Cluster Analysis

The CLUSTER ProcedureSingle Linkage Cluster Analysis

Cluster History

NCL Clusters Joined FREQ

NormMinDist

Tie

9 NEW YORK WASHINGTON D.C. 2 0.1447

8 LOS ANGELES SAN FRANCISCO 2 0.2449

7 ATLANTA CL9 3 0.3832

6 CL7 CHICAGO 4 0.4142

5 CL6 MIAMI 5 0.4262

4 CL8 SEATTLE 3 0.4784

3 CL5 HOUSTON 6 0.4947

2 DENVER CL4 4 0.5864

1 CL3 CL2 10 0.6203

Mean Distance Between Observations = 1417.133

Single Linkage Cluster Analysis

The CLUSTER ProcedureWard's Minimum Variance Cluster Analysis

Cluster History

NCL Clusters Joined FREQ SPRSQ RSQ PSF PST2

Tie

9 NEW YORKWASHINGTON D.C.

2 0.0019 .998 66.7 .

8LOS ANGELES

SAN FRANCISCO 2 0.0054 .993 39.2 .

7 ATLANTA CHICAGO 2 0.0153 .977 21.7 .

6 CL7 CL9 4 0.0296 .948 14.5 3.4

5 DENVER HOUSTON 2 0.0344 .913 13.2 .

4 CL8 SEATTLE 3 0.0391 .874 13.9 7.3

3 CL6 MIAMI 5 0.0586 .816 15.5 3.8

2 CL3 CL5 7 0.1488 .667 16.0 5.3

1 CL2 CL4 10 0.6669 .000 . 16.0


Ward's Minimum Variance Cluster Analysis

Fisher (1936) Iris Data

Initial Seeds

Cluster SepalLength SepalWidth PetalLength PetalWidth

1 43.00000000 30.00000000 11.00000000 1.00000000

2 77.00000000 26.00000000 69.00000000 23.00000000

Minimum Distance Between Initial Seeds = 70.85196

The FASTCLUS Procedure

Replace=FULL Radius=0 Maxclusters=2 Maxiter=10 Converge=0.02




Iteration History

Iteration Criterion

Relative Change in ClusterSeeds

1 2

1 11.0638 0.1904 0.3163

2 5.3780 0.0596 0.0264

3 5.0718 0.0174 0.00766

Convergence criterion is satisfied.

Criterion Based on Final Seeds = 5.0417



Cluster Summary

Cluster

Frequency

RMS Std Deviation

Maximum Distance

from Seed to Observation

RadiusExceed

edNearest Cluster

Distance Between

Cluster Centroids

1 53 3.7050 21.1621 2 39.2879

2 97 5.6779 24.6430 1 39.2879



Statistics for Variables

Variable Total STD Within STD R-Square RSQ/(1-RSQ)

SepalLength 8.28066 5.49313 0.562896 1.287784

SepalWidth 4.35866 3.70393 0.282710 0.394137

PetalLength 17.65298 6.80331 0.852470 5.778291

PetalWidth 7.62238 3.57200 0.781868 3.584390

OVER-ALL 10.69224 5.07291 0.776410 3.472463

Pseudo F Statistic = 513.92

Approximate Expected Over-All R-Squared = 0.51539

Cubic Clustering Criterion = 14.806

WARNING: The two above values are invalid for correlated variables

c: number of clusters

n: number of observations

)/()1(

)1/(2

2

cnR

cRF




Cluster Means


1 50.05660377 33.69811321 15.60377358 2.90566038

2 63.01030928 28.86597938 49.58762887 16.95876289

Cluster Standard Deviations


1 3.427350930 4.396611045 4.404279486 2.105525249

2 6.336887455 3.267991438 7.800577673 4.155612484


The FREQ Procedure

Frequency Percent Row Pct Col Pct

Table of CLUSTER by Species

CLUSTER(Cluster)

Species

TotalSetosa Versicolor Virginica 1 50

33.3394.34

100.00

32.005.666.00

00.000.000.00

5335.33

2 00.000.000.00

4731.3348.4594.00

5033.3351.55

100.00

9764.67

Total 5033.33

5033.33

5033.33

150100.00




Initial Seeds

Cluster SepalLength SepalWidth PetalLength PetalWidth1 58.00000000 40.00000000 12.00000000 2.00000000

2 77.00000000 38.00000000 67.00000000 22.00000000

3 49.00000000 25.00000000 45.00000000 17.00000000

Minimum Distance Between Initial Seeds = 38.23611




Iteration History

Iteration Criterion

Relative Change in Cluster Seeds

1 2 3

1 6.7591 0.2652 0.3205 0.2985

2 3.7097 0 0.0459 0.0317

3 3.6427 0 0.0182 0.0124

Convergence criterion is satisfied.

Criterion Based on Final Seeds = 3.6289


Cluster Summary

Cluster

Frequency

RMS Std Deviation

Maximum Distance

from Seedto Observation

RadiusExceeded

Nearest Cluster

Distance Between

Cluster Centroids

1 50 2.7803 12.4803 3 33.5693

2 38 4.0168 14.9736 3 17.9718

3 62 4.0398 16.9272 2 17.9718


Statistics for Variables

Variable Total STD Within STD R-Square RSQ/(1-RSQ)

SepalLength 8.28066 4.39488 0.722096 2.598359

SepalWidth 4.35866 3.24816 0.452102 0.825156

PetalLength 17.65298 4.21431 0.943773 16.784895

PetalWidth 7.62238 2.45244 0.897872 8.791618

OVER-ALL 10.69224 3.66198 0.884275 7.641194

Pseudo F Statistic = 561.63

Approximate Expected Over-All R-Squared = 0.62728

Cubic Clustering Criterion = 25.021

WARNING: The two above values are invalid for correlated variables.




Cluster Means


2 68.50000000 30.73684211 57.42105263 20.71052632

3 59.01612903 27.48387097 43.93548387 14.33870968

Cluster Standard Deviations


2 4.941550255 2.900924461 4.885895746 2.798724562

3 4.664100551 2.962840548 5.088949673 2.974997167


The FREQ ProcedureFrequency Percent Row Pct Col Pct

Table of CLUSTER by Species

CLUSTER(Cluster)

Species

TotalSetosa Versicolor Virginica 1 50

33.33100.00100.00

00.000.000.00

00.000.000.00

5033.33

2 00.000.000.00

21.335.264.00

3624.0094.7472.00

3825.33

3 00.000.000.00

4832.0077.4296.00

149.33

22.5828.00

6241.33

Total 5033.33

5033.33

5033.33

150100.00

Lecture 6 Statistical Lecture ─ Cluster Analysis.

Documents

Transcript of Lecture 6 Statistical Lecture ─ Cluster Analysis.