Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State...

Math 5364 NotesChapter 8: Cluster Analysis

Jesse Crawford

Department of MathematicsTarleton State University

Today's Topics

• Overview of Cluster Analysis

• K-means clustering

What is Cluster Analysis?

• Dividing objects into clusters• Distances within clusters are small• Distances between clusters are large

What is Cluster Analysis?

• Dividing objects into clusters• Distances within clusters are small• Distances between clusters are large

• Training data has no class labels!

• Cluster analysis is also called unsupervised classification

Cluster Centers

• Cluster centers: prototypes, centroids, medoids

Purposes of Cluster Analysis

• Understanding• Biology: Divide organisms into different

classes (kingdom, phylum, class, etc.)

• Business: Divide customers into clusters for marketing purposes

• Weather: Identify patterns in atmosphere and ocean

Purposes of Cluster Analysis

• Utility• Replace data points with cluster centers

for summarization/compression

K-Means Clustering

K-Means Algorithm

• Select K initial centroids• Repeat the following:

• Form K clusters (assign each point to closest centroid)• Recompute the centroid of each cluster

• Stop when centroids converge

K-Means Clustering

K-Means Algorithm

• Select K initial centroids• Repeat the following:

• Form K clusters (assign each point to closest centroid)• Recompute the centroid of each cluster

• Stop when centroids converge

Requires distance metric(Example: Euclidean distance)Depends on metric

(Example: centroid = meanfor Euclidean distance)

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 6

Sums of Squares for K-Means

Notation

• Number of clusters

• Number of points in th cluster

• th point in the th cluster

• mean of th cluster

• mean of all points

Total SS

‖ ‖ 2

Within SS

SSWkNK

ki kk i

‖ ‖ 2

Between SS

SSBkNK

‖ ‖

1 1 1 1

1 1 1 1 1 1

tal SS Within SS Between SS

N NK K

ki ki k kk i k i

ki k k ki k kk i

N N NK K K

ki k k ki k kk i k i k i

x x x x

x x x x x x

x x x x xx xx

‖ ‖ ‖ ‖

Total SS Within SS Between SS

Goal of -means : Minimize Within SS

Equivalent goal: Maximize Between SS

A Problem with K-Means

• Different initial centroids can result in different clusterings

• Some choices of intial centroids may lead to local minima only.

• Possible solution: Repeat with randomly chosen initial centroids.

• Let m = number of repetitions !1

Today's Topics

• Cluster Evaluation

• Unsupervised Evaluation Measures• SSW• Silhouette Coefficient

• Supervised Evaluation Measures• Entropy• Purity

• Significance Tests

Unsupervised Evaluation Measures

• Does not use class labels

• SSW = Within Sum of Squares

• Silhouette Coefficient

Interpreting SSW

• SSW 0 as

• SSW 0 when # of points in data set

• Solution : Look for in plot of SSW vs.

• Optimal va

"elbow

Silhouette Coefficient

1. For the ith data object, calculate its distance to all other objects in its cluster. Call this value ai

2. For the ith data object and any cluster not containing that object, calculate the object's average distance to all the objects in the given cluster.

3. The minimum value from Step 2 is called bi

4. For the ith object, the silhouette coefficient is

( ) / max( , )i i i i ias b a b

Silhouette Coefficient

• The silhouette coefficient for the clustering

is the average of the silhouette coefficients.

• Silhouette coefficients near 1 indicate

strong cl

ustering.

Distance Matrix for a Data Set

• Suppose

• th row of

• Distance matrix

ij i j

‖ ‖

Statistical Significance of the Silhouette Coefficient

• Generate 100 uniform data sets with the same data range and sample size as the original data.

• Calculate the silhouette coefficient for each uniform sample.

• Find the percentile rank of the silhouette coefficient for the original data among the randomly generated ones.

• If percentile rank , there is statistically significant evidence of clustering (we can reject the null hypothesis of no clustering).

Supervised Evaluation Measures

• Entropy log

• Purity max

• Any classification metric

(precision, recall

50 50 3 32 2 253 53 53 53

47 47 50 502 2 297 97 97 97

Entropy(Cluster 1) [ log ) log ) 0log 0)]

tropy(Cluster 2) [0 log 0) log ) log )]

Entropy logc

53 97150 150Weighted Entropy (0.314) (0.9

Purity(Cluster 1)

Purity(Cluster 2)

Purity m ]ax [i ip

53 50 97 50150 53 150

97Weighted Purity ( ) ( )

Today's Topics

• Chi-squared Test for Cluster Evaluation

• DBSCAN

Chi-square Test for Independence

Engineering Science and Tech Business Other TotalsIn State 16 14 13 13 56Out of State 14 6 10 8 38Totals 30 20 23 21 94

How can we test indepence of these two variables?

Column 1 Column 2 Column 3 … Column c Totals

Row 1 O11 O12 O13 … O1c R1

Row 2 O21 O22 O23 … O2c R2

… … … … … … …

Row r Or1 Or2 Or3 … Orc Rr

Totals C1 C2 C3 … Cc N

Pr(Row )

Pr(Column )

H Rows/columns independent

ij i j

Row 1 O11 O12 O13 … O1c R1

Row 2 O21 O22 O23 … O2c R2

… … … … … … …

Pr(Row )

Pr(Column )

ij i j

Row 1 O11 O12 O13 … O1c R1

Row 2 O21 O22 O23 … O2c R2

… … … … … … …

Pr(Row )

Pr(Column )

ij i j

0Under H :

ij ij i j

j i ji

E Np N

C RCRNN N N

Row 1 O11 O12 O13 … O1c R1

Row 2 O21 O22 O23 … O2c R2

… … … … … … …

Define

Under H has an approximate chi-square

distribution with ( 1)( 1) degrees of freedo

(Assuming 5, fo

r all , )

r cij ij

i j ij

Observed Engineering Science and Tech Business Other TotalsIn State 16 14 13 13 56Out of State 14 6 10 8 38Totals 30 20 23 21 94

56 3017.87

Expected Engineering Science and Tech Business Other TotalsIn State 17.87 11.91 13.70 12.51 56Out of State 12.13 8.09 9.30 8.49 38Totals 30 20 23 21 94

Observed Engineering Science and Tech Business Other TotalsIn State 16 14 13 13 56Out of State 14 6 10 8 38Totals 30 20 23 21 94

Expected Engineering Science and Tech Business Other TotalsIn State 17.87 11.91 13.70 12.51 56Out of State 12.13 8.09 9.30 8.49 38Totals 30 20 23 21 94

2 2 22 (16 17.87) (14 11.91) (8 8.49)

1.5217.87 11.91 8.49

-value 0.68 Do not reject null hypothesis that rows

and columns are independent

DBSCAN

• Clustering Algorithm

• Density Based Spatial Clustering of Applications with Noise

DBSCAN: Parameters and Types of Points

• Requires two parameters:• Eps (Must be chosen)• MinPts (Default value = 5)

• Three types of points:• Core points: Those with at least MinPts

neighbors within its Eps neighborhood

• Border points: Not a core point, but within the Eps neighborhood of a core point

• Noise points: Not a core point or a border point

• Requires two parameters:• Eps = 0.2 • MinPts = 5

Core point• Eps neighborhood

contains points

Border point• Eps neighborhood

contains points• Eps neighborhood

contains a core point

Noise point• Eps neighborhood

contains points• Eps neighborhood

contains no core points

DBSCAN Algorithm

• Identify all core points, border points, and noise points.

• Two core points within Eps of each other are assigned to the same cluster.

• Border points are assigned to one of the clusters of its associated core points.

• Noise points are not assigned to clusters. They are simply classified as noise.

DBSCAN Algorithm

• Identify all core points, border points, and noise points.

• Two core points within Eps of each other are assigned to the same cluster.

• Border points are assigned to one of the clusters of its associated core points.

• Noise points are not assigned to clusters. They are simply classified as noise.

Today's Topics

• Agglomerative Hierarchical Clustering

Hierarchical Clustering

Taxonomy of Living Organisms

Dendrogram

Agglomerative Hierarchical Clustering

Distances Between Clusters

1 2 1 2

Single Link

( , ) min{ ( , ) }

Complete Link

( , ) max{ ( , ) }

Average

, ( )| |

,( )| | yx C C

d C C d x y x C C

d CC C

Agglomerative Hierarchical Clustering

Heights = 1.0, 1.4, 3.0, 3.6, 5.6, 8.1, 13.0, 20.3

Today's Topics

• Gaussian Mixture EM Clustering

Setting for Gaussian Mixture EM Clustering

/2 1/2 1/212

for 1,2, ,

( , , ) ( , , )

Assume the conditional distribution of given i

( ) ( ),

| | ) (

s ( , )

( | ) (2 ) | exp{ ( ) ( )| }

|p p p

py y y y

P X x X x Y y f x x y

X Y y N

P Y y f y

p.m.f. for YPrior distribution for Y

Joint conditional distributionof Xj's given Y

Setting for Gaussian Mixture EM Clustering

/2 1/2 1/212

for 1,2, ,

( | ) (2 ) | exp{ ( ) ( )}

( ) ( ),

( ) (( |

py y y y

f x y x x

P Y y f y

f y f xP Y y X x

Prior distribution for Y

Posterior distribution for Y

/2 1/2 1/212

( | ) (2 ) | exp{ ( ) ( )}

Parameter for model:

( ( ), ) 1, ,

( , , ) ' {

py y y y

f x y x x

f y y c

Y Y Y c

/2 1/2 1/212

( | ) (2 ) | exp{ ( ) ( )}

,, ) 1, ,

( , , ) ' {0, , }

; , (()( ) | )

py y y y

i i ii

f x y x x

f y y c

Y Y Y c

Y XL f Y f X Y

Want to maximize this

Problem: Don't know Y's

/2 1/2 1/212

( | ) (2 ) | exp{ ( ) ( )}

,, ) 1, ,

( , , ) ' {0, , }

; , (()( ) | )

py y y y

i i ii

f x y x x

f y y c

Y Y Y c

Y XL f Y f X Y

( 1) ( )

Expectation Maximization (EM) Algorithm

E Step

( | log ( ; , )]

M Step

arg max ( |

/2 1/2 1/212

( | ) (2 ) | exp{ ( ) ( )}

( ( ), ) 1, ,

( , , ) ' {0, , }

; , ) ) ( | )

( ) (( |

( ) ( | )

py y y y

i i ii

f x y x x

f y y c

Y Y Y c

Y X fL X Y

f y f xP Y y X x

( 1) ( )

Expectation Maximization (EM) Algorithm

E Step

( | log ( ; , )]

M Step

arg max ( |

Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State...

Documents

Transcript of Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State...

Opma 5364 Part 3

PROGRAM - Tarleton

TARLETON STATE UNIVERSITY EPP SITE SUPERVISOR …

Tarleton Employee Services Newsletter

Administration - CII Homepage - Tarleton State University

Tarleton State University

Tarleton State University - International Student Orientation

April 2021 - Tarleton

Fall / Winter 2011-2012 Connections - Tarleton

OPMA 5364 Project Management Part 2 Project Managers.

Tarleton State University Excavation Safety Program

OPMA 5364 Project Management Part 6 Project Control.

Faculty Websites - Tarleton State University

Medical Physics Track - Tarleton State University

Copy Working - Tarleton

Arete - Tarleton

Steele @ tarleton modern conflict 1.6

Tarleton State University Foundation, Inc

Medical Physics Flyer - Tarleton State University

2014 steele @-tarleton-modern-conflict-1.6