Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State...

Post on 27-Dec-2015

218 views 2 download

Tags:

Transcript of Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State...

Math 5364 NotesChapter 8: Cluster Analysis

Jesse Crawford

Department of MathematicsTarleton State University

Today's Topics

• Overview of Cluster Analysis

• K-means clustering

What is Cluster Analysis?

• Dividing objects into clusters• Distances within clusters are small• Distances between clusters are large

What is Cluster Analysis?

• Dividing objects into clusters• Distances within clusters are small• Distances between clusters are large

• Training data has no class labels!

• Cluster analysis is also called unsupervised classification

Cluster Centers

• Cluster centers: prototypes, centroids, medoids

Purposes of Cluster Analysis

• Understanding• Biology: Divide organisms into different

classes (kingdom, phylum, class, etc.)

• Business: Divide customers into clusters for marketing purposes

• Weather: Identify patterns in atmosphere and ocean

Purposes of Cluster Analysis

• Utility• Replace data points with cluster centers

for summarization/compression

K-Means Clustering

K-Means Algorithm

• Select K initial centroids• Repeat the following:

• Form K clusters (assign each point to closest centroid)• Recompute the centroid of each cluster

• Stop when centroids converge

K-Means Clustering

K-Means Algorithm

• Select K initial centroids• Repeat the following:

• Form K clusters (assign each point to closest centroid)• Recompute the centroid of each cluster

• Stop when centroids converge

Requires distance metric(Example: Euclidean distance)Depends on metric

(Example: centroid = meanfor Euclidean distance)

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 6

Sums of Squares for K-Means

2

1 1

Notation

• Number of clusters

• Number of points in th cluster

• th point in the th cluster

• mean of th cluster

• mean of all points

Total SS

SSTk

k

ki

k

NK

kik i

k

i k

K

N

x

x k

x

x x

‖ ‖ 2

1 1

Within SS

SSWkNK

ki kk i

x x

‖ ‖ 2

1 1

Between SS

SSBkNK

kk i

x x

‖ ‖

2 2

1 1 1 1

2 2

1 1

2 2

1 1 1 1 1 1

2

To

( ( )

2 (

tal SS Within SS Between SS

SST

) '

) '

SSW S

(

B

)

S

k k

k

k k k

N NK K

ki ki k kk i k i

NK

ki k k ki k kk i

N N NK K K

ki k k ki k kk i k i k i

x x

x x

x x x x

x x x x x x

x x x x xx xx

‖ ‖ ‖ ‖

‖ ‖ ‖ ‖

‖ ‖ ‖ ‖

Total SS Within SS Between SS

Goal of -means : Minimize Within SS

Equivalent goal: Maximize Between SS

K

A Problem with K-Means

• Different initial centroids can result in different clusterings

• Some choices of intial centroids may lead to local minima only.

• Possible solution: Repeat with randomly chosen initial centroids.

• Let m = number of repetitions !1

K

mK

K

ò

Today's Topics

• Cluster Evaluation

• Unsupervised Evaluation Measures• SSW• Silhouette Coefficient

• Supervised Evaluation Measures• Entropy• Purity

• Significance Tests

Unsupervised Evaluation Measures

• Does not use class labels

• SSW = Within Sum of Squares

• Silhouette Coefficient

Interpreting SSW

• SSW 0 as

• SSW 0 when # of points in data set

• Solution : Look for in plot of SSW vs.

• Optimal va

"elbow

l 3

''

ue:

K

K

K

K

Silhouette Coefficient

1. For the ith data object, calculate its distance to all other objects in its cluster. Call this value ai

2. For the ith data object and any cluster not containing that object, calculate the object's average distance to all the objects in the given cluster.

3. The minimum value from Step 2 is called bi

4. For the ith object, the silhouette coefficient is

( ) / max( , )i i i i ias b a b

Silhouette Coefficient

1

• 1

• The silhouette coefficient for the clustering

is the average of the silhouette coefficients.

1

• Silhouette coefficients near 1 indicate

strong cl

1

ustering.

i

n

ii

s

s sn

Distance Matrix for a Data Set

2

• Suppose

• th row of

• Distance matrix

n p

i

n n

ij i j

X

X i X

D

D X X

‖ ‖

Statistical Significance of the Silhouette Coefficient

• Generate 100 uniform data sets with the same data range and sample size as the original data.

• Calculate the silhouette coefficient for each uniform sample.

• Find the percentile rank of the silhouette coefficient for the original data among the randomly generated ones.

• If percentile rank , there is statistically significant evidence of clustering (we can reject the null hypothesis of no clustering).

Supervised Evaluation Measures

21

• Entropy log

• Purity max

• Any classification metric

(precision, recall

[

,

]

etc)

c

i ii

i i

p p

p

50 50 3 32 2 253 53 53 53

47 47 50 502 2 297 97 97 97

Entropy(Cluster 1) [ log ) log ) 0log 0)]

0.314

En

(

tropy(Cluster 2) [0 log 0) log ) log )]

(

0.

(

( ( (

999

21

Entropy logc

i ii

p p

53 97150 150Weighted Entropy (0.314) (0.9

75

99)

0. 7

5053

5097

Purity(Cluster 1)

0.943

Purity(Cluster 2)

0.515

Purity m ]ax [i ip

53 50 97 50150 53 150

23

97Weighted Purity ( ) ( )

Today's Topics

• Chi-squared Test for Cluster Evaluation

• DBSCAN

Chi-square Test for Independence

Engineering Science and Tech Business Other TotalsIn State 16 14 13 13 56Out of State 14 6 10 8 38Totals 30 20 23 21 94

How can we test indepence of these two variables?

Chi-square Test for Independence

Column 1 Column 2 Column 3 … Column c Totals

Row 1 O11 O12 O13 … O1c R1

Row 2 O21 O22 O23 … O2c R2

… … … … … … …

Row r Or1 Or2 Or3 … Orc Rr

Totals C1 C2 C3 … Cc N

0

0

Pr(Row )

Pr(Column )

H Rows/columns independent

H :

:

i

j

ij i j

i

p j

p p

p

p

Chi-square Test for Independence

Column 1 Column 2 Column 3 … Column c Totals

Row 1 O11 O12 O13 … O1c R1

Row 2 O21 O22 O23 … O2c R2

… … … … … … …

Row r Or1 Or2 Or3 … Orc Rr

Totals C1 C2 C3 … Cc N

0

0

Pr(Row )

Pr(Column )

H Rows/columns independent

H :

:

i

j

ij i j

i

p j

p p

p

p

ˆ

ˆ jj

ii

Rp

NC

pN

Chi-square Test for Independence

Column 1 Column 2 Column 3 … Column c Totals

Row 1 O11 O12 O13 … O1c R1

Row 2 O21 O22 O23 … O2c R2

… … … … … … …

Row r Or1 Or2 Or3 … Orc Rr

Totals C1 C2 C3 … Cc N

0

0

Pr(Row )

Pr(Column )

H Rows/columns independent

H :

:

i

j

ij i j

i

p j

p p

p

p

ˆ

ˆ jj

ii

Rp

NC

pN

0Under H :

ij ij i j

j i ji

E Np N

C RCRNN N N

p p

Chi-square Test for Independence

Column 1 Column 2 Column 3 … Column c Totals

Row 1 O11 O12 O13 … O1c R1

Row 2 O21 O22 O23 … O2c R2

… … … … … … …

Row r Or1 Or2 Or3 … Orc Rr

Totals C1 C2 C3 … Cc N

Define

i jij

RC

NE

22

1 1

20

( )

Under H has an approximate chi-square

distribution with ( 1)( 1) degrees of freedo

(Assuming 5, fo

,

m

r all , )

r cij ij

ij

i j ij

E i

O

j

E

E

r c

Chi-square Test for Independence

Observed Engineering Science and Tech Business Other TotalsIn State 16 14 13 13 56Out of State 14 6 10 8 38Totals 30 20 23 21 94

1 111

56 3017.87

94

R

NE

C

Expected Engineering Science and Tech Business Other TotalsIn State 17.87 11.91 13.70 12.51 56Out of State 12.13 8.09 9.30 8.49 38Totals 30 20 23 21 94

Chi-square Test for Independence

Observed Engineering Science and Tech Business Other TotalsIn State 16 14 13 13 56Out of State 14 6 10 8 38Totals 30 20 23 21 94

Expected Engineering Science and Tech Business Other TotalsIn State 17.87 11.91 13.70 12.51 56Out of State 12.13 8.09 9.30 8.49 38Totals 30 20 23 21 94

2 2 22 (16 17.87) (14 11.91) (8 8.49)

1.5217.87 11.91 8.49

-value 0.68 Do not reject null hypothesis that rows

and columns are independent

p

DBSCAN

• Clustering Algorithm

• Density Based Spatial Clustering of Applications with Noise

DBSCAN: Parameters and Types of Points

• Requires two parameters:• Eps (Must be chosen)• MinPts (Default value = 5)

• Three types of points:• Core points: Those with at least MinPts

neighbors within its Eps neighborhood

• Border points: Not a core point, but within the Eps neighborhood of a core point

• Noise points: Not a core point or a border point

DBSCAN: Parameters and Types of Points

• Requires two parameters:• Eps = 0.2 • MinPts = 5

DBSCAN: Parameters and Types of Points

• Requires two parameters:• Eps = 0.2 • MinPts = 5

Core point• Eps neighborhood

contains points

DBSCAN: Parameters and Types of Points

• Requires two parameters:• Eps = 0.2 • MinPts = 5

Border point• Eps neighborhood

contains points• Eps neighborhood

contains a core point

DBSCAN: Parameters and Types of Points

• Requires two parameters:• Eps = 0.2 • MinPts = 5

Noise point• Eps neighborhood

contains points• Eps neighborhood

contains no core points

DBSCAN Algorithm

• Identify all core points, border points, and noise points.

• Two core points within Eps of each other are assigned to the same cluster.

• Border points are assigned to one of the clusters of its associated core points.

• Noise points are not assigned to clusters. They are simply classified as noise.

DBSCAN Algorithm

• Identify all core points, border points, and noise points.

• Two core points within Eps of each other are assigned to the same cluster.

• Border points are assigned to one of the clusters of its associated core points.

• Noise points are not assigned to clusters. They are simply classified as noise.

Today's Topics

• Agglomerative Hierarchical Clustering

Hierarchical Clustering

Taxonomy of Living Organisms

Dendrogram

Agglomerative Hierarchical Clustering

Agglomerative Hierarchical Clustering

Agglomerative Hierarchical Clustering

Agglomerative Hierarchical Clustering

Agglomerative Hierarchical Clustering

Agglomerative Hierarchical Clustering

Agglomerative Hierarchical Clustering

Agglomerative Hierarchical Clustering

Agglomerative Hierarchical Clustering

Agglomerative Hierarchical Clustering

Distances Between Clusters

1 2

1 2 1 2

1 2 1 2

12

21

Single Link

( , ) min{ ( , ) }

Complete Link

( , ) max{ ( , ) }

Average

1 1

,

,

, ( )| |

,( )| | yx C C

y

y

C

d C C d x y x C C

d C C d x y x C C

d CC C

d x y

û

û

Agglomerative Hierarchical Clustering

Heights = 1.0, 1.4, 3.0, 3.6, 5.6, 8.1, 13.0, 20.3

Today's Topics

• Gaussian Mixture EM Clustering

Setting for Gaussian Mixture EM Clustering

1 1 1

/2 1/2 1/212

for 1,2, ,

( , , ) ( , , )

Assume the conditional distribution of given i

( ) ( ),

| | ) (

s ( , )

( | ) (2 ) | exp{ ( ) ( )| }

|p p p

y y

py y y y

y c

P X x X x Y y f x x y

X Y y N

f

P Y y f y

y

x y

f x

x x

p.m.f. for YPrior distribution for Y

Joint conditional distributionof Xj's given Y

Setting for Gaussian Mixture EM Clustering

/2 1/2 1/212

1

for 1,2, ,

( | ) (2 ) | exp{ ( ) ( )}

| ))

| )

( ) ( ),

|

( ) (( |

( ) (

py y y y

c

y

y c

f x y x x

y

y y

P Y y f y

f y f xP Y y X x

f f x

Prior distribution for Y

Posterior distribution for Y

/2 1/2 1/212

1

( | ) (2 ) | exp{ ( ) ( )}

Parameter for model:

( ( ), ) 1, ,

( , , ) ' {

|

0, ,

)

,

}

(

( | )

py y y y

y y

nn

n pij

f x y x x

f y y c

Y Y Y c

X X

/2 1/2 1/212

1

1

( | ) (2 ) | exp{ ( ) ( )}

Parameter for model:

( ( )

|

,, ) 1, ,

( , , ) ' {0, , }

( )

; , (()( ) | )

( | )

py y y y

y y

nn

n pij

n

i i ii

f x y x x

f y y c

Y Y Y c

X X

Y XL f Y f X Y

Want to maximize this

Problem: Don't know Y's

/2 1/2 1/212

1

1

( | ) (2 ) | exp{ ( ) ( )}

Parameter for model:

( ( )

|

,, ) 1, ,

( , , ) ' {0, , }

( )

; , (()( ) | )

( | )

py y y y

y y

nn

n pij

n

i i ii

f x y x x

f y y c

Y Y Y c

X X

Y XL f Y f X Y

( )

( )

| ,

( 1) ( )

)

Expectation Maximization (EM) Algorithm

E Step

( | log ( ; , )]

M Step

arg max ( |

[

)

t

t

Y X

t t

Q L Y

Q

E X

/2 1/2 1/212

1

1

1

( | ) (2 ) | exp{ ( ) ( )}

Parameter for model:

( ( ), ) 1, ,

( , , ) ' {0, , }

( )

; , ) ) ( | )

| ))

|

,

( (

( ) (( |

( ) ( | )

( | )

py y y y

y y

nn

n pij

n

i i ii

i i c

y

f x y x x

f y y c

Y Y Y c

X X

Y X fL X Y

y

y y

f Y

f y f xP Y y X x

f f x

( )

( )

| ,

( 1) ( )

)

Expectation Maximization (EM) Algorithm

E Step

( | log ( ; , )]

M Step

arg max ( |

[

)

t

t

Y X

t t

Q L Y

Q

E X

Further Reading

• Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977). "Maximum Likelihood from Incomplete Data via the EM Algorithm". Journal of the Royal Statistical Society, Series B. 39 (1): 1—38.

• Ledolter, J. (2013). Data Mining and Business Analytics with R.