Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State...
-
Upload
erik-skinner -
Category
Documents
-
view
218 -
download
2
Transcript of Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State...
Math 5364 NotesChapter 8: Cluster Analysis
Jesse Crawford
Department of MathematicsTarleton State University
Today's Topics
• Overview of Cluster Analysis
• K-means clustering
What is Cluster Analysis?
• Dividing objects into clusters• Distances within clusters are small• Distances between clusters are large
What is Cluster Analysis?
• Dividing objects into clusters• Distances within clusters are small• Distances between clusters are large
• Training data has no class labels!
• Cluster analysis is also called unsupervised classification
Cluster Centers
• Cluster centers: prototypes, centroids, medoids
Purposes of Cluster Analysis
• Understanding• Biology: Divide organisms into different
classes (kingdom, phylum, class, etc.)
• Business: Divide customers into clusters for marketing purposes
• Weather: Identify patterns in atmosphere and ocean
Purposes of Cluster Analysis
• Utility• Replace data points with cluster centers
for summarization/compression
K-Means Clustering
K-Means Algorithm
• Select K initial centroids• Repeat the following:
• Form K clusters (assign each point to closest centroid)• Recompute the centroid of each cluster
• Stop when centroids converge
K-Means Clustering
K-Means Algorithm
• Select K initial centroids• Repeat the following:
• Form K clusters (assign each point to closest centroid)• Recompute the centroid of each cluster
• Stop when centroids converge
Requires distance metric(Example: Euclidean distance)Depends on metric
(Example: centroid = meanfor Euclidean distance)
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 6
Sums of Squares for K-Means
2
1 1
Notation
• Number of clusters
• Number of points in th cluster
• th point in the th cluster
• mean of th cluster
• mean of all points
Total SS
SSTk
k
ki
k
NK
kik i
k
i k
K
N
x
x k
x
x x
‖ ‖ 2
1 1
Within SS
SSWkNK
ki kk i
x x
‖ ‖ 2
1 1
Between SS
SSBkNK
kk i
x x
‖ ‖
2 2
1 1 1 1
2 2
1 1
2 2
1 1 1 1 1 1
2
To
( ( )
2 (
tal SS Within SS Between SS
SST
) '
) '
SSW S
(
B
)
S
k k
k
k k k
N NK K
ki ki k kk i k i
NK
ki k k ki k kk i
N N NK K K
ki k k ki k kk i k i k i
x x
x x
x x x x
x x x x x x
x x x x xx xx
‖ ‖ ‖ ‖
‖ ‖ ‖ ‖
‖ ‖ ‖ ‖
Total SS Within SS Between SS
Goal of -means : Minimize Within SS
Equivalent goal: Maximize Between SS
K
A Problem with K-Means
• Different initial centroids can result in different clusterings
• Some choices of intial centroids may lead to local minima only.
• Possible solution: Repeat with randomly chosen initial centroids.
• Let m = number of repetitions !1
K
mK
K
ò
Today's Topics
• Cluster Evaluation
• Unsupervised Evaluation Measures• SSW• Silhouette Coefficient
• Supervised Evaluation Measures• Entropy• Purity
• Significance Tests
Unsupervised Evaluation Measures
• Does not use class labels
• SSW = Within Sum of Squares
• Silhouette Coefficient
Interpreting SSW
• SSW 0 as
• SSW 0 when # of points in data set
• Solution : Look for in plot of SSW vs.
• Optimal va
"elbow
l 3
''
ue:
K
K
K
K
Silhouette Coefficient
1. For the ith data object, calculate its distance to all other objects in its cluster. Call this value ai
2. For the ith data object and any cluster not containing that object, calculate the object's average distance to all the objects in the given cluster.
3. The minimum value from Step 2 is called bi
4. For the ith object, the silhouette coefficient is
( ) / max( , )i i i i ias b a b
Silhouette Coefficient
1
• 1
• The silhouette coefficient for the clustering
is the average of the silhouette coefficients.
1
• Silhouette coefficients near 1 indicate
strong cl
1
ustering.
i
n
ii
s
s sn
Distance Matrix for a Data Set
2
• Suppose
• th row of
• Distance matrix
n p
i
n n
ij i j
X
X i X
D
D X X
‖ ‖
Statistical Significance of the Silhouette Coefficient
• Generate 100 uniform data sets with the same data range and sample size as the original data.
• Calculate the silhouette coefficient for each uniform sample.
• Find the percentile rank of the silhouette coefficient for the original data among the randomly generated ones.
• If percentile rank , there is statistically significant evidence of clustering (we can reject the null hypothesis of no clustering).
Supervised Evaluation Measures
21
• Entropy log
• Purity max
• Any classification metric
(precision, recall
[
,
]
etc)
c
i ii
i i
p p
p
50 50 3 32 2 253 53 53 53
47 47 50 502 2 297 97 97 97
Entropy(Cluster 1) [ log ) log ) 0log 0)]
0.314
En
(
tropy(Cluster 2) [0 log 0) log ) log )]
(
0.
(
( ( (
999
21
Entropy logc
i ii
p p
53 97150 150Weighted Entropy (0.314) (0.9
75
99)
0. 7
5053
5097
Purity(Cluster 1)
0.943
Purity(Cluster 2)
0.515
Purity m ]ax [i ip
53 50 97 50150 53 150
23
97Weighted Purity ( ) ( )
Today's Topics
• Chi-squared Test for Cluster Evaluation
• DBSCAN
Chi-square Test for Independence
Engineering Science and Tech Business Other TotalsIn State 16 14 13 13 56Out of State 14 6 10 8 38Totals 30 20 23 21 94
How can we test indepence of these two variables?
Chi-square Test for Independence
Column 1 Column 2 Column 3 … Column c Totals
Row 1 O11 O12 O13 … O1c R1
Row 2 O21 O22 O23 … O2c R2
… … … … … … …
Row r Or1 Or2 Or3 … Orc Rr
Totals C1 C2 C3 … Cc N
0
0
Pr(Row )
Pr(Column )
H Rows/columns independent
H :
:
i
j
ij i j
i
p j
p p
p
p
Chi-square Test for Independence
Column 1 Column 2 Column 3 … Column c Totals
Row 1 O11 O12 O13 … O1c R1
Row 2 O21 O22 O23 … O2c R2
… … … … … … …
Row r Or1 Or2 Or3 … Orc Rr
Totals C1 C2 C3 … Cc N
0
0
Pr(Row )
Pr(Column )
H Rows/columns independent
H :
:
i
j
ij i j
i
p j
p p
p
p
ˆ
ˆ jj
ii
Rp
NC
pN
Chi-square Test for Independence
Column 1 Column 2 Column 3 … Column c Totals
Row 1 O11 O12 O13 … O1c R1
Row 2 O21 O22 O23 … O2c R2
… … … … … … …
Row r Or1 Or2 Or3 … Orc Rr
Totals C1 C2 C3 … Cc N
0
0
Pr(Row )
Pr(Column )
H Rows/columns independent
H :
:
i
j
ij i j
i
p j
p p
p
p
ˆ
ˆ jj
ii
Rp
NC
pN
0Under H :
ij ij i j
j i ji
E Np N
C RCRNN N N
p p
Chi-square Test for Independence
Column 1 Column 2 Column 3 … Column c Totals
Row 1 O11 O12 O13 … O1c R1
Row 2 O21 O22 O23 … O2c R2
… … … … … … …
Row r Or1 Or2 Or3 … Orc Rr
Totals C1 C2 C3 … Cc N
Define
i jij
RC
NE
22
1 1
20
( )
Under H has an approximate chi-square
distribution with ( 1)( 1) degrees of freedo
(Assuming 5, fo
,
m
r all , )
r cij ij
ij
i j ij
E i
O
j
E
E
r c
Chi-square Test for Independence
Observed Engineering Science and Tech Business Other TotalsIn State 16 14 13 13 56Out of State 14 6 10 8 38Totals 30 20 23 21 94
1 111
56 3017.87
94
R
NE
C
Expected Engineering Science and Tech Business Other TotalsIn State 17.87 11.91 13.70 12.51 56Out of State 12.13 8.09 9.30 8.49 38Totals 30 20 23 21 94
Chi-square Test for Independence
Observed Engineering Science and Tech Business Other TotalsIn State 16 14 13 13 56Out of State 14 6 10 8 38Totals 30 20 23 21 94
Expected Engineering Science and Tech Business Other TotalsIn State 17.87 11.91 13.70 12.51 56Out of State 12.13 8.09 9.30 8.49 38Totals 30 20 23 21 94
2 2 22 (16 17.87) (14 11.91) (8 8.49)
1.5217.87 11.91 8.49
-value 0.68 Do not reject null hypothesis that rows
and columns are independent
p
DBSCAN
• Clustering Algorithm
• Density Based Spatial Clustering of Applications with Noise
DBSCAN: Parameters and Types of Points
• Requires two parameters:• Eps (Must be chosen)• MinPts (Default value = 5)
• Three types of points:• Core points: Those with at least MinPts
neighbors within its Eps neighborhood
• Border points: Not a core point, but within the Eps neighborhood of a core point
• Noise points: Not a core point or a border point
DBSCAN: Parameters and Types of Points
• Requires two parameters:• Eps = 0.2 • MinPts = 5
DBSCAN: Parameters and Types of Points
• Requires two parameters:• Eps = 0.2 • MinPts = 5
Core point• Eps neighborhood
contains points
DBSCAN: Parameters and Types of Points
• Requires two parameters:• Eps = 0.2 • MinPts = 5
Border point• Eps neighborhood
contains points• Eps neighborhood
contains a core point
DBSCAN: Parameters and Types of Points
• Requires two parameters:• Eps = 0.2 • MinPts = 5
Noise point• Eps neighborhood
contains points• Eps neighborhood
contains no core points
DBSCAN Algorithm
• Identify all core points, border points, and noise points.
• Two core points within Eps of each other are assigned to the same cluster.
• Border points are assigned to one of the clusters of its associated core points.
• Noise points are not assigned to clusters. They are simply classified as noise.
DBSCAN Algorithm
• Identify all core points, border points, and noise points.
• Two core points within Eps of each other are assigned to the same cluster.
• Border points are assigned to one of the clusters of its associated core points.
• Noise points are not assigned to clusters. They are simply classified as noise.
Today's Topics
• Agglomerative Hierarchical Clustering
Hierarchical Clustering
Taxonomy of Living Organisms
Dendrogram
Agglomerative Hierarchical Clustering
Agglomerative Hierarchical Clustering
Agglomerative Hierarchical Clustering
Agglomerative Hierarchical Clustering
Agglomerative Hierarchical Clustering
Agglomerative Hierarchical Clustering
Agglomerative Hierarchical Clustering
Agglomerative Hierarchical Clustering
Agglomerative Hierarchical Clustering
Agglomerative Hierarchical Clustering
Distances Between Clusters
1 2
1 2 1 2
1 2 1 2
12
21
Single Link
( , ) min{ ( , ) }
Complete Link
( , ) max{ ( , ) }
Average
1 1
,
,
, ( )| |
,( )| | yx C C
y
y
C
d C C d x y x C C
d C C d x y x C C
d CC C
d x y
û
û
Agglomerative Hierarchical Clustering
Heights = 1.0, 1.4, 3.0, 3.6, 5.6, 8.1, 13.0, 20.3
Today's Topics
• Gaussian Mixture EM Clustering
Setting for Gaussian Mixture EM Clustering
1 1 1
/2 1/2 1/212
for 1,2, ,
( , , ) ( , , )
Assume the conditional distribution of given i
( ) ( ),
| | ) (
s ( , )
( | ) (2 ) | exp{ ( ) ( )| }
|p p p
y y
py y y y
y c
P X x X x Y y f x x y
X Y y N
f
P Y y f y
y
x y
f x
x x
p.m.f. for YPrior distribution for Y
Joint conditional distributionof Xj's given Y
Setting for Gaussian Mixture EM Clustering
/2 1/2 1/212
1
for 1,2, ,
( | ) (2 ) | exp{ ( ) ( )}
| ))
| )
( ) ( ),
|
( ) (( |
( ) (
py y y y
c
y
y c
f x y x x
y
y y
P Y y f y
f y f xP Y y X x
f f x
Prior distribution for Y
Posterior distribution for Y
/2 1/2 1/212
1
( | ) (2 ) | exp{ ( ) ( )}
Parameter for model:
( ( ), ) 1, ,
( , , ) ' {
|
0, ,
)
,
}
(
( | )
py y y y
y y
nn
n pij
f x y x x
f y y c
Y Y Y c
X X
/2 1/2 1/212
1
1
( | ) (2 ) | exp{ ( ) ( )}
Parameter for model:
( ( )
|
,, ) 1, ,
( , , ) ' {0, , }
( )
; , (()( ) | )
( | )
py y y y
y y
nn
n pij
n
i i ii
f x y x x
f y y c
Y Y Y c
X X
Y XL f Y f X Y
Want to maximize this
Problem: Don't know Y's
/2 1/2 1/212
1
1
( | ) (2 ) | exp{ ( ) ( )}
Parameter for model:
( ( )
|
,, ) 1, ,
( , , ) ' {0, , }
( )
; , (()( ) | )
( | )
py y y y
y y
nn
n pij
n
i i ii
f x y x x
f y y c
Y Y Y c
X X
Y XL f Y f X Y
( )
( )
| ,
( 1) ( )
)
Expectation Maximization (EM) Algorithm
E Step
( | log ( ; , )]
M Step
arg max ( |
[
)
t
t
Y X
t t
Q L Y
Q
E X
/2 1/2 1/212
1
1
1
( | ) (2 ) | exp{ ( ) ( )}
Parameter for model:
( ( ), ) 1, ,
( , , ) ' {0, , }
( )
; , ) ) ( | )
| ))
|
,
( (
( ) (( |
( ) ( | )
( | )
py y y y
y y
nn
n pij
n
i i ii
i i c
y
f x y x x
f y y c
Y Y Y c
X X
Y X fL X Y
y
y y
f Y
f y f xP Y y X x
f f x
( )
( )
| ,
( 1) ( )
)
Expectation Maximization (EM) Algorithm
E Step
( | log ( ; , )]
M Step
arg max ( |
[
)
t
t
Y X
t t
Q L Y
Q
E X
Further Reading
• Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977). "Maximum Likelihood from Incomplete Data via the EM Algorithm". Journal of the Royal Statistical Society, Series B. 39 (1): 1—38.
• Ledolter, J. (2013). Data Mining and Business Analytics with R.