Clustering Lecture
-
Upload
ahmetdursun03 -
Category
Documents
-
view
221 -
download
0
Transcript of Clustering Lecture
-
8/16/2019 Clustering Lecture
1/46
1
Unsupervised Clustering
Clustering is a very general problem that appears inmany different settings (not necessarily in a datamining context)
Grouping “similar products” together to improvethe efficiency of a production line
Pacing “similar items” into containers
Grouping “similar customers” together
Grouping “similar stocs” together
-
8/16/2019 Clustering Lecture
2/46
!
"he #imilarity Concept
$bviously% the concept of similarity is eyto clustering& Using similarity definitions that are specific to a
domain may generate more acceptable clusters& '&g& products that reuire same or similar
toolsprocesses in the production line aresimilar&
*rticles that are in the course pac of the samecourse are similar&
General similarity measures are reuired forgeneral purpose algorithms&
-
8/16/2019 Clustering Lecture
3/46
+
Clustering, "he -./eans *lgorithm(0loyd% 12!)
1& Choose a value for K % the total number ofclusters&
!& 3andomly choose K points as cluster centers&
+& *ssign the remaining instances to theirclosest cluster center&
4& Calculate a ne5 cluster center for eachcluster&
6& 3epeat steps +.6 until the cluster centers donot change&
-
8/16/2019 Clustering Lecture
4/46
4
7istance /easure
"he similarity is captured by adistance measure in this algorithm&
"he original proposed measure ofdistance is the 'uclidean distance&
∑=
−=
==
n
i
ii y x y xd
y y yY x x x X nn
1
2
2
)(),(
),,,(),,,,( 211
-
8/16/2019 Clustering Lecture
5/46
6
Table 3.6 • K-Means Input Values
Instance X Y
1 1.0 1.52 1.0 4.53 2.0 1.54 2.0 3.55 3.0 2.5
6 5.0 6.0
*n 'xample Using -.
/eans
-
8/16/2019 Clustering Lecture
6/46
8
0
1
2
3
4
5
6
7
0 1 2 3 4 5 6
f(x)
x
1
2
3
4
5
6
-
8/16/2019 Clustering Lecture
7/46
9
Table 3.7 • Several Applications of the K-Means Algorithm ( K = 2)
Outcome Cluster Centers Cluster Points Squared Error
1 (2.67,4.67) 2, 4, 614.50
(2.00,1.83) 1, 3, 5
2 (1.5,1.5) 1, 315.94
(2.75,4.125) 2, 4, 5, 6
3 (1.8,2.7) 1, 2, 3, 4, 59.60
(5,6) 6
-
8/16/2019 Clustering Lecture
8/46
2
0
1
2
3
4
5
6
7
0 1 2 3 4 5 6
x
f(x)
-
8/16/2019 Clustering Lecture
9/46
General Considerations
:ors best 5hen the clusters in the data are ofapproximately eual si;e&
*ttribute significance cannot be determined&
0acs explanation capabilities&
3euires real.valued data& Categorical data can beconverted into real% but the distance function needs tobe 5ored out carefully&
:e must select the number of clusters present in thedata&
7ata normali;ation may be reuired if attribute rangesvary significantly&
*lternative distance measures may generate differentclusters&
-
8/16/2019 Clustering Lecture
10/46
-.means Clustering
Partitional clustering approach
'ach cluster is associated 5ith a centroid (center point)
'ach point is assigned to the cluster 5ith the closest centroid
-
8/16/2019 Clustering Lecture
11/46
-.means Clustering = 7etails
>nitial centroids are often chosen randomly& Clusters produced vary from one run to another&
"he centroid is (typically) the mean of the points in the cluster&
?Closeness@ is measured by 'uclidean distance% cosine similarity% correlation% etc&
-.means 5ill converge for common similarity measures mentioned above&
/ost of the convergence happens in the first fe5 iterations&
$ften the stopping condition is changed to ?Until relatively fe5 points change clusters@ Complexity is $( n A - A > A d )
n B number of points% - B number of cl usters%> B number of iterations% d B number of attributes
-
8/16/2019 Clustering Lecture
12/46
"5o different -.means Clusterings
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Sub-otimal Clusterin!
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Otimal Clusterin!
Ori!inal Points
-
8/16/2019 Clustering Lecture
13/46
>mportance of Choosing >nitial Centroids
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 6
-
8/16/2019 Clustering Lecture
14/46
>mportance of Choosing >nitial Centroids
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 6
-
8/16/2019 Clustering Lecture
15/46
'valuating -.means Clusters
/ost common measure is #um of #uared 'rror (##')
or each point% the error is the distance to the nearestcluster
"o get ##'% 5e suare these errors and sum them&
x is a data point in cluster C i and mi is the representativepoint for cluster C i
can sho5 that mi corresponds to the center (mean) ofthe cluster
Given t5o clusters% 5e can choose the one 5ith thesmallest error
∑ ∑= ∈
= K
i C x
i
i
xmdist SSE 1
2 ),(
-
8/16/2019 Clustering Lecture
16/46
>mportance of Choosing >nitial Centroids D
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
-
8/16/2019 Clustering Lecture
17/46
>mportance of Choosing >nitial Centroids D
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
-
8/16/2019 Clustering Lecture
18/46
Problems 5ith #electing >nitial Points
>f there are - ?real@ clusters then the chance of selecting one centroid from each cluster is small& Chance is relatively small 5hen - is large
>f clusters are the same si;e% n% then
or example% if - B 1E% then probability B 1EF1E 1E B E&EEE+8
#ometimes the initial centroids 5ill readust themselves in ?right@ 5ay% and sometimes they don@t
Consider an example of five pairs of clusters
-
8/16/2019 Clustering Lecture
19/46
Hierarchical Clustering
Produces a set of nested clustersorgani;ed as a hierarchical tree
Can be visuali;ed as a dendrogram * tree lie diagram that records theseuences of merges or splits
1 3 2 5 4 6
0
0.05
0.1
0.15
0.2
1
2
!
"
#
1
"3 #
$
-
8/16/2019 Clustering Lecture
20/46
#trengths of HierarchicalClustering
7o not have to assume any particularnumber of clusters *ny desired number of clusters can be
obtained by ?cutting@ the dendogram at theproper level
"hey may correspond to meaningful
taxonomies 'xample in biological sciences (e&g&% animal
ingdom% phylogeny reconstruction% D)
-
8/16/2019 Clustering Lecture
21/46
Hierarchical Clustering
"5o main types of hierarchical clustering *gglomerative,
#tart 5ith the points as individual clusters
*t each step% merge the closest pair of clusters until only onecluster (or clusters) left
7ivisive,
#tart 5ith one% all.inclusive cluster
*t each step% split a cluster until each cluster contains a point (orthere are clusters)
"raditional hierarchical algorithms use a similarity ordistance matrix /erge or split one cluster at a time
-
8/16/2019 Clustering Lecture
22/46
*gglomerative Clustering*lgorithm /ore popular hierarchical clustering techniue
Iasic algorithm is straightfor5ard1& Compute the proximity matrix
!& 0et each data point be a cluster
3. Repeat
4& /erge the t5o closest clusters
6& Update the proximity matrix
6. Until only a single cluster remains
-ey operation is the computation of the
proximity of t5o clusters 7ifferent approaches to defining the distance
bet5een clusters distinguish the different algorithms
-
8/16/2019 Clustering Lecture
23/46
#tarting #ituation
#tart 5ith clusters of individual pointsand a proximity matrix
1
3
$
#
"
1 " 3 # $ % % %
%
%
% Proximit& 'atrix
-
8/16/2019 Clustering Lecture
24/46
>ntermediate #ituation
*fter some merging steps% 5e have some clusters
C1
C#
C" C$
C3
C"C1
C1
C3
C$
C#
C"
C3 C# C$
Proximit& 'atrix
-
8/16/2019 Clustering Lecture
25/46
>ntermediate #ituation :e 5ant to merge the t5o closest clusters (C! and C6)
and update the proximity matrix&
C1
C#
C" C$
C3
C"C1
C1
C3
C$
C#
C"
C3 C# C$
Proximit& 'atrix
-
8/16/2019 Clustering Lecture
26/46
*fter /erging "he uestion is “Ho5 do 5e update the proximity
matrixJ”
C1
C#
C" U C$
C3
( ( ( (
(
(
(
C"
U
C$C1
C1
C3
C#
C" U C$
C3 C#
Proximit& 'atrix
-
8/16/2019 Clustering Lecture
27/46
Ho5 to 7efine >nter.Cluster #imilarity
1
3
$
#
"
1 " 3 # $ % % %
%
%
%
Similarit&(
MIN
MAX
Group Average
D!"a#$e %e"&ee# 'e#"ro(!
)"*er +e"*o(! (rve# ,- a# o,.e$"ve
/u#$"o# ar(! Me"*o( u!e! !uare( error
Proximit& 'atrix
-
8/16/2019 Clustering Lecture
28/46
Ho5 to 7efine >nter.Cluster #imilarity
1
3
$
#
"
1 " 3 # $ % % %
%
%
%Proximit& 'atrix
MIN
MAX
Group Average
D!"a#$e %e"&ee# 'e#"ro(!
)"*er +e"*o(! (rve# ,- a# o,.e$"ve
/u#$"o# ar(! Me"*o( u!e! !uare( error
-
8/16/2019 Clustering Lecture
29/46
Ho5 to 7efine >nter.Cluster #imilarity
1
3
$
#
"
1 " 3 # $ % % %
%
%
%Proximit& 'atrix
MIN
MAX
Group Average
D!"a#$e %e"&ee# 'e#"ro(!
)"*er +e"*o(! (rve# ,- a# o,.e$"ve
/u#$"o# ar(! Me"*o( u!e! !uare( error
-
8/16/2019 Clustering Lecture
30/46
Ho5 to 7efine >nter.Cluster #imilarity
1
3
$
#
"
1 " 3 # $ % % %
%
%
%Proximit& 'atrix
MIN
MAX
Group Average
D!"a#$e %e"&ee# 'e#"ro(!
)"*er +e"*o(! (rve# ,- a# o,.e$"ve
/u#$"o# ar(! Me"*o( u!e! !uare( error
-
8/16/2019 Clustering Lecture
31/46
Ho5 to 7efine >nter.Cluster #imilarity
1
3
$
#
"
1 " 3 # $ % % %
%
%
%Proximit& 'atrix
MIN
MAX
Group Average
D!"a#$e %e"&ee# 'e#"ro(!
)"*er +e"*o(! (rve# ,- a# o,.e$"ve
/u#$"o# ar(! Me"*o( u!e! !uare( error
-
8/16/2019 Clustering Lecture
32/46
Cluster #imilarity, />< or#ingle 0in #imilarity of t5o clusters is based on
the t5o most similar (closest) pointsin the different clusters
7etermined by one pair of points% i&e&% byone lin in the proximity graph&
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 ! "
-
8/16/2019 Clustering Lecture
33/46
Hierarchical Clustering, /><
)ested Clusters *endro!ram
1
2
!
"
#
1
"
3
#
$
3 6 2 5 4 10
0.05
0.1
0.15
0.2
-
8/16/2019 Clustering Lecture
34/46
#trength of /><
Ori!inal Points +,o Clusters
Can andle non-ellitical saes
-
8/16/2019 Clustering Lecture
35/46
0imitations of /><
Ori!inal Points +,o Clusters
Sensiti.e to noise and outliers
-
8/16/2019 Clustering Lecture
36/46
Cluster #imilarity, /*K or Complete0inage
#imilarity of t5o clusters is based onthe t5o least similar (most distant)points in the different clusters 7etermined by all pairs of points in the
t5o clustersI1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 ! "
-
8/16/2019 Clustering Lecture
37/46
Hierarchical Clustering, /*K
)ested Clusters *endro!ram
3 6 4 1 2 50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
1
2
!
"
#
1
" $
3
#
-
8/16/2019 Clustering Lecture
38/46
#trength of /*K
Ori!inal Points +,o Clusters
/ess suscetible to noise and outliers
f
-
8/16/2019 Clustering Lecture
39/46
0imitations of /*K
Ori!inal Points +,o Clusters
+ends to brea0 lar!e clusters
iased to,ards !lobular clusters
-
8/16/2019 Clustering Lecture
40/46
Cluster #imilarity, Group *verage
Proximity of t5o clusters is the average of pair5ise
proximity bet5een points in the t5o clusters&
-
8/16/2019 Clustering Lecture
41/46
Hierarchical Clustering, Group*verage
)ested Clusters *endro!ram
3 6 4 1 2 50
0.05
0.1
0.15
0.2
0.25
1
2
!
"
#
1
"
$
3
#
-
8/16/2019 Clustering Lecture
42/46
Hierarchical Clustering, Group*verage
Compromise bet5een #ingle andComplete 0in
#trengths 0ess susceptible to noise and outliers
0imitations Iiased to5ards globular clusters
-
8/16/2019 Clustering Lecture
43/46
Cluster #imilarity, :ard@s/ethod
#imilarity of t5o clusters is based on the increasein suared error 5hen t5o clusters are merged #imilar to group average if distance bet5een points is
distance suared
0ess susceptible to noise and outliers
Iiased to5ards globular clusters
Hierarchical analogue of -.means Can be used to initiali;e -.means
-
8/16/2019 Clustering Lecture
44/46
Hierarchical Clustering, Comparison
2rou .era!e
4ard5s 'etod
1
2
!
"
#
1
"
$
3
#
'I) 'X
1
2
!
"
#
1
"
$
3#
1
2
!
"
#
1
" $
3
#1
2
!
"
#
1
"
3
#
$
-
8/16/2019 Clustering Lecture
45/46
Hierarchical Clustering, "ime and #pace reuirements
$(
$(
-
8/16/2019 Clustering Lecture
46/46
Hierarchical Clustering, Problems and 0imitations
$nce a decision is made to combine t5oclusters% it cannot be undone