Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering...
Transcript of Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering...
Clustering
Panagio's Papapetrou, PhD Associate Professor, Stockholm University Adjunct Professor, Aalto University
1
TODAY… Nov 4 Introduc4on to data mining
Nov 5 Associa4on Rules
Nov 10, 14 Clustering and Data Representa4on
Nov 17 Exercise session 1 (Homework 1 due)
Nov 19 Classifica4on
Nov 24, 26 Similarity Matching and Model Evalua4on
Dec 1 Exercise session 2 (Homework 2 due)
Dec 3 Combining Models
Dec 8, 10 Time Series Analysis
Dec 15 Exercise session 3 (Homework 3 due)
Dec 17 Ranking
Jan 13 Review
Jan 14 EXAM
Feb 23 Re-‐EXAM 2
What is clustering?
• A grouping of data objects such that the objects within a group are similar (or related) to one another and different from (or unrelated to) the objects in other groups
Inter-cluster distances are maximized
Intra-cluster distances are
minimized
3
Outliers • Outliers are objects that do not belong to any cluster or
form clusters of very small cardinality
• In some applica4ons we are interested in discovering outliers, not clusters (outlier analysis)
cluster
outliers
Applica4ons of clustering?
• Image Processing – cluster images based on their visual content
• Web – Cluster groups of users based on their access pa]erns on webpages
– Cluster webpages based on their content • Bioinforma4cs
– Cluster similar proteins together (similarity wrt chemical structure and/or func4onality etc)
• Many more…
5
The clustering task • Group observa4ons into groups so that the observa4ons
belonging in the same group are similar, whereas observa4ons in different groups are different
• Basic quesGons:
– What does “similar” mean? – What is a good par44on of the objects? I.e., how is the quality of a solu4on measured?
– How to find a good par44on of the observa4ons?
6
Observa4ons to cluster
• Usually data objects consist of a set of a]ributes (also known as dimensions)
• J. Smith, 20, 200K
• If all d dimensions are real-‐valued then we can visualize the data as points in a d-‐dimensional space
• If all d dimensions are binary then we can think of each data point as a binary vector
7
Distance func4ons for binary vectors
Q1 Q2 Q3 Q4 Q5 Q6
X 1 0 0 1 1 0
Y 0 1 1 0 1 0
JSim(X,Y ) = X∩YX∪Y
• Jaccard similarity between binary vectors X and Y • Jaccard distance between binary vectors X and Y Jdist(X,Y) = 1-‐ JSim(X,Y)
• Example: • JSim = 1/6 • Jdist = 5/6
Distance func4ons for real-‐valued vectors
• Lp norms or Minkowski distance:
9
Distance func4ons for real-‐valued vectors
• Lp norms or Minkowski distance:
where p is a posi4ve integer
Lp(x,y)= (xi−yi)i=1
d∑
p#
$
%%%%%
&
'
(((((
1/p
10
Distance func4ons for real-‐valued vectors
• Lp norms or Minkowski distance:
where p is a posi4ve integer
• If p = 1, L1 is the Manha@an (or city block) distance:
• If p = 2, L2 is the Euclidean distance:
Lp(x,y)= (xi−yi)i=1
d∑
p#
$
%%%%%
&
'
(((((
1/p
L1(x,y)= xi−yii=1
d∑
L2(x,y)= (|x1−y1|2+|x2−y2|
2+...+|xd−yd |2)
11
Data Structures
• Data matrix
• Distance matrix
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
ndx...nx...n1x...............idx...ix...i1x...............1dx...1x...11x
ℓ
ℓ
ℓ
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
0...)2,()1,(:::
)2,3()
...ndnd
0dd(3,10d(2,1)
0
a]ributes/dimensions
tuples/objects
objects ob
jects
12
Par44oning algorithms
• Basic concept:
– Construct a par44on of a set of n objects into a set of k clusters
– Each object belongs to exactly one cluster
– The number of clusters k is given in advance
13
The k-‐means problem • Given a set X of n points in a d-‐dimensional space and an integer k
• Task: choose a set of k points {c1, c2,…,ck} in the d-‐dimensional space to form clusters C = {C1, C2,…,Ck} such that
is minimized • Some special cases: k = 1, k = n
Cost(C) = L22 x,ci( )
x∈Ci
∑i=1
k
∑ = x − ci( )2x∈Ci
∑i=1
k
∑
14
Algorithmic proper4es of the k-‐means problem
• NP-‐hard if the dimensionality of the data is at least 2 (d>=2)
• Finding the best solu4on in polynomial 4me is infeasible
• For d=1 the problem is solvable in polynomial 4me
• A simple itera4ve algorithm works quite well in prac4ce
15
The k-‐means algorithm • Randomly pick k cluster centers {c1,…,ck}
• For each i, set the cluster Ci to be the set of points in X that are closer to ci than they are to cj for all i≠j
• For each i let ci be the center of cluster Ci (mean of the vectors in Ci)
• Repeat un4l convergence
16
0
1
2
3
4
5
0 1 2 3 4 5
K-‐means Clustering: Step 1 Algorithm: k-‐means, Distance Metric: Euclidean Distance
k1
k2
k3
17
0
1
2
3
4
5
0 1 2 3 4 5
K-‐means Clustering: Step 2 Algorithm: k-‐means, Distance Metric: Euclidean Distance
k1
k2
k3
18
0
1
2
3
4
5
0 1 2 3 4 5
K-‐means Clustering: Step 3 Algorithm: k-‐means, Distance Metric: Euclidean Distance
k1
k2
k3
19
0
1
2
3
4
5
0 1 2 3 4 5
K-‐means Clustering: Step 4 Algorithm: k-‐means, Distance Metric: Euclidean Distance
k1
k2
k3
20
Proper4es of the k-‐means algorithm
• Finds a local op4mum
• Converges oken quickly (but not always) • But it is guaranteed to converge aker at most KN itera4ons (N number of objects)
• The choice of ini4al points can have large influence in the result
21
Discussion of the k-‐means algorithm
x1
x2
x3
x4 K = 2
22
Discussion of the k-‐means algorithm
x1
x2
x3
x4
c1
c2 K = 2
23
Discussion of the k-‐means algorithm
x1
x2
x3
x4
c1
c2 K = 2
K-‐means converges immediately!
24
Discussion of the k-‐means algorithm
x1
x2
x3
x4
c1
c2 K = 2
K-‐means converges immediately!
Bad result L 25
Discussion of the k-‐means algorithm
x1
x2
x3
x4
c1
c2 K = 2
K-‐means converges immediately!
Even worse as the horizontal lines stretch 26
K-‐means++ • Choose one center uniformly at random from X
• For each data point x, compute d(x)
– distance between x and the nearest center that has already been chosen
• Choose “the farthest” data point as a new center – using a weighted probability distribu4on where a point x is chosen with probability propor4onal to d(x)
• Repeat Steps 2 and 3 un4l k centers have been chosen
27
K-‐means • Limita4ons?
• What if the space is not known?
• Example: images… 0
1
2
3
4
5
0 1 2 3 4 5
k1
k2
k3
28
Clustering images
29
Clustering images: scien4st or not
30
What are the centroids?
The k-‐median problem • Given a set X of n points in a d-‐dimensional space and an integer k
• Task: choose a set of k points {c1,c2,…,ck} from X and form clusters {C1,C2,…,Ck} such that
is minimized
31
∑∑= ∈
=k
i Cxi
i
cxLCCost1
1 ),()(
The k-‐medoids algorithm • Or … PAM (Par44oning Around Medoids, 1987)
– Choose randomly k medoids from the original
dataset X
– Assign each of the n-‐k remaining points in X to their
closest medoid
– Itera4vely replace one of the medoids by one of the
non-‐medoids if it improves the total clustering cost
32
Discussion of PAM algorithm • The algorithm is very similar to the k-‐means
algorithm
• Advantages and disadvantages?
33
K-‐means vs. K-‐medoids • K-‐medoids is more flexible:
– Can be used with any distance/similarity measure
– K-‐means works only with measures that are consistent with the mean (e.g., Euclidean Distance)
• K-‐medoids is more robust to ouliers: – Example: [1 2 3 4 100000]
– mean = 20002 vs medoid = 3
• K-‐medoids is more expensive: – All pair-‐wise distances are computed at each itera4on
34
CLARA (Clustering Large Applica4ons) • It draws mul6ple samples of the data set, applies PAM on
each sample, and gives the best clustering as the output
• Strength: deals with larger data sets than PAM
• Weakness: – Efficiency depends on the sample size
– A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased
35
What is the right number of clusters? • …or who sets the value of k?
• For n points to be clustered consider the case where k=n. What is the value of the error func4on?
• What happens when k = 1?
• Since we want to minimize the error why don’t we select always k = n?
• X-‐means: – based on the Bayesian Informa4on Criterion – Detects the op4mal number of clusters
36
Hierarchical Clustering
• Produces a set of nested clusters organized as a hierarchical tree
• Can be visualized as a dendrogram – A tree-‐like diagram that records the sequences of merges or splits
1 3 2 5 4 60
0.05
0.1
0.15
0.2
1
2
3
4
5
6
1
23 4
5
37
Strengths of Hierarchical Clustering
• No assump4ons on the number of clusters – Any desired number of clusters can be obtained by ‘cutng’ the dendogram at the proper level
• Hierarchical clusterings may correspond to meaningful taxonomies – Example in web (e.g., product catalogs), etc…
38
Hierarchical Clustering • Two main types of hierarchical clustering
– AgglomeraGve: • Start with the points as individual clusters • At each step, merge the closest pair of clusters un4l only one cluster (or k clusters) lek
– Divisive: • Start with one, all-‐inclusive cluster • At each step, split a cluster un4l each cluster contains a point (or there are k clusters)
• Tradi4onal hierarchical algorithms use a similarity or distance matrix
– Merge or split one cluster at a 4me
39
Complexity of hierarchical clustering
• Distance matrix is used for deciding which clusters to merge/split
• At least quadra4c in the number of data points
• Not usable for large datasets
40
Agglomera4ve clustering algorithm
• Most popular hierarchical clustering technique
• Basic algorithm 1. Compute the distance matrix between the input data points 2. Let each data point be a cluster 3. Repeat 4. Merge the two closest clusters 5. Update the distance matrix 6. UnGl only a single cluster remains
• Key opera4on is the computa4on of the distance between two clusters
– Different defini4ons of the distance between clusters lead to different algorithms
41
Input/ Ini4al setng • Start with clusters of individual points and a distance/proximity
matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Distance/Proximity Matrix
...p1 p2 p3 p4 p9 p10 p11 p12
42
Intermediate State • Aker some merging steps, we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Distance/Proximity Matrix
...p1 p2 p3 p4 p9 p10 p11 p12
43
Intermediate State • Merge the two closest clusters (C2 and C5) and update the
distance matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Distance/Proximity Matrix
...p1 p2 p3 p4 p9 p10 p11 p1244
Aker Merging • “How do we update the distance matrix?”
C1
C4
C2 U C5
C3 ? ? ? ?
?
?
?
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
...p1 p2 p3 p4 p9 p10 p11 p1245
Distance between two clusters
• Each cluster is a set of points
• How do we define distance between two sets of points – Lots of alterna4ves – Not an easy task
46
Distance between two clusters
• Single-‐link distance between clusters Ci and Cj the minimum distance (or highest similarity) between any object in Ci and any object in Cj
• The distance is defined by the two most similar objects
( ) { }jiyxjisl CyCxyxdCCD ∈∈= ,),(min, ,
47
Single-‐link clustering: example
• Determined by one pair of points, i.e., by one link in the proximity graph using the similarity matrix below:
I1 I2 I3 I4 I5I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
48
Strengths of single-‐link clustering
Original Points Two Clusters
• Can handle non-elliptical shapes
49
Limita4ons of single-‐link clustering
Original Points Two Clusters
• Sensitive to noise and outliers • It produces long, elongated clusters
50
Distance between two clusters
• Complete-‐link distance between clusters Ci and Cj the maximum distance (minimum similarity) between any object in Ci and any object in Cj
• The distance is defined by the two most dissimilar objects
( ) { }jiyxjicl CyCxyxdCCD ∈∈= ,),(max, ,
51
Complete-‐link clustering: example
• Distance between clusters is determined by the two most distant points in the different clusters
I1 I2 I3 I4 I5I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
52
Strengths of complete-‐link clustering
Original Points Two Clusters
• More balanced clusters (with equal diameter) • Less susceptible to noise
53
Limita4ons of complete-‐link clustering
Original Points Two Clusters
• Tends to break large clusters • All clusters tend to have the same diameter – small clusters are merged with larger ones
54
Distance between two clusters
• Average-‐link distance between clusters Ci and Cj is the average distance between any object in Ci and any object in Cj
( ) ∑∈∈×
=ji CyCxji
jiavg yxdCC
CCD,
),(1,
55
Average-‐link clustering: example
• Proximity of two clusters is the average of pairwise proximity between points in the two clusters.
I1 I2 I3 I4 I5I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
56
Average-‐link clustering: discussion
• Compromise between Single and Complete Link
• Strengths – Less suscep4ble to noise and outliers
• Limita4ons – Biased towards globular clusters
57
Distance between two clusters
• Ward’s distance between clusters Ci and Cj the difference between: – the total within cluster sum of squares for the two clusters separately
– the within cluster sum of squares resul6ng from merging the two clusters in cluster Cij
• ri: centroid (mean) of Ci • rj: centroid (mean) of Cj • rij: centroid (mean) of Cij
( ) ( ) ( ) ( )∑∑∑∈∈∈
−−−+−=ijji Cx
ijCx
jCx
ijiw rxrxrxCCD 222,
58
Ward’s distance for clusters
• Similar to average-‐link
• Less suscep4ble to noise and outliers
• Biased towards globular clusters
• Hierarchical analogue of k-‐means – Can be used to ini4alize k-‐means
59
Hierarchical Clustering: Time and Space requirements
• For a dataset X consis4ng of n points
• O(n2) space; it requires storing the distance matrix
• O(n3) Gme in most of the cases – There are n steps and at each step the size n2
distance matrix must be updated and searched – Complexity can be reduced to O(n2 log(n) ) 4me by
using appropriate data structures
60
How good is a clustering? • Cluster purity
– Each cluster is assigned to the class label that is most frequent in the cluster
– Purity is measured by coun4ng the number of correctly assigned objects and dividing by the total number of objects
Purity = (5 + 4 + 3) / 17 = 12 / 17
61
How good is purity?
• Cluster purity – Simple! – Intui4ve
• Trade-‐off between number of clusters and purity? – If each object belongs to its own cluster, then purity is 1.0 – Is that desirable?
• Cannot trade-‐off the quality of clusters against the number of clusters
• Alterna4ves: – Normalized Mutual Informa4on
– Rand Index 62
What if we don’t have class labels? • a(i): average dissimilarity (distance) of i with all other data within
the same cluster • b(i): the lowest average dissimilarity (distance) of i to any other
cluster where i is not a member • The Silhoue]e of object i:
• Values between -‐1 and 1
– if b(i) >> a(i) then s(i) = 1 – if b(i) << a(i) then s(i) = -‐1
• Total Silhouete: – average silhouete value of all objects
63
i
Correct number of clusters
64
• Can be used to iden4fy the correct number of clusters
Clustering aggrega4on
• Many different clusterings for the same dataset! – Different objec4ve func4ons – Different algorithms – Different number of clusters
• Which clustering is the best? – AggregaGon: we do not need to decide, but rather find a reconcilia4on between different outputs
65
The clustering-‐aggrega4on problem
• Input – n objects X = {x1,x2,…,xn} – m clusterings of the objects {C1,…,Cm} – clustering: a collec4on of disjoint sets that cover X
• Output
– a single clustering C, that is as close as possible to all input clusterings
• How do we measure closeness of clusterings?
– disagreement distance
66
Disagreement distance • For two par44ons C and P, and objects x,y
in X define
• if IC,P(x,y) = 1 we say that x,y create a disagreement between clusterings C and P • The disagreement distance between C and
P is:
U C P x1 1 1 x2 1 2 x3 2 1 x4 3 3 x5 3 4
D(C,P) = 3
⎪⎪⎪
⎩
⎪⎪⎪
⎨
⎧
=≠
≠=
=
otherwise0
P(y) P(x) and C(y) C(x) ifOR
P(y) P(x) and C(y) C(x) if 1
y)(x,I PC,
D(C,P) = IC,P (x,y)(x,y)∑
67
Clustering aggrega4on • Given m clusterings C1,…,Cm find C such that
is minimized ∑=
=m
1ii )CD(C, D(C)
U C1 C2 C3 C x1 1 1 1 1 x2 1 2 2 2 x3 2 1 1 1 x4 2 2 2 2 x5 3 3 3 3 x6 3 4 3 3
the aggrega4on cost
68
Why clustering aggrega4on?
• Clustering categorical data U City Profession Nationality
x1 New York Doctor U.S.
x2 New York Teacher Canada
x3 Boston Doctor U.S.
x4 Boston Teacher Canada
x5 Los Angeles Lawer Mexican
x6 Los Angeles Actor Mexican
69
Why clustering aggrega4on?
• Detect outliers – outliers are defined as points for which there is no consensus by the clusterings
• Improve the robustness of clustering algorithms – different algorithms have different weaknesses – combining them can produce a be]er result
• Privacy preserving clustering – different companies have data for the same users – compute an aggregate clustering without sharing the actual data
70
Complexity of Clustering Aggrega4on
• The clustering aggrega4on problem is NP-‐hard – the median par44on problem [Barthelemy and LeClerc 1995]
• Look for heuris4cs and approximate solu4ons
ALG(I) ≤ c OPT(I)
71
A simple 2-‐approxima4on algorithm
• The disagreement distance D(C,P) is a metric – What does that mean? We will revisit this in a few lectures…
• The algorithm BEST: – select among the input clusterings the clustering C* that minimizes D(C*)
– a 2-‐approximate solu4on
72
TODOs
• Assignment 1: – Due next week [Nov. 17 at 11:00am] – Discussion forum on this assignment now available on iLearn2
• Exercise Session 1: – Nov. 17 at 11:00am
73