Introduction to Machine Learning€¦ · Lecture 28: Unsupervised Learning, Clustering Introduction...
Transcript of Introduction to Machine Learning€¦ · Lecture 28: Unsupervised Learning, Clustering Introduction...
Teacher:Gianni A. Di Caro
Lecture 28:Unsupervised Learning, Clustering
Introduction to Machine Learning10-315 Fall ‘19
Disclaimer: These slides can include material from different sources. I’ll happy to explicitly acknowledge a source if required. Contact me for requests.
2
Unsupervised learning
Unsupervised learning aims to find hidden structures, association relationships
It works in an unsupervised manner: no labels / responses !
It can be seen as learning a new representation of data, typically a compressed representation
3
Unsupervised learning: representative (typical) tasks
Dimensionality Reduction: reduce the dimensionality (#descriptors) of the data by identifying latent variables/relationships (PCA!)
4
Unsupervised learning: representative (typical) tasks
Clustering: based on some measure of similarity/dissimilarity, group data items in 𝑛𝑛 clusters, where 𝑛𝑛 is (usually) not known in advance.
5
Unsupervised learning: representative (typical) tasks
Generative models: estimate the probability density of data, the probability distribution models underlying the data, which equals to identifying (probabilistic) relations and correlations among the data
6
Typical Workflow of Unsupervised Learning for Clustering tasks
Unsupervised ML algorithm for
data clustering
Test phase: given a new input image, assign it to an existing cluster
7
Clustering problem: Partitional (flat) clustering
Given: A set of 𝑁𝑁 unlabeled examples, {𝒙𝒙1,𝒙𝒙2,⋯ ,𝒙𝒙𝑁𝑁}, where each 𝒙𝒙𝑖𝑖 ∈ ℝ𝐷𝐷
Measure of similarity / dissimilarity among the inputs
Goal: Group the examples into 𝐾𝐾 partitions based on the given measure of similarity, where each partition should feature inputs that are similar / homogenous 𝐾𝐾 can be given or not
Clustering ~ Classification without labels
A good clustering is one that achieves:
High similarity among within-cluster data
Low similarity among inter-cluster data
8
Clustering problem: Hierarchical clustering
Given: A set of 𝑁𝑁 unlabeled examples, {𝒙𝒙1,𝒙𝒙2,⋯ ,𝒙𝒙𝑁𝑁}, where each 𝒙𝒙𝑖𝑖 ∈ ℝ𝐷𝐷
Measure of similarity / dissimilarity among the inputs
Goal: Group the examples into a hierarchy of clusters based on the given measure of similarity where each cluster incrementally feature inputs that are similar / homogenous Don’t need a number of clusters! Don’t need to seed the partitions
Can be constructed (we’ll see this next time):
Agglomerative (bottom-up)
Divisive (top-down)
9
Similarity needs to be well defined, otherwise it is subjective
o Clustering is based on the given notion of similarity, no labels are given
o However, without labels, a meaningful measure of similarity can be hard to define
Similarity measure, number of clusters ~ Inductive bias, the hypotheses that are set
Similarity can be subjective!
Remove subjectivity by defining similarity in terms of a distance between vectors or correlations between random variables.
10
Distance metrics
11
Correlation measures
12
Clustering examples
o Clustering groups of Documents / Images / Webpages / Web search
o Image Segmentation (clustering pixels)
o Clustering (people) nodes in (social) networks/graphs
13
Types of clustering
o Hard / partitional clustering: a data item belongs to one and only on cluster (membership is the indicator function). Partitions/clusters are independent of each other.
o Soft clustering: fractional or probabilistic membership, a data item belongs to different clusters with a different degree of membership (fuzzy clustering) or with a different probability (probabilistic clustering)
o Overlapping clustering: a data item belongs (in the hard sense) to different clusters
o Hierarchical clustering: data is organized in a hierarchicalmanner, such that the goal is to identify a hierarchy of clusters represented using a dendogram. Individual data is at the leaf nodes of the three structure.
14
Clustering algorithms
(next time)
15
Flat / Partitional clustering: K-means problem
K-means problem:
Given a set of 𝑁𝑁 unlabeled examples, {𝒙𝒙1,𝒙𝒙2,⋯ ,𝒙𝒙𝑁𝑁}, where each 𝒙𝒙𝑖𝑖 ∈ ℝ𝐷𝐷
Given a target number 𝐾𝐾 of clusters (the size of data partitioning)
Find the assignment of the 𝑁𝑁 examples to 𝐾𝐾 clusters (sets) 𝑪𝑪 = {𝐶𝐶1,𝐶𝐶2,⋯ ,𝐶𝐶𝑘𝑘}, such that the sets in 𝑪𝑪 minimize the within-cluster sum of squares (variance of set)(𝐿𝐿2 squared distance = Squared Euclidean distance):
arg min𝑪𝑪�𝑖𝑖=1
𝐾𝐾
�𝑥𝑥𝑗𝑗∈𝐶𝐶𝑖𝑖
𝒙𝒙𝑗𝑗 − 𝝁𝝁𝑖𝑖2
= arg min𝐶𝐶�𝑖𝑖=1
𝐾𝐾
𝐶𝐶𝑖𝑖 𝑉𝑉𝑉𝑉𝑉𝑉(𝐶𝐶𝑖𝑖)
where 𝝁𝝁 = (𝝁𝝁1,𝝁𝝁2,⋯ ,𝝁𝝁𝐾𝐾) are the 𝐾𝐾 cluster means (cluster prototypes) and 𝑉𝑉𝑉𝑉𝑉𝑉(𝐶𝐶𝑖𝑖) is the variance, spread of data about the center, in set 𝐶𝐶𝑖𝑖
16
Flat clustering: K-means problem
K-means problem:
Given a set of 𝑁𝑁 unlabeled examples, {𝒙𝒙1,𝒙𝒙2,⋯ ,𝒙𝒙𝑁𝑁}, where each 𝒙𝒙𝑖𝑖 ∈ ℝ𝐷𝐷
Given a target number 𝐾𝐾 of clusters (the size of data partitioning)
Find the assignment of the 𝑁𝑁 examples to 𝐾𝐾 clusters (sets) 𝑪𝑪 = {𝐶𝐶1,𝐶𝐶2,⋯ ,𝐶𝐶𝑘𝑘}, such that the sets in 𝑪𝑪 minimize the within-cluster sum of squares (variance of set)(𝐿𝐿2 squared distance = Squared Euclidean distance):
arg min𝑪𝑪�𝑖𝑖=1
𝐾𝐾
�𝑥𝑥𝑗𝑗∈𝐶𝐶𝑖𝑖
𝒙𝒙𝑗𝑗 − 𝝁𝝁𝑖𝑖2
= arg min𝐶𝐶�𝑖𝑖=1
𝐾𝐾
𝐶𝐶𝑖𝑖 𝑉𝑉𝑉𝑉𝑉𝑉(𝐶𝐶𝑖𝑖) = arg min𝐶𝐶�𝑖𝑖=1
𝐾𝐾1
2 𝐶𝐶𝑖𝑖�𝒙𝒙,𝒚𝒚∈𝐶𝐶𝑖𝑖
𝒙𝒙 − 𝒚𝒚 2
Sum of intra-cluster variances weighted by the number of elements in each cluster
Sum of pair-wise squared deviations of points in the same cluster weighted by the inverse number of elements in each cluster
A difficult problem: NP-hard!
17
Naïve K-means algorithm: Step 1
Initialize: guess initial cluster means 𝝁𝝁1,𝝁𝝁2,⋯ ,𝝁𝝁𝐾𝐾 (𝑘𝑘1, 𝑘𝑘2,𝑘𝑘3 in the figure)
o Usually initialized randomly, but good initialization is crucial (see later); many smarter initialization heuristics exist (e.g., K-means++, Arthur & Vassilvitskii, 2007)
18
Naïve K-means algorithm: Step 2
(Re)-Assign: based on the current cluster centers, (re)assign each example to the closest (in the 𝐿𝐿2 sense) cluster center in order to minimize within-cluster distances
o This is equivalent to build the Voronoi diagram with the cluster centers as cell centers
19
Naïve K-means algorithm: Step 3
(Re)-Estimate: based on the current assignment of examples to clusters, (re)-estimate cluster centers (centroids), in order to best reflect the new assignment
20
Naïve K-means algorithm: Step 2.1
(Re)-Assign: based on the updated cluster centers, (re)assign each example to the closest cluster center in order to minimize the new within-cluster distances
o This is equivalent to build a new Voronoi diagram with the cluster centers as cell centers
21
Naïve K-means algorithm: Step 3.1
(Re)-Estimate: based on the current assignment of examples to clusters, (re)-estimate cluster centers, in order to best reflect the new assignment
22
Naïve K-means algorithm: Step 2.2
Stop: based on the current assignment of examples to clusters and cluster centers, no example gets reassigned, the Voronoi diagram won’t change, the algorithm has reached a stable configuration, it stops.
23
(Naïve) K-means algorithm (Floyd, 1957)Input:o A set of 𝑁𝑁 unlabeled examples, {𝒙𝒙1,𝒙𝒙2,⋯ ,𝒙𝒙𝑁𝑁}, where each 𝒙𝒙𝑖𝑖 ∈ ℝ𝐷𝐷
o A target number 𝐾𝐾 of clusters
Initialize: Cluster means 𝝁𝝁1,𝝁𝝁2,⋯ ,𝝁𝝁𝐾𝐾Repeat:o Assign each example to the cluster center which is closest based on Euclidean metric,
which will minimize the within-cluster distance for the current cluster centers
𝐶𝐶𝑘𝑘 = {𝒙𝒙𝑛𝑛: 𝑘𝑘 = arg min𝑘𝑘
𝒙𝒙𝑛𝑛 − 𝝁𝝁𝐾𝐾 2 } , 𝑘𝑘 = 1,⋯ ,𝐾𝐾
o Update the estimates of the cluster centers as means based on the new assignmentof examples to clusters
𝝁𝝁𝐾𝐾 = mean 𝐶𝐶𝑘𝑘 =1
|𝐶𝐶𝑘𝑘| �𝒙𝒙𝑛𝑛∈𝐶𝐶𝑘𝑘
𝒙𝒙𝑛𝑛 𝑘𝑘 = 1,⋯ ,𝐾𝐾
o Stop when cluster centers / assignments don’t change (or don’t change too much)
24
K-means = Prototype classification
25
K-means example
Iteration 1 Iteration 2
Iteration 3 Iteration 4
26
Complexity, Completeness, and Optimality of K-means
At each iteration:
o Computing distances between each of the 𝑁𝑁 data items and the 𝐾𝐾cluster centers →𝑂𝑂(𝐾𝐾𝑁𝑁)
o Computing cluster centers: each item gets added once to some cluster →𝑂𝑂(𝑁𝑁)
Assume these two steps are each done once for 𝑇𝑇 iterations →𝑂𝑂(𝑇𝑇𝐾𝐾𝑁𝑁)
Questions:
1. Is K-means guaranteed to converge (i.e., is the procedure guaranteed to terminate)?
2. If it does converge, is the final assignment the optimal one for the K-means problem?
27
Effect of the initial guess and local minima
𝑇𝑇 = 1 𝑇𝑇 = 2
The results of the K- means algorithm can vary based on initial placement of centers
o Some placements can result in poor convergence rate, or convergence to sub-optimal clustering → K-means algorithm can get stuck easily in local minima
Countermeasures: Select initial centers by using a heuristic (e.g. K-means++, choose centers that are far apart)
Try out multiple starting points (very important!!!) Initialize with the results of another method
28
Loss function of the K-means problem
o A set of 𝑁𝑁 unlabeled examples, {𝒙𝒙1,𝒙𝒙2,⋯ ,𝒙𝒙𝑁𝑁}, where each 𝒙𝒙𝑖𝑖 ∈ ℝ𝐷𝐷
o 𝐶𝐶𝑘𝑘 = {𝒙𝒙𝑛𝑛: 𝑘𝑘 = arg min𝑘𝑘
𝒙𝒙𝑛𝑛 − 𝝁𝝁𝐾𝐾 2 } , 𝑘𝑘 = 1,⋯ ,𝐾𝐾 (𝐶𝐶𝑘𝑘 is the set of points in cluster 𝑘𝑘)
o Cluster means 𝝁𝝁1,𝝁𝝁2,⋯ ,𝝁𝝁𝐾𝐾, where each 𝝁𝝁𝑖𝑖 ∈ ℝ𝐷𝐷, 𝝁𝝁𝑖𝑖 = mean 𝐶𝐶𝑖𝑖 = 1|𝐶𝐶𝑖𝑖|
∑𝒙𝒙𝑛𝑛∈𝐶𝐶𝑖𝑖 𝒙𝒙𝑛𝑛o 𝑉𝑉𝑛𝑛𝑘𝑘 = 0,1 , where 𝑉𝑉𝑛𝑛𝑘𝑘 = 1 if 𝒙𝒙𝑛𝑛 ∈ 𝐶𝐶𝑘𝑘, 0 otherwise o 𝒂𝒂𝑛𝑛 = 𝑉𝑉𝑛𝑛1, 𝑉𝑉𝑛𝑛2,⋯ , 𝑉𝑉𝑛𝑛𝑘𝑘 (length 𝐾𝐾 one-hot encoding of 𝒙𝒙𝑛𝑛’s cluster assignment)o 𝑉𝑉 𝑛𝑛 = 𝑘𝑘, where 𝑘𝑘 is the index of the centroid assigned to 𝑥𝑥𝑛𝑛
ℓ(𝝁𝝁,𝒙𝒙𝑛𝑛,𝒂𝒂𝑛𝑛) = �𝑘𝑘=1
𝐾𝐾
𝑉𝑉𝑛𝑛𝑘𝑘 𝒙𝒙𝑛𝑛 − 𝝁𝝁𝑘𝑘 2 The loss (distortion, variance) for clustering example 𝒙𝒙𝑛𝑛:
The loss (variance) for clustering all examples: ℓ(𝝁𝝁,𝑿𝑿,𝑨𝑨) = �𝑛𝑛=1
𝑁𝑁
�𝑘𝑘=1
𝐾𝐾
𝑉𝑉𝑛𝑛𝑘𝑘 𝒙𝒙𝑛𝑛 − 𝝁𝝁𝑘𝑘 2
K-means objective: minimize the loss ℓ(𝝁𝝁,𝑿𝑿,𝑨𝑨)with respect to 𝝁𝝁 and 𝑨𝑨 (𝑿𝑿 is given) = 𝑿𝑿− 𝑨𝑨𝝁𝝁 2
𝑿𝑿𝑁𝑁×𝐷𝐷𝑨𝑨𝑁𝑁×𝐾𝐾𝝁𝝁𝐾𝐾×𝐷𝐷
29
Loss function
ℓ(𝝁𝝁,𝑨𝑨) = �𝑛𝑛=1
𝑁𝑁
�𝑘𝑘=1
𝐾𝐾
𝑉𝑉𝑛𝑛𝑘𝑘 𝒙𝒙𝑛𝑛 − 𝝁𝝁𝑘𝑘 2
ℓ(𝝁𝝁,𝑨𝑨) = �𝑛𝑛=1
𝑁𝑁
𝒙𝒙𝑛𝑛 − 𝝁𝝁𝑎𝑎 𝑛𝑛2 ℓ(𝝁𝝁,𝑨𝑨) = �
𝑘𝑘=1
𝐾𝐾
�𝑛𝑛:𝑎𝑎 𝑛𝑛 =𝑘𝑘
𝒙𝒙𝑛𝑛 − 𝝁𝝁𝑘𝑘 2Equivalent forms
Sum of total distortion (distances from centers)
Within-cluster variance,sum of intra-cluster
distances from centers�𝑨𝑨, �𝝁𝝁 = arg min
𝑨𝑨,𝝁𝝁ℓ(𝝁𝝁,𝑨𝑨)
Replacing 𝐿𝐿2 distance (Euclidean) with 𝐿𝐿1 distance (absolute values), gives the K-median algorithm
30
Loss function and Optimization: Alternating Optimization
�𝑨𝑨, �𝝁𝝁 = arg min𝑨𝑨,𝝁𝝁
ℓ(𝝁𝝁,𝑨𝑨) K-means optimization problem: = arg min𝑨𝑨,𝝁𝝁
�𝑛𝑛=1
𝑁𝑁
�𝑘𝑘=1
𝐾𝐾
𝑉𝑉𝑛𝑛𝑘𝑘 𝒙𝒙𝑛𝑛 − 𝝁𝝁𝑘𝑘 2
Difficult problem: two related vector variables, can’t optimize it jointly for 𝑨𝑨 and 𝝁𝝁
K-means (sub-optimal) approach: alternating optimization between 𝑨𝑨 and 𝝁𝝁
1. Fix 𝝁𝝁 as current �𝝁𝝁 and find the optimal 𝑨𝑨 as:
�𝑨𝑨 = arg min𝑨𝑨ℓ(�𝝁𝝁,𝑨𝑨;𝑿𝑿)
2. Fix 𝑨𝑨 as the found �𝑨𝑨 and find the optimal 𝝁𝝁 as:
�𝝁𝝁 = arg min𝝁𝝁ℓ(𝝁𝝁, �𝑨𝑨;𝑿𝑿)
3. Go to 1 if not yet converged
31
Solving for 𝑨𝑨: Expectation step
1. Fix 𝝁𝝁 as current �𝝁𝝁 and find the optimal 𝑨𝑨 as:
�𝑨𝑨 = arg min𝑨𝑨ℓ �𝝁𝝁,𝑨𝑨;𝑿𝑿 = arg min
𝑨𝑨�𝑛𝑛=1
𝑁𝑁
�𝑘𝑘=1
𝐾𝐾
𝑉𝑉𝑛𝑛𝑘𝑘 𝒙𝒙𝑛𝑛 − �𝝁𝝁𝑘𝑘 2
o 𝑉𝑉𝑛𝑛𝑘𝑘 are discrete, 𝐾𝐾𝑁𝑁 possibilities for 𝑨𝑨N×𝐾𝐾
Greedy approach: exploit the relationship between 𝒙𝒙𝑛𝑛 and 𝑉𝑉𝑛𝑛𝑘𝑘, that says that the terms involving different 𝑛𝑛 are independent of each other, and optimize 𝑨𝑨 one row at-a-time, 𝑉𝑉𝑛𝑛,for one single input 𝒙𝒙𝑛𝑛, keeping all the other 𝑉𝑉𝑚𝑚 ,𝑚𝑚 ≠ 𝑛𝑛, and �𝝁𝝁 fixed
�𝑉𝑉𝑛𝑛 = arg min𝒂𝒂𝑛𝑛
�𝑘𝑘=1
𝐾𝐾
𝑉𝑉𝑛𝑛𝑘𝑘 𝒙𝒙𝑛𝑛 − �𝝁𝝁𝑘𝑘 2 = arg min𝑎𝑎𝑛𝑛
𝒙𝒙𝑛𝑛 − �𝝁𝝁𝑎𝑎(𝑛𝑛)2
This is the same as assigning 𝒙𝒙𝑛𝑛 to the closest centroid This is what K-means does!
Expectation step: Expected cluster assignment
�𝑉𝑉𝑛𝑛 = 𝑘𝑘 if 𝑘𝑘 = arg min𝑗𝑗 𝒙𝒙𝑛𝑛 − �𝝁𝝁𝑗𝑗�𝑉𝑉𝑛𝑛 = 0 otherwise (𝑉𝑉𝑛𝑛𝑘𝑘 = 0)
32
Solving for 𝝁𝝁: Maximization step
2. Fix 𝑨𝑨 as the found �𝑨𝑨 and find the optimal 𝝁𝝁 as:
�𝝁𝝁 = arg min𝝁𝝁ℓ(𝝁𝝁, �𝑨𝑨;𝑿𝑿) = arg min
𝝁𝝁�𝑘𝑘=1
𝐾𝐾
�𝑛𝑛: �𝑎𝑎 𝑛𝑛 =𝑘𝑘
𝒙𝒙𝑛𝑛 − 𝝁𝝁𝑘𝑘 2
o Easy to solve since the 𝝁𝝁 are real-valued vectors, the problem is convex, and each 𝝁𝝁𝑘𝑘can be optimized independently since they have no cross dependencies
�𝝁𝝁𝑘𝑘 = arg min𝝁𝝁𝑘𝑘
�𝑛𝑛: �𝑎𝑎 𝑛𝑛 =𝑘𝑘
𝒙𝒙𝑛𝑛 − 𝝁𝝁𝑘𝑘 2 = arg min𝑘𝑘
�𝑛𝑛=1
𝑁𝑁
�𝑘𝑘=1
𝐾𝐾
�𝑉𝑉𝑛𝑛𝑘𝑘 𝒙𝒙𝑛𝑛 − 𝝁𝝁𝑘𝑘 2
o This is what K-means does!Maximization step:Maximum likelihood for centers
Computing the gradient of ℓ(𝝁𝝁, �𝑨𝑨;𝑿𝑿) and equating to zero: 𝜕𝜕ℓ𝜕𝜕𝝁𝝁 = 2 �
𝑛𝑛=1
𝑁𝑁
𝑉𝑉𝑛𝑛𝑘𝑘 𝒙𝒙𝑛𝑛 − 𝝁𝝁𝑘𝑘 = 0
which gives: �𝝁𝝁𝑘𝑘 =∑𝑛𝑛=1𝑁𝑁 𝑉𝑉𝑛𝑛𝑘𝑘𝒙𝒙𝑛𝑛∑𝑛𝑛=1𝑁𝑁 𝑉𝑉𝑛𝑛𝑘𝑘
→ each �𝝁𝝁𝑘𝑘 is the mean of the points currently in cluster 𝑘𝑘
33
Convergence of K-means
Each step (either updating on 𝑨𝑨 or on 𝝁𝝁) can never increase the K-means loss
The K-means algorithm monotonically decreases the objective
(prove it!)
34
Choosing K
One way to select K for the K-means algorithm is to try different values of K, plot the K-means objective versus K, and look at the elbow-point
Elbow at K=6
o Note: the larger the K, the lower the loss, in the limit each point is a centroid, zero loss!o AIC (Akaike Information Criterion) defines AIC = 2ℓ + KD and chooses the 𝐾𝐾 with the
smallest AIC, which discourages large 𝐾𝐾 (the largero … many other criteria
35
Hard vs. Soft assignments
36
K-means decision boundaries and cluster shapes
37
Kernel K-means