Introduction to Machine Learning€¦ · Lecture 28: Unsupervised Learning, Clustering Introduction...

Teacher:Gianni A. Di Caro

Lecture 28:Unsupervised Learning, Clustering

Introduction to Machine Learning10-315 Fall ‘19

Disclaimer: These slides can include material from different sources. I’ll happy to explicitly acknowledge a source if required. Contact me for requests.

2

Unsupervised learning

Unsupervised learning aims to find hidden structures, association relationships

It works in an unsupervised manner: no labels / responses !

It can be seen as learning a new representation of data, typically a compressed representation

3

Unsupervised learning: representative (typical) tasks

Dimensionality Reduction: reduce the dimensionality (#descriptors) of the data by identifying latent variables/relationships (PCA!)

4


Clustering: based on some measure of similarity/dissimilarity, group data items in 𝑛𝑛 clusters, where 𝑛𝑛 is (usually) not known in advance.

5


Generative models: estimate the probability density of data, the probability distribution models underlying the data, which equals to identifying (probabilistic) relations and correlations among the data

6

Typical Workflow of Unsupervised Learning for Clustering tasks

Unsupervised ML algorithm for

data clustering

Test phase: given a new input image, assign it to an existing cluster

7

Clustering problem: Partitional (flat) clustering

Given: A set of 𝑁𝑁 unlabeled examples, {𝒙𝒙1,𝒙𝒙2,⋯ ,𝒙𝒙𝑁𝑁}, where each 𝒙𝒙𝑖𝑖 ∈ ℝ𝐷𝐷

Measure of similarity / dissimilarity among the inputs

Goal: Group the examples into 𝐾𝐾 partitions based on the given measure of similarity, where each partition should feature inputs that are similar / homogenous 𝐾𝐾 can be given or not

Clustering ~ Classification without labels

A good clustering is one that achieves:

High similarity among within-cluster data

Low similarity among inter-cluster data

8

Clustering problem: Hierarchical clustering

Given: A set of 𝑁𝑁 unlabeled examples, {𝒙𝒙1,𝒙𝒙2,⋯ ,𝒙𝒙𝑁𝑁}, where each 𝒙𝒙𝑖𝑖 ∈ ℝ𝐷𝐷

Measure of similarity / dissimilarity among the inputs

Goal: Group the examples into a hierarchy of clusters based on the given measure of similarity where each cluster incrementally feature inputs that are similar / homogenous Don’t need a number of clusters! Don’t need to seed the partitions

Can be constructed (we’ll see this next time):

Agglomerative (bottom-up)

Divisive (top-down)

9

Similarity needs to be well defined, otherwise it is subjective

o Clustering is based on the given notion of similarity, no labels are given

o However, without labels, a meaningful measure of similarity can be hard to define

Similarity measure, number of clusters ~ Inductive bias, the hypotheses that are set

Similarity can be subjective!

Remove subjectivity by defining similarity in terms of a distance between vectors or correlations between random variables.

10

Distance metrics

11

Correlation measures

12

Clustering examples

o Clustering groups of Documents / Images / Webpages / Web search

o Image Segmentation (clustering pixels)

o Clustering (people) nodes in (social) networks/graphs

13

Types of clustering

o Hard / partitional clustering: a data item belongs to one and only on cluster (membership is the indicator function). Partitions/clusters are independent of each other.

o Soft clustering: fractional or probabilistic membership, a data item belongs to different clusters with a different degree of membership (fuzzy clustering) or with a different probability (probabilistic clustering)

o Overlapping clustering: a data item belongs (in the hard sense) to different clusters

o Hierarchical clustering: data is organized in a hierarchicalmanner, such that the goal is to identify a hierarchy of clusters represented using a dendogram. Individual data is at the leaf nodes of the three structure.

14

Clustering algorithms

(next time)

15

Flat / Partitional clustering: K-means problem

K-means problem:

Given a set of 𝑁𝑁 unlabeled examples, {𝒙𝒙1,𝒙𝒙2,⋯ ,𝒙𝒙𝑁𝑁}, where each 𝒙𝒙𝑖𝑖 ∈ ℝ𝐷𝐷

Given a target number 𝐾𝐾 of clusters (the size of data partitioning)

Find the assignment of the 𝑁𝑁 examples to 𝐾𝐾 clusters (sets) 𝑪𝑪 = {𝐶𝐶1,𝐶𝐶2,⋯ ,𝐶𝐶𝑘𝑘}, such that the sets in 𝑪𝑪 minimize the within-cluster sum of squares (variance of set)(𝐿𝐿2 squared distance = Squared Euclidean distance):

arg min𝑪𝑪�𝑖𝑖=1

𝐾𝐾

�𝑥𝑥𝑗𝑗∈𝐶𝐶𝑖𝑖

𝒙𝒙𝑗𝑗 − 𝝁𝝁𝑖𝑖2

= arg min𝐶𝐶�𝑖𝑖=1

𝐾𝐾

𝐶𝐶𝑖𝑖 𝑉𝑉𝑉𝑉𝑉𝑉(𝐶𝐶𝑖𝑖)

where 𝝁𝝁 = (𝝁𝝁1,𝝁𝝁2,⋯ ,𝝁𝝁𝐾𝐾) are the 𝐾𝐾 cluster means (cluster prototypes) and 𝑉𝑉𝑉𝑉𝑉𝑉(𝐶𝐶𝑖𝑖) is the variance, spread of data about the center, in set 𝐶𝐶𝑖𝑖

16

Flat clustering: K-means problem

K-means problem:

Given a set of 𝑁𝑁 unlabeled examples, {𝒙𝒙1,𝒙𝒙2,⋯ ,𝒙𝒙𝑁𝑁}, where each 𝒙𝒙𝑖𝑖 ∈ ℝ𝐷𝐷

Given a target number 𝐾𝐾 of clusters (the size of data partitioning)

Find the assignment of the 𝑁𝑁 examples to 𝐾𝐾 clusters (sets) 𝑪𝑪 = {𝐶𝐶1,𝐶𝐶2,⋯ ,𝐶𝐶𝑘𝑘}, such that the sets in 𝑪𝑪 minimize the within-cluster sum of squares (variance of set)(𝐿𝐿2 squared distance = Squared Euclidean distance):

arg min𝑪𝑪�𝑖𝑖=1

𝐾𝐾

�𝑥𝑥𝑗𝑗∈𝐶𝐶𝑖𝑖

𝒙𝒙𝑗𝑗 − 𝝁𝝁𝑖𝑖2

= arg min𝐶𝐶�𝑖𝑖=1

𝐾𝐾

𝐶𝐶𝑖𝑖 𝑉𝑉𝑉𝑉𝑉𝑉(𝐶𝐶𝑖𝑖) = arg min𝐶𝐶�𝑖𝑖=1

𝐾𝐾1

2 𝐶𝐶𝑖𝑖�𝒙𝒙,𝒚𝒚∈𝐶𝐶𝑖𝑖

𝒙𝒙 − 𝒚𝒚 2

Sum of intra-cluster variances weighted by the number of elements in each cluster

Sum of pair-wise squared deviations of points in the same cluster weighted by the inverse number of elements in each cluster

A difficult problem: NP-hard!

17

Naïve K-means algorithm: Step 1

Initialize: guess initial cluster means 𝝁𝝁1,𝝁𝝁2,⋯ ,𝝁𝝁𝐾𝐾 (𝑘𝑘1, 𝑘𝑘2,𝑘𝑘3 in the figure)

o Usually initialized randomly, but good initialization is crucial (see later); many smarter initialization heuristics exist (e.g., K-means++, Arthur & Vassilvitskii, 2007)

18


(Re)-Assign: based on the current cluster centers, (re)assign each example to the closest (in the 𝐿𝐿2 sense) cluster center in order to minimize within-cluster distances

o This is equivalent to build the Voronoi diagram with the cluster centers as cell centers

19


(Re)-Estimate: based on the current assignment of examples to clusters, (re)-estimate cluster centers (centroids), in order to best reflect the new assignment

20

Naïve K-means algorithm: Step 2.1

(Re)-Assign: based on the updated cluster centers, (re)assign each example to the closest cluster center in order to minimize the new within-cluster distances

o This is equivalent to build a new Voronoi diagram with the cluster centers as cell centers

21


(Re)-Estimate: based on the current assignment of examples to clusters, (re)-estimate cluster centers, in order to best reflect the new assignment

22


Stop: based on the current assignment of examples to clusters and cluster centers, no example gets reassigned, the Voronoi diagram won’t change, the algorithm has reached a stable configuration, it stops.

23

(Naïve) K-means algorithm (Floyd, 1957)Input:o A set of 𝑁𝑁 unlabeled examples, {𝒙𝒙1,𝒙𝒙2,⋯ ,𝒙𝒙𝑁𝑁}, where each 𝒙𝒙𝑖𝑖 ∈ ℝ𝐷𝐷

o A target number 𝐾𝐾 of clusters

Initialize: Cluster means 𝝁𝝁1,𝝁𝝁2,⋯ ,𝝁𝝁𝐾𝐾Repeat:o Assign each example to the cluster center which is closest based on Euclidean metric,

which will minimize the within-cluster distance for the current cluster centers

𝐶𝐶𝑘𝑘 = {𝒙𝒙𝑛𝑛: 𝑘𝑘 = arg min𝑘𝑘

𝒙𝒙𝑛𝑛 − 𝝁𝝁𝐾𝐾 2 } , 𝑘𝑘 = 1,⋯ ,𝐾𝐾

o Update the estimates of the cluster centers as means based on the new assignmentof examples to clusters

𝝁𝝁𝐾𝐾 = mean 𝐶𝐶𝑘𝑘 =1

|𝐶𝐶𝑘𝑘| �𝒙𝒙𝑛𝑛∈𝐶𝐶𝑘𝑘

𝒙𝒙𝑛𝑛 𝑘𝑘 = 1,⋯ ,𝐾𝐾

o Stop when cluster centers / assignments don’t change (or don’t change too much)

24

K-means = Prototype classification

25

K-means example

Iteration 1 Iteration 2

Iteration 3 Iteration 4

26

Complexity, Completeness, and Optimality of K-means

At each iteration:

o Computing distances between each of the 𝑁𝑁 data items and the 𝐾𝐾cluster centers →𝑂𝑂(𝐾𝐾𝑁𝑁)

o Computing cluster centers: each item gets added once to some cluster →𝑂𝑂(𝑁𝑁)

Assume these two steps are each done once for 𝑇𝑇 iterations →𝑂𝑂(𝑇𝑇𝐾𝐾𝑁𝑁)

Questions:

1. Is K-means guaranteed to converge (i.e., is the procedure guaranteed to terminate)?

2. If it does converge, is the final assignment the optimal one for the K-means problem?

27

Effect of the initial guess and local minima

𝑇𝑇 = 1 𝑇𝑇 = 2

The results of the K- means algorithm can vary based on initial placement of centers

o Some placements can result in poor convergence rate, or convergence to sub-optimal clustering → K-means algorithm can get stuck easily in local minima

Countermeasures: Select initial centers by using a heuristic (e.g. K-means++, choose centers that are far apart)

Try out multiple starting points (very important!!!) Initialize with the results of another method

28

Loss function of the K-means problem

o A set of 𝑁𝑁 unlabeled examples, {𝒙𝒙1,𝒙𝒙2,⋯ ,𝒙𝒙𝑁𝑁}, where each 𝒙𝒙𝑖𝑖 ∈ ℝ𝐷𝐷

o 𝐶𝐶𝑘𝑘 = {𝒙𝒙𝑛𝑛: 𝑘𝑘 = arg min𝑘𝑘

𝒙𝒙𝑛𝑛 − 𝝁𝝁𝐾𝐾 2 } , 𝑘𝑘 = 1,⋯ ,𝐾𝐾 (𝐶𝐶𝑘𝑘 is the set of points in cluster 𝑘𝑘)

o Cluster means 𝝁𝝁1,𝝁𝝁2,⋯ ,𝝁𝝁𝐾𝐾, where each 𝝁𝝁𝑖𝑖 ∈ ℝ𝐷𝐷, 𝝁𝝁𝑖𝑖 = mean 𝐶𝐶𝑖𝑖 = 1|𝐶𝐶𝑖𝑖|

∑𝒙𝒙𝑛𝑛∈𝐶𝐶𝑖𝑖 𝒙𝒙𝑛𝑛o 𝑉𝑉𝑛𝑛𝑘𝑘 = 0,1 , where 𝑉𝑉𝑛𝑛𝑘𝑘 = 1 if 𝒙𝒙𝑛𝑛 ∈ 𝐶𝐶𝑘𝑘, 0 otherwise o 𝒂𝒂𝑛𝑛 = 𝑉𝑉𝑛𝑛1, 𝑉𝑉𝑛𝑛2,⋯ , 𝑉𝑉𝑛𝑛𝑘𝑘 (length 𝐾𝐾 one-hot encoding of 𝒙𝒙𝑛𝑛’s cluster assignment)o 𝑉𝑉 𝑛𝑛 = 𝑘𝑘, where 𝑘𝑘 is the index of the centroid assigned to 𝑥𝑥𝑛𝑛

ℓ(𝝁𝝁,𝒙𝒙𝑛𝑛,𝒂𝒂𝑛𝑛) = �𝑘𝑘=1

𝐾𝐾

𝑉𝑉𝑛𝑛𝑘𝑘 𝒙𝒙𝑛𝑛 − 𝝁𝝁𝑘𝑘 2 The loss (distortion, variance) for clustering example 𝒙𝒙𝑛𝑛:

The loss (variance) for clustering all examples: ℓ(𝝁𝝁,𝑿𝑿,𝑨𝑨) = �𝑛𝑛=1

𝑁𝑁

�𝑘𝑘=1

𝐾𝐾

𝑉𝑉𝑛𝑛𝑘𝑘 𝒙𝒙𝑛𝑛 − 𝝁𝝁𝑘𝑘 2

K-means objective: minimize the loss ℓ(𝝁𝝁,𝑿𝑿,𝑨𝑨)with respect to 𝝁𝝁 and 𝑨𝑨 (𝑿𝑿 is given) = 𝑿𝑿− 𝑨𝑨𝝁𝝁 2

𝑿𝑿𝑁𝑁×𝐷𝐷𝑨𝑨𝑁𝑁×𝐾𝐾𝝁𝝁𝐾𝐾×𝐷𝐷

29

Loss function

ℓ(𝝁𝝁,𝑨𝑨) = �𝑛𝑛=1

𝑁𝑁

�𝑘𝑘=1

𝐾𝐾


ℓ(𝝁𝝁,𝑨𝑨) = �𝑛𝑛=1

𝑁𝑁

𝒙𝒙𝑛𝑛 − 𝝁𝝁𝑎𝑎 𝑛𝑛2 ℓ(𝝁𝝁,𝑨𝑨) = �

𝑘𝑘=1

𝐾𝐾

�𝑛𝑛:𝑎𝑎 𝑛𝑛 =𝑘𝑘

𝒙𝒙𝑛𝑛 − 𝝁𝝁𝑘𝑘 2Equivalent forms

Sum of total distortion (distances from centers)

Within-cluster variance,sum of intra-cluster

distances from centers�𝑨𝑨, �𝝁𝝁 = arg min

𝑨𝑨,𝝁𝝁ℓ(𝝁𝝁,𝑨𝑨)

Replacing 𝐿𝐿2 distance (Euclidean) with 𝐿𝐿1 distance (absolute values), gives the K-median algorithm

30

Loss function and Optimization: Alternating Optimization

�𝑨𝑨, �𝝁𝝁 = arg min𝑨𝑨,𝝁𝝁

ℓ(𝝁𝝁,𝑨𝑨) K-means optimization problem: = arg min𝑨𝑨,𝝁𝝁

�𝑛𝑛=1

𝑁𝑁

�𝑘𝑘=1

𝐾𝐾


Difficult problem: two related vector variables, can’t optimize it jointly for 𝑨𝑨 and 𝝁𝝁

K-means (sub-optimal) approach: alternating optimization between 𝑨𝑨 and 𝝁𝝁

1. Fix 𝝁𝝁 as current �𝝁𝝁 and find the optimal 𝑨𝑨 as:

�𝑨𝑨 = arg min𝑨𝑨ℓ(�𝝁𝝁,𝑨𝑨;𝑿𝑿)

2. Fix 𝑨𝑨 as the found �𝑨𝑨 and find the optimal 𝝁𝝁 as:

�𝝁𝝁 = arg min𝝁𝝁ℓ(𝝁𝝁, �𝑨𝑨;𝑿𝑿)

3. Go to 1 if not yet converged

31

Solving for 𝑨𝑨: Expectation step

1. Fix 𝝁𝝁 as current �𝝁𝝁 and find the optimal 𝑨𝑨 as:

�𝑨𝑨 = arg min𝑨𝑨ℓ �𝝁𝝁,𝑨𝑨;𝑿𝑿 = arg min

𝑨𝑨�𝑛𝑛=1

𝑁𝑁

�𝑘𝑘=1

𝐾𝐾

𝑉𝑉𝑛𝑛𝑘𝑘 𝒙𝒙𝑛𝑛 − �𝝁𝝁𝑘𝑘 2

o 𝑉𝑉𝑛𝑛𝑘𝑘 are discrete, 𝐾𝐾𝑁𝑁 possibilities for 𝑨𝑨N×𝐾𝐾

Greedy approach: exploit the relationship between 𝒙𝒙𝑛𝑛 and 𝑉𝑉𝑛𝑛𝑘𝑘, that says that the terms involving different 𝑛𝑛 are independent of each other, and optimize 𝑨𝑨 one row at-a-time, 𝑉𝑉𝑛𝑛,for one single input 𝒙𝒙𝑛𝑛, keeping all the other 𝑉𝑉𝑚𝑚 ,𝑚𝑚 ≠ 𝑛𝑛, and �𝝁𝝁 fixed

�𝑉𝑉𝑛𝑛 = arg min𝒂𝒂𝑛𝑛

�𝑘𝑘=1

𝐾𝐾

𝑉𝑉𝑛𝑛𝑘𝑘 𝒙𝒙𝑛𝑛 − �𝝁𝝁𝑘𝑘 2 = arg min𝑎𝑎𝑛𝑛

𝒙𝒙𝑛𝑛 − �𝝁𝝁𝑎𝑎(𝑛𝑛)2

This is the same as assigning 𝒙𝒙𝑛𝑛 to the closest centroid This is what K-means does!

Expectation step: Expected cluster assignment

�𝑉𝑉𝑛𝑛 = 𝑘𝑘 if 𝑘𝑘 = arg min𝑗𝑗 𝒙𝒙𝑛𝑛 − �𝝁𝝁𝑗𝑗�𝑉𝑉𝑛𝑛 = 0 otherwise (𝑉𝑉𝑛𝑛𝑘𝑘 = 0)

32

Solving for 𝝁𝝁: Maximization step

2. Fix 𝑨𝑨 as the found �𝑨𝑨 and find the optimal 𝝁𝝁 as:

�𝝁𝝁 = arg min𝝁𝝁ℓ(𝝁𝝁, �𝑨𝑨;𝑿𝑿) = arg min

𝝁𝝁�𝑘𝑘=1

𝐾𝐾

�𝑛𝑛: �𝑎𝑎 𝑛𝑛 =𝑘𝑘

𝒙𝒙𝑛𝑛 − 𝝁𝝁𝑘𝑘 2

o Easy to solve since the 𝝁𝝁 are real-valued vectors, the problem is convex, and each 𝝁𝝁𝑘𝑘can be optimized independently since they have no cross dependencies

�𝝁𝝁𝑘𝑘 = arg min𝝁𝝁𝑘𝑘

�𝑛𝑛: �𝑎𝑎 𝑛𝑛 =𝑘𝑘

𝒙𝒙𝑛𝑛 − 𝝁𝝁𝑘𝑘 2 = arg min𝑘𝑘

�𝑛𝑛=1

𝑁𝑁

�𝑘𝑘=1

𝐾𝐾

�𝑉𝑉𝑛𝑛𝑘𝑘 𝒙𝒙𝑛𝑛 − 𝝁𝝁𝑘𝑘 2

o This is what K-means does!Maximization step:Maximum likelihood for centers

Computing the gradient of ℓ(𝝁𝝁, �𝑨𝑨;𝑿𝑿) and equating to zero: 𝜕𝜕ℓ𝜕𝜕𝝁𝝁 = 2 �

𝑛𝑛=1

𝑁𝑁

𝑉𝑉𝑛𝑛𝑘𝑘 𝒙𝒙𝑛𝑛 − 𝝁𝝁𝑘𝑘 = 0

which gives: �𝝁𝝁𝑘𝑘 =∑𝑛𝑛=1𝑁𝑁 𝑉𝑉𝑛𝑛𝑘𝑘𝒙𝒙𝑛𝑛∑𝑛𝑛=1𝑁𝑁 𝑉𝑉𝑛𝑛𝑘𝑘

→ each �𝝁𝝁𝑘𝑘 is the mean of the points currently in cluster 𝑘𝑘

33

Convergence of K-means

Each step (either updating on 𝑨𝑨 or on 𝝁𝝁) can never increase the K-means loss

The K-means algorithm monotonically decreases the objective

(prove it!)

34

Choosing K

One way to select K for the K-means algorithm is to try different values of K, plot the K-means objective versus K, and look at the elbow-point

Elbow at K=6

o Note: the larger the K, the lower the loss, in the limit each point is a centroid, zero loss!o AIC (Akaike Information Criterion) defines AIC = 2ℓ + KD and chooses the 𝐾𝐾 with the

smallest AIC, which discourages large 𝐾𝐾 (the largero … many other criteria

35

Hard vs. Soft assignments

36

K-means decision boundaries and cluster shapes

37

Kernel K-means

Introduction to Machine Learning€¦ · Lecture 28: Unsupervised Learning, Clustering Introduction...

Documents

Transcript of Introduction to Machine Learning€¦ · Lecture 28: Unsupervised Learning, Clustering Introduction...