lec18 · 2017. 11. 9. · Title: lec18 Created Date: 11/9/2017 2:38:35 AM
Transcript of lec18 · 2017. 11. 9. · Title: lec18 Created Date: 11/9/2017 2:38:35 AM
-
Department of Computer ScienceCSCI 5622: Machine Learning
Chenhao TanLecture 18: Clustering
Slides adapted from Jordan Boyd-Graber, Chris Ketelsen
1
-
Learning objectives
• Learn about general clustering
• Learn about the K-Means algorithm
• Learn about Gaussian Mixture Models
2
-
Supervised learning
3
Unsupervised learning
Data: X Labels: Y Data: X
Latent structure: Z
-
Clustering
• One important unsupervised method is clustering
• Goal: Organize data in classes
4
-
Clustering applications – Microarray Gene Expression data
5
From: “Skin layer-specific transcriptional profiles in normal and recessive yellow (Mc1re/Mc1re) mice'' by April and Barsh in Pigment Cell Research (2006)
-
Clustering applications – Medical Imaging
6
-
Clustering applications – Community detection
7
-
News Media
8
-
Clustering
• One important unsupervised method is clustering
• Goal: Organize data in classes
• Classes are hard to define
• Different data representation may lead to different clusterings
9
-
Clustering
• One important unsupervised method is clustering
• Goal: Organize data in classes
• Data have high in-class similarity
• Data have low out-of-class similarity
10
-
Clustering - Similarity
11
-
Clustering - Similarity
12
-
K-Means
• Simplest clustering method• Iterative in nature• Reasonably fast• Very popular in practice (though with more bells and whistles)• Requires real-valued data
13
-
K-Means
14
-
K-Means
15
-
16
-
17
-
18
-
19
-
20
-
21
-
22
-
More K-means
• Animations:http://shabal.in/visuals/kmeans/4.html
23
-
K-Means in numbers
24
-
K-Means in numbers
25
-
K-Means in numbers
26
-
K-Means in numbers
27
-
K-Means in numbers
28
-
K-Means in numbers
29
-
K-Means in numbers
30
-
K-Means in numbers
31
-
K-Means in numbers
32
-
K-Means in numbers
33
-
K-Means in numbers
34
-
K-Means in numbers
35
-
K-Means
36
-
K-Means
37
-
K-Means
• Weaknesses• Doesn't really work with categorical data• Usually only converges to local minimum• Have to determine number of clusters• Can be sensitive to outliers• Only generates convex clusters
38
-
K-means - Weaknesses
• Doesn't really work with categorical data
39
-
K-means - Weaknesses
• Doesn't really work with categorical data • Fix: Do K-Modes instead
40
-
K-means - Weaknesses
• Usually only converges to local minimum
41
-
K-means - Weaknesses
• Usually only converges to local minimum • Fix: Do several runs with random inits. and choose best
42
-
K-means - Weaknesses
• Have to determine number of clusters
43
-
K-means - Weaknesses
• Have to determine number of clusters• Fix: Use the elbow methodRun K-Means for different values of k and look at loss function
44
-
45
-
46
-
47
-
48
-
49
-
50
-
Gaussian Mixture Models
51
-
Gaussian Mixture Models
52
-
Gaussian Mixture Models
53
-
Gaussian Mixture Models
54
-
Gaussian Mixture Models
55
-
Gaussian Mixture Models
56
-
Gaussian Mixture Models
57
-
Gaussian Mixture Models
58
-
Gaussian Mixture Models
59
-
Recap
• K-means is the most commonly used clustering algorithm• We learned the Gaussian Mixture Model’s generative story
• We will learn EM-algorithm next week
60