The Clustering Problem
-
Upload
gannon-winters -
Category
Documents
-
view
37 -
download
0
description
Transcript of The Clustering Problem
![Page 1: The Clustering Problem](https://reader030.fdocuments.us/reader030/viewer/2022032606/56812d69550346895d927b6d/html5/thumbnails/1.jpg)
The Clustering Problem
Yongsub LimApplied Algorithm Laboratory
KAIST
![Page 2: The Clustering Problem](https://reader030.fdocuments.us/reader030/viewer/2022032606/56812d69550346895d927b6d/html5/thumbnails/2.jpg)
04/19/2023 The Clustering Problem 2
Contents
• The Clustering Problem• Basic Algorithms
K-Means K-Clustering of Max. Spacing
• Two-Phase Algorithms• Other Algorithms
![Page 3: The Clustering Problem](https://reader030.fdocuments.us/reader030/viewer/2022032606/56812d69550346895d927b6d/html5/thumbnails/3.jpg)
04/19/2023 The Clustering Problem 3
The Clustering Problem
• Given data, it is to discover “mean-ingful” groups
• Data in same group are similar, and• Data between different groups are
not similar
![Page 4: The Clustering Problem](https://reader030.fdocuments.us/reader030/viewer/2022032606/56812d69550346895d927b6d/html5/thumbnails/4.jpg)
04/19/2023 The Clustering Problem 4
Example of clustering
1x
2x
![Page 5: The Clustering Problem](https://reader030.fdocuments.us/reader030/viewer/2022032606/56812d69550346895d927b6d/html5/thumbnails/5.jpg)
04/19/2023 The Clustering Problem 5
Example of clustering
1x
2x
![Page 6: The Clustering Problem](https://reader030.fdocuments.us/reader030/viewer/2022032606/56812d69550346895d927b6d/html5/thumbnails/6.jpg)
04/19/2023 The Clustering Problem 6
Example of clustering
1x
2x
![Page 7: The Clustering Problem](https://reader030.fdocuments.us/reader030/viewer/2022032606/56812d69550346895d927b6d/html5/thumbnails/7.jpg)
04/19/2023 The Clustering Problem 7
Applications of Clustering
• The image segmentation problem can be considered as a clustering of pixels of an image
• In unsupervised learning, before making a decision rule, we classify unlabeled training data through clus-tering
![Page 8: The Clustering Problem](https://reader030.fdocuments.us/reader030/viewer/2022032606/56812d69550346895d927b6d/html5/thumbnails/8.jpg)
04/19/2023 The Clustering Problem 8
Applications of Clustering
• In a network or a graph, we can do grouping vertices which are highly connected within one group
• Clustering is also useful in biology to classify genes
![Page 9: The Clustering Problem](https://reader030.fdocuments.us/reader030/viewer/2022032606/56812d69550346895d927b6d/html5/thumbnails/9.jpg)
04/19/2023 The Clustering Problem 9
Basic Algorithms
• Two algorithms will be introduced
• K-Means computes iteratively centers of K clusters
• K-Clustering of Max. Spacing uses a minimum spanning tree
• Objective functions of theses are dif-ferent
![Page 10: The Clustering Problem](https://reader030.fdocuments.us/reader030/viewer/2022032606/56812d69550346895d927b6d/html5/thumbnails/10.jpg)
04/19/2023 The Clustering Problem 10
K-Means
• Determine means of K clusters ran-domly
• At each iteration, Every data belongs to a cluster whose
mean is the nearest one among K means
Re-compute means of all clusters
![Page 11: The Clustering Problem](https://reader030.fdocuments.us/reader030/viewer/2022032606/56812d69550346895d927b6d/html5/thumbnails/11.jpg)
04/19/2023 The Clustering Problem 11
K-Means
• Objective is to minimize the sum of distance of centers of clusters and their members
• It is clustering for high density in one cluster
![Page 12: The Clustering Problem](https://reader030.fdocuments.us/reader030/viewer/2022032606/56812d69550346895d927b6d/html5/thumbnails/12.jpg)
04/19/2023 The Clustering Problem 12
K-Means Algorithm
• Worst caseInitial two cen-ters randomly chosen
This may be not what we want!!!
![Page 13: The Clustering Problem](https://reader030.fdocuments.us/reader030/viewer/2022032606/56812d69550346895d927b6d/html5/thumbnails/13.jpg)
04/19/2023 The Clustering Problem 13
K-Clustering of Max. Spacing
• Given data, find K clusters which maximize the minimum distances be-tween all pairs of clusters
• spacing: min. distance between any pair of data in different clusters
![Page 14: The Clustering Problem](https://reader030.fdocuments.us/reader030/viewer/2022032606/56812d69550346895d927b6d/html5/thumbnails/14.jpg)
04/19/2023 The Clustering Problem 14
K-Clustering of Max. Spacing
![Page 15: The Clustering Problem](https://reader030.fdocuments.us/reader030/viewer/2022032606/56812d69550346895d927b6d/html5/thumbnails/15.jpg)
04/19/2023 The Clustering Problem 15
K-Clustering of Max. Spacing
• Consider given data to a complete graph with Euclidean distance
• Compute a MST
• Delete the K-1 most expensive edges of a MST
![Page 16: The Clustering Problem](https://reader030.fdocuments.us/reader030/viewer/2022032606/56812d69550346895d927b6d/html5/thumbnails/16.jpg)
04/19/2023 The Clustering Problem 16
K-Clustering of Max. Spacing
Calg
Copt
≤ spacing of Calg
![Page 17: The Clustering Problem](https://reader030.fdocuments.us/reader030/viewer/2022032606/56812d69550346895d927b6d/html5/thumbnails/17.jpg)
04/19/2023 The Clustering Problem 17
K-Clustering of Max. Spacing
• It is no randomness
• Objective seems to be better or more reasonable than K-means
![Page 18: The Clustering Problem](https://reader030.fdocuments.us/reader030/viewer/2022032606/56812d69550346895d927b6d/html5/thumbnails/18.jpg)
04/19/2023 The Clustering Problem 18
K-Means vs. Max. Spacing
• Good clustering is High density in one cluster (K-Means) Long dist. between clusters (Max. Spac-
ing)
>
![Page 19: The Clustering Problem](https://reader030.fdocuments.us/reader030/viewer/2022032606/56812d69550346895d927b6d/html5/thumbnails/19.jpg)
04/19/2023 The Clustering Problem 19
K-Means vs. Max. Spacing
![Page 20: The Clustering Problem](https://reader030.fdocuments.us/reader030/viewer/2022032606/56812d69550346895d927b6d/html5/thumbnails/20.jpg)
04/19/2023 The Clustering Problem 20
Two-Phase Algorithms
• Two algorithms will be introduced
• In the first phase, both do clustering without restriction on K
• In second phase, if # of clusters are larger than K, merge using Max. Spacing
![Page 21: The Clustering Problem](https://reader030.fdocuments.us/reader030/viewer/2022032606/56812d69550346895d927b6d/html5/thumbnails/21.jpg)
04/19/2023 The Clustering Problem 21
Hierarchical EMST
Oleksandr Grygorash, Yan zhou, Zach Jorgensen, Minimum spanning Tree Based Clustering Algorithms
![Page 22: The Clustering Problem](https://reader030.fdocuments.us/reader030/viewer/2022032606/56812d69550346895d927b6d/html5/thumbnails/22.jpg)
04/19/2023 The Clustering Problem 22
Hierarchical EMST
• HEMST removes all edges with weights greater than the threshold (mean+std. of edges)
• If # of clusters is less than a given K, same with Max. Spacing
• If not, it runs Max. Spacing on data set each of which is nearest to the center of its cluster
Oleksandr Grygorash, Yan zhou, Zach Jorgensen, Minimum spanning Tree Based Clustering Algorithms
![Page 23: The Clustering Problem](https://reader030.fdocuments.us/reader030/viewer/2022032606/56812d69550346895d927b6d/html5/thumbnails/23.jpg)
04/19/2023 The Clustering Problem 23
Hierarchical EMST
Oleksandr Grygorash, Yan zhou, Zach Jorgensen, Minimum spanning Tree Based Clustering Algorithms
![Page 24: The Clustering Problem](https://reader030.fdocuments.us/reader030/viewer/2022032606/56812d69550346895d927b6d/html5/thumbnails/24.jpg)
04/19/2023 The Clustering Problem 24
Hierarchical EMST
Oleksandr Grygorash, Yan zhou, Zach Jorgensen, Minimum spanning Tree Based Clustering Algorithms
![Page 25: The Clustering Problem](https://reader030.fdocuments.us/reader030/viewer/2022032606/56812d69550346895d927b6d/html5/thumbnails/25.jpg)
04/19/2023 The Clustering Problem 25
Hierarchical EMST
Oleksandr Grygorash, Yan zhou, Zach Jorgensen, Minimum spanning Tree Based Clustering Algorithms
![Page 26: The Clustering Problem](https://reader030.fdocuments.us/reader030/viewer/2022032606/56812d69550346895d927b6d/html5/thumbnails/26.jpg)
04/19/2023 The Clustering Problem 26
Hierarchical EMST
Oleksandr Grygorash, Yan zhou, Zach Jorgensen, Minimum spanning Tree Based Clustering Algorithms
![Page 27: The Clustering Problem](https://reader030.fdocuments.us/reader030/viewer/2022032606/56812d69550346895d927b6d/html5/thumbnails/27.jpg)
04/19/2023 The Clustering Problem 27
Modified K-Means Process
• MKF, in the first phase, is similar to K-Means
• The difference is that if data is far enough from all clusters, it becomes the center of the new cluster
• While running, if # of clusters is larger than a threshold, the two nearest clus-ters are merged
• In the second phase, apply Max. Spaing
M.F. Jiang, S.S. Tseng, C.M. Su, Two-phase clustering process for outliers detection
![Page 28: The Clustering Problem](https://reader030.fdocuments.us/reader030/viewer/2022032606/56812d69550346895d927b6d/html5/thumbnails/28.jpg)
04/19/2023 The Clustering Problem 28
Modified K-Means Process
• This scheme can identify outliers by using Max. Spacing
M.F. Jiang, S.S. Tseng, C.M. Su, Two-phase clustering process for outliers detection
![Page 29: The Clustering Problem](https://reader030.fdocuments.us/reader030/viewer/2022032606/56812d69550346895d927b6d/html5/thumbnails/29.jpg)
04/19/2023 The Clustering Problem 29
Modified K-Means Process
M.F. Jiang, S.S. Tseng, C.M. Su, Two-phase clustering process for outliers detection
![Page 30: The Clustering Problem](https://reader030.fdocuments.us/reader030/viewer/2022032606/56812d69550346895d927b6d/html5/thumbnails/30.jpg)
04/19/2023 The Clustering Problem 30
Modified K-Means Process
M.F. Jiang, S.S. Tseng, C.M. Su, Two-phase clustering process for outliers detection
![Page 31: The Clustering Problem](https://reader030.fdocuments.us/reader030/viewer/2022032606/56812d69550346895d927b6d/html5/thumbnails/31.jpg)
04/19/2023 The Clustering Problem 31
Two-Phase Algorithms
• Both give more weights to members in small sets in the first phase
• A small set will be the most likely clustered data, so it is reasonable to decrease distances between them
![Page 32: The Clustering Problem](https://reader030.fdocuments.us/reader030/viewer/2022032606/56812d69550346895d927b6d/html5/thumbnails/32.jpg)
04/19/2023 The Clustering Problem 32
Other Algorithms
• HCS uses min-cut of a graph
• It recursively separate data to dis-joint two subsets (min-cut) until all clusters are highly connected
• A graph is highly connected if the min. # of edges whose removal disconnects the graph is greater than |V|/2
Erez Hartuv, Ron Shamir, a clustering algorithm based on graph connectivity
![Page 33: The Clustering Problem](https://reader030.fdocuments.us/reader030/viewer/2022032606/56812d69550346895d927b6d/html5/thumbnails/33.jpg)
04/19/2023 The Clustering Problem 33
Other Algorithms
• Voting
• Apply K-Means N times
• If any pair of data belonged to same cluster greater than threshold t times, they are grouped
Ana L.N. Fred, Anil K. Jain, Data Clustering Using Evidence ac-cumulation
![Page 34: The Clustering Problem](https://reader030.fdocuments.us/reader030/viewer/2022032606/56812d69550346895d927b6d/html5/thumbnails/34.jpg)
04/19/2023 The Clustering Problem 34
Thanks