Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s...
-
Upload
stephen-atkinson -
Category
Documents
-
view
217 -
download
0
description
Transcript of Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s...
![Page 1: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/1.jpg)
Clustering
http://net.pku.edu.cn/~course/cs402/2009
Hongfei YanSchool of EECS, Peking University
7/8/2009
Refer to Aaron Kimball’s slides
![Page 2: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/2.jpg)
Google News
• They didn’t pick all 3,400,217 related articles by hand…
• Or Amazon.com • Or Netflix…
![Page 3: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/3.jpg)
Other less glamorous things...
• Hospital Records• Scientific Imaging
– Related genes, related stars, related sequences• Market Research
– Segmenting markets, product positioning• Social Network Analysis• Data mining• Image segmentation…
![Page 4: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/4.jpg)
The Distance Measure• How the similarity of two elements in a set is
determined, e.g.– Euclidean Distance
– Manhattan Distance– Inner Product Space– Maximum Norm – Or any metric you define over the space…
![Page 5: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/5.jpg)
• Hierarchical Clustering vs.• Partitional Clustering
Types of Algorithms
![Page 6: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/6.jpg)
Hierarchical Clustering
• Builds or breaks up a hierarchy of clusters.
![Page 7: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/7.jpg)
Partitional Clustering
• Partitions set into all clusters simultaneously.
![Page 8: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/8.jpg)
Partitional Clustering
• Partitions set into all clusters simultaneously.
![Page 9: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/9.jpg)
K-Means Clustering • Simple Partitional Clustering
• Choose the number of clusters, k• Choose k points to be cluster centers• Then…
![Page 10: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/10.jpg)
K-Means Clustering
iterate { Compute distance from all points to all k- centers Assign each point to the nearest k-center Compute the average of all points assigned to all specific k-centers
Replace the k-centers with the new averages}
![Page 11: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/11.jpg)
But!
• The complexity is pretty high: – k * n * O ( distance metric ) * num (iterations)
• Moreover, it can be necessary to send tons of data to each Mapper Node. Depending on your bandwidth and memory available, this could be impossible.
![Page 12: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/12.jpg)
Furthermore
• There are three big ways a data set can be large:– There are a large number of elements in the
set.– Each element can have many features.– There can be many clusters to discover
• Conclusion – Clustering can be huge, even when you distribute it.
![Page 13: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/13.jpg)
Canopy Clustering
• Preliminary step to help parallelize computation.
• Clusters data into overlapping Canopies using super cheap distance metric.
• Efficient• Accurate
![Page 14: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/14.jpg)
Canopy Clustering
While there are unmarked points { pick a point which is not strongly marked call it a canopy centermark all points within some threshold of
it as in it’s canopystrongly mark all points within some
stronger threshold }
![Page 15: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/15.jpg)
After the canopy clustering…
• Resume hierarchical or partitional clustering as usual.
• Treat objects in separate clusters as being at infinite distances.
![Page 16: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/16.jpg)
MapReduce Implementation:
• Problem – Efficiently partition a large data set (say… movies with user ratings!) into a fixed number of clusters using Canopy Clustering, K-Means Clustering, and a Euclidean distance measure.
![Page 17: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/17.jpg)
The Distance Metric• The Canopy Metric ($)
• The K-Means Metric ($$$)
![Page 18: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/18.jpg)
Steps!
• Get Data into a form you can use (MR)• Picking Canopy Centers (MR)• Assign Data Points to Canopies (MR)• Pick K-Means Cluster Centers• K-Means algorithm (MR)
– Iterate!
![Page 19: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/19.jpg)
Data Massage• This isn’t interesting, but it has to be done.
![Page 20: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/20.jpg)
Selecting Canopy Centers
![Page 21: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/21.jpg)
![Page 22: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/22.jpg)
![Page 23: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/23.jpg)
![Page 24: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/24.jpg)
![Page 25: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/25.jpg)
![Page 26: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/26.jpg)
![Page 27: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/27.jpg)
![Page 28: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/28.jpg)
![Page 29: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/29.jpg)
![Page 30: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/30.jpg)
![Page 31: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/31.jpg)
![Page 32: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/32.jpg)
![Page 33: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/33.jpg)
![Page 34: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/34.jpg)
![Page 35: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/35.jpg)
![Page 36: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/36.jpg)
Assigning Points to Canopies
![Page 37: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/37.jpg)
![Page 38: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/38.jpg)
![Page 39: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/39.jpg)
![Page 40: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/40.jpg)
![Page 41: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/41.jpg)
![Page 42: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/42.jpg)
K-Means Map
![Page 43: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/43.jpg)
![Page 44: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/44.jpg)
![Page 45: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/45.jpg)
![Page 46: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/46.jpg)
![Page 47: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/47.jpg)
![Page 48: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/48.jpg)
![Page 49: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/49.jpg)
![Page 50: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/50.jpg)
Elbow Criterion• Choose a number of clusters s.t. adding a
cluster doesn’t add interesting information.
• Rule of thumb to determine what number of Clusters should be chosen.
• Initial assignment of cluster seeds has bearing on final model performance.
• Often required to run clustering several times to get maximal performance
![Page 51: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/51.jpg)
Clustering Conclusions
• Clustering is slick• And it can be done super efficiently• And in lots of different ways
![Page 52: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.](https://reader036.fdocuments.us/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/52.jpg)
Homework
• Lab 4 - Clustering the Netflix Movie Data• Hw4 – Read IIR chapter 16
– Flat clustering