Lecture 9 Clustering
-
Upload
mohsin-iqbal -
Category
Documents
-
view
228 -
download
0
Transcript of Lecture 9 Clustering
-
8/13/2019 Lecture 9 Clustering
1/28
Clustering
-
8/13/2019 Lecture 9 Clustering
2/28
Pattern Recognition
Learning/Testing
Supervised Learning (Data with type labels) :
(Bayesian, KNN, Neural network, ..) = Classifiers in
Weka
Un-Supervised Learning (Data with no labels): (K-
Means, CobWeb, EM, ..) = Cluster in Weka= So
generally this topic is called Clustering
-
8/13/2019 Lecture 9 Clustering
3/28
3
Supervised learning vs.
unsupervised learning Supervised learning:discover patterns in the
data that relate data attributes with a target(class) attribute.
These patterns are then utilized to predict thevalues of the target attribute in future datainstances.
Unsupervised learning: The data have no
target attribute. We want to explore the data to find some intrinsic
structures in them.
-
8/13/2019 Lecture 9 Clustering
4/28
4
Clustering Clustering is a technique for finding similarity groupsin
data, called clusters. I.e., it groups data instances that are similar to (near) each other in
one cluster and data instances that are very different (far away)from each other into different clusters.
Clustering is often called an unsupervised learningtaskasno class values denoting an a priorigrouping of the datainstances are given, which is the case in supervisedlearning.
-
8/13/2019 Lecture 9 Clustering
5/28
5
An illustration
The data set has three natural groups of data points, i.e.,3 natural clusters.
-
8/13/2019 Lecture 9 Clustering
6/28
6
What is clustering for?
Let us see some real-life examples Example 1: groups people of similar sizes
together to make small, medium and
large T-Shirts. Tailor-made for each person: too expensive
One-size-fits-all: does not fit all.
Example 2: In marketing, segment customersaccording to their similarities
To do targeted marketing.
-
8/13/2019 Lecture 9 Clustering
7/28
7
What is clustering for? (cont) Example 3: Given a collection of text
documents, we want to organize themaccording to their content similarities, To produce a topic hierarchy
In fact, clustering is one of the most utilizeddata mining techniques. It has a long history, and used in almost every field,
e.g., medicine, psychology, botany, sociology,biology, archeology, marketing, insurance, libraries,etc.
In recent years, due to the rapid increase of onlinedocuments, text clustering becomes important.
-
8/13/2019 Lecture 9 Clustering
8/28
8
Aspects of clustering A clustering algorithm
Partitional clustering
Hierarchical clustering
A distance (similarity, or dissimilarity) function
Clustering quality Inter-clusters distancemaximized
Intra-clusters distanceminimized
The qualityof a clustering result depends onthe algorithm, the distance function, and theapplication.
-
8/13/2019 Lecture 9 Clustering
9/28
9
K-means clustering
K-means is a partitional clusteringalgorithm Let the set of data points (or instances) Dbe
{x1, x2, , xn},
where xi= (x
i1,x
i2, ,x
ir) is a vectorin a real-
valued spaceXRr, and ris the number ofattributes (dimensions) in the data.
The k-means algorithm partitions the given
data into kclusters. Each cluster has a cluster center, called centroid.
kis specified by the user
-
8/13/2019 Lecture 9 Clustering
10/28
Algorithmk-means
1. Decide on a value for k.
2. Initialize the kcluster centers (randomly, if
necessary).
3. Decide the class memberships of the Nobjects byassigning them to the nearest cluster center.
4. Re-estimate the kcluster centers, by assuming the
memberships found above are correct.
5. If none of the Nobjects changed membership in
the last iteration, exit. Otherwise goto 3.
-
8/13/2019 Lecture 9 Clustering
11/28
11
K-means algorithm(cont )
-
8/13/2019 Lecture 9 Clustering
12/28
0
1
2
3
4
5
0 1 2 3 4 5
K-means Clustering: Step 1Algorithm: k-means, Distance Metric: Euclidean Distance
k1
k2
k3
-
8/13/2019 Lecture 9 Clustering
13/28
0
1
2
3
4
5
0 1 2 3 4 5
K-means Clustering: Step 2Algorithm: k-means, Distance Metric: Euclidean Distance
k1
k2
k3
-
8/13/2019 Lecture 9 Clustering
14/28
0
1
2
3
4
5
0 1 2 3 4 5
K-means Clustering: Step 3Algorithm: k-means, Distance Metric: Euclidean Distance
k1
k2
k3
-
8/13/2019 Lecture 9 Clustering
15/28
0
1
2
3
4
5
0 1 2 3 4 5
K-means Clustering: Step 4Algorithm: k-means, Distance Metric: Euclidean Distance
k1
k2
k3
-
8/13/2019 Lecture 9 Clustering
16/28
0
1
2
3
4
5
0 1 2 3 4 5
expression in condition 1
expressioni
nc
ondition2
K-means Clustering: Step 5Algorithm: k-means, Distance Metric: Euclidean Distance
k1
k2k3
-
8/13/2019 Lecture 9 Clustering
17/28
10
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
How can we tell the rightnumber of clusters?
In general, this is an unsolved problem. However there are many approximate methods. In
the next few slides we will see an example.
For our example, we will use the dataset
on the left.
However, in this case we are imagining
that we do NOT know the class labels. We
are only clustering on the X and Y axis
values.
-
8/13/2019 Lecture 9 Clustering
18/28
1 2 3 4 5 6 7 8 9 10
When k = 1, the objective function is 873.0
-
8/13/2019 Lecture 9 Clustering
19/28
1 2 3 4 5 6 7 8 9 10
When k = 2, the objective function is 173.1
-
8/13/2019 Lecture 9 Clustering
20/28
1 2 3 4 5 6 7 8 9 10
When k = 3, the objective function is 133.6
-
8/13/2019 Lecture 9 Clustering
21/28
0.00E+00
1.00E+02
2.00E+02
3.00E+02
4.00E+02
5.00E+02
6.00E+02
7.00E+02
8.00E+02
9.00E+02
1.00E+03
1 2 3 4 5 6
We can plot the objective function values for k equals 1 to 6
The abrupt change at k = 2, is highly suggestive of two clusters in the data. This
technique for determining the number of clusters is known as knee finding or
elbow finding.
Note that the results are not always as clear cut as in this toy example
k
Obje
ctiveFunction
-
8/13/2019 Lecture 9 Clustering
22/28
Image Segmentation Results
An image (I) Three-cluster image (J) on
gray values of I
-
8/13/2019 Lecture 9 Clustering
23/28
23
Strengths of k-means Strengths:
Simple: easy to understand and to implement
Efficient: Time complexity: O(tkn),
where nis the number of data points,
kis the number of clusters, and
t is the number of iterations.
Since both kand tare small, k-means is considered a linear
algorithm.
K-means is the most popular clustering algorithm.
-
8/13/2019 Lecture 9 Clustering
24/28
24
Weaknesses of k-means
The algorithm is only applicable if the meanisdefined.
For categorical data, k-mode - the centroid is represented
by most frequent values.
The user needs to specify k.
The algorithm is sensitive to outliers
Outliers are data points that are very far away from other
data points.
Outliers could be errors in the data recording or some
special data points with very different values.
-
8/13/2019 Lecture 9 Clustering
25/28
25
Weaknesses of k-means: Problems with
outliers
-
8/13/2019 Lecture 9 Clustering
26/28
26
Weaknesses of k-means: To deal with outliers
One method is to remove some data points in theclustering process that are much further away from the
centroids than other data points.
Another method is to perform random sampling. Since insampling, we only choose a small subset of the data
points, the chance of selecting an outlier is very small.
-
8/13/2019 Lecture 9 Clustering
27/28
27
Weaknesses of k-means (cont )
The algorithm is sensitive to initial seeds.
-
8/13/2019 Lecture 9 Clustering
28/28
28
Weaknesses of k-means (cont )
If we use different seeds: good results
There are some
methods to help
choose good
seeds