Enhance The K Means Algorithm On Spatial Dataset
description
Transcript of Enhance The K Means Algorithm On Spatial Dataset
ENHANCED K-MEANS ALGORITHM ON SPATIAL
DATASET
OVERVIEW
K
-Means algorithm introduced by J.B. MacQueen in
1967, is one of the most common clustering
algorithms and it is considered as one of the simplest
unsupervised learning algorithms that partition
feature vectors into k clusters so that the within
group sum of squares is minimized.
VARIATION IN K-MEAN ALGORITHM
T
here are several variants of the k-means clustering algorithm, but
most variants involve an iterative scheme that operates over a fixed
number of clusters, while attempting to satisfy the following
properties:
1. E
ach class has a center which is the mean position of all the samples
in that class.
2. E
ach sample is in the class whose center it is closest to.
PROCEDURE OF K-MEAN ALGORITHM
Step 1: Place randomly initial group centroids into the 2d
space.
Step 2: Assign each object to the group that has the closest
centroid.
Step 3: Recalculate the positions of the centroids.
Step 4: If the positions of the centroids didn't change go to
the next step, else go to Step 2.
Step 5: End.
FLOW CHART
HOW K-MEANS ALGORITHM WORKS
1. It accepts the number of clusters to group data into, and the
dataset to cluster as input values.
2. It then creates the first K initial clusters (K= number of
clusters needed) from the dataset by choosing K rows of
data randomly from the dataset. For Example, if there are
10,000 rows of data in the dataset and 3 clusters need to
be formed, then the first K=3 initial clusters will be
created by selecting 3 records randomly from the dataset
as the initial clusters. Each of the 3 initial clusters formed
will have just one row of data.
HOW K-MEANS ALGORITHM WORKS
3. The K-Means algorithm calculates the Arithmetic Mean of each cluster formed in the dataset. The Arithmetic Mean of a cluster is the mean of all the individual records in the cluster. In each of the first K initial clusters, there is only one record. The Arithmetic Mean of a cluster with one record is the set of values that make up that record. For Example if the dataset we are discussing is a set of Height, Weight and Age measurements for students in a University, where a record P in the dataset S is represented by a Height, Weight and Age measurement, then P = {Age, Height, Weight). Then a record containing the measurements of a student John, would be represented as John = {20, 170, 80} where John's Age = 20 years, Height = 1.70 meters and Weight = 80 Pounds. Since there is only one record in each initial cluster then the Arithmetic Mean of a cluster with only the record for John as a member = {20, 170, 80}.
HOW K-MEANS ALGORITHM WORKS
4. It Next, K-Means assigns each record in the dataset to
only one of the initial clusters. Each record is assigned
to the nearest cluster (the cluster which it is most
similar to) using a measure of distance or similarity like
the Euclidean Distance Measure or Manhattan/City-
Block Distance Measure.
HOW K-MEANS ALGORITHM WORKS
5. Next, K-Means re-assigns each record in the dataset to the most similar cluster and re-calculates the arithmetic mean of all the clusters in the dataset. The arithmetic mean of a cluster is the arithmetic mean of all the records in that cluster. For Example, if a cluster contains two records where the record of the set of measurements for John = {20, 170, 80} and Henry = {30, 160, 120}, then the arithmetic mean Pmean is represented as Pmean= {Agemean, Heightmean, Weightmean). Agemean= (20 + 30)/2, Heightmean= (170 + 160)/2 and Weightmean= (80 + 120)/2. The arithmetic mean of this cluster = {25, 165, 100}. This new arithmetic mean becomes the center of this new cluster. Following the same procedure, new cluster centers are formed for all the existing clusters.
HOW K-MEANS ALGORITHM WORKS
6. It K-Means re-assigns each record in the dataset to only one of the new clusters formed. A record or data point is assigned to the nearest cluster (the cluster which it is most similar to) using a measure of distance or similarity like the Euclidean Distance Measure or Manhattan/City-Block Distance Measure.
7. The preceding steps are repeated until stable clusters are formed and the K-Means clustering procedure is completed. Stable clusters are formed when new iterations or repetitions of the K-Means clustering algorithm does not create new clusters as the cluster center or Arithmetic Mean of each cluster formed is the same as the old cluster center. There are different techniques for determining when a stable cluster is formed or when the k-means clustering algorithm procedure is completed.
COMPUTATIONAL COMPLEXITY
N
P-hard in general Euclidean space d even for 2 clusters.
N
P-hard for a general number of clusters k even in the plane.
I
f k and d are fixed, the problem can be exactly solved in
time O(ndk+1 log n), where n is the number of entities to be
clustered.
ADVANTAGES
Relatively efficient: O(tkn), where n is the number of instances, c is the number of clusters, and t is the number of iterations. Normally, k, t << n.
Often terminates at a local optimum. The global optimum may be found using techniques such as: simulated annealing or genetic algorithms
DISADVANTAGES
Applicable only when mean is defined.Need to specify c, the number of
clusters, in advance.Unable to handle noisy data and
outliers.Not suitable to discover clusters with
non-convex shapes.
K-MEANS FOR SPHERICAL CLUSTERS
C
lustering technique for exploratory data analysis,
for summary generation, and as a preprocessing
step for other data mining tasks.
c
lusters may be of arbitrary shapes and can be
nested within one another.
EXAMPLES OF SUCH SHAPES
c
hain-like patterns (represent active and
inactive volcanoes).
(a) Chain-like patterns(b) Clusters detected by K-means
k
-means algorithm discovers spherical shaped
cluster, whose center is the gravity center of points
in that cluster.
T
he center moves as new points are added to or
removed from it.
T
his motion makes the center closer to some points
and far apart from the other points, the points that
become closer to the center will stay in that cluster.
T
he points far apart from the center may change the
cluster.
SPHERICAL SHAPED WITH LARGE VARIANCE IN SIZES.
H
owever, this algorithm is suitable for spherical
shaped clusters of similar sizes and densities.
T
he quality of the resulting clusters decreases when
the data set contains spherical shaped with large
variance in sizes.
CONT.
T
he proposed method is based on shifting
the center of the large cluster toward the
small cluster, and re-computing the
membership of small cluster points.
SPATIAL AUTOCORRELATION
S
patial autocorrelation is determined
both by similarities in position, and by
similarities in attributes .
ENHANCED K-MEANS ALGORITHM
S
patial databases has a huge amount of data
collected and stored.( increases the need for
effective analysis methods).
C
luster analysis is a primary data analysis tasks .
G
oal of enhancement: Improve the
computational speed of the k-means algorithm.
U
sing simple data structure(e.g. Arrays) to
keep some information in each iteration to be
used in the next iteration.
K
-Means: Computes the distances between data
point and all centers(computationally very
expensive ).
W
hy we do not benefit from previous iteration
of k-means algorithm?
Point_ID
K_IDDistanc
e
If (New distance <= Previous distance){The point stays in its cluster.}else{Implement K-Means functionality}
For each data point, we can keep the distance to the nearest cluster.
This saves the time required to compute distances to k−1 cluster centers.
FUNCTION “DISTANCE”Function distance()//assign each point to its nearest cluster1 For i=1 to n2 For j=1 to k3 Compute squared Euclidean distanced2(xi, mj);4 endfor5 Find the closest centroid mj to xi;6 mj=mj+xi; nj=nj+1;7 MSE=MSE+d2(xi, mj);
8 Clusterid[i]=number of the closest centroid;9 Pointdis[i]=Euclidean distance to the closestcentroid;10 endfor11 For j=1 to k12 mj=mj/nj;13 endfor
keep the number of the closest cluster
and the distance to the closest
cluster
FUNCTION “DISTANCE_NEW”Function distance_new()//assign each point to its nearest cluster1 For i=1 to nCompute squared Euclidean distanced2(xi, Clusterid[i]);If (d2(xi, Clusterid[i])<=Pointdis[i])Point stay in its cluster;2 Else3 For j=1 to k4 Compute squared Euclidean distanced2(xi, mj);5 endfor6 Find the closest centroid mj to xi;7 mj=mj+xi; nj=nj+1;8 MSE=MSE+d2(xi, mj);9 Clustered[i]=number of the closest centroid;10 Pointdis[i]=Euclidean distance to the closestcentroid;11 endfor12 For j=1 to k13 mj=mj/nj;14 endfor
No need to compute the distances to
the other k−1 centers
COMPLEXITY
K
-Means Complexity : O(nkl).
w
here n is the number of points, k is the
number of clusters and l is the number
of iterations.
If
the point stays in its cluster this require O(1), otherwise require
O(k).
If
we suppose that half points move from their clusters, this
req
uires O(nk/2),
sin
ce the algorithm converges to local minimum, the number of points
moved from their clusters decreases in each iteration.
S
o we expect the total cost is nk Σi=1toL 1/ i . Even
for large number of
i
terations, nk Σi=1toL 1/ i is much less than nkl.
E
nhanced k-means algorithm Complexity : O(nk).