Enhance The K Means Algorithm On Spatial Dataset

ENHANCED K-MEANS ALGORITHM ON SPATIAL

DATASET

OVERVIEW

K

-Means algorithm introduced by J.B. MacQueen in

1967, is one of the most common clustering

algorithms and it is considered as one of the simplest

unsupervised learning algorithms that partition

feature vectors into k clusters so that the within

group sum of squares is minimized.

VARIATION IN K-MEAN ALGORITHM

T

here are several variants of the k-means clustering algorithm, but

most variants involve an iterative scheme that operates over a fixed

number of clusters, while attempting to satisfy the following

properties:

1. E

ach class has a center which is the mean position of all the samples

in that class.

2. E

ach sample is in the class whose center it is closest to.

PROCEDURE OF K-MEAN ALGORITHM

Step 1: Place randomly initial group centroids into the 2d

space.

Step 2: Assign each object to the group that has the closest

centroid.

Step 3: Recalculate the positions of the centroids.

Step 4: If the positions of the centroids didn't change go to

the next step, else go to Step 2.

Step 5: End.

FLOW CHART

HOW K-MEANS ALGORITHM WORKS

1. It accepts the number of clusters to group data into, and the

dataset to cluster as input values.

2. It then creates the first K initial clusters (K= number of

clusters needed) from the dataset by choosing K rows of

data randomly from the dataset. For Example, if there are

10,000 rows of data in the dataset and 3 clusters need to

be formed, then the first K=3 initial clusters will be

created by selecting 3 records randomly from the dataset

as the initial clusters. Each of the 3 initial clusters formed

will have just one row of data.


3. The K-Means algorithm calculates the Arithmetic Mean of each cluster formed in the dataset. The Arithmetic Mean of a cluster is the mean of all the individual records in the cluster. In each of the first K initial clusters, there is only one record. The Arithmetic Mean of a cluster with one record is the set of values that make up that record. For Example if the dataset we are discussing is a set of Height, Weight and Age measurements for students in a University, where a record P in the dataset S is represented by a Height, Weight and Age measurement, then P = {Age, Height, Weight). Then a record containing the measurements of a student John, would be represented as John = {20, 170, 80} where John's Age = 20 years, Height = 1.70 meters and Weight = 80 Pounds. Since there is only one record in each initial cluster then the Arithmetic Mean of a cluster with only the record for John as a member = {20, 170, 80}.


4. It Next, K-Means assigns each record in the dataset to

only one of the initial clusters. Each record is assigned

to the nearest cluster (the cluster which it is most

similar to) using a measure of distance or similarity like

the Euclidean Distance Measure or Manhattan/City-

Block Distance Measure.


5. Next, K-Means re-assigns each record in the dataset to the most similar cluster and re-calculates the arithmetic mean of all the clusters in the dataset. The arithmetic mean of a cluster is the arithmetic mean of all the records in that cluster. For Example, if a cluster contains two records where the record of the set of measurements for John = {20, 170, 80} and Henry = {30, 160, 120}, then the arithmetic mean Pmean is represented as Pmean= {Agemean, Heightmean, Weightmean). Agemean= (20 + 30)/2, Heightmean= (170 + 160)/2 and Weightmean= (80 + 120)/2. The arithmetic mean of this cluster = {25, 165, 100}. This new arithmetic mean becomes the center of this new cluster. Following the same procedure, new cluster centers are formed for all the existing clusters.


6. It K-Means re-assigns each record in the dataset to only one of the new clusters formed. A record or data point is assigned to the nearest cluster (the cluster which it is most similar to) using a measure of distance or similarity like the Euclidean Distance Measure or Manhattan/City-Block Distance Measure.

7. The preceding steps are repeated until stable clusters are formed and the K-Means clustering procedure is completed. Stable clusters are formed when new iterations or repetitions of the K-Means clustering algorithm does not create new clusters as the cluster center or Arithmetic Mean of each cluster formed is the same as the old cluster center. There are different techniques for determining when a stable cluster is formed or when the k-means clustering algorithm procedure is completed.

COMPUTATIONAL COMPLEXITY

N

P-hard in general Euclidean space d even for 2 clusters.

N

P-hard for a general number of clusters k even in the plane.

I

f k and d are fixed, the problem can be exactly solved in

time O(ndk+1 log n), where n is the number of entities to be

clustered.

ADVANTAGES

Relatively efficient: O(tkn), where n is the number of instances, c is the number of clusters, and t is the number of iterations. Normally, k, t << n.

Often terminates at a local optimum. The global optimum may be found using techniques such as: simulated annealing or genetic algorithms

DISADVANTAGES

Applicable only when mean is defined.Need to specify c, the number of

clusters, in advance.Unable to handle noisy data and

outliers.Not suitable to discover clusters with

non-convex shapes.

K-MEANS FOR SPHERICAL CLUSTERS

C

lustering technique for exploratory data analysis,

for summary generation, and as a preprocessing

step for other data mining tasks.

c

lusters may be of arbitrary shapes and can be

nested within one another.

EXAMPLES OF SUCH SHAPES

c

hain-like patterns (represent active and

inactive volcanoes).

(a) Chain-like patterns(b) Clusters detected by K-means

k

-means algorithm discovers spherical shaped

cluster, whose center is the gravity center of points

in that cluster.

T

he center moves as new points are added to or

removed from it.

T

his motion makes the center closer to some points

and far apart from the other points, the points that

become closer to the center will stay in that cluster.

T

he points far apart from the center may change the

cluster.

SPHERICAL SHAPED WITH LARGE VARIANCE IN SIZES.

H

owever, this algorithm is suitable for spherical

shaped clusters of similar sizes and densities.

T

he quality of the resulting clusters decreases when

the data set contains spherical shaped with large

variance in sizes.

CONT.

T

he proposed method is based on shifting

the center of the large cluster toward the

small cluster, and re-computing the

membership of small cluster points.

SPATIAL AUTOCORRELATION

S

patial autocorrelation is determined

both by similarities in position, and by

similarities in attributes .

ENHANCED K-MEANS ALGORITHM

S

patial databases has a huge amount of data

collected and stored.( increases the need for

effective analysis methods).

C

luster analysis is a primary data analysis tasks .

G

oal of enhancement: Improve the

computational speed of the k-means algorithm.

U

sing simple data structure(e.g. Arrays) to

keep some information in each iteration to be

used in the next iteration.

K

-Means: Computes the distances between data

point and all centers(computationally very

expensive ).

W

hy we do not benefit from previous iteration

of k-means algorithm?

Point_ID

K_IDDistanc

e

If (New distance <= Previous distance){The point stays in its cluster.}else{Implement K-Means functionality}

For each data point, we can keep the distance to the nearest cluster.

This saves the time required to compute distances to k−1 cluster centers.

FUNCTION “DISTANCE”Function distance()//assign each point to its nearest cluster1 For i=1 to n2 For j=1 to k3 Compute squared Euclidean distanced2(xi, mj);4 endfor5 Find the closest centroid mj to xi;6 mj=mj+xi; nj=nj+1;7 MSE=MSE+d2(xi, mj);

8 Clusterid[i]=number of the closest centroid;9 Pointdis[i]=Euclidean distance to the closestcentroid;10 endfor11 For j=1 to k12 mj=mj/nj;13 endfor

keep the number of the closest cluster

and the distance to the closest

cluster

FUNCTION “DISTANCE_NEW”Function distance_new()//assign each point to its nearest cluster1 For i=1 to nCompute squared Euclidean distanced2(xi, Clusterid[i]);If (d2(xi, Clusterid[i])<=Pointdis[i])Point stay in its cluster;2 Else3 For j=1 to k4 Compute squared Euclidean distanced2(xi, mj);5 endfor6 Find the closest centroid mj to xi;7 mj=mj+xi; nj=nj+1;8 MSE=MSE+d2(xi, mj);9 Clustered[i]=number of the closest centroid;10 Pointdis[i]=Euclidean distance to the closestcentroid;11 endfor12 For j=1 to k13 mj=mj/nj;14 endfor

No need to compute the distances to

the other k−1 centers

COMPLEXITY

K

-Means Complexity : O(nkl).

w

here n is the number of points, k is the

number of clusters and l is the number

of iterations.

If

the point stays in its cluster this require O(1), otherwise require

O(k).

If

we suppose that half points move from their clusters, this

req

uires O(nk/2),

sin

ce the algorithm converges to local minimum, the number of points

moved from their clusters decreases in each iteration.

S

o we expect the total cost is nk Σi=1toL 1/ i . Even

for large number of

i

terations, nk Σi=1toL 1/ i is much less than nkl.

E

nhanced k-means algorithm Complexity : O(nk).

Enhance The K Means Algorithm On Spatial Dataset

Documents

Transcript of Enhance The K Means Algorithm On Spatial Dataset