[IEEE 2010 International Conference on Computing, Control and Industrial Engineering - Wuhan, China...

Genetic Clustering Algorithm Based on Dynamic Granularity

Jia Zhen School of Electronic and Information Engineering,

Liaoning Technical University, Huludao Liaoning 125105, China

e-mail:[email protected]

Wang Yong Gui School of Electronic and Information Engineering,

Liaoning Technical University, Huludao Liaoning 125105, China

e-mail: [email protected]

Abstract—From the view of granularity, this paper presents a genetic clustering algorithm based on dynamic granularity. In view of a parallel, random search, global optimization and diversity characteristics of genetic algorithm, it is combined with dynamic granularity model. In the process of granularity changing, appropriate granulation can be made by coarsening and refining the granularity, which can ensure clustering efficiency and quality of the algorithm. Experimental data show that the method effectively improves the clustering algorithm based on genetic algorithm local search ability and convergence speed.

Keywords-dynamic granularity; genetic algorithm; K-medoids algorithm; harmony

I. INTRODUCTION Clustering is a basic understanding of human activity. In

machine learning, clustering is an important issue. Clustering[1] groups data objects into more than one class or clusters, the principle of divis ion is that there is a higher degree of similarity among objects in the same cluster, while objects in different clusters vary greatly. The difference with the classification is that clustering operation to divide the class is unknown in advance. Class formation is completely data-driven, and is a form of unsupervised learning methods. There are many typical cluster analysis methods, such as division of clustering, density clustering method, hierarchical clustering method, grid clustering, model clustering method and fuzzy clustering method. K-medoids clustering is one of the most typical clustering algorithms. It can solve an issue of sensitive isolated point of K-means, which greatly improved the clustering accuracy. However, this method greatly affected by the initial value, there are the problem of convergence to a local optimum and getting the global optimal solution difficultly.

Genetic algorithm is a kind of randomized search methods which evolves out of drawing on the biological evolution of law. Its main characteristic is to operate directly on the structure of the object, there is no limited of continuity of derivation and function; inherent implicit parallelis m, and better ability of global optimization; use of probability-oriented optimization methods, can automatically access and guidance the optimal search space and adaptively adjust the search direction without the need to determine rules. Global optimization capability of genetic algorithm can overcome this shortcoming that K-medoids clustering algorithm is sensitive to initial value. People have begun to use genetic algorithm to clustering.

This paper will combine with dynamic granularity model, from the parallel search and informat ion granularity point of view to construct a new clustering algorithm. Experiments show that the algorithm not only eliminates the incompatibility between the clustering results in feature space and the a priori knowledge, but also improves the accuracy of clustering. Apply this algorithm to cluster the data analysis, and compare to other algorithms. Verification the effectiveness of genetic clustering algorithm based on dynamic granularity.

II. CLUSTERING ALGORITHM BASED ON DYNAMIC GRANULARITY

A. Information granularity principle and incompatibility of cluster

1) Information granularity theory It uses a triple (X, F, Г) to describe a problem in [2], in

which: X is the problem of the theory of the domain, that is, we have to consider the basic elements of a collection. And let F is an attribute function, defined as F: X → Y, Y represents the properties collection of basic elements .

Г indicated the structure of the domain is defined as the relationship between the various basic elements. From a more "coarse" point of view, in fact, simplify to X , view the elements of a similar nature as equivalent , as a whole, as a new element, so that this translates into a [X] domain of larger granularity. Thus the original problem (X , F, Г) is transformed into a new level of problem ([X], [Y], [Г]).

Granularity and equivalence relations are very closely linked. In fact, the above mentioned simplified process and the concept of quotient set are identical. R indicates a collection which composes of the set of all equivalence relations on X, you can define the following equivalence relation, that is, "coarse" and "fine" of granularity.

Define 1 R1, R2 R, if for any x, y X, there x R1 y x R2 y, then said R1 smaller than R2, denoted

by R1 ≤ R Theorem 1 R in the above definition of "≤" relations

forms a complete semi-order lattice According to this theorem, we can get the following

sequence: Rn ≤ Rn-1…≤ R1 ≤ R0

From the surface observation, the sequence of operations corresponds to an n-layer tree. Set T is an n-layer tree, all leaf nodes constitute a set of X, and then every layer node corresponds to a division of X. The clustering operation to

2010 International Conference on Computing, Control and Industrial Engineering

978-0-7695-4026-9/10 $26.00 © 2010 IEEE

DOI 10.1109/CCIE.2010.114

431

be a pedigree diagram of the cluster is exactly n-layer tree, it must exist a corresponding sequence of equivalence relation, which is the reasons of interlinked between cluster and the granularity.

2) The incompatibility between Clustering results and priori knowledge

The best priori knowledge is the following: In the feature space, there are clear boundaries and less similarity measure between the different samples points which belong to the same class; but the sample points that belong to the same class are gathered together and have a larger similarity measure. From the clustering point of view, priori knowledge that specifies a certain type of sample points, in accordance with the selected feature space and similarity measure, should be gathered as a class . Unfortunately, it will not achieve this ideal state in most cases. Experts believe those points should be classified as a class is often far from the particular in the feature space. This situation is often encountered; while those sample points that are considered belonging to different class are very close. In other words, there is often some kind of incompatibility between the clustering results and priori knowledge.

R is an equivalence relation on the domain U. Cluster Genealogy Chart is actually defines an equivalence relation sequence of granularity changing from coarse to fine. Select a threshold value is actually selected an R, and Knowledge systems U / R can be further, which is the quotient set. If a subset X of U can use the existing knowledge to express exactly, it means clustering results and priori knowledge is coordinated; while it is different between the upper approximat ion and lower approximation, it shows clustering results and a priori knowledge is inconsistent.

B. Granularity of the clustering algorithm based on dynamic The purpose of the introduction of dynamic granularity

is to more effectively complete the clustering task. Granularity can be refined and coarsed in the process of granularity changes. Granularity is obtained too fine, too thin on the description of the problem, each sample sui generis and can not tap the knowledge of the sample; Granularity is obtained too coarse, too rough on the description of the problem, some of the nature of the problem has been blurred. Choose the right granularity is the key to clustering. In order to obtain a suitable granularity, granularity is selected by using the following two useful equivalent classification methods.

Defin ition 2 R1 and R2 are the two equivalence relations on the domain U, and meet the following two conditions:

R1 < R and R2 < R There is R', R1 < R', R2 < R', and R < R'

So called R for the product of R1 and R2, recorded as R = R1 R2.

Defin ition 3 R1 and R2 are the two equivalence relations on the domain U, and meet the following two conditions:

R < R1 and R < R2

There is R', R' < R1, R'< R2, and R'< R So called R for the sum of R1 and R2, recorded as R = R

R2. Cluster analysis of specific issues set a collection

corresponding to an equivalence relation R0 (the corresponding granularity△ 0), the preliminary conclusion A0. If you meet the demand, clustering granularity is appropriate and problem is solved. Otherwise, the sub-two situations are considered:

Granularity is fine compared to △ 0, then take a coarse equivalence relation R0', so that R1 = R R0'. Analysis on the R1 and obtain conclusion A1 and granularity△ 1. If the A1 is still fine, then take a coarse equivalence relation R1', so that R2 = RR1', analysis on the R2.

Granularity is coarse compared to △ 0, then get a fine equivalence relation R0', so that R1 = R0 R0'. Analysis on the R1 and obtain conclusion A1 and granularity △ 1. If the A1 is still coarse, then take a fine equivalent relation R1', so that R2 = R1 R1', analysis on the R2.

The above process repeated until a suitable granularity can be up to. Then a sequence of equivalence relations P = {Rn, Rn-1… R1} is obtained to meet the partial order Rn ≤ Rn-1 ≤… ≤ R1.

Defin ition 4 A given knowledge base K = (U, R), X U, and a given equivalence relations sequence of

partial order relations, P = (Rn, Rn-1,…, R1), meet the partial order Rn ≤ Rn-1≤ … ≤ R1. Thus the coordination degree between clustering results and X is defined as

P XH P X

X

Among them, is the collection base. Coordination degree H (P , X) [0, 1] When the H (P, X) = 0, the clustering results and a priori knowledge is most uncoordinated; When H (P, X) =1, the clustering results and a priori knowledge is most coordination, it said that the existing knowledge can accurately describe the priori knowledge.

III. GENETIC CLUSTERING ALGORITHM BASED ON DYNAMIC GRANULARITY.

This paper will introduce genetic algorithms to the clustering algorithm based on dynamic granularity and propose the genetic clustering algorithm based on dynamic granularity. The algorithm not only eliminates the incompatibility between the clustering results in feature space and the priori knowledge, but also improves the accuracy of clustering.

A. K-medoids algorithm Suppose a database including of N data objects

composes a collection S = (S1, S2… SN), the algorithm flow is as follows:

Choice K objects as the initial cluster center (m1, m2,…, mK) from the N data objects;

432

According to the principle of minimum distance, the remaining objects are assigned to the each category represented by above-mentioned center point;

For any class i, select objects mr in order, calculate the values of FK that is the distance sum of each object to the nearest cluster center after mr instead of mi , mr which has the smallest values of FK instead of mi;

Cycles b) c), until the K center points are fixed. Function FK expresses as the following:

min ( , )1,...1

NF d S mi iK i Ki

B. K-medoids algorithm based on genetic algorithm Using floating-point encoding combined with the use of

K center operations and the shortest distance-based gene matching arithmetic crossover operator, is not only solve the initial value sensitive of K-medoids algorithm and achieve local optimization problems easily, but also greatly improve the genetic clustering algorithm for local search ability and convergence speed.

K-medoids algorithm based on genetic algorithm basic flow is as follows:

Step0: In itialization operation parameters; Step1: According to coding rules, randomly generate

initial population; K points are selected randomly as a solution of the

problem from the clustering points, and encoded as a chromosome. Repeat this operation until the popsize (initial population size) chromosomes have all been initialized. Initial population for the pop[i] .

Step2: Implement K-medoids operations for each individual, and then calculate each individual's fitness function value;

Genetic algorithms is based on the fitness function in the process, and use the fitness function value of each individual in populations to search, so the selection of the fitness function will have a direct impact on the algorithm convergence speed. Using the inverse distance among objects as the fitness function, that is

1 ( ( ( , )))min1 1,...

popsizef f F d S mi i K i iFK i i K

.The overall

fitness function should be1

popsizeF fii

.

Step3: Choose popsize individuals from popsize parents to form a temporary population according to roulette wheel selection method;

The probability of individual i selected is P fiFi ,

according to the probability, eliminate an individual of the small probability, select popsize individuals to form the temporary population.

Step4: Select randomly individuals to cross-operation in accordance with cross-rates in temporary populations;

With two individuals of pop[i] , pop[i+1] as a benchmark to generate new two individuals tem[i] , tem[i+1] , which pop[i]={d[1],d[2],...,d[K]}i = 1 ... pop size, generation rule[2] as follows: the k-dimensional of first individual

compare to each dimension of the second individuals, and choose the smallest distance pop[i+1].d [j] as temp[i].d[k] , temp[i+1] is also generated. So you can gain new popsize individuals temp[popsize] , and then let pop[i] and temp[i] cross-operator to obtain the new group of Newpop[i] i = 1 ...

popsize, the principle of cross-operator is as follows: Newpop[i].d[j]= a * pop[i].d[j] + (1-a) * temp[i].d[j], where a is a random number between 0 and 1.

Step5: Select each individual in the temporary population which is given by cross-operation to mutation operation in accordance with the mutation rate;

Step6: Create a new generation of population, to judge whether a predetermined number of iterations is achieved, if so, then end the optimization process, otherwise go to Step2.

C. Genetic clustering algorithm based on dynamic granularity. The use of genetic clustering algorithm in the selection,

crossover and mutation operator, combined with dynamic granularity model. In the process of granularity changing, appropriate granulation can be made by coarsening and refining the granularity, and thus construct a new clustering algorithm.

The algorithm can be described in detail as follows: Initialization parameters: use the above K-medoids algorithm Based on

genetic algorithm to clustering operation; Get a series of thresholds after clustering operation

completed. Select a group of threshold value is equivalent to select an equivalence relation. Granular synthesis by using the definition 2 and definition 3 find the appropriate granulation, and get a sequence of equivalence relations P = (Rn, Rn-1… R1), Rn Rn-1 … R1.

calculate the coordination degree between clustering results and X;

If Hi (P, X) = 1, then output the clustering results; otherwise i = i + 1, go to step 3).

IV. EXPERIMENT AND RESULTS ANALYSIS Select Fisher's IRIS data of plant sample collection as the

test samples[5]. It consists of three different types of plants belonging to the 150 sample points, and each sample point is four kinds of attributes. The relevant parameter settings are as follows: crossover probability Pc = 0.9, mutation probability Pm = 0.005, population size popsize=30; the maximum number of iterations is 200. On the IRIS data clustering experiments, various clustering algorithms were compared to, the results of the comparison is shown in TABLE1.

TABLE 1 COMPARISON OF CLUSTERING ALGORITHMS

Algorithm Population size

Convergence time (s)

Correct rate (%)

Artificial Immune 30 7.38 85

433

K-medoids based on genetic algorithm

30 1.9 90

This algorithm 30 1.5 94

V. CONCLUSION K-medoids based on genetic algorithm is combined with

clustering algorithm based on dynamic granularity , it provides a new way of thinking of the simulated evolutionary clustering algorithm based on granularity. Information granularity theory is described, given the clustering granularity adjustment method can be faster and better to choose the appropriate granulation. At the same time, In view of a parallel and random search characteristics of genetic algorithm, it is combined with dynamic granularity, thus a new clustering algorithm is constructed. This method not only solves the problems that K-medoids algorithm is sensitive to the initial value and easy to achieve local optimization, but also greatly increased the genetic clustering algorithm for local search ability and convergence speed.

REFERENCES [1] Xiao-Dong Kang. “ased on data warehouse data mining technology

[M]. Beijing. Mechanical Industry Press, ” 2004.1. [2] Lin Lu flowers, Bo. “ An improved genetic clustering algorithm [J].

Computer Engineering and Applications,” 2007,43 (21) 170-172. [3] Hu locks, Zong-Hai Chen. “Based on hybrid genetic algorithm for

cluster analysis [J]. Pattern Recognition and Artificial Intelligence, ” 2001 14 (3) 353-354.

[4] Zhang Bo.Zhang Ling. “Theory of Problem Solving and Its Application. Beijing: T singhua University Press,” 1990(in Chinese .

[5] Gao Jian. “Based on parallel multi-population adaptive ant colony clustering algorithm [J]. Computer Engineering and Applications, ” 2003,39 (25): 78.

[6] Tang Xixi. “A new hybrid clustering algorithm. [J] Journal of Guangxi University of Technology, ” 2006,17 (3) :77-81.

[7] Rahila H.Sheikh, M.M.Raghuwanshi, Anil N.Jaiswal.“Genetic Algorithm Based Clustering: A Survey” [J]. First International Conference on Emerging Trends in Engineering and Technology.

434

[IEEE 2010 International Conference on Computing, Control and Industrial Engineering - Wuhan, China...

Documents

Transcript of [IEEE 2010 International Conference on Computing, Control and Industrial Engineering - Wuhan, China...