Social Spider Algorithm Approach for Clusteringceur-ws.org/Vol-1743/paper14.pdf · ing as metric...

8
Social Spider Algorithm Approach for Clustering Harley Vera-Olivera* Jos´ e Luis Soncco- ´ Alvarez* [email protected], [email protected], [email protected] National University of San Antonio Abad del Cusco* Lauro Enciso-Rodas* Abstract Clustering is a popular data analysis tech- nique to identify homogeneous groups of objects based on the values of their at- tributes, used in many disciplines and applications. This extended abstract of our undergraduate thesis for obtaining the engineer degree in informatics and sys- tems, presents an approach based on the Social Spider Optimization (SSO) algo- rithm for optimizing clusters of data, tak- ing as metric the sum of euclidean dis- tances. Other important algorithms of the literature were implemented in order to make comparisons: K-means algorithm, and a Genetic Algorithm (GA) for Clus- tering. Experiments were performed us- ing 5 datasets taken from the UCI Ma- chine Learning Repository, each algorithm was executed many times and then the fol- lowing measures were calculated: mean, median, minimum, and maximum values of the results. These experiments showed that the SSO algorithm outperforms the K- means algorithm, and it has results equally competitive as the GA. All these results were confirmed by statistical tests performed over the outputs of the algo- rithm. 1 Introduction Clustering is useful in several analysis such as ex- ploration of patterns, machine learning including data mining, documents retrieval, image segmen- tation and pattern classification. However in many This work was started when first and second authors were undergraduate students at the National University of San Antonio Abad del Cusco and was finished when the au- thors were graduate students at the University of Brasilia, both receiving a CAPES scholarship. of these problems there is little prior information, such as statistical models. It is under these restric- tions that clustering is particularly appropriate for the exploration of interrelationships between data to make a preliminary evaluation of its structure (Jain et al., 1999). Thus new conditions imposed by Big Data presented new challenges at different levels including clustering. The term clustering is used in several commu- nities to describe methods for grouping of un- labeled data (Jain et al., 1999). Clustering is the task of discovering groups and data structures that are in some way or another ”similar”, with- out using known structures (vijayalakshmi and Renuka Devi, 2012). Intuitively, patterns within a group are more similar compared to those patterns belonging to a different group. Here, the goal is to develop an automatic algorithm that can accu- rately classify an unlabeled dataset in groups. Recent literature classifies clustering algorithms in hierarchical, partitioning, and overlapping (Xu and Wunsch, 2009). The partitional algorithm di- vides a dataset into a finite number based on cer- tain criteria known as a measure of fitness. The fit- ness measure affects directly the natural formation of the groups, once a measure is selected the task of the partition becomes an optimization problem. K-means algorithm is the most fundamental concept of partitional grouping, was published in 1957 by Lloyd (Lloyd, 1982). In this case the minimization of the Euclidean distance between its elements and the center of a cluster was con- sidered as a criterion of optimization. Inspired by K-means many algorithms were developed such as: Bisecting K-means (Steinbach et al., 2000), sort-means (Phillips, 2002), X-means (Pelleg and Moore, 1999), among others. Recent studies reveal a new trend, which was named as stochastic algorithms with randomized and local search meta-heuristic. The random pro- 114

Transcript of Social Spider Algorithm Approach for Clusteringceur-ws.org/Vol-1743/paper14.pdf · ing as metric...

Page 1: Social Spider Algorithm Approach for Clusteringceur-ws.org/Vol-1743/paper14.pdf · ing as metric the sum of euclidean dis-tances. Other important algorithms of the literature were

Social Spider Algorithm Approach for Clustering

Harley Vera-Olivera* Jose Luis Soncco-Alvarez*[email protected], [email protected], [email protected]

National University of San Antonio Abad del Cusco*

Lauro Enciso-Rodas*

Abstract

Clustering is a popular data analysis tech-nique to identify homogeneous groups ofobjects based on the values of their at-tributes, used in many disciplines andapplications. This extended abstract ofour undergraduate thesis for obtaining theengineer degree in informatics and sys-tems, presents an approach based on theSocial Spider Optimization (SSO) algo-rithm for optimizing clusters of data, tak-ing as metric the sum of euclidean dis-tances. Other important algorithms of theliterature were implemented in order tomake comparisons: K-means algorithm,and a Genetic Algorithm (GA) for Clus-tering. Experiments were performed us-ing 5 datasets taken from the UCI Ma-chine Learning Repository, each algorithmwas executed many times and then the fol-lowing measures were calculated: mean,median, minimum, and maximum valuesof the results. These experiments showedthat the SSO algorithm outperforms theK- means algorithm, and it has resultsequally competitive as the GA. All theseresults were confirmed by statistical testsperformed over the outputs of the algo-rithm.

1 Introduction

Clustering is useful in several analysis such as ex-ploration of patterns, machine learning includingdata mining, documents retrieval, image segmen-tation and pattern classification. However in many

This work was started when first and second authorswere undergraduate students at the National University ofSan Antonio Abad del Cusco and was finished when the au-thors were graduate students at the University of Brasilia,both receiving a CAPES scholarship.

of these problems there is little prior information,such as statistical models. It is under these restric-tions that clustering is particularly appropriate forthe exploration of interrelationships between datato make a preliminary evaluation of its structure(Jain et al., 1999). Thus new conditions imposedby Big Data presented new challenges at differentlevels including clustering.

The term clustering is used in several commu-nities to describe methods for grouping of un-labeled data (Jain et al., 1999). Clustering isthe task of discovering groups and data structuresthat are in some way or another ”similar”, with-out using known structures (vijayalakshmi andRenuka Devi, 2012). Intuitively, patterns within agroup are more similar compared to those patternsbelonging to a different group. Here, the goal isto develop an automatic algorithm that can accu-rately classify an unlabeled dataset in groups.

Recent literature classifies clustering algorithmsin hierarchical, partitioning, and overlapping (Xuand Wunsch, 2009). The partitional algorithm di-vides a dataset into a finite number based on cer-tain criteria known as a measure of fitness. The fit-ness measure affects directly the natural formationof the groups, once a measure is selected the taskof the partition becomes an optimization problem.

K-means algorithm is the most fundamentalconcept of partitional grouping, was published in1957 by Lloyd (Lloyd, 1982). In this case theminimization of the Euclidean distance betweenits elements and the center of a cluster was con-sidered as a criterion of optimization. Inspired byK-means many algorithms were developed suchas: Bisecting K-means (Steinbach et al., 2000),sort-means (Phillips, 2002), X-means (Pelleg andMoore, 1999), among others.

Recent studies reveal a new trend, which wasnamed as stochastic algorithms with randomizedand local search meta-heuristic. The random pro-

114

Page 2: Social Spider Algorithm Approach for Clusteringceur-ws.org/Vol-1743/paper14.pdf · ing as metric the sum of euclidean dis-tances. Other important algorithms of the literature were

cess generates arbitrary solutions that explore thesearch space and are responsible for achievingglobal solution (Nanda and Panda, 2014). Thefirst meta-heuristic inspired by nature was the ge-netic algorithm developed by Holland and his col-leagues in 1975 (Holland, 1975). This algorithmis classified as evolutionary algorithm. On theother hand, new bio-inspired optimization algo-rithms are being introduced, such is the case of thealgorithm inspired by the social behavior of spi-ders (Cuevas et al., 2013) classified as swarm in-telligence algorithm proposed in 2013, which hadnot been applied to the clustering problem untilour proposal.

This extended abstract of our undergraduatethesis for obtaining the engineer degree in in-formatics and systems (Vera-Olivera and Soncco-Alvarez, 2016), presents an approach based on theSSO algorithm for the clustering problem. Thecontribution of this work is to show that the SSOalgorithm can produce competitive results regard-ing classic approaches such as: (a) the k-means al-gorithm, which was implemented as presented in(Maulik and Bandyopadhyay, 2000); and (b) a ge-netic algorithm approach for the clustering prob-lem, which was proposed by Maulik and Bandy-opadhyay (2000). The metric used for the com-parisons is the sum of euclidean distances of theelements of the clusters to their respective cen-ter, this metric is the output of the algorithms.For the experiments were used 5 datasets fromthe UCI Machine Learning Repository, for each ofthese datasets the algorithms were executed sev-eral times, and then the following measures werecalculated: mean, median, minimum, and max-imum values. This experiment showed that theSSO algorithm has better results compared to theones obtained by the K-means algorithm, also theSSO algorithm has equally competitive results asthe GA. Additionally, a statistical analysis wasperformed, since we are working with stochasticalgorithms, using the Kolmogorov-Smirnov testand the Wilcoxon rank sum test as discussed in(Demsar, 2006), (Durillo et al., 2009), (Munoz etal., 2011). The results of these statistical test con-firmed the results of the experiments.

This paper is organized as follow: in SectionII, are given some definitions related to the clus-tering problem; in Section III is given the originalproposal of the SSO algorithm; in Section IV, de-tails of our approach based on the SSO algorithm

for the clustering problem are presented, also thepseudo-code of the algorithms is presented; inSection V the experiments and results are showed,a discussion of this results is presented, and also astatistical analysis is performed; finally in SectionVI are presented the conclusions and future work.

2 The Clustering Problem

According to Mirkin (Mirkin, 2012), clustering isa discipline dedicated to reveal and describe thestructures of groups in a dataset and may definefour important involved concepts: data, structuregroups, reveal a group structure, and describe agroup structure. The following definitions weretaken from (Maulik and Bandyopadhyay, 2000;De Falco et al., 2007; Karaboga and Ozturk, 2011;Senthilnath et al., 2011).

Suppose S = {x1, x2, . . . xn} is a set of N -dimensional n points and C = {c1, c2, . . . , c

k

} isa set of N -dimensional k elements. The clusteringproblem in a N -dimensional space RN consists inpartitioning the set S in a number k of clustersbased on a similarity metric, where each clusterhas as center an element c

i

from C.Suppose that G

i

, i = 1, . . . , k, represents acluster, then the following properties hold:

• Gi

6= �, to i = 1, . . . , k;

• Gi

\ Gj

= �, to i, j = 1, . . . , k, such thati 6= j;

•kS

i=1G

i

= S

The clustering metric that has been adopted inthis work is the sum of the Euclidean distances ofthe points of a group to their respective center. Thedefinition of this clustering metric M for k clus-ters G1, G2, . . . , G

k

, is given by the following ex-pression:

M(G1, G2, . . . , Gk

) =

kX

i=1

X

xj2Gi

kxj

� ci

k

3 Algorithm Based on the SocialBehavior of Spiders

Cuevas et al. (2013) proposed a new optimiza-tion algorithm, called Social Spider Optimiza-tion(SSO), the development of this new algorithm

115

Page 3: Social Spider Algorithm Approach for Clusteringceur-ws.org/Vol-1743/paper14.pdf · ing as metric the sum of euclidean dis-tances. Other important algorithms of the literature were

was guided by the operational principle of the so-cial behavior of spiders. The SSO algorithm as-sumes that the solution space is a community net-work (spider web), where spiders interact to eachothers. The main features of this approach are:

• Each solution within a space of solutions rep-resents the position of a spider in the commu-nity network.

• Each spider receives a weight according tothe value of fitness solution that represents.

• The algorithm modeled two types of searchagents (spiders): male and female. Depend-ing on the genre each individual performs dif-ferent types of operations that simulate theirsocial behavior within the colony.

An important feature of the colonies of socialspiders is that they have a high number of femaleagents. This fact is simulated by defining the num-ber of females N

f

randomly within the range of 65to 90% of N , which is the number of elements ofthe total population. The number of males N

m

iscalculated as the complement of N

f

regarding N .The total population S is divided into two sub-

groups F and M . The group F is the set of femalespiders, and the group M is the set of male spiders.

F = {f1, f2, . . . , fNf }

M = {m1,m2, . . . ,mNm}

where S = F [M = {s1, s2, . . . , sN}

3.1 Calculation of FitnessEach individual (spider) i of the population S re-ceives a w

i

weight, that represents the quality of itssolution. This weight can be calculated as follows:

wi

=J(s

i

)� worsts

bests

� worsts

where J(si

) is the fitness value calculated by eval-uating the position of a spider s

i

regarding thefunction J . The values worst

s

and bests

consid-ering a maximization problem, are defined as fol-lows:

bests

= max(J(sk

)), k 2 {1, 2, . . . , N}

worsts

= min(J(sk

)), k 2 {1, 2, . . . , N}

3.2 Modeling of Vibrations Through theCommunity Network

The community network is used as a mechanismfor transmitting information between the membersof the colony. This information is coded as smallvibrations that are critical for collective coordina-tion of all individuals. The vibrations are based onthe weight and the distance of the spider that gen-erated it. The vibrations that are perceived by anindividual i as a result of information transmittedby an individual j are modeled by the followingexpression:

V ibi,j

= wj

⇤ e�d

2i,j

Where di,j

is the euclidean distance betweenspiders i and j. There are three special types of vi-brations that are considered in the SSO algorithm:

• V ibi,c

vibrations, where c is the closestmember to i that has a higher weight com-pared to i (w

c

> wi

).

• V ibi,b

vibrations, where b is the individualwho has the best weight (best fitness value)of the whole population S.

• V ibi,f

vibrations, where f is the female in-dividual closest to i.

3.3 Initialization of PopulationThe SSO algorithm starts by initializing the set S,which contains N spiders positions. Each posi-tion f

i

or mi

, is an n-dimensional vector contain-ing the values to be optimized. These values aredistributed uniformly between the values, plow andphigh, which are previously specified.

3.4 Cooperative Operators3.4.1 Cooperative Operator for female

spidersTo emulate the cooperative behavior of the femalespiders, a new operator is defined. The operatorconsiders the change in position of a female spideri at each iteration, this change can be attractiveor repulsive and is calculated by combining threeelements:

• The first element considers the change re-garding the nearest member to i that hasthe highest weight and produces vibrationV ib

i,c

;

116

Page 4: Social Spider Algorithm Approach for Clusteringceur-ws.org/Vol-1743/paper14.pdf · ing as metric the sum of euclidean dis-tances. Other important algorithms of the literature were

• The second element considers the change re-garding the best individual of the populationS that produces vibration V ib

i,b

;

• The third incorporates a random movement.

The last three elements can be considered as onemovement, we use the ”+” symbol for attractionand the ”-” symbol for repulsion. The change inposition can be calculated as follows:

fk+1i

= fk

i

±movement

where k represents the iteration number.

3.4.2 Cooperative Operator for male spidersTo emulate the cooperative behavior of the malespiders, these are divided into two groups: dom-inant D and non-dominant ND. This divisionis made according to its position respect to themedian of all male individuals. Individuals whohave a weight that is above the median are con-sidered dominants, otherwise they are considerednon-dominant.

For dominant males are defined two move-ments: (a) a movement of attraction to the near-est female f that produces a vibration V ib

i,f

, and(b) a random movement. The last two movementscan be considered as one, and then the change in adominant male can be calculated as follows:

mk+1i

= mk

i

+ D movement

where k represents the iteration number.For non-dominant males is defined just one

movement of attraction to the weighted average ofmale spiders. Then the change in a non-dominantcan be calculated as follows:

mk+1i

= mk

i

+ ND movement

where k represents the iteration number.

3.5 Mating operatorMating a colony of spiders is made between fe-males and dominant males. So when a dominantmale m

g

finds a set of female spiders Eg withina range of mating r, it mates, forming a new off-spring S

new

. This new offspring is generated fromthe set T g, which is formed by the union of Eg andm

g

. When the set T g is empty, mating operationis canceled.

The weight of each spider that is involved inthe mating process, i.e. spiders from the set T g,

defines a probability of influence on the new off-spring. The probability of influence P

si is as-signed using the roulette-wheel selection, whichis defined as follows:

Psi =

wiP

j2T g wj

where si

2 T g.A spider is a solution within the solution space,

so a new spider is formed by choosing values foreach variable, this variable is chosen within thevalues defined by the method of roulette. For ex-ample let s

new

= {v1, v2, . . . , vn} be the newspider, each variable v

i

is determined using themethod of roulette-wheel selection.

Once a new spider snew

was formed is com-pared with the worst spider s

worst

from the colonyaccording to their weights, where w

worst

=min

l2{1,2,...,N}(wl

). If the new spider snew

is bet-ter than the worst spider s

worst

, then sworst

is re-placed by s

new

. Otherwise, the new spider is dis-carded and the colony does not suffer alterations.If a replacement occur, the new spider takes thegenre and index from the spider replaced.

4 Optimization Algorithm Based onSocial Behaviour Spiders forClustering Problems

As proposal we present an SSO (Cuevas et al.,2013) approach to solve the clustering problem.This optimization algorithm based on the socialbehaviour of spiders is a meta-heuristic algorithmof general purpose, so it is necessary to modifymany elements of the algorithms such as the rep-resentation of the individuals, calculation of thefitness function, etc. Below are presented the el-ements on which it was necessary to make mod-ifications to the original algorithm proposed in(Cuevas et al., 2013).

4.1 Representations of Spiders (Individuals)The first consideration to take into account is therepresentation of each spider. Each spider (maleor female) represents a set of k clusters centers,which is a feasible solution to the problem of clus-tering.

For instance, let x ={(10.5; 20.4), (15.2; 25.0)} be a spiderthat contains k = 2 cluster centers that are{(10.5; 20.4) and (15.2; 25.0)}, in this particularcase each center has dimension n = 2.

117

Page 5: Social Spider Algorithm Approach for Clusteringceur-ws.org/Vol-1743/paper14.pdf · ing as metric the sum of euclidean dis-tances. Other important algorithms of the literature were

Each spider of the initial population was gen-erated taking k random points of a given dataset,where k is the number of cluster to be found.

4.2 Distance between Two SpidersIt is necessary to define the distance between twospiders, since a spider is formed by a set of clustercenters (each center formed by several points) andnot by a set of points. So we define the distancebetween two spiders as the sum of the euclideandistances between their centers of clusters.

For instance, let a = {(ax1 ; ay1), (ax2 ; ay2)}

and b = {(bx1; by1), (bx2 ; by2)} be two spidersthat have k = 2 clusters centers, with each centerhaving dimension 2. Then the distance betweenthese two spiders will be:

da,b

= d((ax1 ; ay1), (bx1 ; by1)) +

d((ax2 ; ay2), (bx2 ; by2))

where d((ax1 ; a

y1), (bx1 ; b

y1)) is the Euclideandistance between the centers (a

x1 ; ay1) and

(bx1 ; by1).

4.3 Fitness and Weight of a SpiderThe fitness of each spider, which is an indicatorof how good is the solution that this spider repre-sents, is calculated using the metric M. The aimof the SSO algorithm is to minimize the fitness ofthe population. Thus, a spider that has the mini-mum fitness is the best within the population.

The pseudocode for calculating the fitness of aspider is presented in Algorithm 1. The weight of aspider i was re-defined, because fitness and weighthave negative correlation, and it is calculated inthe following way:

wi

=worst

s

� J(si

)

worsts

� bests

bests = min( J(sk) ), k 2 {1, 2, . . . , N}

worsts = max( J(sk) ), k 2 {1, 2, . . . , N}

where J(si

) is the fitness value of the spider ithat was calculated using Algorithm 1.

4.4 Mating of SpidersIn the mating stage was defined a mating set Twhich is formed by a dominant male spider and thefemale spiders that are within its range of mating.From this set T are created new spiders, a new spi-der represents a set of cluster centers, where eachcluster center is inherited from a spider within the

Algorithm 1: Algorithm for calculating thefitness of a spider

Input: An array of cluster centers C (spiderC); a set D of n-dimensional mpoints; an integer k > 0 thatrepresents the number of clusters

Output: Metric M of spider C1 Create the set of empty clustersG = {G1, G2, . . . , G

k

}2 foreach point x of the set D do3 Assign the point x to the cluster G

i

whosecenter C

i

is the nearest to x;

4 foreach cluster Gi

do5 calculate a new center C⇤

i

;

6 Calculate the metric M for the set of clustersG as defined in Section 2;

set T . In order to define the spider from which thenew spider will inherit a cluster center, it is usedthe roulette-wheel selection.

4.5 Substitution of SpidersIn order to decide which spiders will be replacedby the new spiders produced in the mating stage,also is used the roulette wheel selection method,where spiders of the population with less weight(greater fitness) have more probability to be re-placed. It is important to note that the weight ofa spider have a negative correlation with respectto its fitness value, since we are working with aminimization problem and not with a maximiza-tion problem as originally proposed by (Cuevas etal., 2013).

The pseudocode of our proposal is showed inAlgorithm 2.

5 Experiments and Results

To compare the algorithms were taken five datasetfrom UCI (UCI Machine Learning Repository)repository: Balance, Cancer-Int, Dermatology, Di-abetes, Iris.

The Balance dataset was generated to modelpsychological experiments, each example is clas-sified as having the balance scale tip to the right,tip to the left, or be balanced. The attributes are theleft weight, the left distance, the right weight, andthe right distance. The correct way to find the classis the greater of (left-distance * left-weight) and(right-distance * right-weight). If they are equal,it is balanced.

118

Page 6: Social Spider Algorithm Approach for Clusteringceur-ws.org/Vol-1743/paper14.pdf · ing as metric the sum of euclidean dis-tances. Other important algorithms of the literature were

Algorithm 2: Social Spider Optimization al-gorithm for the clustering problemInput: A dataset D of n-dimensional m

points; an integer k > 0 thatrepresents the number of clusters

Output: Metric M of the clusters found1 foreach spider C of population P do2 Choose randomly k points from dataset D

and create the array C (spider C) ofcluster centers;

3 Calculate fitness of population P ;4 Calculate weight of population P ;5 for i = 2 to numberGenerations do6 Cooperative operator for female spiders;7 Cooperative operator for male spiders;8 Mating operator;9 Replacement of spiders in P ;

10 Calculate fitness of population P ;11 Calculate weight of population P ;

12 Return fitness (metric M) of the best solutionfound;

The Cancer-int dataset is one of three domainsprovided by the Oncology Institute that has repeat-edly appeared in the machine learning literature.This data set includes 201 instances of one classand 85 instances of another class.

In the Dermatology dataset is shown diagnosesof erythemato-squamos diseases.

Diabetes patient records were obtained fromtwo sources: an automatic electronic recording de-vice and paper records. The automatic device hadan internal clock to timestamp events, whereas thepaper records only provided ”logical time” slots(breakfast, lunch, dinner, bedtime).

Finally Iris contains 3 classes of 50 instanceseach, where each class refers to a type of iris plant.One class is linearly separable from the other 2;the latter are NOT linearly separable from eachother.

More features about the datasets are shown inthe Table 1.

For the experiments the number of generationswas fixed at 100 for the three algorithms (K-means, GA, SSO). For the case of GA and SSOalgorithms the number of elements of their respec-tive population was fixed at 100.

The experiments were performed as follows:for each dataset, the three algorithms (K-means,

Table 1: Properties of datasetsDataset Size Attributes ClassesBalance 625 4 3

Cancer-Int 699 9 2Dermatology 366 34 6

Diabetes 768 8 2Iris 150 4 3

GA, SSO) were executed 50 times . Each execu-tion of an algorithm returns the metric M of thebest solution found. Then, the following measureswere calculated: average, median, minimum andmaximum value of the results.

The results of the experiments for each datasetare shown in tables 2, 3, 4, 5, and 6 where the bestresults are highlighted in bold.

Table 2: Results of the experiments for the datasetBalance

K-means Genetic. Alg. SSO Alg.Mean 1426,544 1423,860 1423,851

Median 1425,804 1423,851 1423,851Minimum 1423,851 1423,851 1423,851Maximum 1442,669 1424,071 1423,851

Table 3: Results of the experiments for the datasetCancer-Int

K-means Genetic Alg. SSO Alg.Mean 2824,135 2820,319 2820,302

Median 2824,136 2820,302 2820,302Minimum 2824,136 2820,302 2820,302Maximum 2824,136 2821,138 2820,302

5.1 DiscussionIn the experiments for the dataset Balance, see Ta-ble 2, we can see that SSO algorithm has the bestresults respect to all measures. Furthermore, re-spect to the median and minimum values the SSOalgorithm has the same values as the GA.

From the results for the dataset Cancer-int,shown in Table 3, we can see that SSO algorithmhas the best results respect to all measures. Fur-thermore, respect to the median and minimum val-ues the SSO algorithm has the same results as theGA.

In the case of Dermatology dataset, shown inTable 4, GA algorithm has the best results respectto all measures. Furthermore, respect to the me-dian and minimum values the SSO algorithm hasthe same results as the GA this result is similar asthe two previous experiments.

119

Page 7: Social Spider Algorithm Approach for Clusteringceur-ws.org/Vol-1743/paper14.pdf · ing as metric the sum of euclidean dis-tances. Other important algorithms of the literature were

Table 4: Results of the experiments for the datasetDermatology

K-means Genetic Alg. SSO Alg.Mean 1127,390 1092,353 1092,355

Median 1121,087 1092,341 1092,356Minimum 1092,644 1092,341 1092,341Maximum 1415,274 1092,373 1092,373

Table 5: Results of the experiments for the datasetDiabetes

K-means Genetic Alg. SSO Alg.Mean 52072,244 49160,016 49159,956

Median 52072,243 49160,214 49159,939Minimum 52072,244 49157,441 49157,441Maximum 52072,244 49161,999 49165,111

In the Table 5, we can see results of Diabetesdataset, the results shown that SSO algorithm hasthe best results respect to all measures. And, re-spect to the minimum values the SSO algorithmhas the same results as the GA.

Finally in the results for the Iris dataset, shownin Table 6, SSO algorithm has the best results re-spect to all measures too. Furthermore, respectto the median and minimum values the SSO al-gorithm has the same results as the GA.

5.2 Statistical AnalysisAn additional statistical analysis was performedfor comparing the algorithms, since we are work-ing with stochastic algorithms.

The following methodology was used: first theKolmogorov-Smirnov test was applied to deter-mine whether results (of 50 executions) of eachalgorithm have a normal distribution. After deter-mining that the algorithms do not have normal dis-tribution the non parametric Wilcoxon rank sumtest was applied to compare the medians of twoalgorithms. This methodology was discussed andapplied in others works (Demsar, 2006), (Durilloet al., 2009), (Munoz et al., 2011).

The Wilcoxon rank sum test is used to test thenull hypothesis (H0) that the samples (of 50 ex-ecutions) of two algorithms come from distribu-tions with same medians. If the null hypothesisis rejected the alternative hypothesis is assumed(H

A

) that the samples come from distributionswith different medians.

A significance level of 5% (p � value less orequal than 0.05) was used for the Wilcoxon ranksum test. If the test is successful then the nullhypothesis is rejected and the alternative hypothe-sis is assumed, this result is shown using the ’s+’symbol. Otherwise p� value is greater than 0.05

Table 6: Results of the experiments for the datasetIris

K-means Genetic Alg. SSO Alg.Mean 102.495 97.223 97.222

Median 97.326 97.222 97.222Minimum 97.326 97.222 97.222Maximum 124.182 97.232 97.222

and the null hypothesis is assumed, this results isshown using the symbol ’s-’.

In the table 7 are shown the results of the sta-tistical test of Wilcoxon between the SSO and K-means clustering algorithms. In this table we cansee that there is statistical difference between SSOand K-means algorithms for all cases. So we canconclude that the SSO algorithm presents resultssignificantly better than K-means algorithm.

Table 7: Results of the Wilcoxon rank sum testbetween the SSO algorithm and the K-means al-gorithm

Dataset SSO Alg. K-means(median) (median)

Balance 1423,851 1425,804 s+Cancer-Int 2820,302 2824,136 s+

Dermatology 1092,356 1121,086 s+Diabetes 49159,939 52072,244 s+

Iris 97,222 97,326 s+

In the table 8 is shown the result of the statis-tical test of Wilcoxon between the SSO and theGA. In this table we can see that for most casesthe SSO and GA algorithms have similar results.Only in the case of Iris dataset exists statisticalsignificance, but we can not conclude that an al-gorithm is better than another since they have thesame median, we can only say that the algorithmshave different behavior.

Table 8: Results of the Wilcoxon rank sum testbetween the SSO algorithm and the genetic algo-rithm

Dataset SSO Alg. GA(median) (median)

Balance 1423,851 1423,851 s-Cancer-Int 2820,302 2820,302 s-

Dermatology 1092,356 1092,341 s-Diabetes 49159,939 49160,214 s-

Iris 97,222 97,222 s+

6 Conclusions and Future Work

In this work, a SSO approach for the cluster-ing problem has been proposed. For evaluat-ing its performance experiments were performed

120

Page 8: Social Spider Algorithm Approach for Clusteringceur-ws.org/Vol-1743/paper14.pdf · ing as metric the sum of euclidean dis-tances. Other important algorithms of the literature were

over 5 datasets from the UCI repository (Balance,Cancer-Int, Dermatology, Diabetes, and Iris), alsocomparisons were performed with two classic ap-proaches for the clustering problem: the k-meansalgorithm and a genetic algorithm for clustering.

The experiments showed that the SSO algo-rithm has better results regarding the algorithm k-means, and regarding the genetic algorithm, theSSO algorithm has equally competitive results.All these results were validated statistically usingthe non-parametric Wilcoxon rank sum test. Thus,the main contribution of this work was to showthat the SSO algorithm can produce competitiveresults when compared with classic algorithms.

As future works, we will include comparisonswith newer algorithms of the literature. Also, itis interesting to include in the experiments biggerdatasets. Finally, additional experiments will beperformed using other metrics such as the Clas-sification Error Percentage used in others works:(De Falco et al., 2007), (Karaboga and Ozturk,2011) and (Senthilnath et al., 2011).

ReferencesErik Cuevas, Miguel Cienfuegos, Daniel Zaldvar, and

Marco Prez-Cisneros. 2013. A swarm opti-mization algorithm inspired in the behavior of thesocial-spider. Expert Systems with Applications,40(16):6374 – 6384.

Ivanoe De Falco, Antonio Della Cioppa, and ErnestoTarantino. 2007. Facing classification problemswith particle swarm optimization. Applied SoftComputing, 7(3):652–658.

Janez Demsar. 2006. Statistical comparisons of classi-fiers over multiple data sets. The Journal of MachineLearning Research, 7:1–30.

Juan J Durillo, Jose Garcıa-Nieto, Antonio J Nebro,Carlos A Coello Coello, Francisco Luna, and En-rique Alba. 2009. Multi-objective particle swarmoptimizers: An experimental comparison. In Evo-lutionary Multi-Criterion Optimization, pages 495–509. Springer.

John H Holland. 1975. Adaptation in natural and ar-tificial systems: An introductory analysis with ap-plications to biology, control, and artificial intelli-gence. U Michigan Press.

Anil K Jain, M Narasimha Murty, and Patrick J Flynn.1999. Data clustering: a review. ACM computingsurveys (CSUR), 31(3):264–323.

Dervis Karaboga and Celal Ozturk. 2011. A novelclustering approach: Artificial bee colony (abc) al-gorithm. Applied Soft Computing, 11(1):652–657.

S. Lloyd. 1982. Least squares quantization inpcm. Information Theory, IEEE Transactions on,28(2):129–137, Mar.

Ujjwal Maulik and Sanghamitra Bandyopadhyay.2000. Genetic algorithm-based clustering tech-nique. Pattern Recognition, 33(9):1455 – 1465.

Boris Mirkin. 2012. Clustering: a data recovery ap-proach. CRC Press.

Daniel M Munoz, Carlos H Llanos, LDS Coelho, andMauricio Ayala-Rincon. 2011. Opposition-basedshuffled pso with passive congregation applied to fmmatching synthesis. In Evolutionary Computation(CEC), 2011 IEEE Congress on, pages 2775–2781.IEEE.

Satyasai Jagannath Nanda and Ganapati Panda. 2014.A survey on nature inspired metaheuristic algo-rithms for partitional clustering. Swarm and Evo-lutionary Computation, 16:1–18.

Dan Pelleg and Andrew Moore. 1999. Acceleratingexact k-means algorithms with geometric reasoning.In Proceedings of the fifth ACM SIGKDD interna-tional conference on Knowledge discovery and datamining, pages 277–281. ACM.

Steven J Phillips. 2002. Acceleration of k-means andrelated clustering algorithms. In Algorithm Engi-neering and Experiments, pages 166–177. Springer.

J Senthilnath, SN Omkar, and V Mani. 2011. Clus-tering using firefly algorithm: Performance study.Swarm and Evolutionary Computation, 1(3):164–171.

Michael Steinbach, George Karypis, Vipin Kumar,et al. 2000. A comparison of document cluster-ing techniques. In KDD workshop on text mining,volume 400, pages 525–526. Boston.

Harley Vera-Olivera and Jose Luis Soncco-Alvarez.2016. Algoritmo de optimizacion basado en el com-portamiento social de aranas para clustering. Uni-versidad Nacional de San Antonio Abad del Cusco.Undergraduate Thesis for Obtaining the EngineerDegree in Informatics and Systems.

M vijayalakshmi and M Renuka Devi. 2012. A surveyof different issue of different clustering algorithmsused in large data sets. In International Journal ofAdvanced Research in Computer Science and Soft-ware Engineering, pages 305 – 307.

R. Xu and D.C Wunsch. 2009. Clustering. OxfordWiley.

121