Research Article Fuzzy -Means and Cluster Ensemble with Random Projection for Big Data...

14
Research Article Fuzzy -Means and Cluster Ensemble with Random Projection for Big Data Clustering Mao Ye, 1,2 Wenfen Liu, 1,2 Jianghong Wei, 1 and Xuexian Hu 1 1 State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou 450002, China 2 State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China Correspondence should be addressed to Mao Ye; [email protected] Received 19 April 2016; Revised 17 June 2016; Accepted 19 June 2016 Academic Editor: Stefan Balint Copyright © 2016 Mao Ye et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Because of its positive effects on dealing with the curse of dimensionality in big data, random projection for dimensionality reduction has become a popular method recently. In this paper, an academic analysis of influences of random projection on the variability of data set and the dependence of dimensions has been proposed. Together with the theoretical analysis, a new fuzzy -means (FCM) clustering algorithm with random projection has been presented. Empirical results verify that the new algorithm not only preserves the accuracy of original FCM clustering, but also is more efficient than original clustering and clustering with singular value decomposition. At the same time, a new cluster ensemble approach based on FCM clustering with random projection is also proposed. e new aggregation method can efficiently compute the spectral embedding of data with cluster centers based representation which scales linearly with data size. Experimental results reveal the efficiency, effectiveness, and robustness of our algorithm compared to the state-of-the-art methods. 1. Introduction With the rapid development of mobile Internet, cloud com- puting, Internet of things, social network service, and other emerging services, data is growing at an explosive rate recently. How to achieve fast and effective analyses of data and then maximize the data property’s benefits has become the focus of attention. e “four Vs” model [1], variety, volume, velocity, and value, for big data has made traditional methods of data analysis unapplicable. erefore, new techniques for big data analysis such as distributed or parallelized [2, 3], feature extraction [4, 5], and sampling [6] have been widely concerned. Clustering is an essential method of data analysis through which the original data set can be partitioned into several data subsets according to similarities of data points. It becomes an underlying tool for outlier detection [7], biology [8], indexing [9], and so on. In the context of fuzzy clustering analysis, each object in data set no longer belongs to a single group but possibly belongs to any group. e degree of an object belonging to a group is denoted by a value in [0, 1]. Among various methods of fuzzy clustering, fuzzy -means (FCM) [10] clustering has received particular attention for its special features. In recent years, based on different sampling and extension methods, a lot of modified FCM algorithms [11–13] designed for big data analysis have been proposed. However, these algorithms are unsatisfactory in efficiency for high dimensional data, since they initially do not take the problem of “curse of dimensionality” into account. In 1984, Johnson and Lindenstrauss [14] used the projec- tion generated by a random orthogonal matrix to reduce the dimensionality of data. is method can preserve pairwise distances of the points within a factor of . Subsequently, [15] stated that such projection could be produced by a ran- dom Gaussian matrix. Moreover, Achlioptas investigated that even projection from a random scaled sign matrix satisfied the property of preserving pairwise distances [16]. ese results laid the theoretic foundation for applying random projection to clustering analysis based on pairwise distances. Recently, Boutsidis et al. [17] designed a provably accurate dimensionality reduction method for -means clustering based on random projection. Since the method above was analyzed for crisp partitions, the effect of random projection on FCM clustering algorithm is still unknown. Hindawi Publishing Corporation Mathematical Problems in Engineering Volume 2016, Article ID 6529794, 13 pages http://dx.doi.org/10.1155/2016/6529794

Transcript of Research Article Fuzzy -Means and Cluster Ensemble with Random Projection for Big Data...

Page 1: Research Article Fuzzy -Means and Cluster Ensemble with Random Projection for Big Data ...downloads.hindawi.com/journals/mpe/2016/6529794.pdf · 2019. 7. 30. · Research Article

Research ArticleFuzzy 119888-Means and Cluster Ensemble with Random Projectionfor Big Data Clustering

Mao Ye12 Wenfen Liu12 Jianghong Wei1 and Xuexian Hu1

1State Key Laboratory of Mathematical Engineering and Advanced Computing Zhengzhou 450002 China2State Key Laboratory of Networking and Switching Technology Beijing University of Posts and Telecommunications Beijing China

Correspondence should be addressed to Mao Ye yemao119gmailcom

Received 19 April 2016 Revised 17 June 2016 Accepted 19 June 2016

Academic Editor Stefan Balint

Copyright copy 2016 Mao Ye et al This is an open access article distributed under the Creative Commons Attribution License whichpermits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Because of its positive effects on dealing with the curse of dimensionality in big data random projection for dimensionalityreduction has become a popular method recently In this paper an academic analysis of influences of random projection on thevariability of data set and the dependence of dimensions has been proposed Together with the theoretical analysis a new fuzzy119888-means (FCM) clustering algorithm with random projection has been presented Empirical results verify that the new algorithmnot only preserves the accuracy of original FCM clustering but also is more efficient than original clustering and clustering withsingular value decomposition At the same time a new cluster ensemble approach based on FCMclustering with randomprojectionis also proposed The new aggregation method can efficiently compute the spectral embedding of data with cluster centers basedrepresentation which scales linearly with data size Experimental results reveal the efficiency effectiveness and robustness of ouralgorithm compared to the state-of-the-art methods

1 Introduction

With the rapid development of mobile Internet cloud com-puting Internet of things social network service and otheremerging services data is growing at an explosive raterecentlyHow to achieve fast and effective analyses of data andthen maximize the data propertyrsquos benefits has become thefocus of attention The ldquofour Vsrdquo model [1] variety volumevelocity and value for big data has made traditional methodsof data analysis unapplicable Therefore new techniques forbig data analysis such as distributed or parallelized [2 3]feature extraction [4 5] and sampling [6] have been widelyconcerned

Clustering is an essential method of data analysis throughwhich the original data set can be partitioned into several datasubsets according to similarities of data points It becomesan underlying tool for outlier detection [7] biology [8]indexing [9] and so on In the context of fuzzy clusteringanalysis each object in data set no longer belongs to a singlegroup but possibly belongs to any group The degree of anobject belonging to a group is denoted by a value in [0 1]Among various methods of fuzzy clustering fuzzy 119888-means

(FCM) [10] clustering has received particular attention for itsspecial features In recent years based on different samplingand extension methods a lot of modified FCM algorithms[11ndash13] designed for big data analysis have been proposedHowever these algorithms are unsatisfactory in efficiency forhigh dimensional data since they initially do not take theproblem of ldquocurse of dimensionalityrdquo into account

In 1984 Johnson and Lindenstrauss [14] used the projec-tion generated by a random orthogonal matrix to reduce thedimensionality of data This method can preserve pairwisedistances of the points within a factor of 1 plusmn 120576 Subsequently[15] stated that such projection could be produced by a ran-domGaussianmatrixMoreover Achlioptas investigated thateven projection from a random scaled sign matrix satisfiedthe property of preserving pairwise distances [16] Theseresults laid the theoretic foundation for applying randomprojection to clustering analysis based on pairwise distancesRecently Boutsidis et al [17] designed a provably accuratedimensionality reduction method for 119896-means clusteringbased on random projection Since the method above wasanalyzed for crisp partitions the effect of random projectionon FCM clustering algorithm is still unknown

Hindawi Publishing CorporationMathematical Problems in EngineeringVolume 2016 Article ID 6529794 13 pageshttpdxdoiorg10115520166529794

2 Mathematical Problems in Engineering

As it can combine multiple base clustering solutions ofthe same object set into a single consensus solution clusterensemble has many attractive properties such as improvedquality of solution robust clustering and knowledge reuse[18] Ensemble approaches of fuzzy clustering with randomprojection have been proposed in [19ndash21] These methodswere all based on multiple random projections of originaldata set and then integrated all fuzzy clustering results ofthe projected data sets Reference [21] pointed out that theirmethod used smaller memory and ran faster than the onesof [19 20] However with respect to crisp partition solutiontheir method still needs computing and storing the productof membership matrices which requires time and spacecomplexity with quadratic data size

Our Contribution In this paper our contributions can bedivided into two parts one is the analysis of impact of randomprojection on FCM clustering the other is the propositionof a cluster ensemble method with random projection whichis more efficient robust and suitable for a wider range ofgeometrical data sets Concretely the contributions are asfollows

(i) We theoretically analyze that random projection canpreserve the entire variability of data and prove theeffectiveness of randomprojection for dimensionalityreduction from the linear independence of dimen-sions of projected data Together with the propertyof preserving pairwise distances of points we obtaina modified FCM clustering algorithm with randomprojection The accuracy and efficiency of modifiedalgorithm have been verified through experiments onboth synthetic and real data sets

(ii) We propose a new cluster ensemble algorithm forFCM clustering with random projection which getsspectral embedding efficiently through singular valuedecomposition (SVD) of the concatenation of mem-bership matrices The new method avoids the con-struction of similarity or distancematrix so it is moreefficient and space-saving than method in [21] withrespect to crisp partition and methods in [19 20] forlarge scale data sets In addition the improvementson robustness and efficiency of our approach are alsoverified by the experimental results on both syntheticand real data sets At the same time our algorithm isnot only as accurate as the existing ones on Gaussianmixture data set but also obviously more accuratethan the existing ones on the real data set whichindicates that our approach is suitable for a widerrange of data sets

2 Preliminaries

In this section we present some notations used throughoutthis paper introduce the FCM clustering algorithm and givesome traditional cluster ensemble methods using randomprojection

21 Matrix Notations We use X to denote data matrix x119894to

denote the 119894th row vector ofX and the 119894th point 119909119894119895to denote

the (119894 119895)th element of X 119864(120585) means the expectation of arandom variable 120585 and Pr(119860) denotes the probability of anevent 119860 Let cov(120585 120578) be the covariance of random variables120585 120578 let var(120585) be the variance of random variable 120585

We denote the trace of matrix by tr() given A isin R119899times119899then

tr (A) =119899

sum

119894=1

119886119894119894 (1)

For any matrix AB isin R119899times119899 we have the following property

tr (AB) = tr (BA) (2)

Singular value decomposition is a popular dimensionalityreduction method through which one can get a projection119891 X rarr R119905 with 119891(x

119894) = x119894V119905 where V

119905contains the top 119905

right singular vectors of matrix X The exact SVD of X takescubic time of dimension size and quadratic time of data size

22 Fuzzy 119888-Means Clustering Algorithm (FCM) The goalof fuzzy clustering is to get a flexible partition where eachpoint hasmembership inmore than one cluster with values in[0 1] Among the various fuzzy clustering algorithms FCMclustering algorithm is widely used in low dimensional databecause of its efficiency and effectiveness [22] We start fromgiving the definition of fuzzy 119888-means clustering problem andthen describe the FCM clustering algorithm precisely

Definition 1 (the fuzzy 119888-means clustering problem) Given adata set of 119899 points with 119889 features denoted by an 119899times119889matrixX a positive integer 119888 regarded as the number of clusters andfuzzy constant 119898 gt 1 find the partition matrix Uopt isin R119888times119899

and centers of clusters Vopt = kopt1 kopt2 kopt119888 suchthat

(UV)opt = argminUV

119888

sum

119894=1

119899

sum

119895=1

119906119898

119894119895

10038171003817100381710038171003817x119895minus k119894

10038171003817100381710038171003817

2

(3)

Here sdot denotes norm usually Euclidean norm theelement of partition matrix 119906

119894119895denotes the membership of

point 119895 in the cluster 119894 Moreover for any 119895 isin [1 119899]sum119888119894=1119906119894119895=

1The objective function is defined assum119888119894=1sum119899

119895=1119906119898

119894119895x119895minusk1198942≜

objFCM clustering algorithm first computes the degree of

membership through distances between points and centersof clusters and then updates the center of each cluster basedon the membership degree By means of computing clustercenters and partitionmatrix iteratively a solution is obtainedIt should be noted that FCM clustering can only get a locallyoptimal solution and the final clustering result depends onthe initialization The detailed procedure of FCM clusteringis shown in Algorithm 1

23 Ensemble Aggregations forMultiple Fuzzy Clustering Solu-tions with Random Projection There are several algorithms

Mathematical Problems in Engineering 3

Input data set X (an 119899 times 119889matrix) number of clusters 119888 fuzzy constant119898Output partition matrix U centers of clusters VInitialize sample U (or V) randomly from proper spaceWhile |objold minus objnew|

2gt 120576 do

119906119894119895=[

[

119888

sum

119896=1

(

10038171003817100381710038171003817x119895minus v119894

10038171003817100381710038171003817

10038171003817100381710038171003817x119895minus v119896

10038171003817100381710038171003817

)

2(119898minus1)

]

]

minus1

forall119894 119895

v119894=

sum119899

119895=1(119906119894119895)

119898

x119895

sum119899

119895=1(119906119894119895)

119898 forall119894

obj =119888

sum

119894=1

119899

sum

119895=1

119906119898

119894119895

10038171003817100381710038171003817x119895minus v119894

10038171003817100381710038171003817

2

Algorithm 1 FCM clustering algorithm

proposed for aggregating themultiple fuzzy clustering resultswith random projection The main strategy is to generatedata membership matrices through multiple fuzzy clusteringsolutions on the different projected data sets and then toaggregate the resulting membership matrices Therefore dif-ferentmethods of generation and aggregation ofmembershipmatrices lead to various ensemble approaches about fuzzyclustering

The first cluster ensemble approach using randomprojec-tion was proposed in [20] After projecting the data into lowdimensional space with random projection the membershipmatrices were calculated through the probabilistic model 120579 of119888 Gaussian mixture gained by EM clustering Subsequentlythe similarity of points 119894 and 119895 was computed as 119875120579

119894119895=

sum119888

119897=1119875(119897 | 119894 120579) times 119875(119897 | 119895 120579) where 119875(119897 | 119894 120579) denoted the the

probability of point 119894 belonging to cluster 119897 undermodel 120579 and119875120579

119894119895denoted the probability that points 119894 and 119895 belonged to the

same cluster undermodel 120579The aggregated similarity matrixwas obtained by averaging across the multiple runs andthe final clustering solution was produced by a hierarchicalclustering method called complete linkage For mixturemodel the estimation for the cluster number and values ofunknown parameters is often complicated [23] In additionthis approach needs 119874(1198992) space for storing the similaritymatrix of data points

Another approach which was used to find genes in DNAmicroarray data was presented in [19] Similarly the data wasprojected into a low dimensional space with random matrixThen the method employed FCM clustering to partition theprojected data and generated membership matrices U

119894isin

R119888times119899 119894 = 1 2 119903 with multiple runs 119903 For each run 119894 thesimilarity matrix was computed as M

119894= U119879119894U119894 Then the

combined similarity matrix M was calculated by averagingas M = (1119903)sum

119903

119894=1M119894 A distance matrix was computed by

D = 1 minusM and the final partition matrix was gained by FCMclustering on the distance matrixD Since this method needsto compute the product of partition matrix and its transposethe time complexity is119874(119903 lowast 1198881198992) and the space complexity is119874(1198992)Considering the large scale data set in the context of big

data [21] proposed a new method for aggregating partition

matrices from FCM clustering They concatenated the par-tition matrices as Ucon = [U1198791 U

119879

2 ] instead of averaging

the agreement matrix Finally they got the ensemble resultas U119891= FCM(Ucon 119888) This algorithm avoids the products

of partition matrices and is more suitable than [19] for largescale data sets However it still needs the multiplication ofconcatenated partition matrix when crisp partition result iswanted

3 Random Projection

Dimensionality reduction is a common technique for analysisof high dimensional data The most popular skill is SVD (orprincipal component analysis) where the original featuresare replaced by a small size of principal components inorder to compress the data But SVD takes cubic time ofthe number of dimensions Recently some literatures statedthat random projection can be applied to dimensionalityreduction and preserve pairwise distances within a smallfactor [15 16] Low computing complexity and preservingthe metric structure make random projection receive muchattention Lemma 2 indicates that there are three kinds ofsimple random projection possessing the above properties

Lemma 2 (see [15 16]) Let matrix X isin R119899times119889 be a data set of119899 points and 119889 features Given 120576 120573 gt 0 let

1198960=

4 + 2120573

12057622 minus 120576

33

log 119899 (4)

For integer 119905 ge 1198960 let matrix R be a 119889 times 119905 (119905 le 119889) random

matrix wherein elements 119877119894119895are independently identically

distributed random variables from either one of the followingthree probability distributions

119877119894119895sim 119873 (0 1)

119877119894119895=

+1 with probablity 12

minus1 with probablity 12

(5)

4 Mathematical Problems in Engineering

119877119894119895= radic3 times

+1 with probablity 16

0 with probablity 23

minus1 with probablity 16

(6)

Let 119891 R119889 rarr R119905 with 119891(x119894) = (1radic119905)x

119894R For any u k isin X

with probability at least 1 minus 119899minus120573 it holds that

(1 minus 120576) u minus k22le1003817100381710038171003817119891 (u) minus 119891 (k)100381710038171003817

1003817

2

2le (1 + 120576) u minus k2

2 (7)

Lemma 2 implies that if the number of dimensionsof data reduced by random projection is bigger than acertain bound then pairwise Euclidean distance squares arepreserved within a multiplicative factor of 1 plusmn 120576

With the above properties researchers have checked thefeasibility of applying random projection to 119896-means clus-tering in terms of theory and experiment [17 24] Howeveras membership degrees for FCM clustering and 119896-meansclustering are defined differently the analysis method cannot be directly used for assessing the effect of randomprojection on FCM clustering Motivated by the idea ofprincipal component analysis we draw the conclusion thatthe compressed data gains the whole variability of originaldata in probabilistic sense based on the analysis of the vari-ance difference Besides variables referring to dimensions ofprojected data are linear independent As a result we canachieve dimensionality reduction via replacing original databy compressed data as ldquoprincipal componentsrdquo

Next we give a useful lemma for proof of the subsequenttheorem

Lemma 3 Let 120585119894(1 le 119894 le 119899) be independently distributed

randomvariables fromone of the three probability distributionsdescribed in Lemma 2 then

Pr lim119899rarrinfin

1

119899

119899

sum

119894=1

1205852

119894= 1 = 1 (8)

Proof According to the probability distribution of randomvariable 120585

119894 it is easy to know that

119864 (1205852

119894) = 1 (1 le 119894 le 119899)

119864(

1

119899

119899

sum

119894=1

1205852

119894) = 1

(9)

Then 1205852119894 obeys the law of large numbers namely

Pr lim119899rarrinfin

1

119899

119899

sum

119894=1

1205852

119894= 119864(

1

119899

119899

sum

119894=1

1205852

119894)

= Pr lim119899rarrinfin

1

119899

119899

sum

119894=1

1205852

119894= 1 = 1

(10)

Since centralization of data does not change the distanceof any two points and the FCM clustering algorithm is

based on pairwise distances to partition data points weassume that expectation of the data input is 0 In practicecovariancematrix of population is likely unknownThereforewe investigate the effect of random projection on variabilityof both population and sample

Theorem 4 Let data set X isin R119899times119889 be 119899 independentsamples of119889-dimensional randomvector (119883

1 1198832 119883

119889) and

S denotes the sample covariance matrix of X The randomprojection induced by random matrix R isin R119889times119905 mapsthe 119889-dimensional random vector to 119905-dimensional randomvector (119884

1 1198842 119884

119905) = (1radic119905)(119883

1 1198832 119883

119889) sdot R and Slowast

denotes the sample covariance matrix of projected data Ifelements of random matrix R obey distribution demanded byLemma 2 and are mutually independent with random vector(1198831 1198832 119883

119889) then

(1) dimensions of projected data are linearly independentcov(119884

119894 119884119895) = 0 forall119894 = 119895

(2) random projection maintains the whole variabilitysum119905

119894=1var(119884119894) = sum

119889

119894=1var(119883

119894) when 119905 rarr infin with

probability 1 tr(Slowast) = tr(S)

Proof It is easy to know that the expectation of any element ofrandommatrix 119864(119877

119894119895) = 0 1 le 119894 le 119889 1 le 119895 le 119905 As elements

of random matrix R and random vector (1198831 1198832 119883

119889)

are mutually independent the covariance of random vectorinduced by random projection is

cov (119884119894 119884119895) = cov( 1

radic119905

sdot

119889

sum

119896=1

119883119896sdot 119877119896119894

1

radic119905

sdot

119889

sum

119897=1

119883119897sdot 119877119897119895)

=

1

119905

sdot

119889

sum

119896=1

119889

sum

119897=1

cov (119883119896sdot 119877119896119894 119883119897sdot 119877119897119895)

=

1

119905

sdot

119889

sum

119896=1

119889

sum

119897=1

119864 (119883119896sdot 119877119896119894sdot 119883119897sdot 119877119897119895) minus

1

119905

sdot

119889

sum

119896=1

119889

sum

119897=1

(119864 (119883119896sdot 119877119896119894) sdot 119864 (119883

119897sdot 119877119897119895))

=

1

119905

sdot

119889

sum

119896=1

119889

sum

119897=1

119864 (119883119896sdot 119877119896119894sdot 119883119897sdot 119877119897119895)

=

1

119905

sdot

119889

sum

119896=1

119889

sum

119897=1

(119864 (119883119896sdot 119883119897) sdot 119864 (119877

119896119894sdot 119877119897119895))

=

1

119905

sdot

119889

sum

119896=1

(119864 (1198832

119896) sdot 119864 (119877

119896119894sdot 119877119896119895))

(11)

(1) If 119894 = 119895 then

cov (119884119894 119884119895) =

1

119905

sdot (

119889

sum

119896=1

119864 (1198832

119896) sdot 119864 (119877

119896119894) sdot 119864 (119877

119897119895))

= 0

(12)

Mathematical Problems in Engineering 5

(2) If 119894 = 119895 then

cov (119884119894 119884119894) = var (119884

119894) =

1

119905

sdot (

119889

sum

119896=1

119864 (1198832

119896) sdot 119864 (119877

2

119896119894))

=

1

119905

sdot

119889

sum

119896=1

119864 (1198832

119896)

(13)

Thus by assumption 119864(119883119894) = 0 (1 le 119894 le 119889) we can get

119905

sum

119894=1

var (119884119894) =

119889

sum

119894=1

var (119883119894) (14)

We denote spectral decomposition of sample covariancematrice S by S = VΛV119879 whereV is thematrix of eigenvectorsand Λ is a diagonal matrix in which the diagonal elementsare 1205821 1205822 120582

119889and 120582

1ge 1205822ge sdot sdot sdot ge 120582

119889 Supposing the data

samples have been centralized namely their means are 0119904 wecan get covariance matrix S = (1119899)X119879X For conveniencewe still denote a sample of random matrix by R Thusprojected data Y = (1radic119905)XR and sample covariance matrixof projected data Slowast = (1119899)((1radic119905)XR)119879((1radic119905)XR) =

(1119905)R119879SR Then we can get

tr (Slowast) = tr(1119905

R119879VΛV119879R) = tr(1119905

R119879ΛVV119879R)

= tr(1119905

R119879ΛR) =119889

sum

119894=1

120582119894sdot (

1

119905

sdot

119905

sum

119895=1

1199032

119894119895)

(15)

where 119903119894119895(1 le 119894 le 119889 1 le 119895 le 119905) is sample of element of

random matrix RIn practice the spectrum of a covariance often displays a

distinct decay after few large eigenvalues So we assume thatthere exists an integer 119901 limited constant 119902 gt 0 such that forall 119894 gt 119901 it holds that 120582

119894le 119902 Then

1003816100381610038161003816tr (Slowast) minus tr (S)100381610038161003816

1003816=

10038161003816100381610038161003816100381610038161003816100381610038161003816

119889

sum

119894=1

120582119894sdot (

1

119905

119905

sum

119895=1

1199032

119894119895minus 1)

10038161003816100381610038161003816100381610038161003816100381610038161003816

le

10038161003816100381610038161003816100381610038161003816100381610038161003816

119901

sum

119894=1

120582119894sdot (

1

119905

119905

sum

119895=1

1199032

119894119895minus 1)

10038161003816100381610038161003816100381610038161003816100381610038161003816

+

10038161003816100381610038161003816100381610038161003816100381610038161003816

119889

sum

119894=119901+1

120582119894sdot (

1

119905

119905

sum

119895=1

1199032

119894119895minus 1)

10038161003816100381610038161003816100381610038161003816100381610038161003816

le

10038161003816100381610038161003816100381610038161003816100381610038161003816

119901

sum

119894=1

120582119894sdot

1

119905

119905

sum

119895=1

(1199032

119894119895minus 1)

10038161003816100381610038161003816100381610038161003816100381610038161003816

+ 119902

10038161003816100381610038161003816100381610038161003816100381610038161003816

119889

sum

119894=119901+1

(

1

119905

119905

sum

119895=1

(1199032

119894119895minus 1))

10038161003816100381610038161003816100381610038161003816100381610038161003816

(16)

By Lemma 3 with probability 1

lim119905rarrinfin

(

1

119905

119905

sum

119895=1

(1199032

119894119895minus 1)) = 0

lim119905rarrinfin

119889

sum

119894=119901+1

(

1

119905

119905

sum

119895=1

(1199032

119894119895minus 1)) = 0

(17)

Combining the above arguments we achieve tr(Slowast) =tr(S) with probability 1 when 119905 rarr infin

Part (1) of Theorem 4 indicates that compressed dataproduced by random projection can take much informationwith low dimensionality owing to linear independence ofreduced dimensions Part (2) manifests that sum of variancesof dimensions of original data is consistent with the oneof projected data namely random projection holds thevariability of primal data Combining results of Lemma 2with those ofTheorem 4 we consider that random projectioncan be employed to improve the efficiency of FCM clusteringalgorithm with low dimensionality and the modified algo-rithm can keep the accuracy of partition approximately

4 FCM Clustering with Random Projectionand an Efficient Cluster Ensemble Approach

41 FCM Clustering via Random Projection According tothe results of Section 3 we design an improved FCMclustering algorithm with random projection for dimen-sionality reduction The procedure of new algorithm isshown in Algorithm 2

Algorithm 2 reduces the dimensions of input data viamultiplying a random matrix Compared with the 119874(1198881198991198892)time for running each iteration in original FCM clusteringthe new algorithm would imply an 119874(119888119899(120576minus2 ln 119899)2) time foreach iteration Thus the time complexity of new algorithmdecreases obviously for high dimensional data in the case120576minus2 ln 119899 ≪ 119889 Another common dimensionality reductionmethod is SVD Compared with the 119874(1198893 + 1198991198892) time ofrunning SVD on data matrix X the new algorithm onlyneeds 119874(120576minus2119889 ln 119899) time to generate random matrix R Itindicates that random projection is a cost-effective methodof dimensionality reduction for FCM clustering algorithm

42 Ensemble Approach Based on Graph Partition As dif-ferent random projections may result in different clusteringsolutions [20] it is attractive to design the cluster ensembleframework with random projection for improved and robustclustering performance Although it uses smaller memoryand runs faster than ensemble method in [19] the clusterensemble algorithm in [21] still needs product of concate-nated partition matrix for crisp grouping which leads to ahigh time and space costs under the circumstances of big data

In this section we propose a more efficient and effectiveaggregation method for multiple FCM clustering results Theoverview of our new ensemble approach is presented inFigure 1 The new ensemble method is based on partition

6 Mathematical Problems in Engineering

Input data set X (an 119899 times 119889matrix) number of clusters 119888 fuzzy constant119898 FCM clustering algorithmOutput partition matrix U centers of clusters V(1) sample a 119889 times 119905 (119905 le 119889 119905 = Ω(120576minus2 ln 119899)) random projection Rmeeting the requirements of Lemma 2(2) compute the product Y = (1radic119905)XR(3) run FCM algorithm on Y get the partition matrix U(4) compute the centers of clusters through original data X and U

Algorithm 2 FCM clustering with random projection

Input data set X (an 119899 times 119889matrix) number of clusters 119888 reduced dimension 119905 number ofrandom projection 119903 FCM clustering algorithmOutput cluster label vector u(1) at each iteration 119894 isin [1 119903] run Algorithm 2 get membership matrix U

119894isin R119888times119899

(2) concatenate the membership matrices Ucon = [U1198791 U119879

119903] isin R119899times119888119903

(3) compute the first 119888 left singular vectors of Ucon denoted by A = [a1 a2 a

119888] isin R119899times119888

where Ucon = Ucon(119903 sdot D)minus12 D is a diagonal matrix and 119889119894119894= sum119895119906con 119895119894

(4) treat each row of A as a data point and apply 119896-means to obtain cluster label vector

Algorithm 3 Cluster ensemble for FCM clustering with random projection

on similarity graph For each random projection a new dataset is generated After performing FCM clustering on thenew data sets membershipmatrices are outputThe elementsof membership matrix are treated as the similarity measurebetween points and the cluster centers Through SVD on theconcatenation of membership matrices we get the spectralembedding of data point efficiently The detailed procedureof new cluster ensemble approach is shown in Algorithm 3

In step (3) of the procedure in Algorithm 3 the left sin-gular vectors of Ucon are equivalent to the eigenvectors ofUconU

119879

con It implies that we regard the matrix product as aconstruction of affinity matrix of data points This method ismotivated by the research on landmark-based representation[25 26] In our approach we treat the cluster centers ofeach FCM clustering run as landmarks and the membershipmatrix as landmark-based representation Thus the con-catenation of membership matrices forms a combinationallandmark-based representationmatrix In this way the graphsimilarity matrix is computed as

W =UconU

119879

con (18)

which can create spectral embedding efficiently through step(3) To normalize the graph similarity matrix we multiplyUcon by (119903 sdot D)minus12 As a result the degree matrix of W is anidentity matrix

There are two perspectives to explain why our approachworks Considering the similarity measure defined by 119906

119894119895

in FCM clustering proposition 3 in [26] demonstrated thatsingular vectors of U

119894converged to eigenvectors of W

119904as 119888

converges to 119899 where W119904was affinity matrix generated in

standard spectral clustering As a result singular vectors ofUcon converge to eigenvectors of normalized affinity matrixW119904Thus our final outputwill converge to the one of standard

spectral clustering as 119888 converges to 119899 Another explanationis about the similarity measure defined by 119870(x

119894 x119895) = x119879119894x119895

where x119894and x119895are data pointsWe can treat each row of Ucon

as a transformational data point As a result affinity matrixobtained here is the same as the one of standard spectralembedding and our output is just the partition result ofstandard spectral clustering

To facilitate comparison of different ensemble methodsfor FCM clustering solutions with random projection wedenote the approach of [19] byEFCM-A (average the productsof membership matrices) the algorithm of [21] by EFCM-C (concatenate the membership matrices) and our newmethod by EFCM-S (spectral clustering on the membershipmatrices) In the cluster ensemble phase the main computa-tions of EFCM-Amethod are multiplications of membershipmatrices Similarly the algorithm of EFCM-C also needsthe product of concatenated membership matrices in orderto get the crisp partition result Thus the above methodsboth need 119874(119899

2) space and 119874(119888119903119899

2) time However the

main computation of EFCM-S is SVD for Ucon and 119896-meansclustering for A The overall space is 119874(119888119903119899) the SVD timeis 119874((119888119903)2119899) and the 119896-means clustering time is 1198971198882119899 where119897 is iteration number of 119896-means Therefore computationalcomplexity of EFCM-S is obviously decreased comparedwiththe ones of EFCM-A and EFCM-C considering 119888119903 ≪ 119899 and119897 ≪ 119899 in large scale data set

5 Experiments

In this section we present the experimental evaluations ofnew algorithms proposed in Section 4 We implemented therelated algorithms in Matlab computing environment andconducted our experiments on aWindows-based systemwiththe Intel Core 36GHz processor and 16GB of RAM

51 Data Sets and Parameter Settings We conducted theexperiments on synthetic and real data sets which bothhave relatively high dimensionality The synthetic data

Mathematical Problems in Engineering 7

Original dataset

Randomprojection 1

Randomprojection 2

Randomprojection r

Generateddataset 1

Generateddataset 2

Generateddataset r

FCM clustering

Membershipmatrix 1

Consensus matrix

Final result

Membershipmatrix 1

Membershipmatrix 1

FCM clusteringFCM clustering

k-means

First c left singularvectors A

middot middot middot

middot middot middot

middot middot middot

middot middot middot

Figure 1 Framework of the new ensemble approach based on graph partition

set had 10000 data points with 1000 dimensions whichwere generated from 3 Gaussian mixtures in propor-tions (025 05 025) The means of components were(2 2 2)

1000 (0 0 0)

1000 and (minus2 minus2 minus2)

1000and

the standard deviations were (1 1 1)1000

(2 2 2)1000

and (3 3 3)

1000 The real data set is the daily and sports

activities data (ACT) published on UCI machine learningrepository (theACTdata set can be found at httparchiveicsuciedumldatasetsDaily+and+Sports+Activities)Thesearedata of 19 activities collected by 45 motion sensors in 5minutes at 25Hz sampling frequency Each activity wasperformed by 8 subjects in their own styles To get highdimensional data sets we treated 1 minute and 5 secondsof activity data as an instance respectively As a result wegot 760 times 67500 (ACT1) and 9120 times 5625 (ACT2) datamatrices whose rows were activity instances and columnswere features

For the parameters of FCM clustering we let 120576 = 10minus5 welet maximum iteration number be 100 we let fuzzy factor 119898be 2 and we let the number of clusters be 119888 = 3 for syntheticdata set and 119888 = 19 for ACT data sets We also normalizedthe objective function as objlowast = objX2

119865 where sdot

119865is

Frobenius norm of matrix [27] To minimize the influenceintroduced by different initializations we present the averagevalues of evaluation indices of 20 independent experiments

In order to compare different dimensionality reductionmethods for FCM clustering we initialized algorithms by

choosing 119888 points randomly as the cluster centers and madesure that every algorithm began with the same initializationIn addition we ran Algorithm 2 with 119905 = 10 20 100 forsynthetic data set and 119905 = 100 200 1000 for ACT1 dataset Two kinds of random projections (with random variablesfrom (5) in Lemma 2) were both tested for verifying theirfeasibility We also compared Algorithm 2 against anotherpopular method of dimensionality reductionmdashSVD Whatcalls for special attention is that the number of eigenvectorscorresponding to nonzero eigenvalues of ACT1 data is only760 so we just took 119905 = 100 200 700 on FCM clusteringwith SVD for ACT1 data set

Among comparisons of different cluster ensemble algo-rithms we set dimension number of projected data as 119905 =10 20 100 for both synthetic andACT2data sets In orderto meet 119888119903 ≪ 119899 for Algorithm 3 the number of randomprojection 119903 was set as 20 for the synthetic data set and 5 forthe ACT2 data set respectively

52 Evaluation Criteria For clustering algorithms clusteringvalidation and running time are two important indices forjudging their performances Clustering validation measuresevaluate the goodness of clustering results [28] and often canbe divided into two categories external clustering validationand internal clustering validation External validation mea-sures use external information such as the given class labelsto evaluate the goodness of solution output by a clustering

8 Mathematical Problems in Engineering

algorithm On the contrary internal measures are to evaluatethe clustering results using feature inherited from data setsIn this paper validity evaluation criteria used are rand indexand clustering validation index based on nearest neighborsfor crisp partition together with fuzzy rand index and Xie-Beni index for fuzzy partition Here rand index and fuzzyrand index are external validation measures whereas theclustering validation index based on nearest neighbors indexand Xie-Beni index are internal validation measures

(1) Rand Index (RI) [29] RI describes the similarity ofclustering solution and correct labels through pairs of pointsIt takes into account the numbers of point pairs that are in thesame and different clusters The RI is defined as

RI =11989911+ 11989900

1198622

119899

(19)

where 11989911

is the number of pairs of points that exist in thesame cluster in both clustering result and given class labels11989900is the number of pairs of points that are in different subsets

for both clustering result and given class labels and1198622119899equals

119899(119899 minus 1)2 The value of RI ranges from 0 to 1 and the highervalue implies the better clustering solution

(2) Fuzzy Rand Index (FRI) [30] FRI is a generalizationof RI with respect to soft partition It also measures theproportion of pairs of points which exist in the same anddifferent clusters in both clustering solution and true classlabels It needs to compute the analogous 119899

11and 11989900through

contingency table described in [30] Therefore the range ofFRI is also [0 1] and the larger value means more accuratecluster solution

(3) Xie-Beni Index (XB) [31] XB takes the minimum squaredistance between cluster centers as the separation of thepartition and the average square fuzzy deviation of datapoints as the compactness of the partition XB is calculatedas follows

XB =sum119888

119894=1sum119899

119895=1119906119898

119894119895

10038171003817100381710038171003817x119895minus k119894

10038171003817100381710038171003817

2

119899 sdotmin119894119895

10038171003817100381710038171003817k119894minus k119895

10038171003817100381710038171003817

2 (20)

where sum119888119894=1sum119899

119895=1119906119898

119894119895x119895minus k1198942 is just the objective function of

FCM clustering and k119894is the center of cluster 119894 The smallest

XB indicates the optimal cluster partition

(4) Clustering Validation Index Based on Nearest Neighbors(CVNN) [32] The separation of CVNN is about the situationof objects that have geometrical information of each clusterand the compactness is the mean pairwise distance betweenobjects in the same cluster CVNN is computed as follows

CVNN (119888 119896) =Sep (119888 119896)

(max119888 minle119888le119888 maxSep (119888 119896))

+

Com (119888)(max119888 minle119888le119888 maxCom (119888))

(21)

where Sep(119888 119896) = max119894=12119888

((1119899119894) sdot sum

119899119894

119895=1(119902119895119896)) and

Com(119888) = sum119888119894=1((2119899119894(119899119894minus1)) sdotsum

119909119910isinClu119894 119889(119909 119910)) Here 119888 is the

number of clusters in partition result 119888 max is the maximumcluster number given 119888 min is the minimum cluster numbergiven 119896 is the number of nearest neighbors 119899

119894is the number

of objects in the 119894th cluster Clu119894 119902119895denotes the number of

nearest neighbors of Clu119894rsquos 119895th object which are not in Clu

119894

and 119889(119909 119910) denotes the distance between 119909 and 119910 The lowerCVNN value indicates the better clustering solution

Objective function is a special evaluation criterion ofvalidity for FCM clustering algorithm The smaller objectivefunction indicates that the points inside clusters are moreldquosimilarrdquo

Running time is also an important evaluation criterionoften related to the scalability of algorithm One maintarget of random projection for dimensionality reductionis to decrease the runtime and enhance the applicability ofalgorithm in the context of big data

53 Performance of FCM Clustering with Random ProjectionThe experimental results about FCM clustering with randomprojection are presented in Figure 2 where (a) (c) (e) and (g)correspond to the synthetic data set and (b) (d) (f) and (h)correspond to the ACT1 data setThe evaluation criteria usedto assess proposed algorithms are FRI (a) and (b) XB (c) and(d) objective function (e) and (f) and running time (g) and(h) ldquoSignRPrdquo denotes the proposed algorithm with randomsign matrix ldquoGaussRPrdquo denotes the FCM clustering withrandom Gaussian matrix ldquoFCMrdquo denotes the original FCMclustering algorithm and ldquoSVDrdquo denotes the FCM clusteringwith dimensionality reduction through SVD It should benoted that true XB value of FCM clustering in subfigure (d)is 403e + 12 not 0

From Figure 2 we can see that FCM clustering withrandom projection is clearly more efficient than the originalFCM clustering When number of dimensions 119905 is abovecertain bound the validity indices are nearly stable andsimilar to the ones of naive FCM clustering for both datasets This verifies the conclusion that ldquoaccuracy of clusteringalgorithm can be preserved when the dimensionality exceedsa certain boundrdquo The effectiveness for random projectionmethod is also verified by the small bound compared to thetotal dimensions (301000 for synthetic data and 30067500for ACT1 data) Besides the two different kinds of randomprojection methods have the similar impact on FCM cluster-ing because of the analogous plot

The higher objective function values and the smaller XBindices of SVD method for synthetic data set indicate thatthe generated clustering solution has better separation degreebetween clusters The external cluster validation indices alsoverify that SVD method has better clustering results forsynthetic data These observations state that SVD methodis more suitable for Gaussian mixture data sets than FCMclustering with randomprojection and naive FCM clustering

Although the SVDmethod has a higher FRI for syntheticdata set the random projection methods have analogousFRI values for ACT1 data set and better objective functionvalues for both data sets In addition the random projectionapproaches are obviously more efficient as the SVD needscubic time of dimensionality Hence these observationsindicate that our algorithm is quite encouraging in practice

Mathematical Problems in Engineering 9

10 20 30 40 50 60 70 80 90 100

065

07

075

08

085

09

095

1FR

I

SVDFCM

GaussRPSignRP

Number of dimensions t

(a) FRI versus number of dimensions

SVDFCM

GaussRPSignRP

100 200 300 400 500 600 700 800 900 100008998

09

09002

09004

09006

09008

0901

FRI

Number of dimensions t

(b) FRI versus number of dimensions

SVDFCM

GaussRPSignRP

10 20 30 40 50 60 70 80 90 1000

05

1

15

2

25

3

35

XB

Number of dimensions t

(c) FRI versus number of dimensions

SVDFCM

GaussRPSignRP

100 200 300 400 500 600 700 800 900 10000

2

4

6

8

10

12

14XB

(1015)

Number of dimensions t

(d) FRI versus number of dimensions

10 20 30 40 50 60 70 80 90 100

032

034

036

038

04

042

044

046

048

05

Obj

ectiv

e fun

ctio

n

Number of dimensions t

SVDFCM

GaussRPSignRP

(e) FRI versus number of dimensions

100 200 300 400 500 600 700 800 900 100000285

0029

00295

003

00305

0031

00315

0032

00325

Obj

ectiv

e fun

ctio

n

Number of dimensions t

SVDFCM

GaussRPSignRP

(f) FRI versus number of dimensions

Figure 2 Continued

10 Mathematical Problems in Engineering

24262830

Runn

ing

time

(s)

SVDFCM

10 20 30 40 50 60 70 80 90 1001618

222242628

332

Runn

ing

time (

s)

GaussRPSignRP

10 20 30 40 50 60 70 80 90 100Number of dimensions t

Number of dimensions t

(g) FRI versus number of dimensions

SVDFCM

GaussRPSignRP

820830840850

Runn

ing

time

(s)

100 200 300 400 500 600 700 800 900 1000

100 200 300 400 500 600 700 800 900 1000

0

2

4

6

8

10

12

14

Runn

ing

time (

s)

Number of dimensions t

Number of dimensions t

(h) FRI versus number of dimensions

Figure 2 Performance of clustering algorithms with different dimensionality

Table 1 CVNN indices for different ensemble approaches on ACT2 data

Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-A 17315 17383 17449 17789 1819 183 17623 18182 18685 18067EFCM-C 17938 17558 17584 18351 18088 18353 18247 18385 18105 18381EFCM-S 13975 13144 12736 12974 13112 13643 13533 1409 13701 13765

54 Comparisons of Different Cluster Ensemble Methods Thecomparisons of different cluster ensemble approaches areshown in Figure 3 and Table 1 Similarly (a) and (c) of thefigure correspond to the synthetic data set and (b) and (d)corresponds to the ACT2 data set We use RI (a) and (b)and running time (c) and (d) to present the performanceof ensemble methods Meanwhile the meanings of EFCM-A EFCM-C and EFCM-S are identical to the ones inSection 42 In order to get crisp partition for EFCM-A andEFCM-C we used hierarchical clustering-complete linkagemethod after getting the distance matrix as in [21] Since allthree cluster ensemble methods get perfect partition resultson synthetic data set we only compare CVNN indices ofdifferent ensemble methods on ACT2 data set which ispresented in Table 1

In Figure 3 running time of our algorithm is shorterfor both data sets This verifies the result of time complexityanalysis for different algorithms in Section 42 The threecluster ensemble methods all get the perfect partition forsynthetic data set whereas our method is more accuratethan the other two methods for ACT2 data set The perfectpartition results suggest that all three ensemble methods aresuitable for Gaussian mixture data set However the almost18 improvement on RI for ACT2 data set should be due

to the different grouping ideas Our method is based on thegraph partition such that the edges between different clustershave low weight and the edges within a cluster have highweight This clustering way of spectral embedding is moresuitable for ACT2 data set In Table 1 the smaller values ofCVNN of our new method also show that new approach hasbetter partition results on ACT2 data set These observationsindicate that our algorithm has the advantage on efficiencyand adapts to a wider range of geometries

We also compare the stability for three ensemble meth-ods presented in Table 2 From the table we can see that thestandard deviation of RI about EFCM-S is a lower order ofmagnitude than the ones of the other methods Hence thisresult shows that our algorithm is more robust

Aiming at the situation of unknown clustersrsquo numberwe also varied the number of clusters 119888 in FCM clusteringand spectral embedding for our new method We denotethis version of new method as EFCM-SV Since the numberof random projections was set as 5 for ACT2 data set wechanged the clustersrsquo number from 17 to 21 as the input ofFCM clustering algorithm In addition we set the clustersrsquonumber from 14 to 24 as the input of spectral embeddingand applied CVNN to estimate the most plausible number ofclusters The experimental results are presented in Table 3

Mathematical Problems in Engineering 11

Table 2 Standard deviations of RI of 20 runs with different dimensions on ACT2 data

Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-A 00222 00174 0018 00257 00171 00251 00188 00172 00218 00184EFCM-C 00217 00189 00128 00232 00192 00200 00175 00194 00151 00214EFCM-S 00044 00018 00029 00030 00028 00024 00026 00020 00024 00019

Table 3 RI values for EFCM-S and EFCM-Sv on ACT2 data

Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-S 09227 0922 09223 0923 09215 09218 09226 09225 09231 09237EFCM-SV 09257 09257 09165 09257 0927 09165 09268 0927 09105 09245+CVNN 119888 = 185 119888 = 207 119888 = 194 119888 = 193 119888 = 193 119888 = 182 119888 = 192 119888 = 183 119888 = 194 119888 = 202

10 20 30 40 50 60 70 80 90 1000

02

04

06

08

1

12

14

16

18

2

RI

EFCM-AEFCM-CEFCM-S

Number of dimensions t

(a) RI versus number of dimensions

10 20 30 40 50 60 70 80 90 100074

076

078

08

082

084

086

088

09

092

094

RI

Number of dimensions t

EFCM-AEFCM-CEFCM-S

(b) RI versus number of dimensions

10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

Runn

ing

time (

s)

Number of dimensions t

EFCM-AEFCM-CEFCM-S

(c) Running time versus number of dimensions

10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

80

90

100

Runn

ing

time (

s)

Number of dimensions t

EFCM-AEFCM-CEFCM-S

(d) Running time versus number of dimensions

Figure 3 Performance of cluster ensemble approaches with different dimensionality

12 Mathematical Problems in Engineering

In Table 3 the values with respect to ldquoEFCM-SVrdquo are theaverage RI values with the estimated clustersrsquo numbers for20 individual runs The values of ldquo+CVNNrdquo are the averageclustersrsquo numbers decided by the CVNN cluster validityindex Using the estimated clustersrsquo numbers by CVNN ourmethod gets the similar results of ensemble method withcorrect clustersrsquo number In addition the average estimates ofclustersrsquo number are close to the true one This indicates thatour cluster ensemble method EFCM-SV is attractive whenthe number of clusters is unknown

6 Conclusion and Future Work

The ldquocurse of dimensionalityrdquo in big data gives new chal-lenges for clustering recently and feature extraction fordimensionality reduction is a popular way to deal with thesechallenges We studied the feature extraction method ofrandom projection for FCM clustering Through analyzingthe effects of random projection on the entire variabilityof data theoretically and verification both on syntheticand real world data empirically we designed an enhancedFCM clustering algorithm with random projection The newalgorithm can maintain nearly the same clustering solutionof preliminary FCM clustering and be more efficient thanfeature extraction method of SVD What is more we alsoproposed a cluster ensemble approach that is more applicableto large scale data sets than existing ones The new ensembleapproach can achieve spectral embedding efficiently fromSVD on the concatenation of membership matrices Theexperiments showed that the new ensemble method ranfaster had more robust partition solutions and fitted a widerrange of geometrical data sets

A future research content is to design the provablyaccurate feature extraction and feature selection methodsfor FCM clustering Another remaining question is thathow to choose proper number of random projections forcluster ensemble method in order to get a trade-off betweenclustering accuracy and efficiency

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This work was supported in part by the National KeyBasic Research Program (973 programme) under Grant2012CB315905 and in part by the National Nature ScienceFoundation of China under Grants 61502527 and 61379150and in part by the Open Foundation of State Key Laboratoryof Networking and Switching Technology (Beijing Universityof Posts and Telecommunications) (no SKLNST-2013-1-06)

References

[1] M Chen S Mao and Y Liu ldquoBig data a surveyrdquo MobileNetworks and Applications vol 19 no 2 pp 171ndash209 2014

[2] J Zhang X Tao and H Wang ldquoOutlier detection from largedistributed databasesrdquoWorld Wide Web vol 17 no 4 pp 539ndash568 2014

[3] C Ordonez N Mohanam and C Garcia-Alvarado ldquoPCA forlarge data sets with parallel data summarizationrdquo Distributedand Parallel Databases vol 32 no 3 pp 377ndash403 2014

[4] D-S Pham S Venkatesh M Lazarescu and S BudhadityaldquoAnomaly detection in large-scale data stream networksrdquo DataMining and Knowledge Discovery vol 28 no 1 pp 145ndash1892014

[5] F Murtagh and P Contreras ldquoRandom projection towardsthe baire metric for high dimensional clusteringrdquo in StatisticalLearning and Data Sciences pp 424ndash431 Springer BerlinGermany 2015

[6] T C Havens J C Bezdek C Leckie L O Hall and MPalaniswami ldquoFuzzy c-means algorithms for very large datardquoIEEETransactions on Fuzzy Systems vol 20 no 6 pp 1130ndash11462012

[7] J Han M Kamber and J Pei Data Mining Concepts andTechniques Concepts and Techniques Elsevier 2011

[8] S Khan G Situ K Decker and C J Schmidt ldquoGoFigureautomated gene ontology annotationrdquo Bioinformatics vol 19no 18 pp 2484ndash2485 2003

[9] S Gunnemann H Kremer D Lenhard and T Seidl ldquoSub-space clustering for indexing high dimensional data a mainmemory index based on local reductions and individual multi-representationsrdquo in Proceedings of the 14th International Confer-ence on Extending Database Technology (EDBT rsquo11) pp 237ndash248ACM Uppsala Sweden March 2011

[10] J C Bezdek R Ehrlich and W Full ldquoFCM the fuzzy c-meansclustering algorithmrdquo Computers amp Geosciences vol 10 no 2-3pp 191ndash203 1984

[11] R J Hathaway and J C Bezdek ldquoExtending fuzzy andprobabilistic clustering to very large data setsrdquo ComputationalStatistics amp Data Analysis vol 51 no 1 pp 215ndash234 2006

[12] P Hore L O Hall and D B Goldgof ldquoSingle pass fuzzy cmeansrdquo in Proceedings of the IEEE International Fuzzy SystemsConference (FUZZ rsquo07) pp 1ndash7 London UK July 2007

[13] P Hore L O Hall D B Goldgof Y Gu A A Maudsley andA Darkazanli ldquoA scalable framework for segmenting magneticresonance imagesrdquo Journal of Signal Processing Systems vol 54no 1ndash3 pp 183ndash203 2009

[14] W B Johnson and J Lindenstrauss ldquoExtensions of lipschitzmappings into aHilbert spacerdquoContemporaryMathematics vol26 pp 189ndash206 1984

[15] P Indyk and R Motwani ldquoApproximate nearest neighborstowards removing the curse of dimensionalityrdquo in Proceedingsof the 13th Annual ACM Symposium on Theory of Computingpp 604ndash613 ACM 1998

[16] D Achlioptas ldquoDatabase-friendly random projectionsJohnson-Lindenstrauss with binary coinsrdquo Journal of Computerand System Sciences vol 66 no 4 pp 671ndash687 2003

[17] C Boutsidis A Zouzias and P Drineas ldquoRandom projectionsfor k-means clusteringrdquo in Advances in Neural InformationProcessing Systems pp 298ndash306 MIT Press 2010

[18] C C Aggarwal and C K Reddy Data Clustering Algorithmsand Applications CRC Press New York NY USA 2013

[19] R Avogadri and G Valentini ldquoFuzzy ensemble clustering basedon random projections for DNA microarray data analysisrdquoArtificial Intelligence in Medicine vol 45 no 2-3 pp 173ndash1832009

Mathematical Problems in Engineering 13

[20] X Z Fern and C E Brodley ldquoRandom projection for highdimensional data clustering a cluster ensemble approachrdquo inProceedings of the 20th International Conference on MachineLearning (ICML rsquo03) vol 3 pp 186ndash193 August 2003

[21] M Popescu J Keller J Bezdek and A Zare ldquoRandomprojections fuzzy c-means (RPFCM) for big data clusteringrdquoin Proceedings of the IEEE International Conference on FuzzySystems (FUZZ-IEEE rsquo15) pp 1ndash6 Istanbul Turkey August 2015

[22] A Fahad N Alshatri Z Tari et al ldquoA survey of clusteringalgorithms for big data taxonomy and empirical analysisrdquo IEEETransactions on Emerging Topics in Computing vol 2 no 3 pp267ndash279 2014

[23] R A Johnson and D W Wichern Applied Multivariate Statis-tical Analysis vol 4 Pearson Prentice Hall Upper Saddle RiverNJ USA 6th edition 2007

[24] C Boutsidis A Zouzias M W Mahoney and P DrineasldquoRandomized dimensionality reduction for k-means cluster-ingrdquo IEEE Transactions on InformationTheory vol 61 no 2 pp1045ndash1062 2015

[25] X Chen and D Cai ldquoLarge scale spectral clustering withlandmark-based representationrdquo in Proceedings of the 25thAAAI Conference on Artificial Intelligence pp 313ndash318 2011

[26] D Cai and X Chen ldquoLarge scale spectral clustering vialandmark-based sparse representationrdquo IEEE Transactions onCybernetics vol 45 no 8 pp 1669ndash1680 2015

[27] G H Golub and C F Van Loan Matrix Computations vol 3JHU Press 2012

[28] U Maulik and S Bandyopadhyay ldquoPerformance evaluation ofsome clustering algorithms and validity indicesrdquo IEEE Transac-tions on Pattern Analysis and Machine Intelligence vol 24 no12 pp 1650ndash1654 2002

[29] W M Rand ldquoObjective criteria for the evaluation of clusteringmethodsrdquo Journal of the American Statistical Association vol66 no 336 pp 846ndash850 1971

[30] D T Anderson J C Bezdek M Popescu and J M KellerldquoComparing fuzzy probabilistic and possibilistic partitionsrdquoIEEE Transactions on Fuzzy Systems vol 18 no 5 pp 906ndash9182010

[31] X L Xie and G Beni ldquoA validity measure for fuzzy clusteringrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 13 no 8 pp 841ndash847 1991

[32] Y Liu Z LiH Xiong XGao JWu and SWu ldquoUnderstandingand enhancement of internal clustering validation measuresrdquoIEEE Transactions on Cybernetics vol 43 no 3 pp 982ndash9942013

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 2: Research Article Fuzzy -Means and Cluster Ensemble with Random Projection for Big Data ...downloads.hindawi.com/journals/mpe/2016/6529794.pdf · 2019. 7. 30. · Research Article

2 Mathematical Problems in Engineering

As it can combine multiple base clustering solutions ofthe same object set into a single consensus solution clusterensemble has many attractive properties such as improvedquality of solution robust clustering and knowledge reuse[18] Ensemble approaches of fuzzy clustering with randomprojection have been proposed in [19ndash21] These methodswere all based on multiple random projections of originaldata set and then integrated all fuzzy clustering results ofthe projected data sets Reference [21] pointed out that theirmethod used smaller memory and ran faster than the onesof [19 20] However with respect to crisp partition solutiontheir method still needs computing and storing the productof membership matrices which requires time and spacecomplexity with quadratic data size

Our Contribution In this paper our contributions can bedivided into two parts one is the analysis of impact of randomprojection on FCM clustering the other is the propositionof a cluster ensemble method with random projection whichis more efficient robust and suitable for a wider range ofgeometrical data sets Concretely the contributions are asfollows

(i) We theoretically analyze that random projection canpreserve the entire variability of data and prove theeffectiveness of randomprojection for dimensionalityreduction from the linear independence of dimen-sions of projected data Together with the propertyof preserving pairwise distances of points we obtaina modified FCM clustering algorithm with randomprojection The accuracy and efficiency of modifiedalgorithm have been verified through experiments onboth synthetic and real data sets

(ii) We propose a new cluster ensemble algorithm forFCM clustering with random projection which getsspectral embedding efficiently through singular valuedecomposition (SVD) of the concatenation of mem-bership matrices The new method avoids the con-struction of similarity or distancematrix so it is moreefficient and space-saving than method in [21] withrespect to crisp partition and methods in [19 20] forlarge scale data sets In addition the improvementson robustness and efficiency of our approach are alsoverified by the experimental results on both syntheticand real data sets At the same time our algorithm isnot only as accurate as the existing ones on Gaussianmixture data set but also obviously more accuratethan the existing ones on the real data set whichindicates that our approach is suitable for a widerrange of data sets

2 Preliminaries

In this section we present some notations used throughoutthis paper introduce the FCM clustering algorithm and givesome traditional cluster ensemble methods using randomprojection

21 Matrix Notations We use X to denote data matrix x119894to

denote the 119894th row vector ofX and the 119894th point 119909119894119895to denote

the (119894 119895)th element of X 119864(120585) means the expectation of arandom variable 120585 and Pr(119860) denotes the probability of anevent 119860 Let cov(120585 120578) be the covariance of random variables120585 120578 let var(120585) be the variance of random variable 120585

We denote the trace of matrix by tr() given A isin R119899times119899then

tr (A) =119899

sum

119894=1

119886119894119894 (1)

For any matrix AB isin R119899times119899 we have the following property

tr (AB) = tr (BA) (2)

Singular value decomposition is a popular dimensionalityreduction method through which one can get a projection119891 X rarr R119905 with 119891(x

119894) = x119894V119905 where V

119905contains the top 119905

right singular vectors of matrix X The exact SVD of X takescubic time of dimension size and quadratic time of data size

22 Fuzzy 119888-Means Clustering Algorithm (FCM) The goalof fuzzy clustering is to get a flexible partition where eachpoint hasmembership inmore than one cluster with values in[0 1] Among the various fuzzy clustering algorithms FCMclustering algorithm is widely used in low dimensional databecause of its efficiency and effectiveness [22] We start fromgiving the definition of fuzzy 119888-means clustering problem andthen describe the FCM clustering algorithm precisely

Definition 1 (the fuzzy 119888-means clustering problem) Given adata set of 119899 points with 119889 features denoted by an 119899times119889matrixX a positive integer 119888 regarded as the number of clusters andfuzzy constant 119898 gt 1 find the partition matrix Uopt isin R119888times119899

and centers of clusters Vopt = kopt1 kopt2 kopt119888 suchthat

(UV)opt = argminUV

119888

sum

119894=1

119899

sum

119895=1

119906119898

119894119895

10038171003817100381710038171003817x119895minus k119894

10038171003817100381710038171003817

2

(3)

Here sdot denotes norm usually Euclidean norm theelement of partition matrix 119906

119894119895denotes the membership of

point 119895 in the cluster 119894 Moreover for any 119895 isin [1 119899]sum119888119894=1119906119894119895=

1The objective function is defined assum119888119894=1sum119899

119895=1119906119898

119894119895x119895minusk1198942≜

objFCM clustering algorithm first computes the degree of

membership through distances between points and centersof clusters and then updates the center of each cluster basedon the membership degree By means of computing clustercenters and partitionmatrix iteratively a solution is obtainedIt should be noted that FCM clustering can only get a locallyoptimal solution and the final clustering result depends onthe initialization The detailed procedure of FCM clusteringis shown in Algorithm 1

23 Ensemble Aggregations forMultiple Fuzzy Clustering Solu-tions with Random Projection There are several algorithms

Mathematical Problems in Engineering 3

Input data set X (an 119899 times 119889matrix) number of clusters 119888 fuzzy constant119898Output partition matrix U centers of clusters VInitialize sample U (or V) randomly from proper spaceWhile |objold minus objnew|

2gt 120576 do

119906119894119895=[

[

119888

sum

119896=1

(

10038171003817100381710038171003817x119895minus v119894

10038171003817100381710038171003817

10038171003817100381710038171003817x119895minus v119896

10038171003817100381710038171003817

)

2(119898minus1)

]

]

minus1

forall119894 119895

v119894=

sum119899

119895=1(119906119894119895)

119898

x119895

sum119899

119895=1(119906119894119895)

119898 forall119894

obj =119888

sum

119894=1

119899

sum

119895=1

119906119898

119894119895

10038171003817100381710038171003817x119895minus v119894

10038171003817100381710038171003817

2

Algorithm 1 FCM clustering algorithm

proposed for aggregating themultiple fuzzy clustering resultswith random projection The main strategy is to generatedata membership matrices through multiple fuzzy clusteringsolutions on the different projected data sets and then toaggregate the resulting membership matrices Therefore dif-ferentmethods of generation and aggregation ofmembershipmatrices lead to various ensemble approaches about fuzzyclustering

The first cluster ensemble approach using randomprojec-tion was proposed in [20] After projecting the data into lowdimensional space with random projection the membershipmatrices were calculated through the probabilistic model 120579 of119888 Gaussian mixture gained by EM clustering Subsequentlythe similarity of points 119894 and 119895 was computed as 119875120579

119894119895=

sum119888

119897=1119875(119897 | 119894 120579) times 119875(119897 | 119895 120579) where 119875(119897 | 119894 120579) denoted the the

probability of point 119894 belonging to cluster 119897 undermodel 120579 and119875120579

119894119895denoted the probability that points 119894 and 119895 belonged to the

same cluster undermodel 120579The aggregated similarity matrixwas obtained by averaging across the multiple runs andthe final clustering solution was produced by a hierarchicalclustering method called complete linkage For mixturemodel the estimation for the cluster number and values ofunknown parameters is often complicated [23] In additionthis approach needs 119874(1198992) space for storing the similaritymatrix of data points

Another approach which was used to find genes in DNAmicroarray data was presented in [19] Similarly the data wasprojected into a low dimensional space with random matrixThen the method employed FCM clustering to partition theprojected data and generated membership matrices U

119894isin

R119888times119899 119894 = 1 2 119903 with multiple runs 119903 For each run 119894 thesimilarity matrix was computed as M

119894= U119879119894U119894 Then the

combined similarity matrix M was calculated by averagingas M = (1119903)sum

119903

119894=1M119894 A distance matrix was computed by

D = 1 minusM and the final partition matrix was gained by FCMclustering on the distance matrixD Since this method needsto compute the product of partition matrix and its transposethe time complexity is119874(119903 lowast 1198881198992) and the space complexity is119874(1198992)Considering the large scale data set in the context of big

data [21] proposed a new method for aggregating partition

matrices from FCM clustering They concatenated the par-tition matrices as Ucon = [U1198791 U

119879

2 ] instead of averaging

the agreement matrix Finally they got the ensemble resultas U119891= FCM(Ucon 119888) This algorithm avoids the products

of partition matrices and is more suitable than [19] for largescale data sets However it still needs the multiplication ofconcatenated partition matrix when crisp partition result iswanted

3 Random Projection

Dimensionality reduction is a common technique for analysisof high dimensional data The most popular skill is SVD (orprincipal component analysis) where the original featuresare replaced by a small size of principal components inorder to compress the data But SVD takes cubic time ofthe number of dimensions Recently some literatures statedthat random projection can be applied to dimensionalityreduction and preserve pairwise distances within a smallfactor [15 16] Low computing complexity and preservingthe metric structure make random projection receive muchattention Lemma 2 indicates that there are three kinds ofsimple random projection possessing the above properties

Lemma 2 (see [15 16]) Let matrix X isin R119899times119889 be a data set of119899 points and 119889 features Given 120576 120573 gt 0 let

1198960=

4 + 2120573

12057622 minus 120576

33

log 119899 (4)

For integer 119905 ge 1198960 let matrix R be a 119889 times 119905 (119905 le 119889) random

matrix wherein elements 119877119894119895are independently identically

distributed random variables from either one of the followingthree probability distributions

119877119894119895sim 119873 (0 1)

119877119894119895=

+1 with probablity 12

minus1 with probablity 12

(5)

4 Mathematical Problems in Engineering

119877119894119895= radic3 times

+1 with probablity 16

0 with probablity 23

minus1 with probablity 16

(6)

Let 119891 R119889 rarr R119905 with 119891(x119894) = (1radic119905)x

119894R For any u k isin X

with probability at least 1 minus 119899minus120573 it holds that

(1 minus 120576) u minus k22le1003817100381710038171003817119891 (u) minus 119891 (k)100381710038171003817

1003817

2

2le (1 + 120576) u minus k2

2 (7)

Lemma 2 implies that if the number of dimensionsof data reduced by random projection is bigger than acertain bound then pairwise Euclidean distance squares arepreserved within a multiplicative factor of 1 plusmn 120576

With the above properties researchers have checked thefeasibility of applying random projection to 119896-means clus-tering in terms of theory and experiment [17 24] Howeveras membership degrees for FCM clustering and 119896-meansclustering are defined differently the analysis method cannot be directly used for assessing the effect of randomprojection on FCM clustering Motivated by the idea ofprincipal component analysis we draw the conclusion thatthe compressed data gains the whole variability of originaldata in probabilistic sense based on the analysis of the vari-ance difference Besides variables referring to dimensions ofprojected data are linear independent As a result we canachieve dimensionality reduction via replacing original databy compressed data as ldquoprincipal componentsrdquo

Next we give a useful lemma for proof of the subsequenttheorem

Lemma 3 Let 120585119894(1 le 119894 le 119899) be independently distributed

randomvariables fromone of the three probability distributionsdescribed in Lemma 2 then

Pr lim119899rarrinfin

1

119899

119899

sum

119894=1

1205852

119894= 1 = 1 (8)

Proof According to the probability distribution of randomvariable 120585

119894 it is easy to know that

119864 (1205852

119894) = 1 (1 le 119894 le 119899)

119864(

1

119899

119899

sum

119894=1

1205852

119894) = 1

(9)

Then 1205852119894 obeys the law of large numbers namely

Pr lim119899rarrinfin

1

119899

119899

sum

119894=1

1205852

119894= 119864(

1

119899

119899

sum

119894=1

1205852

119894)

= Pr lim119899rarrinfin

1

119899

119899

sum

119894=1

1205852

119894= 1 = 1

(10)

Since centralization of data does not change the distanceof any two points and the FCM clustering algorithm is

based on pairwise distances to partition data points weassume that expectation of the data input is 0 In practicecovariancematrix of population is likely unknownThereforewe investigate the effect of random projection on variabilityof both population and sample

Theorem 4 Let data set X isin R119899times119889 be 119899 independentsamples of119889-dimensional randomvector (119883

1 1198832 119883

119889) and

S denotes the sample covariance matrix of X The randomprojection induced by random matrix R isin R119889times119905 mapsthe 119889-dimensional random vector to 119905-dimensional randomvector (119884

1 1198842 119884

119905) = (1radic119905)(119883

1 1198832 119883

119889) sdot R and Slowast

denotes the sample covariance matrix of projected data Ifelements of random matrix R obey distribution demanded byLemma 2 and are mutually independent with random vector(1198831 1198832 119883

119889) then

(1) dimensions of projected data are linearly independentcov(119884

119894 119884119895) = 0 forall119894 = 119895

(2) random projection maintains the whole variabilitysum119905

119894=1var(119884119894) = sum

119889

119894=1var(119883

119894) when 119905 rarr infin with

probability 1 tr(Slowast) = tr(S)

Proof It is easy to know that the expectation of any element ofrandommatrix 119864(119877

119894119895) = 0 1 le 119894 le 119889 1 le 119895 le 119905 As elements

of random matrix R and random vector (1198831 1198832 119883

119889)

are mutually independent the covariance of random vectorinduced by random projection is

cov (119884119894 119884119895) = cov( 1

radic119905

sdot

119889

sum

119896=1

119883119896sdot 119877119896119894

1

radic119905

sdot

119889

sum

119897=1

119883119897sdot 119877119897119895)

=

1

119905

sdot

119889

sum

119896=1

119889

sum

119897=1

cov (119883119896sdot 119877119896119894 119883119897sdot 119877119897119895)

=

1

119905

sdot

119889

sum

119896=1

119889

sum

119897=1

119864 (119883119896sdot 119877119896119894sdot 119883119897sdot 119877119897119895) minus

1

119905

sdot

119889

sum

119896=1

119889

sum

119897=1

(119864 (119883119896sdot 119877119896119894) sdot 119864 (119883

119897sdot 119877119897119895))

=

1

119905

sdot

119889

sum

119896=1

119889

sum

119897=1

119864 (119883119896sdot 119877119896119894sdot 119883119897sdot 119877119897119895)

=

1

119905

sdot

119889

sum

119896=1

119889

sum

119897=1

(119864 (119883119896sdot 119883119897) sdot 119864 (119877

119896119894sdot 119877119897119895))

=

1

119905

sdot

119889

sum

119896=1

(119864 (1198832

119896) sdot 119864 (119877

119896119894sdot 119877119896119895))

(11)

(1) If 119894 = 119895 then

cov (119884119894 119884119895) =

1

119905

sdot (

119889

sum

119896=1

119864 (1198832

119896) sdot 119864 (119877

119896119894) sdot 119864 (119877

119897119895))

= 0

(12)

Mathematical Problems in Engineering 5

(2) If 119894 = 119895 then

cov (119884119894 119884119894) = var (119884

119894) =

1

119905

sdot (

119889

sum

119896=1

119864 (1198832

119896) sdot 119864 (119877

2

119896119894))

=

1

119905

sdot

119889

sum

119896=1

119864 (1198832

119896)

(13)

Thus by assumption 119864(119883119894) = 0 (1 le 119894 le 119889) we can get

119905

sum

119894=1

var (119884119894) =

119889

sum

119894=1

var (119883119894) (14)

We denote spectral decomposition of sample covariancematrice S by S = VΛV119879 whereV is thematrix of eigenvectorsand Λ is a diagonal matrix in which the diagonal elementsare 1205821 1205822 120582

119889and 120582

1ge 1205822ge sdot sdot sdot ge 120582

119889 Supposing the data

samples have been centralized namely their means are 0119904 wecan get covariance matrix S = (1119899)X119879X For conveniencewe still denote a sample of random matrix by R Thusprojected data Y = (1radic119905)XR and sample covariance matrixof projected data Slowast = (1119899)((1radic119905)XR)119879((1radic119905)XR) =

(1119905)R119879SR Then we can get

tr (Slowast) = tr(1119905

R119879VΛV119879R) = tr(1119905

R119879ΛVV119879R)

= tr(1119905

R119879ΛR) =119889

sum

119894=1

120582119894sdot (

1

119905

sdot

119905

sum

119895=1

1199032

119894119895)

(15)

where 119903119894119895(1 le 119894 le 119889 1 le 119895 le 119905) is sample of element of

random matrix RIn practice the spectrum of a covariance often displays a

distinct decay after few large eigenvalues So we assume thatthere exists an integer 119901 limited constant 119902 gt 0 such that forall 119894 gt 119901 it holds that 120582

119894le 119902 Then

1003816100381610038161003816tr (Slowast) minus tr (S)100381610038161003816

1003816=

10038161003816100381610038161003816100381610038161003816100381610038161003816

119889

sum

119894=1

120582119894sdot (

1

119905

119905

sum

119895=1

1199032

119894119895minus 1)

10038161003816100381610038161003816100381610038161003816100381610038161003816

le

10038161003816100381610038161003816100381610038161003816100381610038161003816

119901

sum

119894=1

120582119894sdot (

1

119905

119905

sum

119895=1

1199032

119894119895minus 1)

10038161003816100381610038161003816100381610038161003816100381610038161003816

+

10038161003816100381610038161003816100381610038161003816100381610038161003816

119889

sum

119894=119901+1

120582119894sdot (

1

119905

119905

sum

119895=1

1199032

119894119895minus 1)

10038161003816100381610038161003816100381610038161003816100381610038161003816

le

10038161003816100381610038161003816100381610038161003816100381610038161003816

119901

sum

119894=1

120582119894sdot

1

119905

119905

sum

119895=1

(1199032

119894119895minus 1)

10038161003816100381610038161003816100381610038161003816100381610038161003816

+ 119902

10038161003816100381610038161003816100381610038161003816100381610038161003816

119889

sum

119894=119901+1

(

1

119905

119905

sum

119895=1

(1199032

119894119895minus 1))

10038161003816100381610038161003816100381610038161003816100381610038161003816

(16)

By Lemma 3 with probability 1

lim119905rarrinfin

(

1

119905

119905

sum

119895=1

(1199032

119894119895minus 1)) = 0

lim119905rarrinfin

119889

sum

119894=119901+1

(

1

119905

119905

sum

119895=1

(1199032

119894119895minus 1)) = 0

(17)

Combining the above arguments we achieve tr(Slowast) =tr(S) with probability 1 when 119905 rarr infin

Part (1) of Theorem 4 indicates that compressed dataproduced by random projection can take much informationwith low dimensionality owing to linear independence ofreduced dimensions Part (2) manifests that sum of variancesof dimensions of original data is consistent with the oneof projected data namely random projection holds thevariability of primal data Combining results of Lemma 2with those ofTheorem 4 we consider that random projectioncan be employed to improve the efficiency of FCM clusteringalgorithm with low dimensionality and the modified algo-rithm can keep the accuracy of partition approximately

4 FCM Clustering with Random Projectionand an Efficient Cluster Ensemble Approach

41 FCM Clustering via Random Projection According tothe results of Section 3 we design an improved FCMclustering algorithm with random projection for dimen-sionality reduction The procedure of new algorithm isshown in Algorithm 2

Algorithm 2 reduces the dimensions of input data viamultiplying a random matrix Compared with the 119874(1198881198991198892)time for running each iteration in original FCM clusteringthe new algorithm would imply an 119874(119888119899(120576minus2 ln 119899)2) time foreach iteration Thus the time complexity of new algorithmdecreases obviously for high dimensional data in the case120576minus2 ln 119899 ≪ 119889 Another common dimensionality reductionmethod is SVD Compared with the 119874(1198893 + 1198991198892) time ofrunning SVD on data matrix X the new algorithm onlyneeds 119874(120576minus2119889 ln 119899) time to generate random matrix R Itindicates that random projection is a cost-effective methodof dimensionality reduction for FCM clustering algorithm

42 Ensemble Approach Based on Graph Partition As dif-ferent random projections may result in different clusteringsolutions [20] it is attractive to design the cluster ensembleframework with random projection for improved and robustclustering performance Although it uses smaller memoryand runs faster than ensemble method in [19] the clusterensemble algorithm in [21] still needs product of concate-nated partition matrix for crisp grouping which leads to ahigh time and space costs under the circumstances of big data

In this section we propose a more efficient and effectiveaggregation method for multiple FCM clustering results Theoverview of our new ensemble approach is presented inFigure 1 The new ensemble method is based on partition

6 Mathematical Problems in Engineering

Input data set X (an 119899 times 119889matrix) number of clusters 119888 fuzzy constant119898 FCM clustering algorithmOutput partition matrix U centers of clusters V(1) sample a 119889 times 119905 (119905 le 119889 119905 = Ω(120576minus2 ln 119899)) random projection Rmeeting the requirements of Lemma 2(2) compute the product Y = (1radic119905)XR(3) run FCM algorithm on Y get the partition matrix U(4) compute the centers of clusters through original data X and U

Algorithm 2 FCM clustering with random projection

Input data set X (an 119899 times 119889matrix) number of clusters 119888 reduced dimension 119905 number ofrandom projection 119903 FCM clustering algorithmOutput cluster label vector u(1) at each iteration 119894 isin [1 119903] run Algorithm 2 get membership matrix U

119894isin R119888times119899

(2) concatenate the membership matrices Ucon = [U1198791 U119879

119903] isin R119899times119888119903

(3) compute the first 119888 left singular vectors of Ucon denoted by A = [a1 a2 a

119888] isin R119899times119888

where Ucon = Ucon(119903 sdot D)minus12 D is a diagonal matrix and 119889119894119894= sum119895119906con 119895119894

(4) treat each row of A as a data point and apply 119896-means to obtain cluster label vector

Algorithm 3 Cluster ensemble for FCM clustering with random projection

on similarity graph For each random projection a new dataset is generated After performing FCM clustering on thenew data sets membershipmatrices are outputThe elementsof membership matrix are treated as the similarity measurebetween points and the cluster centers Through SVD on theconcatenation of membership matrices we get the spectralembedding of data point efficiently The detailed procedureof new cluster ensemble approach is shown in Algorithm 3

In step (3) of the procedure in Algorithm 3 the left sin-gular vectors of Ucon are equivalent to the eigenvectors ofUconU

119879

con It implies that we regard the matrix product as aconstruction of affinity matrix of data points This method ismotivated by the research on landmark-based representation[25 26] In our approach we treat the cluster centers ofeach FCM clustering run as landmarks and the membershipmatrix as landmark-based representation Thus the con-catenation of membership matrices forms a combinationallandmark-based representationmatrix In this way the graphsimilarity matrix is computed as

W =UconU

119879

con (18)

which can create spectral embedding efficiently through step(3) To normalize the graph similarity matrix we multiplyUcon by (119903 sdot D)minus12 As a result the degree matrix of W is anidentity matrix

There are two perspectives to explain why our approachworks Considering the similarity measure defined by 119906

119894119895

in FCM clustering proposition 3 in [26] demonstrated thatsingular vectors of U

119894converged to eigenvectors of W

119904as 119888

converges to 119899 where W119904was affinity matrix generated in

standard spectral clustering As a result singular vectors ofUcon converge to eigenvectors of normalized affinity matrixW119904Thus our final outputwill converge to the one of standard

spectral clustering as 119888 converges to 119899 Another explanationis about the similarity measure defined by 119870(x

119894 x119895) = x119879119894x119895

where x119894and x119895are data pointsWe can treat each row of Ucon

as a transformational data point As a result affinity matrixobtained here is the same as the one of standard spectralembedding and our output is just the partition result ofstandard spectral clustering

To facilitate comparison of different ensemble methodsfor FCM clustering solutions with random projection wedenote the approach of [19] byEFCM-A (average the productsof membership matrices) the algorithm of [21] by EFCM-C (concatenate the membership matrices) and our newmethod by EFCM-S (spectral clustering on the membershipmatrices) In the cluster ensemble phase the main computa-tions of EFCM-Amethod are multiplications of membershipmatrices Similarly the algorithm of EFCM-C also needsthe product of concatenated membership matrices in orderto get the crisp partition result Thus the above methodsboth need 119874(119899

2) space and 119874(119888119903119899

2) time However the

main computation of EFCM-S is SVD for Ucon and 119896-meansclustering for A The overall space is 119874(119888119903119899) the SVD timeis 119874((119888119903)2119899) and the 119896-means clustering time is 1198971198882119899 where119897 is iteration number of 119896-means Therefore computationalcomplexity of EFCM-S is obviously decreased comparedwiththe ones of EFCM-A and EFCM-C considering 119888119903 ≪ 119899 and119897 ≪ 119899 in large scale data set

5 Experiments

In this section we present the experimental evaluations ofnew algorithms proposed in Section 4 We implemented therelated algorithms in Matlab computing environment andconducted our experiments on aWindows-based systemwiththe Intel Core 36GHz processor and 16GB of RAM

51 Data Sets and Parameter Settings We conducted theexperiments on synthetic and real data sets which bothhave relatively high dimensionality The synthetic data

Mathematical Problems in Engineering 7

Original dataset

Randomprojection 1

Randomprojection 2

Randomprojection r

Generateddataset 1

Generateddataset 2

Generateddataset r

FCM clustering

Membershipmatrix 1

Consensus matrix

Final result

Membershipmatrix 1

Membershipmatrix 1

FCM clusteringFCM clustering

k-means

First c left singularvectors A

middot middot middot

middot middot middot

middot middot middot

middot middot middot

Figure 1 Framework of the new ensemble approach based on graph partition

set had 10000 data points with 1000 dimensions whichwere generated from 3 Gaussian mixtures in propor-tions (025 05 025) The means of components were(2 2 2)

1000 (0 0 0)

1000 and (minus2 minus2 minus2)

1000and

the standard deviations were (1 1 1)1000

(2 2 2)1000

and (3 3 3)

1000 The real data set is the daily and sports

activities data (ACT) published on UCI machine learningrepository (theACTdata set can be found at httparchiveicsuciedumldatasetsDaily+and+Sports+Activities)Thesearedata of 19 activities collected by 45 motion sensors in 5minutes at 25Hz sampling frequency Each activity wasperformed by 8 subjects in their own styles To get highdimensional data sets we treated 1 minute and 5 secondsof activity data as an instance respectively As a result wegot 760 times 67500 (ACT1) and 9120 times 5625 (ACT2) datamatrices whose rows were activity instances and columnswere features

For the parameters of FCM clustering we let 120576 = 10minus5 welet maximum iteration number be 100 we let fuzzy factor 119898be 2 and we let the number of clusters be 119888 = 3 for syntheticdata set and 119888 = 19 for ACT data sets We also normalizedthe objective function as objlowast = objX2

119865 where sdot

119865is

Frobenius norm of matrix [27] To minimize the influenceintroduced by different initializations we present the averagevalues of evaluation indices of 20 independent experiments

In order to compare different dimensionality reductionmethods for FCM clustering we initialized algorithms by

choosing 119888 points randomly as the cluster centers and madesure that every algorithm began with the same initializationIn addition we ran Algorithm 2 with 119905 = 10 20 100 forsynthetic data set and 119905 = 100 200 1000 for ACT1 dataset Two kinds of random projections (with random variablesfrom (5) in Lemma 2) were both tested for verifying theirfeasibility We also compared Algorithm 2 against anotherpopular method of dimensionality reductionmdashSVD Whatcalls for special attention is that the number of eigenvectorscorresponding to nonzero eigenvalues of ACT1 data is only760 so we just took 119905 = 100 200 700 on FCM clusteringwith SVD for ACT1 data set

Among comparisons of different cluster ensemble algo-rithms we set dimension number of projected data as 119905 =10 20 100 for both synthetic andACT2data sets In orderto meet 119888119903 ≪ 119899 for Algorithm 3 the number of randomprojection 119903 was set as 20 for the synthetic data set and 5 forthe ACT2 data set respectively

52 Evaluation Criteria For clustering algorithms clusteringvalidation and running time are two important indices forjudging their performances Clustering validation measuresevaluate the goodness of clustering results [28] and often canbe divided into two categories external clustering validationand internal clustering validation External validation mea-sures use external information such as the given class labelsto evaluate the goodness of solution output by a clustering

8 Mathematical Problems in Engineering

algorithm On the contrary internal measures are to evaluatethe clustering results using feature inherited from data setsIn this paper validity evaluation criteria used are rand indexand clustering validation index based on nearest neighborsfor crisp partition together with fuzzy rand index and Xie-Beni index for fuzzy partition Here rand index and fuzzyrand index are external validation measures whereas theclustering validation index based on nearest neighbors indexand Xie-Beni index are internal validation measures

(1) Rand Index (RI) [29] RI describes the similarity ofclustering solution and correct labels through pairs of pointsIt takes into account the numbers of point pairs that are in thesame and different clusters The RI is defined as

RI =11989911+ 11989900

1198622

119899

(19)

where 11989911

is the number of pairs of points that exist in thesame cluster in both clustering result and given class labels11989900is the number of pairs of points that are in different subsets

for both clustering result and given class labels and1198622119899equals

119899(119899 minus 1)2 The value of RI ranges from 0 to 1 and the highervalue implies the better clustering solution

(2) Fuzzy Rand Index (FRI) [30] FRI is a generalizationof RI with respect to soft partition It also measures theproportion of pairs of points which exist in the same anddifferent clusters in both clustering solution and true classlabels It needs to compute the analogous 119899

11and 11989900through

contingency table described in [30] Therefore the range ofFRI is also [0 1] and the larger value means more accuratecluster solution

(3) Xie-Beni Index (XB) [31] XB takes the minimum squaredistance between cluster centers as the separation of thepartition and the average square fuzzy deviation of datapoints as the compactness of the partition XB is calculatedas follows

XB =sum119888

119894=1sum119899

119895=1119906119898

119894119895

10038171003817100381710038171003817x119895minus k119894

10038171003817100381710038171003817

2

119899 sdotmin119894119895

10038171003817100381710038171003817k119894minus k119895

10038171003817100381710038171003817

2 (20)

where sum119888119894=1sum119899

119895=1119906119898

119894119895x119895minus k1198942 is just the objective function of

FCM clustering and k119894is the center of cluster 119894 The smallest

XB indicates the optimal cluster partition

(4) Clustering Validation Index Based on Nearest Neighbors(CVNN) [32] The separation of CVNN is about the situationof objects that have geometrical information of each clusterand the compactness is the mean pairwise distance betweenobjects in the same cluster CVNN is computed as follows

CVNN (119888 119896) =Sep (119888 119896)

(max119888 minle119888le119888 maxSep (119888 119896))

+

Com (119888)(max119888 minle119888le119888 maxCom (119888))

(21)

where Sep(119888 119896) = max119894=12119888

((1119899119894) sdot sum

119899119894

119895=1(119902119895119896)) and

Com(119888) = sum119888119894=1((2119899119894(119899119894minus1)) sdotsum

119909119910isinClu119894 119889(119909 119910)) Here 119888 is the

number of clusters in partition result 119888 max is the maximumcluster number given 119888 min is the minimum cluster numbergiven 119896 is the number of nearest neighbors 119899

119894is the number

of objects in the 119894th cluster Clu119894 119902119895denotes the number of

nearest neighbors of Clu119894rsquos 119895th object which are not in Clu

119894

and 119889(119909 119910) denotes the distance between 119909 and 119910 The lowerCVNN value indicates the better clustering solution

Objective function is a special evaluation criterion ofvalidity for FCM clustering algorithm The smaller objectivefunction indicates that the points inside clusters are moreldquosimilarrdquo

Running time is also an important evaluation criterionoften related to the scalability of algorithm One maintarget of random projection for dimensionality reductionis to decrease the runtime and enhance the applicability ofalgorithm in the context of big data

53 Performance of FCM Clustering with Random ProjectionThe experimental results about FCM clustering with randomprojection are presented in Figure 2 where (a) (c) (e) and (g)correspond to the synthetic data set and (b) (d) (f) and (h)correspond to the ACT1 data setThe evaluation criteria usedto assess proposed algorithms are FRI (a) and (b) XB (c) and(d) objective function (e) and (f) and running time (g) and(h) ldquoSignRPrdquo denotes the proposed algorithm with randomsign matrix ldquoGaussRPrdquo denotes the FCM clustering withrandom Gaussian matrix ldquoFCMrdquo denotes the original FCMclustering algorithm and ldquoSVDrdquo denotes the FCM clusteringwith dimensionality reduction through SVD It should benoted that true XB value of FCM clustering in subfigure (d)is 403e + 12 not 0

From Figure 2 we can see that FCM clustering withrandom projection is clearly more efficient than the originalFCM clustering When number of dimensions 119905 is abovecertain bound the validity indices are nearly stable andsimilar to the ones of naive FCM clustering for both datasets This verifies the conclusion that ldquoaccuracy of clusteringalgorithm can be preserved when the dimensionality exceedsa certain boundrdquo The effectiveness for random projectionmethod is also verified by the small bound compared to thetotal dimensions (301000 for synthetic data and 30067500for ACT1 data) Besides the two different kinds of randomprojection methods have the similar impact on FCM cluster-ing because of the analogous plot

The higher objective function values and the smaller XBindices of SVD method for synthetic data set indicate thatthe generated clustering solution has better separation degreebetween clusters The external cluster validation indices alsoverify that SVD method has better clustering results forsynthetic data These observations state that SVD methodis more suitable for Gaussian mixture data sets than FCMclustering with randomprojection and naive FCM clustering

Although the SVDmethod has a higher FRI for syntheticdata set the random projection methods have analogousFRI values for ACT1 data set and better objective functionvalues for both data sets In addition the random projectionapproaches are obviously more efficient as the SVD needscubic time of dimensionality Hence these observationsindicate that our algorithm is quite encouraging in practice

Mathematical Problems in Engineering 9

10 20 30 40 50 60 70 80 90 100

065

07

075

08

085

09

095

1FR

I

SVDFCM

GaussRPSignRP

Number of dimensions t

(a) FRI versus number of dimensions

SVDFCM

GaussRPSignRP

100 200 300 400 500 600 700 800 900 100008998

09

09002

09004

09006

09008

0901

FRI

Number of dimensions t

(b) FRI versus number of dimensions

SVDFCM

GaussRPSignRP

10 20 30 40 50 60 70 80 90 1000

05

1

15

2

25

3

35

XB

Number of dimensions t

(c) FRI versus number of dimensions

SVDFCM

GaussRPSignRP

100 200 300 400 500 600 700 800 900 10000

2

4

6

8

10

12

14XB

(1015)

Number of dimensions t

(d) FRI versus number of dimensions

10 20 30 40 50 60 70 80 90 100

032

034

036

038

04

042

044

046

048

05

Obj

ectiv

e fun

ctio

n

Number of dimensions t

SVDFCM

GaussRPSignRP

(e) FRI versus number of dimensions

100 200 300 400 500 600 700 800 900 100000285

0029

00295

003

00305

0031

00315

0032

00325

Obj

ectiv

e fun

ctio

n

Number of dimensions t

SVDFCM

GaussRPSignRP

(f) FRI versus number of dimensions

Figure 2 Continued

10 Mathematical Problems in Engineering

24262830

Runn

ing

time

(s)

SVDFCM

10 20 30 40 50 60 70 80 90 1001618

222242628

332

Runn

ing

time (

s)

GaussRPSignRP

10 20 30 40 50 60 70 80 90 100Number of dimensions t

Number of dimensions t

(g) FRI versus number of dimensions

SVDFCM

GaussRPSignRP

820830840850

Runn

ing

time

(s)

100 200 300 400 500 600 700 800 900 1000

100 200 300 400 500 600 700 800 900 1000

0

2

4

6

8

10

12

14

Runn

ing

time (

s)

Number of dimensions t

Number of dimensions t

(h) FRI versus number of dimensions

Figure 2 Performance of clustering algorithms with different dimensionality

Table 1 CVNN indices for different ensemble approaches on ACT2 data

Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-A 17315 17383 17449 17789 1819 183 17623 18182 18685 18067EFCM-C 17938 17558 17584 18351 18088 18353 18247 18385 18105 18381EFCM-S 13975 13144 12736 12974 13112 13643 13533 1409 13701 13765

54 Comparisons of Different Cluster Ensemble Methods Thecomparisons of different cluster ensemble approaches areshown in Figure 3 and Table 1 Similarly (a) and (c) of thefigure correspond to the synthetic data set and (b) and (d)corresponds to the ACT2 data set We use RI (a) and (b)and running time (c) and (d) to present the performanceof ensemble methods Meanwhile the meanings of EFCM-A EFCM-C and EFCM-S are identical to the ones inSection 42 In order to get crisp partition for EFCM-A andEFCM-C we used hierarchical clustering-complete linkagemethod after getting the distance matrix as in [21] Since allthree cluster ensemble methods get perfect partition resultson synthetic data set we only compare CVNN indices ofdifferent ensemble methods on ACT2 data set which ispresented in Table 1

In Figure 3 running time of our algorithm is shorterfor both data sets This verifies the result of time complexityanalysis for different algorithms in Section 42 The threecluster ensemble methods all get the perfect partition forsynthetic data set whereas our method is more accuratethan the other two methods for ACT2 data set The perfectpartition results suggest that all three ensemble methods aresuitable for Gaussian mixture data set However the almost18 improvement on RI for ACT2 data set should be due

to the different grouping ideas Our method is based on thegraph partition such that the edges between different clustershave low weight and the edges within a cluster have highweight This clustering way of spectral embedding is moresuitable for ACT2 data set In Table 1 the smaller values ofCVNN of our new method also show that new approach hasbetter partition results on ACT2 data set These observationsindicate that our algorithm has the advantage on efficiencyand adapts to a wider range of geometries

We also compare the stability for three ensemble meth-ods presented in Table 2 From the table we can see that thestandard deviation of RI about EFCM-S is a lower order ofmagnitude than the ones of the other methods Hence thisresult shows that our algorithm is more robust

Aiming at the situation of unknown clustersrsquo numberwe also varied the number of clusters 119888 in FCM clusteringand spectral embedding for our new method We denotethis version of new method as EFCM-SV Since the numberof random projections was set as 5 for ACT2 data set wechanged the clustersrsquo number from 17 to 21 as the input ofFCM clustering algorithm In addition we set the clustersrsquonumber from 14 to 24 as the input of spectral embeddingand applied CVNN to estimate the most plausible number ofclusters The experimental results are presented in Table 3

Mathematical Problems in Engineering 11

Table 2 Standard deviations of RI of 20 runs with different dimensions on ACT2 data

Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-A 00222 00174 0018 00257 00171 00251 00188 00172 00218 00184EFCM-C 00217 00189 00128 00232 00192 00200 00175 00194 00151 00214EFCM-S 00044 00018 00029 00030 00028 00024 00026 00020 00024 00019

Table 3 RI values for EFCM-S and EFCM-Sv on ACT2 data

Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-S 09227 0922 09223 0923 09215 09218 09226 09225 09231 09237EFCM-SV 09257 09257 09165 09257 0927 09165 09268 0927 09105 09245+CVNN 119888 = 185 119888 = 207 119888 = 194 119888 = 193 119888 = 193 119888 = 182 119888 = 192 119888 = 183 119888 = 194 119888 = 202

10 20 30 40 50 60 70 80 90 1000

02

04

06

08

1

12

14

16

18

2

RI

EFCM-AEFCM-CEFCM-S

Number of dimensions t

(a) RI versus number of dimensions

10 20 30 40 50 60 70 80 90 100074

076

078

08

082

084

086

088

09

092

094

RI

Number of dimensions t

EFCM-AEFCM-CEFCM-S

(b) RI versus number of dimensions

10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

Runn

ing

time (

s)

Number of dimensions t

EFCM-AEFCM-CEFCM-S

(c) Running time versus number of dimensions

10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

80

90

100

Runn

ing

time (

s)

Number of dimensions t

EFCM-AEFCM-CEFCM-S

(d) Running time versus number of dimensions

Figure 3 Performance of cluster ensemble approaches with different dimensionality

12 Mathematical Problems in Engineering

In Table 3 the values with respect to ldquoEFCM-SVrdquo are theaverage RI values with the estimated clustersrsquo numbers for20 individual runs The values of ldquo+CVNNrdquo are the averageclustersrsquo numbers decided by the CVNN cluster validityindex Using the estimated clustersrsquo numbers by CVNN ourmethod gets the similar results of ensemble method withcorrect clustersrsquo number In addition the average estimates ofclustersrsquo number are close to the true one This indicates thatour cluster ensemble method EFCM-SV is attractive whenthe number of clusters is unknown

6 Conclusion and Future Work

The ldquocurse of dimensionalityrdquo in big data gives new chal-lenges for clustering recently and feature extraction fordimensionality reduction is a popular way to deal with thesechallenges We studied the feature extraction method ofrandom projection for FCM clustering Through analyzingthe effects of random projection on the entire variabilityof data theoretically and verification both on syntheticand real world data empirically we designed an enhancedFCM clustering algorithm with random projection The newalgorithm can maintain nearly the same clustering solutionof preliminary FCM clustering and be more efficient thanfeature extraction method of SVD What is more we alsoproposed a cluster ensemble approach that is more applicableto large scale data sets than existing ones The new ensembleapproach can achieve spectral embedding efficiently fromSVD on the concatenation of membership matrices Theexperiments showed that the new ensemble method ranfaster had more robust partition solutions and fitted a widerrange of geometrical data sets

A future research content is to design the provablyaccurate feature extraction and feature selection methodsfor FCM clustering Another remaining question is thathow to choose proper number of random projections forcluster ensemble method in order to get a trade-off betweenclustering accuracy and efficiency

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This work was supported in part by the National KeyBasic Research Program (973 programme) under Grant2012CB315905 and in part by the National Nature ScienceFoundation of China under Grants 61502527 and 61379150and in part by the Open Foundation of State Key Laboratoryof Networking and Switching Technology (Beijing Universityof Posts and Telecommunications) (no SKLNST-2013-1-06)

References

[1] M Chen S Mao and Y Liu ldquoBig data a surveyrdquo MobileNetworks and Applications vol 19 no 2 pp 171ndash209 2014

[2] J Zhang X Tao and H Wang ldquoOutlier detection from largedistributed databasesrdquoWorld Wide Web vol 17 no 4 pp 539ndash568 2014

[3] C Ordonez N Mohanam and C Garcia-Alvarado ldquoPCA forlarge data sets with parallel data summarizationrdquo Distributedand Parallel Databases vol 32 no 3 pp 377ndash403 2014

[4] D-S Pham S Venkatesh M Lazarescu and S BudhadityaldquoAnomaly detection in large-scale data stream networksrdquo DataMining and Knowledge Discovery vol 28 no 1 pp 145ndash1892014

[5] F Murtagh and P Contreras ldquoRandom projection towardsthe baire metric for high dimensional clusteringrdquo in StatisticalLearning and Data Sciences pp 424ndash431 Springer BerlinGermany 2015

[6] T C Havens J C Bezdek C Leckie L O Hall and MPalaniswami ldquoFuzzy c-means algorithms for very large datardquoIEEETransactions on Fuzzy Systems vol 20 no 6 pp 1130ndash11462012

[7] J Han M Kamber and J Pei Data Mining Concepts andTechniques Concepts and Techniques Elsevier 2011

[8] S Khan G Situ K Decker and C J Schmidt ldquoGoFigureautomated gene ontology annotationrdquo Bioinformatics vol 19no 18 pp 2484ndash2485 2003

[9] S Gunnemann H Kremer D Lenhard and T Seidl ldquoSub-space clustering for indexing high dimensional data a mainmemory index based on local reductions and individual multi-representationsrdquo in Proceedings of the 14th International Confer-ence on Extending Database Technology (EDBT rsquo11) pp 237ndash248ACM Uppsala Sweden March 2011

[10] J C Bezdek R Ehrlich and W Full ldquoFCM the fuzzy c-meansclustering algorithmrdquo Computers amp Geosciences vol 10 no 2-3pp 191ndash203 1984

[11] R J Hathaway and J C Bezdek ldquoExtending fuzzy andprobabilistic clustering to very large data setsrdquo ComputationalStatistics amp Data Analysis vol 51 no 1 pp 215ndash234 2006

[12] P Hore L O Hall and D B Goldgof ldquoSingle pass fuzzy cmeansrdquo in Proceedings of the IEEE International Fuzzy SystemsConference (FUZZ rsquo07) pp 1ndash7 London UK July 2007

[13] P Hore L O Hall D B Goldgof Y Gu A A Maudsley andA Darkazanli ldquoA scalable framework for segmenting magneticresonance imagesrdquo Journal of Signal Processing Systems vol 54no 1ndash3 pp 183ndash203 2009

[14] W B Johnson and J Lindenstrauss ldquoExtensions of lipschitzmappings into aHilbert spacerdquoContemporaryMathematics vol26 pp 189ndash206 1984

[15] P Indyk and R Motwani ldquoApproximate nearest neighborstowards removing the curse of dimensionalityrdquo in Proceedingsof the 13th Annual ACM Symposium on Theory of Computingpp 604ndash613 ACM 1998

[16] D Achlioptas ldquoDatabase-friendly random projectionsJohnson-Lindenstrauss with binary coinsrdquo Journal of Computerand System Sciences vol 66 no 4 pp 671ndash687 2003

[17] C Boutsidis A Zouzias and P Drineas ldquoRandom projectionsfor k-means clusteringrdquo in Advances in Neural InformationProcessing Systems pp 298ndash306 MIT Press 2010

[18] C C Aggarwal and C K Reddy Data Clustering Algorithmsand Applications CRC Press New York NY USA 2013

[19] R Avogadri and G Valentini ldquoFuzzy ensemble clustering basedon random projections for DNA microarray data analysisrdquoArtificial Intelligence in Medicine vol 45 no 2-3 pp 173ndash1832009

Mathematical Problems in Engineering 13

[20] X Z Fern and C E Brodley ldquoRandom projection for highdimensional data clustering a cluster ensemble approachrdquo inProceedings of the 20th International Conference on MachineLearning (ICML rsquo03) vol 3 pp 186ndash193 August 2003

[21] M Popescu J Keller J Bezdek and A Zare ldquoRandomprojections fuzzy c-means (RPFCM) for big data clusteringrdquoin Proceedings of the IEEE International Conference on FuzzySystems (FUZZ-IEEE rsquo15) pp 1ndash6 Istanbul Turkey August 2015

[22] A Fahad N Alshatri Z Tari et al ldquoA survey of clusteringalgorithms for big data taxonomy and empirical analysisrdquo IEEETransactions on Emerging Topics in Computing vol 2 no 3 pp267ndash279 2014

[23] R A Johnson and D W Wichern Applied Multivariate Statis-tical Analysis vol 4 Pearson Prentice Hall Upper Saddle RiverNJ USA 6th edition 2007

[24] C Boutsidis A Zouzias M W Mahoney and P DrineasldquoRandomized dimensionality reduction for k-means cluster-ingrdquo IEEE Transactions on InformationTheory vol 61 no 2 pp1045ndash1062 2015

[25] X Chen and D Cai ldquoLarge scale spectral clustering withlandmark-based representationrdquo in Proceedings of the 25thAAAI Conference on Artificial Intelligence pp 313ndash318 2011

[26] D Cai and X Chen ldquoLarge scale spectral clustering vialandmark-based sparse representationrdquo IEEE Transactions onCybernetics vol 45 no 8 pp 1669ndash1680 2015

[27] G H Golub and C F Van Loan Matrix Computations vol 3JHU Press 2012

[28] U Maulik and S Bandyopadhyay ldquoPerformance evaluation ofsome clustering algorithms and validity indicesrdquo IEEE Transac-tions on Pattern Analysis and Machine Intelligence vol 24 no12 pp 1650ndash1654 2002

[29] W M Rand ldquoObjective criteria for the evaluation of clusteringmethodsrdquo Journal of the American Statistical Association vol66 no 336 pp 846ndash850 1971

[30] D T Anderson J C Bezdek M Popescu and J M KellerldquoComparing fuzzy probabilistic and possibilistic partitionsrdquoIEEE Transactions on Fuzzy Systems vol 18 no 5 pp 906ndash9182010

[31] X L Xie and G Beni ldquoA validity measure for fuzzy clusteringrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 13 no 8 pp 841ndash847 1991

[32] Y Liu Z LiH Xiong XGao JWu and SWu ldquoUnderstandingand enhancement of internal clustering validation measuresrdquoIEEE Transactions on Cybernetics vol 43 no 3 pp 982ndash9942013

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 3: Research Article Fuzzy -Means and Cluster Ensemble with Random Projection for Big Data ...downloads.hindawi.com/journals/mpe/2016/6529794.pdf · 2019. 7. 30. · Research Article

Mathematical Problems in Engineering 3

Input data set X (an 119899 times 119889matrix) number of clusters 119888 fuzzy constant119898Output partition matrix U centers of clusters VInitialize sample U (or V) randomly from proper spaceWhile |objold minus objnew|

2gt 120576 do

119906119894119895=[

[

119888

sum

119896=1

(

10038171003817100381710038171003817x119895minus v119894

10038171003817100381710038171003817

10038171003817100381710038171003817x119895minus v119896

10038171003817100381710038171003817

)

2(119898minus1)

]

]

minus1

forall119894 119895

v119894=

sum119899

119895=1(119906119894119895)

119898

x119895

sum119899

119895=1(119906119894119895)

119898 forall119894

obj =119888

sum

119894=1

119899

sum

119895=1

119906119898

119894119895

10038171003817100381710038171003817x119895minus v119894

10038171003817100381710038171003817

2

Algorithm 1 FCM clustering algorithm

proposed for aggregating themultiple fuzzy clustering resultswith random projection The main strategy is to generatedata membership matrices through multiple fuzzy clusteringsolutions on the different projected data sets and then toaggregate the resulting membership matrices Therefore dif-ferentmethods of generation and aggregation ofmembershipmatrices lead to various ensemble approaches about fuzzyclustering

The first cluster ensemble approach using randomprojec-tion was proposed in [20] After projecting the data into lowdimensional space with random projection the membershipmatrices were calculated through the probabilistic model 120579 of119888 Gaussian mixture gained by EM clustering Subsequentlythe similarity of points 119894 and 119895 was computed as 119875120579

119894119895=

sum119888

119897=1119875(119897 | 119894 120579) times 119875(119897 | 119895 120579) where 119875(119897 | 119894 120579) denoted the the

probability of point 119894 belonging to cluster 119897 undermodel 120579 and119875120579

119894119895denoted the probability that points 119894 and 119895 belonged to the

same cluster undermodel 120579The aggregated similarity matrixwas obtained by averaging across the multiple runs andthe final clustering solution was produced by a hierarchicalclustering method called complete linkage For mixturemodel the estimation for the cluster number and values ofunknown parameters is often complicated [23] In additionthis approach needs 119874(1198992) space for storing the similaritymatrix of data points

Another approach which was used to find genes in DNAmicroarray data was presented in [19] Similarly the data wasprojected into a low dimensional space with random matrixThen the method employed FCM clustering to partition theprojected data and generated membership matrices U

119894isin

R119888times119899 119894 = 1 2 119903 with multiple runs 119903 For each run 119894 thesimilarity matrix was computed as M

119894= U119879119894U119894 Then the

combined similarity matrix M was calculated by averagingas M = (1119903)sum

119903

119894=1M119894 A distance matrix was computed by

D = 1 minusM and the final partition matrix was gained by FCMclustering on the distance matrixD Since this method needsto compute the product of partition matrix and its transposethe time complexity is119874(119903 lowast 1198881198992) and the space complexity is119874(1198992)Considering the large scale data set in the context of big

data [21] proposed a new method for aggregating partition

matrices from FCM clustering They concatenated the par-tition matrices as Ucon = [U1198791 U

119879

2 ] instead of averaging

the agreement matrix Finally they got the ensemble resultas U119891= FCM(Ucon 119888) This algorithm avoids the products

of partition matrices and is more suitable than [19] for largescale data sets However it still needs the multiplication ofconcatenated partition matrix when crisp partition result iswanted

3 Random Projection

Dimensionality reduction is a common technique for analysisof high dimensional data The most popular skill is SVD (orprincipal component analysis) where the original featuresare replaced by a small size of principal components inorder to compress the data But SVD takes cubic time ofthe number of dimensions Recently some literatures statedthat random projection can be applied to dimensionalityreduction and preserve pairwise distances within a smallfactor [15 16] Low computing complexity and preservingthe metric structure make random projection receive muchattention Lemma 2 indicates that there are three kinds ofsimple random projection possessing the above properties

Lemma 2 (see [15 16]) Let matrix X isin R119899times119889 be a data set of119899 points and 119889 features Given 120576 120573 gt 0 let

1198960=

4 + 2120573

12057622 minus 120576

33

log 119899 (4)

For integer 119905 ge 1198960 let matrix R be a 119889 times 119905 (119905 le 119889) random

matrix wherein elements 119877119894119895are independently identically

distributed random variables from either one of the followingthree probability distributions

119877119894119895sim 119873 (0 1)

119877119894119895=

+1 with probablity 12

minus1 with probablity 12

(5)

4 Mathematical Problems in Engineering

119877119894119895= radic3 times

+1 with probablity 16

0 with probablity 23

minus1 with probablity 16

(6)

Let 119891 R119889 rarr R119905 with 119891(x119894) = (1radic119905)x

119894R For any u k isin X

with probability at least 1 minus 119899minus120573 it holds that

(1 minus 120576) u minus k22le1003817100381710038171003817119891 (u) minus 119891 (k)100381710038171003817

1003817

2

2le (1 + 120576) u minus k2

2 (7)

Lemma 2 implies that if the number of dimensionsof data reduced by random projection is bigger than acertain bound then pairwise Euclidean distance squares arepreserved within a multiplicative factor of 1 plusmn 120576

With the above properties researchers have checked thefeasibility of applying random projection to 119896-means clus-tering in terms of theory and experiment [17 24] Howeveras membership degrees for FCM clustering and 119896-meansclustering are defined differently the analysis method cannot be directly used for assessing the effect of randomprojection on FCM clustering Motivated by the idea ofprincipal component analysis we draw the conclusion thatthe compressed data gains the whole variability of originaldata in probabilistic sense based on the analysis of the vari-ance difference Besides variables referring to dimensions ofprojected data are linear independent As a result we canachieve dimensionality reduction via replacing original databy compressed data as ldquoprincipal componentsrdquo

Next we give a useful lemma for proof of the subsequenttheorem

Lemma 3 Let 120585119894(1 le 119894 le 119899) be independently distributed

randomvariables fromone of the three probability distributionsdescribed in Lemma 2 then

Pr lim119899rarrinfin

1

119899

119899

sum

119894=1

1205852

119894= 1 = 1 (8)

Proof According to the probability distribution of randomvariable 120585

119894 it is easy to know that

119864 (1205852

119894) = 1 (1 le 119894 le 119899)

119864(

1

119899

119899

sum

119894=1

1205852

119894) = 1

(9)

Then 1205852119894 obeys the law of large numbers namely

Pr lim119899rarrinfin

1

119899

119899

sum

119894=1

1205852

119894= 119864(

1

119899

119899

sum

119894=1

1205852

119894)

= Pr lim119899rarrinfin

1

119899

119899

sum

119894=1

1205852

119894= 1 = 1

(10)

Since centralization of data does not change the distanceof any two points and the FCM clustering algorithm is

based on pairwise distances to partition data points weassume that expectation of the data input is 0 In practicecovariancematrix of population is likely unknownThereforewe investigate the effect of random projection on variabilityof both population and sample

Theorem 4 Let data set X isin R119899times119889 be 119899 independentsamples of119889-dimensional randomvector (119883

1 1198832 119883

119889) and

S denotes the sample covariance matrix of X The randomprojection induced by random matrix R isin R119889times119905 mapsthe 119889-dimensional random vector to 119905-dimensional randomvector (119884

1 1198842 119884

119905) = (1radic119905)(119883

1 1198832 119883

119889) sdot R and Slowast

denotes the sample covariance matrix of projected data Ifelements of random matrix R obey distribution demanded byLemma 2 and are mutually independent with random vector(1198831 1198832 119883

119889) then

(1) dimensions of projected data are linearly independentcov(119884

119894 119884119895) = 0 forall119894 = 119895

(2) random projection maintains the whole variabilitysum119905

119894=1var(119884119894) = sum

119889

119894=1var(119883

119894) when 119905 rarr infin with

probability 1 tr(Slowast) = tr(S)

Proof It is easy to know that the expectation of any element ofrandommatrix 119864(119877

119894119895) = 0 1 le 119894 le 119889 1 le 119895 le 119905 As elements

of random matrix R and random vector (1198831 1198832 119883

119889)

are mutually independent the covariance of random vectorinduced by random projection is

cov (119884119894 119884119895) = cov( 1

radic119905

sdot

119889

sum

119896=1

119883119896sdot 119877119896119894

1

radic119905

sdot

119889

sum

119897=1

119883119897sdot 119877119897119895)

=

1

119905

sdot

119889

sum

119896=1

119889

sum

119897=1

cov (119883119896sdot 119877119896119894 119883119897sdot 119877119897119895)

=

1

119905

sdot

119889

sum

119896=1

119889

sum

119897=1

119864 (119883119896sdot 119877119896119894sdot 119883119897sdot 119877119897119895) minus

1

119905

sdot

119889

sum

119896=1

119889

sum

119897=1

(119864 (119883119896sdot 119877119896119894) sdot 119864 (119883

119897sdot 119877119897119895))

=

1

119905

sdot

119889

sum

119896=1

119889

sum

119897=1

119864 (119883119896sdot 119877119896119894sdot 119883119897sdot 119877119897119895)

=

1

119905

sdot

119889

sum

119896=1

119889

sum

119897=1

(119864 (119883119896sdot 119883119897) sdot 119864 (119877

119896119894sdot 119877119897119895))

=

1

119905

sdot

119889

sum

119896=1

(119864 (1198832

119896) sdot 119864 (119877

119896119894sdot 119877119896119895))

(11)

(1) If 119894 = 119895 then

cov (119884119894 119884119895) =

1

119905

sdot (

119889

sum

119896=1

119864 (1198832

119896) sdot 119864 (119877

119896119894) sdot 119864 (119877

119897119895))

= 0

(12)

Mathematical Problems in Engineering 5

(2) If 119894 = 119895 then

cov (119884119894 119884119894) = var (119884

119894) =

1

119905

sdot (

119889

sum

119896=1

119864 (1198832

119896) sdot 119864 (119877

2

119896119894))

=

1

119905

sdot

119889

sum

119896=1

119864 (1198832

119896)

(13)

Thus by assumption 119864(119883119894) = 0 (1 le 119894 le 119889) we can get

119905

sum

119894=1

var (119884119894) =

119889

sum

119894=1

var (119883119894) (14)

We denote spectral decomposition of sample covariancematrice S by S = VΛV119879 whereV is thematrix of eigenvectorsand Λ is a diagonal matrix in which the diagonal elementsare 1205821 1205822 120582

119889and 120582

1ge 1205822ge sdot sdot sdot ge 120582

119889 Supposing the data

samples have been centralized namely their means are 0119904 wecan get covariance matrix S = (1119899)X119879X For conveniencewe still denote a sample of random matrix by R Thusprojected data Y = (1radic119905)XR and sample covariance matrixof projected data Slowast = (1119899)((1radic119905)XR)119879((1radic119905)XR) =

(1119905)R119879SR Then we can get

tr (Slowast) = tr(1119905

R119879VΛV119879R) = tr(1119905

R119879ΛVV119879R)

= tr(1119905

R119879ΛR) =119889

sum

119894=1

120582119894sdot (

1

119905

sdot

119905

sum

119895=1

1199032

119894119895)

(15)

where 119903119894119895(1 le 119894 le 119889 1 le 119895 le 119905) is sample of element of

random matrix RIn practice the spectrum of a covariance often displays a

distinct decay after few large eigenvalues So we assume thatthere exists an integer 119901 limited constant 119902 gt 0 such that forall 119894 gt 119901 it holds that 120582

119894le 119902 Then

1003816100381610038161003816tr (Slowast) minus tr (S)100381610038161003816

1003816=

10038161003816100381610038161003816100381610038161003816100381610038161003816

119889

sum

119894=1

120582119894sdot (

1

119905

119905

sum

119895=1

1199032

119894119895minus 1)

10038161003816100381610038161003816100381610038161003816100381610038161003816

le

10038161003816100381610038161003816100381610038161003816100381610038161003816

119901

sum

119894=1

120582119894sdot (

1

119905

119905

sum

119895=1

1199032

119894119895minus 1)

10038161003816100381610038161003816100381610038161003816100381610038161003816

+

10038161003816100381610038161003816100381610038161003816100381610038161003816

119889

sum

119894=119901+1

120582119894sdot (

1

119905

119905

sum

119895=1

1199032

119894119895minus 1)

10038161003816100381610038161003816100381610038161003816100381610038161003816

le

10038161003816100381610038161003816100381610038161003816100381610038161003816

119901

sum

119894=1

120582119894sdot

1

119905

119905

sum

119895=1

(1199032

119894119895minus 1)

10038161003816100381610038161003816100381610038161003816100381610038161003816

+ 119902

10038161003816100381610038161003816100381610038161003816100381610038161003816

119889

sum

119894=119901+1

(

1

119905

119905

sum

119895=1

(1199032

119894119895minus 1))

10038161003816100381610038161003816100381610038161003816100381610038161003816

(16)

By Lemma 3 with probability 1

lim119905rarrinfin

(

1

119905

119905

sum

119895=1

(1199032

119894119895minus 1)) = 0

lim119905rarrinfin

119889

sum

119894=119901+1

(

1

119905

119905

sum

119895=1

(1199032

119894119895minus 1)) = 0

(17)

Combining the above arguments we achieve tr(Slowast) =tr(S) with probability 1 when 119905 rarr infin

Part (1) of Theorem 4 indicates that compressed dataproduced by random projection can take much informationwith low dimensionality owing to linear independence ofreduced dimensions Part (2) manifests that sum of variancesof dimensions of original data is consistent with the oneof projected data namely random projection holds thevariability of primal data Combining results of Lemma 2with those ofTheorem 4 we consider that random projectioncan be employed to improve the efficiency of FCM clusteringalgorithm with low dimensionality and the modified algo-rithm can keep the accuracy of partition approximately

4 FCM Clustering with Random Projectionand an Efficient Cluster Ensemble Approach

41 FCM Clustering via Random Projection According tothe results of Section 3 we design an improved FCMclustering algorithm with random projection for dimen-sionality reduction The procedure of new algorithm isshown in Algorithm 2

Algorithm 2 reduces the dimensions of input data viamultiplying a random matrix Compared with the 119874(1198881198991198892)time for running each iteration in original FCM clusteringthe new algorithm would imply an 119874(119888119899(120576minus2 ln 119899)2) time foreach iteration Thus the time complexity of new algorithmdecreases obviously for high dimensional data in the case120576minus2 ln 119899 ≪ 119889 Another common dimensionality reductionmethod is SVD Compared with the 119874(1198893 + 1198991198892) time ofrunning SVD on data matrix X the new algorithm onlyneeds 119874(120576minus2119889 ln 119899) time to generate random matrix R Itindicates that random projection is a cost-effective methodof dimensionality reduction for FCM clustering algorithm

42 Ensemble Approach Based on Graph Partition As dif-ferent random projections may result in different clusteringsolutions [20] it is attractive to design the cluster ensembleframework with random projection for improved and robustclustering performance Although it uses smaller memoryand runs faster than ensemble method in [19] the clusterensemble algorithm in [21] still needs product of concate-nated partition matrix for crisp grouping which leads to ahigh time and space costs under the circumstances of big data

In this section we propose a more efficient and effectiveaggregation method for multiple FCM clustering results Theoverview of our new ensemble approach is presented inFigure 1 The new ensemble method is based on partition

6 Mathematical Problems in Engineering

Input data set X (an 119899 times 119889matrix) number of clusters 119888 fuzzy constant119898 FCM clustering algorithmOutput partition matrix U centers of clusters V(1) sample a 119889 times 119905 (119905 le 119889 119905 = Ω(120576minus2 ln 119899)) random projection Rmeeting the requirements of Lemma 2(2) compute the product Y = (1radic119905)XR(3) run FCM algorithm on Y get the partition matrix U(4) compute the centers of clusters through original data X and U

Algorithm 2 FCM clustering with random projection

Input data set X (an 119899 times 119889matrix) number of clusters 119888 reduced dimension 119905 number ofrandom projection 119903 FCM clustering algorithmOutput cluster label vector u(1) at each iteration 119894 isin [1 119903] run Algorithm 2 get membership matrix U

119894isin R119888times119899

(2) concatenate the membership matrices Ucon = [U1198791 U119879

119903] isin R119899times119888119903

(3) compute the first 119888 left singular vectors of Ucon denoted by A = [a1 a2 a

119888] isin R119899times119888

where Ucon = Ucon(119903 sdot D)minus12 D is a diagonal matrix and 119889119894119894= sum119895119906con 119895119894

(4) treat each row of A as a data point and apply 119896-means to obtain cluster label vector

Algorithm 3 Cluster ensemble for FCM clustering with random projection

on similarity graph For each random projection a new dataset is generated After performing FCM clustering on thenew data sets membershipmatrices are outputThe elementsof membership matrix are treated as the similarity measurebetween points and the cluster centers Through SVD on theconcatenation of membership matrices we get the spectralembedding of data point efficiently The detailed procedureof new cluster ensemble approach is shown in Algorithm 3

In step (3) of the procedure in Algorithm 3 the left sin-gular vectors of Ucon are equivalent to the eigenvectors ofUconU

119879

con It implies that we regard the matrix product as aconstruction of affinity matrix of data points This method ismotivated by the research on landmark-based representation[25 26] In our approach we treat the cluster centers ofeach FCM clustering run as landmarks and the membershipmatrix as landmark-based representation Thus the con-catenation of membership matrices forms a combinationallandmark-based representationmatrix In this way the graphsimilarity matrix is computed as

W =UconU

119879

con (18)

which can create spectral embedding efficiently through step(3) To normalize the graph similarity matrix we multiplyUcon by (119903 sdot D)minus12 As a result the degree matrix of W is anidentity matrix

There are two perspectives to explain why our approachworks Considering the similarity measure defined by 119906

119894119895

in FCM clustering proposition 3 in [26] demonstrated thatsingular vectors of U

119894converged to eigenvectors of W

119904as 119888

converges to 119899 where W119904was affinity matrix generated in

standard spectral clustering As a result singular vectors ofUcon converge to eigenvectors of normalized affinity matrixW119904Thus our final outputwill converge to the one of standard

spectral clustering as 119888 converges to 119899 Another explanationis about the similarity measure defined by 119870(x

119894 x119895) = x119879119894x119895

where x119894and x119895are data pointsWe can treat each row of Ucon

as a transformational data point As a result affinity matrixobtained here is the same as the one of standard spectralembedding and our output is just the partition result ofstandard spectral clustering

To facilitate comparison of different ensemble methodsfor FCM clustering solutions with random projection wedenote the approach of [19] byEFCM-A (average the productsof membership matrices) the algorithm of [21] by EFCM-C (concatenate the membership matrices) and our newmethod by EFCM-S (spectral clustering on the membershipmatrices) In the cluster ensemble phase the main computa-tions of EFCM-Amethod are multiplications of membershipmatrices Similarly the algorithm of EFCM-C also needsthe product of concatenated membership matrices in orderto get the crisp partition result Thus the above methodsboth need 119874(119899

2) space and 119874(119888119903119899

2) time However the

main computation of EFCM-S is SVD for Ucon and 119896-meansclustering for A The overall space is 119874(119888119903119899) the SVD timeis 119874((119888119903)2119899) and the 119896-means clustering time is 1198971198882119899 where119897 is iteration number of 119896-means Therefore computationalcomplexity of EFCM-S is obviously decreased comparedwiththe ones of EFCM-A and EFCM-C considering 119888119903 ≪ 119899 and119897 ≪ 119899 in large scale data set

5 Experiments

In this section we present the experimental evaluations ofnew algorithms proposed in Section 4 We implemented therelated algorithms in Matlab computing environment andconducted our experiments on aWindows-based systemwiththe Intel Core 36GHz processor and 16GB of RAM

51 Data Sets and Parameter Settings We conducted theexperiments on synthetic and real data sets which bothhave relatively high dimensionality The synthetic data

Mathematical Problems in Engineering 7

Original dataset

Randomprojection 1

Randomprojection 2

Randomprojection r

Generateddataset 1

Generateddataset 2

Generateddataset r

FCM clustering

Membershipmatrix 1

Consensus matrix

Final result

Membershipmatrix 1

Membershipmatrix 1

FCM clusteringFCM clustering

k-means

First c left singularvectors A

middot middot middot

middot middot middot

middot middot middot

middot middot middot

Figure 1 Framework of the new ensemble approach based on graph partition

set had 10000 data points with 1000 dimensions whichwere generated from 3 Gaussian mixtures in propor-tions (025 05 025) The means of components were(2 2 2)

1000 (0 0 0)

1000 and (minus2 minus2 minus2)

1000and

the standard deviations were (1 1 1)1000

(2 2 2)1000

and (3 3 3)

1000 The real data set is the daily and sports

activities data (ACT) published on UCI machine learningrepository (theACTdata set can be found at httparchiveicsuciedumldatasetsDaily+and+Sports+Activities)Thesearedata of 19 activities collected by 45 motion sensors in 5minutes at 25Hz sampling frequency Each activity wasperformed by 8 subjects in their own styles To get highdimensional data sets we treated 1 minute and 5 secondsof activity data as an instance respectively As a result wegot 760 times 67500 (ACT1) and 9120 times 5625 (ACT2) datamatrices whose rows were activity instances and columnswere features

For the parameters of FCM clustering we let 120576 = 10minus5 welet maximum iteration number be 100 we let fuzzy factor 119898be 2 and we let the number of clusters be 119888 = 3 for syntheticdata set and 119888 = 19 for ACT data sets We also normalizedthe objective function as objlowast = objX2

119865 where sdot

119865is

Frobenius norm of matrix [27] To minimize the influenceintroduced by different initializations we present the averagevalues of evaluation indices of 20 independent experiments

In order to compare different dimensionality reductionmethods for FCM clustering we initialized algorithms by

choosing 119888 points randomly as the cluster centers and madesure that every algorithm began with the same initializationIn addition we ran Algorithm 2 with 119905 = 10 20 100 forsynthetic data set and 119905 = 100 200 1000 for ACT1 dataset Two kinds of random projections (with random variablesfrom (5) in Lemma 2) were both tested for verifying theirfeasibility We also compared Algorithm 2 against anotherpopular method of dimensionality reductionmdashSVD Whatcalls for special attention is that the number of eigenvectorscorresponding to nonzero eigenvalues of ACT1 data is only760 so we just took 119905 = 100 200 700 on FCM clusteringwith SVD for ACT1 data set

Among comparisons of different cluster ensemble algo-rithms we set dimension number of projected data as 119905 =10 20 100 for both synthetic andACT2data sets In orderto meet 119888119903 ≪ 119899 for Algorithm 3 the number of randomprojection 119903 was set as 20 for the synthetic data set and 5 forthe ACT2 data set respectively

52 Evaluation Criteria For clustering algorithms clusteringvalidation and running time are two important indices forjudging their performances Clustering validation measuresevaluate the goodness of clustering results [28] and often canbe divided into two categories external clustering validationand internal clustering validation External validation mea-sures use external information such as the given class labelsto evaluate the goodness of solution output by a clustering

8 Mathematical Problems in Engineering

algorithm On the contrary internal measures are to evaluatethe clustering results using feature inherited from data setsIn this paper validity evaluation criteria used are rand indexand clustering validation index based on nearest neighborsfor crisp partition together with fuzzy rand index and Xie-Beni index for fuzzy partition Here rand index and fuzzyrand index are external validation measures whereas theclustering validation index based on nearest neighbors indexand Xie-Beni index are internal validation measures

(1) Rand Index (RI) [29] RI describes the similarity ofclustering solution and correct labels through pairs of pointsIt takes into account the numbers of point pairs that are in thesame and different clusters The RI is defined as

RI =11989911+ 11989900

1198622

119899

(19)

where 11989911

is the number of pairs of points that exist in thesame cluster in both clustering result and given class labels11989900is the number of pairs of points that are in different subsets

for both clustering result and given class labels and1198622119899equals

119899(119899 minus 1)2 The value of RI ranges from 0 to 1 and the highervalue implies the better clustering solution

(2) Fuzzy Rand Index (FRI) [30] FRI is a generalizationof RI with respect to soft partition It also measures theproportion of pairs of points which exist in the same anddifferent clusters in both clustering solution and true classlabels It needs to compute the analogous 119899

11and 11989900through

contingency table described in [30] Therefore the range ofFRI is also [0 1] and the larger value means more accuratecluster solution

(3) Xie-Beni Index (XB) [31] XB takes the minimum squaredistance between cluster centers as the separation of thepartition and the average square fuzzy deviation of datapoints as the compactness of the partition XB is calculatedas follows

XB =sum119888

119894=1sum119899

119895=1119906119898

119894119895

10038171003817100381710038171003817x119895minus k119894

10038171003817100381710038171003817

2

119899 sdotmin119894119895

10038171003817100381710038171003817k119894minus k119895

10038171003817100381710038171003817

2 (20)

where sum119888119894=1sum119899

119895=1119906119898

119894119895x119895minus k1198942 is just the objective function of

FCM clustering and k119894is the center of cluster 119894 The smallest

XB indicates the optimal cluster partition

(4) Clustering Validation Index Based on Nearest Neighbors(CVNN) [32] The separation of CVNN is about the situationof objects that have geometrical information of each clusterand the compactness is the mean pairwise distance betweenobjects in the same cluster CVNN is computed as follows

CVNN (119888 119896) =Sep (119888 119896)

(max119888 minle119888le119888 maxSep (119888 119896))

+

Com (119888)(max119888 minle119888le119888 maxCom (119888))

(21)

where Sep(119888 119896) = max119894=12119888

((1119899119894) sdot sum

119899119894

119895=1(119902119895119896)) and

Com(119888) = sum119888119894=1((2119899119894(119899119894minus1)) sdotsum

119909119910isinClu119894 119889(119909 119910)) Here 119888 is the

number of clusters in partition result 119888 max is the maximumcluster number given 119888 min is the minimum cluster numbergiven 119896 is the number of nearest neighbors 119899

119894is the number

of objects in the 119894th cluster Clu119894 119902119895denotes the number of

nearest neighbors of Clu119894rsquos 119895th object which are not in Clu

119894

and 119889(119909 119910) denotes the distance between 119909 and 119910 The lowerCVNN value indicates the better clustering solution

Objective function is a special evaluation criterion ofvalidity for FCM clustering algorithm The smaller objectivefunction indicates that the points inside clusters are moreldquosimilarrdquo

Running time is also an important evaluation criterionoften related to the scalability of algorithm One maintarget of random projection for dimensionality reductionis to decrease the runtime and enhance the applicability ofalgorithm in the context of big data

53 Performance of FCM Clustering with Random ProjectionThe experimental results about FCM clustering with randomprojection are presented in Figure 2 where (a) (c) (e) and (g)correspond to the synthetic data set and (b) (d) (f) and (h)correspond to the ACT1 data setThe evaluation criteria usedto assess proposed algorithms are FRI (a) and (b) XB (c) and(d) objective function (e) and (f) and running time (g) and(h) ldquoSignRPrdquo denotes the proposed algorithm with randomsign matrix ldquoGaussRPrdquo denotes the FCM clustering withrandom Gaussian matrix ldquoFCMrdquo denotes the original FCMclustering algorithm and ldquoSVDrdquo denotes the FCM clusteringwith dimensionality reduction through SVD It should benoted that true XB value of FCM clustering in subfigure (d)is 403e + 12 not 0

From Figure 2 we can see that FCM clustering withrandom projection is clearly more efficient than the originalFCM clustering When number of dimensions 119905 is abovecertain bound the validity indices are nearly stable andsimilar to the ones of naive FCM clustering for both datasets This verifies the conclusion that ldquoaccuracy of clusteringalgorithm can be preserved when the dimensionality exceedsa certain boundrdquo The effectiveness for random projectionmethod is also verified by the small bound compared to thetotal dimensions (301000 for synthetic data and 30067500for ACT1 data) Besides the two different kinds of randomprojection methods have the similar impact on FCM cluster-ing because of the analogous plot

The higher objective function values and the smaller XBindices of SVD method for synthetic data set indicate thatthe generated clustering solution has better separation degreebetween clusters The external cluster validation indices alsoverify that SVD method has better clustering results forsynthetic data These observations state that SVD methodis more suitable for Gaussian mixture data sets than FCMclustering with randomprojection and naive FCM clustering

Although the SVDmethod has a higher FRI for syntheticdata set the random projection methods have analogousFRI values for ACT1 data set and better objective functionvalues for both data sets In addition the random projectionapproaches are obviously more efficient as the SVD needscubic time of dimensionality Hence these observationsindicate that our algorithm is quite encouraging in practice

Mathematical Problems in Engineering 9

10 20 30 40 50 60 70 80 90 100

065

07

075

08

085

09

095

1FR

I

SVDFCM

GaussRPSignRP

Number of dimensions t

(a) FRI versus number of dimensions

SVDFCM

GaussRPSignRP

100 200 300 400 500 600 700 800 900 100008998

09

09002

09004

09006

09008

0901

FRI

Number of dimensions t

(b) FRI versus number of dimensions

SVDFCM

GaussRPSignRP

10 20 30 40 50 60 70 80 90 1000

05

1

15

2

25

3

35

XB

Number of dimensions t

(c) FRI versus number of dimensions

SVDFCM

GaussRPSignRP

100 200 300 400 500 600 700 800 900 10000

2

4

6

8

10

12

14XB

(1015)

Number of dimensions t

(d) FRI versus number of dimensions

10 20 30 40 50 60 70 80 90 100

032

034

036

038

04

042

044

046

048

05

Obj

ectiv

e fun

ctio

n

Number of dimensions t

SVDFCM

GaussRPSignRP

(e) FRI versus number of dimensions

100 200 300 400 500 600 700 800 900 100000285

0029

00295

003

00305

0031

00315

0032

00325

Obj

ectiv

e fun

ctio

n

Number of dimensions t

SVDFCM

GaussRPSignRP

(f) FRI versus number of dimensions

Figure 2 Continued

10 Mathematical Problems in Engineering

24262830

Runn

ing

time

(s)

SVDFCM

10 20 30 40 50 60 70 80 90 1001618

222242628

332

Runn

ing

time (

s)

GaussRPSignRP

10 20 30 40 50 60 70 80 90 100Number of dimensions t

Number of dimensions t

(g) FRI versus number of dimensions

SVDFCM

GaussRPSignRP

820830840850

Runn

ing

time

(s)

100 200 300 400 500 600 700 800 900 1000

100 200 300 400 500 600 700 800 900 1000

0

2

4

6

8

10

12

14

Runn

ing

time (

s)

Number of dimensions t

Number of dimensions t

(h) FRI versus number of dimensions

Figure 2 Performance of clustering algorithms with different dimensionality

Table 1 CVNN indices for different ensemble approaches on ACT2 data

Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-A 17315 17383 17449 17789 1819 183 17623 18182 18685 18067EFCM-C 17938 17558 17584 18351 18088 18353 18247 18385 18105 18381EFCM-S 13975 13144 12736 12974 13112 13643 13533 1409 13701 13765

54 Comparisons of Different Cluster Ensemble Methods Thecomparisons of different cluster ensemble approaches areshown in Figure 3 and Table 1 Similarly (a) and (c) of thefigure correspond to the synthetic data set and (b) and (d)corresponds to the ACT2 data set We use RI (a) and (b)and running time (c) and (d) to present the performanceof ensemble methods Meanwhile the meanings of EFCM-A EFCM-C and EFCM-S are identical to the ones inSection 42 In order to get crisp partition for EFCM-A andEFCM-C we used hierarchical clustering-complete linkagemethod after getting the distance matrix as in [21] Since allthree cluster ensemble methods get perfect partition resultson synthetic data set we only compare CVNN indices ofdifferent ensemble methods on ACT2 data set which ispresented in Table 1

In Figure 3 running time of our algorithm is shorterfor both data sets This verifies the result of time complexityanalysis for different algorithms in Section 42 The threecluster ensemble methods all get the perfect partition forsynthetic data set whereas our method is more accuratethan the other two methods for ACT2 data set The perfectpartition results suggest that all three ensemble methods aresuitable for Gaussian mixture data set However the almost18 improvement on RI for ACT2 data set should be due

to the different grouping ideas Our method is based on thegraph partition such that the edges between different clustershave low weight and the edges within a cluster have highweight This clustering way of spectral embedding is moresuitable for ACT2 data set In Table 1 the smaller values ofCVNN of our new method also show that new approach hasbetter partition results on ACT2 data set These observationsindicate that our algorithm has the advantage on efficiencyand adapts to a wider range of geometries

We also compare the stability for three ensemble meth-ods presented in Table 2 From the table we can see that thestandard deviation of RI about EFCM-S is a lower order ofmagnitude than the ones of the other methods Hence thisresult shows that our algorithm is more robust

Aiming at the situation of unknown clustersrsquo numberwe also varied the number of clusters 119888 in FCM clusteringand spectral embedding for our new method We denotethis version of new method as EFCM-SV Since the numberof random projections was set as 5 for ACT2 data set wechanged the clustersrsquo number from 17 to 21 as the input ofFCM clustering algorithm In addition we set the clustersrsquonumber from 14 to 24 as the input of spectral embeddingand applied CVNN to estimate the most plausible number ofclusters The experimental results are presented in Table 3

Mathematical Problems in Engineering 11

Table 2 Standard deviations of RI of 20 runs with different dimensions on ACT2 data

Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-A 00222 00174 0018 00257 00171 00251 00188 00172 00218 00184EFCM-C 00217 00189 00128 00232 00192 00200 00175 00194 00151 00214EFCM-S 00044 00018 00029 00030 00028 00024 00026 00020 00024 00019

Table 3 RI values for EFCM-S and EFCM-Sv on ACT2 data

Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-S 09227 0922 09223 0923 09215 09218 09226 09225 09231 09237EFCM-SV 09257 09257 09165 09257 0927 09165 09268 0927 09105 09245+CVNN 119888 = 185 119888 = 207 119888 = 194 119888 = 193 119888 = 193 119888 = 182 119888 = 192 119888 = 183 119888 = 194 119888 = 202

10 20 30 40 50 60 70 80 90 1000

02

04

06

08

1

12

14

16

18

2

RI

EFCM-AEFCM-CEFCM-S

Number of dimensions t

(a) RI versus number of dimensions

10 20 30 40 50 60 70 80 90 100074

076

078

08

082

084

086

088

09

092

094

RI

Number of dimensions t

EFCM-AEFCM-CEFCM-S

(b) RI versus number of dimensions

10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

Runn

ing

time (

s)

Number of dimensions t

EFCM-AEFCM-CEFCM-S

(c) Running time versus number of dimensions

10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

80

90

100

Runn

ing

time (

s)

Number of dimensions t

EFCM-AEFCM-CEFCM-S

(d) Running time versus number of dimensions

Figure 3 Performance of cluster ensemble approaches with different dimensionality

12 Mathematical Problems in Engineering

In Table 3 the values with respect to ldquoEFCM-SVrdquo are theaverage RI values with the estimated clustersrsquo numbers for20 individual runs The values of ldquo+CVNNrdquo are the averageclustersrsquo numbers decided by the CVNN cluster validityindex Using the estimated clustersrsquo numbers by CVNN ourmethod gets the similar results of ensemble method withcorrect clustersrsquo number In addition the average estimates ofclustersrsquo number are close to the true one This indicates thatour cluster ensemble method EFCM-SV is attractive whenthe number of clusters is unknown

6 Conclusion and Future Work

The ldquocurse of dimensionalityrdquo in big data gives new chal-lenges for clustering recently and feature extraction fordimensionality reduction is a popular way to deal with thesechallenges We studied the feature extraction method ofrandom projection for FCM clustering Through analyzingthe effects of random projection on the entire variabilityof data theoretically and verification both on syntheticand real world data empirically we designed an enhancedFCM clustering algorithm with random projection The newalgorithm can maintain nearly the same clustering solutionof preliminary FCM clustering and be more efficient thanfeature extraction method of SVD What is more we alsoproposed a cluster ensemble approach that is more applicableto large scale data sets than existing ones The new ensembleapproach can achieve spectral embedding efficiently fromSVD on the concatenation of membership matrices Theexperiments showed that the new ensemble method ranfaster had more robust partition solutions and fitted a widerrange of geometrical data sets

A future research content is to design the provablyaccurate feature extraction and feature selection methodsfor FCM clustering Another remaining question is thathow to choose proper number of random projections forcluster ensemble method in order to get a trade-off betweenclustering accuracy and efficiency

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This work was supported in part by the National KeyBasic Research Program (973 programme) under Grant2012CB315905 and in part by the National Nature ScienceFoundation of China under Grants 61502527 and 61379150and in part by the Open Foundation of State Key Laboratoryof Networking and Switching Technology (Beijing Universityof Posts and Telecommunications) (no SKLNST-2013-1-06)

References

[1] M Chen S Mao and Y Liu ldquoBig data a surveyrdquo MobileNetworks and Applications vol 19 no 2 pp 171ndash209 2014

[2] J Zhang X Tao and H Wang ldquoOutlier detection from largedistributed databasesrdquoWorld Wide Web vol 17 no 4 pp 539ndash568 2014

[3] C Ordonez N Mohanam and C Garcia-Alvarado ldquoPCA forlarge data sets with parallel data summarizationrdquo Distributedand Parallel Databases vol 32 no 3 pp 377ndash403 2014

[4] D-S Pham S Venkatesh M Lazarescu and S BudhadityaldquoAnomaly detection in large-scale data stream networksrdquo DataMining and Knowledge Discovery vol 28 no 1 pp 145ndash1892014

[5] F Murtagh and P Contreras ldquoRandom projection towardsthe baire metric for high dimensional clusteringrdquo in StatisticalLearning and Data Sciences pp 424ndash431 Springer BerlinGermany 2015

[6] T C Havens J C Bezdek C Leckie L O Hall and MPalaniswami ldquoFuzzy c-means algorithms for very large datardquoIEEETransactions on Fuzzy Systems vol 20 no 6 pp 1130ndash11462012

[7] J Han M Kamber and J Pei Data Mining Concepts andTechniques Concepts and Techniques Elsevier 2011

[8] S Khan G Situ K Decker and C J Schmidt ldquoGoFigureautomated gene ontology annotationrdquo Bioinformatics vol 19no 18 pp 2484ndash2485 2003

[9] S Gunnemann H Kremer D Lenhard and T Seidl ldquoSub-space clustering for indexing high dimensional data a mainmemory index based on local reductions and individual multi-representationsrdquo in Proceedings of the 14th International Confer-ence on Extending Database Technology (EDBT rsquo11) pp 237ndash248ACM Uppsala Sweden March 2011

[10] J C Bezdek R Ehrlich and W Full ldquoFCM the fuzzy c-meansclustering algorithmrdquo Computers amp Geosciences vol 10 no 2-3pp 191ndash203 1984

[11] R J Hathaway and J C Bezdek ldquoExtending fuzzy andprobabilistic clustering to very large data setsrdquo ComputationalStatistics amp Data Analysis vol 51 no 1 pp 215ndash234 2006

[12] P Hore L O Hall and D B Goldgof ldquoSingle pass fuzzy cmeansrdquo in Proceedings of the IEEE International Fuzzy SystemsConference (FUZZ rsquo07) pp 1ndash7 London UK July 2007

[13] P Hore L O Hall D B Goldgof Y Gu A A Maudsley andA Darkazanli ldquoA scalable framework for segmenting magneticresonance imagesrdquo Journal of Signal Processing Systems vol 54no 1ndash3 pp 183ndash203 2009

[14] W B Johnson and J Lindenstrauss ldquoExtensions of lipschitzmappings into aHilbert spacerdquoContemporaryMathematics vol26 pp 189ndash206 1984

[15] P Indyk and R Motwani ldquoApproximate nearest neighborstowards removing the curse of dimensionalityrdquo in Proceedingsof the 13th Annual ACM Symposium on Theory of Computingpp 604ndash613 ACM 1998

[16] D Achlioptas ldquoDatabase-friendly random projectionsJohnson-Lindenstrauss with binary coinsrdquo Journal of Computerand System Sciences vol 66 no 4 pp 671ndash687 2003

[17] C Boutsidis A Zouzias and P Drineas ldquoRandom projectionsfor k-means clusteringrdquo in Advances in Neural InformationProcessing Systems pp 298ndash306 MIT Press 2010

[18] C C Aggarwal and C K Reddy Data Clustering Algorithmsand Applications CRC Press New York NY USA 2013

[19] R Avogadri and G Valentini ldquoFuzzy ensemble clustering basedon random projections for DNA microarray data analysisrdquoArtificial Intelligence in Medicine vol 45 no 2-3 pp 173ndash1832009

Mathematical Problems in Engineering 13

[20] X Z Fern and C E Brodley ldquoRandom projection for highdimensional data clustering a cluster ensemble approachrdquo inProceedings of the 20th International Conference on MachineLearning (ICML rsquo03) vol 3 pp 186ndash193 August 2003

[21] M Popescu J Keller J Bezdek and A Zare ldquoRandomprojections fuzzy c-means (RPFCM) for big data clusteringrdquoin Proceedings of the IEEE International Conference on FuzzySystems (FUZZ-IEEE rsquo15) pp 1ndash6 Istanbul Turkey August 2015

[22] A Fahad N Alshatri Z Tari et al ldquoA survey of clusteringalgorithms for big data taxonomy and empirical analysisrdquo IEEETransactions on Emerging Topics in Computing vol 2 no 3 pp267ndash279 2014

[23] R A Johnson and D W Wichern Applied Multivariate Statis-tical Analysis vol 4 Pearson Prentice Hall Upper Saddle RiverNJ USA 6th edition 2007

[24] C Boutsidis A Zouzias M W Mahoney and P DrineasldquoRandomized dimensionality reduction for k-means cluster-ingrdquo IEEE Transactions on InformationTheory vol 61 no 2 pp1045ndash1062 2015

[25] X Chen and D Cai ldquoLarge scale spectral clustering withlandmark-based representationrdquo in Proceedings of the 25thAAAI Conference on Artificial Intelligence pp 313ndash318 2011

[26] D Cai and X Chen ldquoLarge scale spectral clustering vialandmark-based sparse representationrdquo IEEE Transactions onCybernetics vol 45 no 8 pp 1669ndash1680 2015

[27] G H Golub and C F Van Loan Matrix Computations vol 3JHU Press 2012

[28] U Maulik and S Bandyopadhyay ldquoPerformance evaluation ofsome clustering algorithms and validity indicesrdquo IEEE Transac-tions on Pattern Analysis and Machine Intelligence vol 24 no12 pp 1650ndash1654 2002

[29] W M Rand ldquoObjective criteria for the evaluation of clusteringmethodsrdquo Journal of the American Statistical Association vol66 no 336 pp 846ndash850 1971

[30] D T Anderson J C Bezdek M Popescu and J M KellerldquoComparing fuzzy probabilistic and possibilistic partitionsrdquoIEEE Transactions on Fuzzy Systems vol 18 no 5 pp 906ndash9182010

[31] X L Xie and G Beni ldquoA validity measure for fuzzy clusteringrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 13 no 8 pp 841ndash847 1991

[32] Y Liu Z LiH Xiong XGao JWu and SWu ldquoUnderstandingand enhancement of internal clustering validation measuresrdquoIEEE Transactions on Cybernetics vol 43 no 3 pp 982ndash9942013

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 4: Research Article Fuzzy -Means and Cluster Ensemble with Random Projection for Big Data ...downloads.hindawi.com/journals/mpe/2016/6529794.pdf · 2019. 7. 30. · Research Article

4 Mathematical Problems in Engineering

119877119894119895= radic3 times

+1 with probablity 16

0 with probablity 23

minus1 with probablity 16

(6)

Let 119891 R119889 rarr R119905 with 119891(x119894) = (1radic119905)x

119894R For any u k isin X

with probability at least 1 minus 119899minus120573 it holds that

(1 minus 120576) u minus k22le1003817100381710038171003817119891 (u) minus 119891 (k)100381710038171003817

1003817

2

2le (1 + 120576) u minus k2

2 (7)

Lemma 2 implies that if the number of dimensionsof data reduced by random projection is bigger than acertain bound then pairwise Euclidean distance squares arepreserved within a multiplicative factor of 1 plusmn 120576

With the above properties researchers have checked thefeasibility of applying random projection to 119896-means clus-tering in terms of theory and experiment [17 24] Howeveras membership degrees for FCM clustering and 119896-meansclustering are defined differently the analysis method cannot be directly used for assessing the effect of randomprojection on FCM clustering Motivated by the idea ofprincipal component analysis we draw the conclusion thatthe compressed data gains the whole variability of originaldata in probabilistic sense based on the analysis of the vari-ance difference Besides variables referring to dimensions ofprojected data are linear independent As a result we canachieve dimensionality reduction via replacing original databy compressed data as ldquoprincipal componentsrdquo

Next we give a useful lemma for proof of the subsequenttheorem

Lemma 3 Let 120585119894(1 le 119894 le 119899) be independently distributed

randomvariables fromone of the three probability distributionsdescribed in Lemma 2 then

Pr lim119899rarrinfin

1

119899

119899

sum

119894=1

1205852

119894= 1 = 1 (8)

Proof According to the probability distribution of randomvariable 120585

119894 it is easy to know that

119864 (1205852

119894) = 1 (1 le 119894 le 119899)

119864(

1

119899

119899

sum

119894=1

1205852

119894) = 1

(9)

Then 1205852119894 obeys the law of large numbers namely

Pr lim119899rarrinfin

1

119899

119899

sum

119894=1

1205852

119894= 119864(

1

119899

119899

sum

119894=1

1205852

119894)

= Pr lim119899rarrinfin

1

119899

119899

sum

119894=1

1205852

119894= 1 = 1

(10)

Since centralization of data does not change the distanceof any two points and the FCM clustering algorithm is

based on pairwise distances to partition data points weassume that expectation of the data input is 0 In practicecovariancematrix of population is likely unknownThereforewe investigate the effect of random projection on variabilityof both population and sample

Theorem 4 Let data set X isin R119899times119889 be 119899 independentsamples of119889-dimensional randomvector (119883

1 1198832 119883

119889) and

S denotes the sample covariance matrix of X The randomprojection induced by random matrix R isin R119889times119905 mapsthe 119889-dimensional random vector to 119905-dimensional randomvector (119884

1 1198842 119884

119905) = (1radic119905)(119883

1 1198832 119883

119889) sdot R and Slowast

denotes the sample covariance matrix of projected data Ifelements of random matrix R obey distribution demanded byLemma 2 and are mutually independent with random vector(1198831 1198832 119883

119889) then

(1) dimensions of projected data are linearly independentcov(119884

119894 119884119895) = 0 forall119894 = 119895

(2) random projection maintains the whole variabilitysum119905

119894=1var(119884119894) = sum

119889

119894=1var(119883

119894) when 119905 rarr infin with

probability 1 tr(Slowast) = tr(S)

Proof It is easy to know that the expectation of any element ofrandommatrix 119864(119877

119894119895) = 0 1 le 119894 le 119889 1 le 119895 le 119905 As elements

of random matrix R and random vector (1198831 1198832 119883

119889)

are mutually independent the covariance of random vectorinduced by random projection is

cov (119884119894 119884119895) = cov( 1

radic119905

sdot

119889

sum

119896=1

119883119896sdot 119877119896119894

1

radic119905

sdot

119889

sum

119897=1

119883119897sdot 119877119897119895)

=

1

119905

sdot

119889

sum

119896=1

119889

sum

119897=1

cov (119883119896sdot 119877119896119894 119883119897sdot 119877119897119895)

=

1

119905

sdot

119889

sum

119896=1

119889

sum

119897=1

119864 (119883119896sdot 119877119896119894sdot 119883119897sdot 119877119897119895) minus

1

119905

sdot

119889

sum

119896=1

119889

sum

119897=1

(119864 (119883119896sdot 119877119896119894) sdot 119864 (119883

119897sdot 119877119897119895))

=

1

119905

sdot

119889

sum

119896=1

119889

sum

119897=1

119864 (119883119896sdot 119877119896119894sdot 119883119897sdot 119877119897119895)

=

1

119905

sdot

119889

sum

119896=1

119889

sum

119897=1

(119864 (119883119896sdot 119883119897) sdot 119864 (119877

119896119894sdot 119877119897119895))

=

1

119905

sdot

119889

sum

119896=1

(119864 (1198832

119896) sdot 119864 (119877

119896119894sdot 119877119896119895))

(11)

(1) If 119894 = 119895 then

cov (119884119894 119884119895) =

1

119905

sdot (

119889

sum

119896=1

119864 (1198832

119896) sdot 119864 (119877

119896119894) sdot 119864 (119877

119897119895))

= 0

(12)

Mathematical Problems in Engineering 5

(2) If 119894 = 119895 then

cov (119884119894 119884119894) = var (119884

119894) =

1

119905

sdot (

119889

sum

119896=1

119864 (1198832

119896) sdot 119864 (119877

2

119896119894))

=

1

119905

sdot

119889

sum

119896=1

119864 (1198832

119896)

(13)

Thus by assumption 119864(119883119894) = 0 (1 le 119894 le 119889) we can get

119905

sum

119894=1

var (119884119894) =

119889

sum

119894=1

var (119883119894) (14)

We denote spectral decomposition of sample covariancematrice S by S = VΛV119879 whereV is thematrix of eigenvectorsand Λ is a diagonal matrix in which the diagonal elementsare 1205821 1205822 120582

119889and 120582

1ge 1205822ge sdot sdot sdot ge 120582

119889 Supposing the data

samples have been centralized namely their means are 0119904 wecan get covariance matrix S = (1119899)X119879X For conveniencewe still denote a sample of random matrix by R Thusprojected data Y = (1radic119905)XR and sample covariance matrixof projected data Slowast = (1119899)((1radic119905)XR)119879((1radic119905)XR) =

(1119905)R119879SR Then we can get

tr (Slowast) = tr(1119905

R119879VΛV119879R) = tr(1119905

R119879ΛVV119879R)

= tr(1119905

R119879ΛR) =119889

sum

119894=1

120582119894sdot (

1

119905

sdot

119905

sum

119895=1

1199032

119894119895)

(15)

where 119903119894119895(1 le 119894 le 119889 1 le 119895 le 119905) is sample of element of

random matrix RIn practice the spectrum of a covariance often displays a

distinct decay after few large eigenvalues So we assume thatthere exists an integer 119901 limited constant 119902 gt 0 such that forall 119894 gt 119901 it holds that 120582

119894le 119902 Then

1003816100381610038161003816tr (Slowast) minus tr (S)100381610038161003816

1003816=

10038161003816100381610038161003816100381610038161003816100381610038161003816

119889

sum

119894=1

120582119894sdot (

1

119905

119905

sum

119895=1

1199032

119894119895minus 1)

10038161003816100381610038161003816100381610038161003816100381610038161003816

le

10038161003816100381610038161003816100381610038161003816100381610038161003816

119901

sum

119894=1

120582119894sdot (

1

119905

119905

sum

119895=1

1199032

119894119895minus 1)

10038161003816100381610038161003816100381610038161003816100381610038161003816

+

10038161003816100381610038161003816100381610038161003816100381610038161003816

119889

sum

119894=119901+1

120582119894sdot (

1

119905

119905

sum

119895=1

1199032

119894119895minus 1)

10038161003816100381610038161003816100381610038161003816100381610038161003816

le

10038161003816100381610038161003816100381610038161003816100381610038161003816

119901

sum

119894=1

120582119894sdot

1

119905

119905

sum

119895=1

(1199032

119894119895minus 1)

10038161003816100381610038161003816100381610038161003816100381610038161003816

+ 119902

10038161003816100381610038161003816100381610038161003816100381610038161003816

119889

sum

119894=119901+1

(

1

119905

119905

sum

119895=1

(1199032

119894119895minus 1))

10038161003816100381610038161003816100381610038161003816100381610038161003816

(16)

By Lemma 3 with probability 1

lim119905rarrinfin

(

1

119905

119905

sum

119895=1

(1199032

119894119895minus 1)) = 0

lim119905rarrinfin

119889

sum

119894=119901+1

(

1

119905

119905

sum

119895=1

(1199032

119894119895minus 1)) = 0

(17)

Combining the above arguments we achieve tr(Slowast) =tr(S) with probability 1 when 119905 rarr infin

Part (1) of Theorem 4 indicates that compressed dataproduced by random projection can take much informationwith low dimensionality owing to linear independence ofreduced dimensions Part (2) manifests that sum of variancesof dimensions of original data is consistent with the oneof projected data namely random projection holds thevariability of primal data Combining results of Lemma 2with those ofTheorem 4 we consider that random projectioncan be employed to improve the efficiency of FCM clusteringalgorithm with low dimensionality and the modified algo-rithm can keep the accuracy of partition approximately

4 FCM Clustering with Random Projectionand an Efficient Cluster Ensemble Approach

41 FCM Clustering via Random Projection According tothe results of Section 3 we design an improved FCMclustering algorithm with random projection for dimen-sionality reduction The procedure of new algorithm isshown in Algorithm 2

Algorithm 2 reduces the dimensions of input data viamultiplying a random matrix Compared with the 119874(1198881198991198892)time for running each iteration in original FCM clusteringthe new algorithm would imply an 119874(119888119899(120576minus2 ln 119899)2) time foreach iteration Thus the time complexity of new algorithmdecreases obviously for high dimensional data in the case120576minus2 ln 119899 ≪ 119889 Another common dimensionality reductionmethod is SVD Compared with the 119874(1198893 + 1198991198892) time ofrunning SVD on data matrix X the new algorithm onlyneeds 119874(120576minus2119889 ln 119899) time to generate random matrix R Itindicates that random projection is a cost-effective methodof dimensionality reduction for FCM clustering algorithm

42 Ensemble Approach Based on Graph Partition As dif-ferent random projections may result in different clusteringsolutions [20] it is attractive to design the cluster ensembleframework with random projection for improved and robustclustering performance Although it uses smaller memoryand runs faster than ensemble method in [19] the clusterensemble algorithm in [21] still needs product of concate-nated partition matrix for crisp grouping which leads to ahigh time and space costs under the circumstances of big data

In this section we propose a more efficient and effectiveaggregation method for multiple FCM clustering results Theoverview of our new ensemble approach is presented inFigure 1 The new ensemble method is based on partition

6 Mathematical Problems in Engineering

Input data set X (an 119899 times 119889matrix) number of clusters 119888 fuzzy constant119898 FCM clustering algorithmOutput partition matrix U centers of clusters V(1) sample a 119889 times 119905 (119905 le 119889 119905 = Ω(120576minus2 ln 119899)) random projection Rmeeting the requirements of Lemma 2(2) compute the product Y = (1radic119905)XR(3) run FCM algorithm on Y get the partition matrix U(4) compute the centers of clusters through original data X and U

Algorithm 2 FCM clustering with random projection

Input data set X (an 119899 times 119889matrix) number of clusters 119888 reduced dimension 119905 number ofrandom projection 119903 FCM clustering algorithmOutput cluster label vector u(1) at each iteration 119894 isin [1 119903] run Algorithm 2 get membership matrix U

119894isin R119888times119899

(2) concatenate the membership matrices Ucon = [U1198791 U119879

119903] isin R119899times119888119903

(3) compute the first 119888 left singular vectors of Ucon denoted by A = [a1 a2 a

119888] isin R119899times119888

where Ucon = Ucon(119903 sdot D)minus12 D is a diagonal matrix and 119889119894119894= sum119895119906con 119895119894

(4) treat each row of A as a data point and apply 119896-means to obtain cluster label vector

Algorithm 3 Cluster ensemble for FCM clustering with random projection

on similarity graph For each random projection a new dataset is generated After performing FCM clustering on thenew data sets membershipmatrices are outputThe elementsof membership matrix are treated as the similarity measurebetween points and the cluster centers Through SVD on theconcatenation of membership matrices we get the spectralembedding of data point efficiently The detailed procedureof new cluster ensemble approach is shown in Algorithm 3

In step (3) of the procedure in Algorithm 3 the left sin-gular vectors of Ucon are equivalent to the eigenvectors ofUconU

119879

con It implies that we regard the matrix product as aconstruction of affinity matrix of data points This method ismotivated by the research on landmark-based representation[25 26] In our approach we treat the cluster centers ofeach FCM clustering run as landmarks and the membershipmatrix as landmark-based representation Thus the con-catenation of membership matrices forms a combinationallandmark-based representationmatrix In this way the graphsimilarity matrix is computed as

W =UconU

119879

con (18)

which can create spectral embedding efficiently through step(3) To normalize the graph similarity matrix we multiplyUcon by (119903 sdot D)minus12 As a result the degree matrix of W is anidentity matrix

There are two perspectives to explain why our approachworks Considering the similarity measure defined by 119906

119894119895

in FCM clustering proposition 3 in [26] demonstrated thatsingular vectors of U

119894converged to eigenvectors of W

119904as 119888

converges to 119899 where W119904was affinity matrix generated in

standard spectral clustering As a result singular vectors ofUcon converge to eigenvectors of normalized affinity matrixW119904Thus our final outputwill converge to the one of standard

spectral clustering as 119888 converges to 119899 Another explanationis about the similarity measure defined by 119870(x

119894 x119895) = x119879119894x119895

where x119894and x119895are data pointsWe can treat each row of Ucon

as a transformational data point As a result affinity matrixobtained here is the same as the one of standard spectralembedding and our output is just the partition result ofstandard spectral clustering

To facilitate comparison of different ensemble methodsfor FCM clustering solutions with random projection wedenote the approach of [19] byEFCM-A (average the productsof membership matrices) the algorithm of [21] by EFCM-C (concatenate the membership matrices) and our newmethod by EFCM-S (spectral clustering on the membershipmatrices) In the cluster ensemble phase the main computa-tions of EFCM-Amethod are multiplications of membershipmatrices Similarly the algorithm of EFCM-C also needsthe product of concatenated membership matrices in orderto get the crisp partition result Thus the above methodsboth need 119874(119899

2) space and 119874(119888119903119899

2) time However the

main computation of EFCM-S is SVD for Ucon and 119896-meansclustering for A The overall space is 119874(119888119903119899) the SVD timeis 119874((119888119903)2119899) and the 119896-means clustering time is 1198971198882119899 where119897 is iteration number of 119896-means Therefore computationalcomplexity of EFCM-S is obviously decreased comparedwiththe ones of EFCM-A and EFCM-C considering 119888119903 ≪ 119899 and119897 ≪ 119899 in large scale data set

5 Experiments

In this section we present the experimental evaluations ofnew algorithms proposed in Section 4 We implemented therelated algorithms in Matlab computing environment andconducted our experiments on aWindows-based systemwiththe Intel Core 36GHz processor and 16GB of RAM

51 Data Sets and Parameter Settings We conducted theexperiments on synthetic and real data sets which bothhave relatively high dimensionality The synthetic data

Mathematical Problems in Engineering 7

Original dataset

Randomprojection 1

Randomprojection 2

Randomprojection r

Generateddataset 1

Generateddataset 2

Generateddataset r

FCM clustering

Membershipmatrix 1

Consensus matrix

Final result

Membershipmatrix 1

Membershipmatrix 1

FCM clusteringFCM clustering

k-means

First c left singularvectors A

middot middot middot

middot middot middot

middot middot middot

middot middot middot

Figure 1 Framework of the new ensemble approach based on graph partition

set had 10000 data points with 1000 dimensions whichwere generated from 3 Gaussian mixtures in propor-tions (025 05 025) The means of components were(2 2 2)

1000 (0 0 0)

1000 and (minus2 minus2 minus2)

1000and

the standard deviations were (1 1 1)1000

(2 2 2)1000

and (3 3 3)

1000 The real data set is the daily and sports

activities data (ACT) published on UCI machine learningrepository (theACTdata set can be found at httparchiveicsuciedumldatasetsDaily+and+Sports+Activities)Thesearedata of 19 activities collected by 45 motion sensors in 5minutes at 25Hz sampling frequency Each activity wasperformed by 8 subjects in their own styles To get highdimensional data sets we treated 1 minute and 5 secondsof activity data as an instance respectively As a result wegot 760 times 67500 (ACT1) and 9120 times 5625 (ACT2) datamatrices whose rows were activity instances and columnswere features

For the parameters of FCM clustering we let 120576 = 10minus5 welet maximum iteration number be 100 we let fuzzy factor 119898be 2 and we let the number of clusters be 119888 = 3 for syntheticdata set and 119888 = 19 for ACT data sets We also normalizedthe objective function as objlowast = objX2

119865 where sdot

119865is

Frobenius norm of matrix [27] To minimize the influenceintroduced by different initializations we present the averagevalues of evaluation indices of 20 independent experiments

In order to compare different dimensionality reductionmethods for FCM clustering we initialized algorithms by

choosing 119888 points randomly as the cluster centers and madesure that every algorithm began with the same initializationIn addition we ran Algorithm 2 with 119905 = 10 20 100 forsynthetic data set and 119905 = 100 200 1000 for ACT1 dataset Two kinds of random projections (with random variablesfrom (5) in Lemma 2) were both tested for verifying theirfeasibility We also compared Algorithm 2 against anotherpopular method of dimensionality reductionmdashSVD Whatcalls for special attention is that the number of eigenvectorscorresponding to nonzero eigenvalues of ACT1 data is only760 so we just took 119905 = 100 200 700 on FCM clusteringwith SVD for ACT1 data set

Among comparisons of different cluster ensemble algo-rithms we set dimension number of projected data as 119905 =10 20 100 for both synthetic andACT2data sets In orderto meet 119888119903 ≪ 119899 for Algorithm 3 the number of randomprojection 119903 was set as 20 for the synthetic data set and 5 forthe ACT2 data set respectively

52 Evaluation Criteria For clustering algorithms clusteringvalidation and running time are two important indices forjudging their performances Clustering validation measuresevaluate the goodness of clustering results [28] and often canbe divided into two categories external clustering validationand internal clustering validation External validation mea-sures use external information such as the given class labelsto evaluate the goodness of solution output by a clustering

8 Mathematical Problems in Engineering

algorithm On the contrary internal measures are to evaluatethe clustering results using feature inherited from data setsIn this paper validity evaluation criteria used are rand indexand clustering validation index based on nearest neighborsfor crisp partition together with fuzzy rand index and Xie-Beni index for fuzzy partition Here rand index and fuzzyrand index are external validation measures whereas theclustering validation index based on nearest neighbors indexand Xie-Beni index are internal validation measures

(1) Rand Index (RI) [29] RI describes the similarity ofclustering solution and correct labels through pairs of pointsIt takes into account the numbers of point pairs that are in thesame and different clusters The RI is defined as

RI =11989911+ 11989900

1198622

119899

(19)

where 11989911

is the number of pairs of points that exist in thesame cluster in both clustering result and given class labels11989900is the number of pairs of points that are in different subsets

for both clustering result and given class labels and1198622119899equals

119899(119899 minus 1)2 The value of RI ranges from 0 to 1 and the highervalue implies the better clustering solution

(2) Fuzzy Rand Index (FRI) [30] FRI is a generalizationof RI with respect to soft partition It also measures theproportion of pairs of points which exist in the same anddifferent clusters in both clustering solution and true classlabels It needs to compute the analogous 119899

11and 11989900through

contingency table described in [30] Therefore the range ofFRI is also [0 1] and the larger value means more accuratecluster solution

(3) Xie-Beni Index (XB) [31] XB takes the minimum squaredistance between cluster centers as the separation of thepartition and the average square fuzzy deviation of datapoints as the compactness of the partition XB is calculatedas follows

XB =sum119888

119894=1sum119899

119895=1119906119898

119894119895

10038171003817100381710038171003817x119895minus k119894

10038171003817100381710038171003817

2

119899 sdotmin119894119895

10038171003817100381710038171003817k119894minus k119895

10038171003817100381710038171003817

2 (20)

where sum119888119894=1sum119899

119895=1119906119898

119894119895x119895minus k1198942 is just the objective function of

FCM clustering and k119894is the center of cluster 119894 The smallest

XB indicates the optimal cluster partition

(4) Clustering Validation Index Based on Nearest Neighbors(CVNN) [32] The separation of CVNN is about the situationof objects that have geometrical information of each clusterand the compactness is the mean pairwise distance betweenobjects in the same cluster CVNN is computed as follows

CVNN (119888 119896) =Sep (119888 119896)

(max119888 minle119888le119888 maxSep (119888 119896))

+

Com (119888)(max119888 minle119888le119888 maxCom (119888))

(21)

where Sep(119888 119896) = max119894=12119888

((1119899119894) sdot sum

119899119894

119895=1(119902119895119896)) and

Com(119888) = sum119888119894=1((2119899119894(119899119894minus1)) sdotsum

119909119910isinClu119894 119889(119909 119910)) Here 119888 is the

number of clusters in partition result 119888 max is the maximumcluster number given 119888 min is the minimum cluster numbergiven 119896 is the number of nearest neighbors 119899

119894is the number

of objects in the 119894th cluster Clu119894 119902119895denotes the number of

nearest neighbors of Clu119894rsquos 119895th object which are not in Clu

119894

and 119889(119909 119910) denotes the distance between 119909 and 119910 The lowerCVNN value indicates the better clustering solution

Objective function is a special evaluation criterion ofvalidity for FCM clustering algorithm The smaller objectivefunction indicates that the points inside clusters are moreldquosimilarrdquo

Running time is also an important evaluation criterionoften related to the scalability of algorithm One maintarget of random projection for dimensionality reductionis to decrease the runtime and enhance the applicability ofalgorithm in the context of big data

53 Performance of FCM Clustering with Random ProjectionThe experimental results about FCM clustering with randomprojection are presented in Figure 2 where (a) (c) (e) and (g)correspond to the synthetic data set and (b) (d) (f) and (h)correspond to the ACT1 data setThe evaluation criteria usedto assess proposed algorithms are FRI (a) and (b) XB (c) and(d) objective function (e) and (f) and running time (g) and(h) ldquoSignRPrdquo denotes the proposed algorithm with randomsign matrix ldquoGaussRPrdquo denotes the FCM clustering withrandom Gaussian matrix ldquoFCMrdquo denotes the original FCMclustering algorithm and ldquoSVDrdquo denotes the FCM clusteringwith dimensionality reduction through SVD It should benoted that true XB value of FCM clustering in subfigure (d)is 403e + 12 not 0

From Figure 2 we can see that FCM clustering withrandom projection is clearly more efficient than the originalFCM clustering When number of dimensions 119905 is abovecertain bound the validity indices are nearly stable andsimilar to the ones of naive FCM clustering for both datasets This verifies the conclusion that ldquoaccuracy of clusteringalgorithm can be preserved when the dimensionality exceedsa certain boundrdquo The effectiveness for random projectionmethod is also verified by the small bound compared to thetotal dimensions (301000 for synthetic data and 30067500for ACT1 data) Besides the two different kinds of randomprojection methods have the similar impact on FCM cluster-ing because of the analogous plot

The higher objective function values and the smaller XBindices of SVD method for synthetic data set indicate thatthe generated clustering solution has better separation degreebetween clusters The external cluster validation indices alsoverify that SVD method has better clustering results forsynthetic data These observations state that SVD methodis more suitable for Gaussian mixture data sets than FCMclustering with randomprojection and naive FCM clustering

Although the SVDmethod has a higher FRI for syntheticdata set the random projection methods have analogousFRI values for ACT1 data set and better objective functionvalues for both data sets In addition the random projectionapproaches are obviously more efficient as the SVD needscubic time of dimensionality Hence these observationsindicate that our algorithm is quite encouraging in practice

Mathematical Problems in Engineering 9

10 20 30 40 50 60 70 80 90 100

065

07

075

08

085

09

095

1FR

I

SVDFCM

GaussRPSignRP

Number of dimensions t

(a) FRI versus number of dimensions

SVDFCM

GaussRPSignRP

100 200 300 400 500 600 700 800 900 100008998

09

09002

09004

09006

09008

0901

FRI

Number of dimensions t

(b) FRI versus number of dimensions

SVDFCM

GaussRPSignRP

10 20 30 40 50 60 70 80 90 1000

05

1

15

2

25

3

35

XB

Number of dimensions t

(c) FRI versus number of dimensions

SVDFCM

GaussRPSignRP

100 200 300 400 500 600 700 800 900 10000

2

4

6

8

10

12

14XB

(1015)

Number of dimensions t

(d) FRI versus number of dimensions

10 20 30 40 50 60 70 80 90 100

032

034

036

038

04

042

044

046

048

05

Obj

ectiv

e fun

ctio

n

Number of dimensions t

SVDFCM

GaussRPSignRP

(e) FRI versus number of dimensions

100 200 300 400 500 600 700 800 900 100000285

0029

00295

003

00305

0031

00315

0032

00325

Obj

ectiv

e fun

ctio

n

Number of dimensions t

SVDFCM

GaussRPSignRP

(f) FRI versus number of dimensions

Figure 2 Continued

10 Mathematical Problems in Engineering

24262830

Runn

ing

time

(s)

SVDFCM

10 20 30 40 50 60 70 80 90 1001618

222242628

332

Runn

ing

time (

s)

GaussRPSignRP

10 20 30 40 50 60 70 80 90 100Number of dimensions t

Number of dimensions t

(g) FRI versus number of dimensions

SVDFCM

GaussRPSignRP

820830840850

Runn

ing

time

(s)

100 200 300 400 500 600 700 800 900 1000

100 200 300 400 500 600 700 800 900 1000

0

2

4

6

8

10

12

14

Runn

ing

time (

s)

Number of dimensions t

Number of dimensions t

(h) FRI versus number of dimensions

Figure 2 Performance of clustering algorithms with different dimensionality

Table 1 CVNN indices for different ensemble approaches on ACT2 data

Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-A 17315 17383 17449 17789 1819 183 17623 18182 18685 18067EFCM-C 17938 17558 17584 18351 18088 18353 18247 18385 18105 18381EFCM-S 13975 13144 12736 12974 13112 13643 13533 1409 13701 13765

54 Comparisons of Different Cluster Ensemble Methods Thecomparisons of different cluster ensemble approaches areshown in Figure 3 and Table 1 Similarly (a) and (c) of thefigure correspond to the synthetic data set and (b) and (d)corresponds to the ACT2 data set We use RI (a) and (b)and running time (c) and (d) to present the performanceof ensemble methods Meanwhile the meanings of EFCM-A EFCM-C and EFCM-S are identical to the ones inSection 42 In order to get crisp partition for EFCM-A andEFCM-C we used hierarchical clustering-complete linkagemethod after getting the distance matrix as in [21] Since allthree cluster ensemble methods get perfect partition resultson synthetic data set we only compare CVNN indices ofdifferent ensemble methods on ACT2 data set which ispresented in Table 1

In Figure 3 running time of our algorithm is shorterfor both data sets This verifies the result of time complexityanalysis for different algorithms in Section 42 The threecluster ensemble methods all get the perfect partition forsynthetic data set whereas our method is more accuratethan the other two methods for ACT2 data set The perfectpartition results suggest that all three ensemble methods aresuitable for Gaussian mixture data set However the almost18 improvement on RI for ACT2 data set should be due

to the different grouping ideas Our method is based on thegraph partition such that the edges between different clustershave low weight and the edges within a cluster have highweight This clustering way of spectral embedding is moresuitable for ACT2 data set In Table 1 the smaller values ofCVNN of our new method also show that new approach hasbetter partition results on ACT2 data set These observationsindicate that our algorithm has the advantage on efficiencyand adapts to a wider range of geometries

We also compare the stability for three ensemble meth-ods presented in Table 2 From the table we can see that thestandard deviation of RI about EFCM-S is a lower order ofmagnitude than the ones of the other methods Hence thisresult shows that our algorithm is more robust

Aiming at the situation of unknown clustersrsquo numberwe also varied the number of clusters 119888 in FCM clusteringand spectral embedding for our new method We denotethis version of new method as EFCM-SV Since the numberof random projections was set as 5 for ACT2 data set wechanged the clustersrsquo number from 17 to 21 as the input ofFCM clustering algorithm In addition we set the clustersrsquonumber from 14 to 24 as the input of spectral embeddingand applied CVNN to estimate the most plausible number ofclusters The experimental results are presented in Table 3

Mathematical Problems in Engineering 11

Table 2 Standard deviations of RI of 20 runs with different dimensions on ACT2 data

Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-A 00222 00174 0018 00257 00171 00251 00188 00172 00218 00184EFCM-C 00217 00189 00128 00232 00192 00200 00175 00194 00151 00214EFCM-S 00044 00018 00029 00030 00028 00024 00026 00020 00024 00019

Table 3 RI values for EFCM-S and EFCM-Sv on ACT2 data

Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-S 09227 0922 09223 0923 09215 09218 09226 09225 09231 09237EFCM-SV 09257 09257 09165 09257 0927 09165 09268 0927 09105 09245+CVNN 119888 = 185 119888 = 207 119888 = 194 119888 = 193 119888 = 193 119888 = 182 119888 = 192 119888 = 183 119888 = 194 119888 = 202

10 20 30 40 50 60 70 80 90 1000

02

04

06

08

1

12

14

16

18

2

RI

EFCM-AEFCM-CEFCM-S

Number of dimensions t

(a) RI versus number of dimensions

10 20 30 40 50 60 70 80 90 100074

076

078

08

082

084

086

088

09

092

094

RI

Number of dimensions t

EFCM-AEFCM-CEFCM-S

(b) RI versus number of dimensions

10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

Runn

ing

time (

s)

Number of dimensions t

EFCM-AEFCM-CEFCM-S

(c) Running time versus number of dimensions

10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

80

90

100

Runn

ing

time (

s)

Number of dimensions t

EFCM-AEFCM-CEFCM-S

(d) Running time versus number of dimensions

Figure 3 Performance of cluster ensemble approaches with different dimensionality

12 Mathematical Problems in Engineering

In Table 3 the values with respect to ldquoEFCM-SVrdquo are theaverage RI values with the estimated clustersrsquo numbers for20 individual runs The values of ldquo+CVNNrdquo are the averageclustersrsquo numbers decided by the CVNN cluster validityindex Using the estimated clustersrsquo numbers by CVNN ourmethod gets the similar results of ensemble method withcorrect clustersrsquo number In addition the average estimates ofclustersrsquo number are close to the true one This indicates thatour cluster ensemble method EFCM-SV is attractive whenthe number of clusters is unknown

6 Conclusion and Future Work

The ldquocurse of dimensionalityrdquo in big data gives new chal-lenges for clustering recently and feature extraction fordimensionality reduction is a popular way to deal with thesechallenges We studied the feature extraction method ofrandom projection for FCM clustering Through analyzingthe effects of random projection on the entire variabilityof data theoretically and verification both on syntheticand real world data empirically we designed an enhancedFCM clustering algorithm with random projection The newalgorithm can maintain nearly the same clustering solutionof preliminary FCM clustering and be more efficient thanfeature extraction method of SVD What is more we alsoproposed a cluster ensemble approach that is more applicableto large scale data sets than existing ones The new ensembleapproach can achieve spectral embedding efficiently fromSVD on the concatenation of membership matrices Theexperiments showed that the new ensemble method ranfaster had more robust partition solutions and fitted a widerrange of geometrical data sets

A future research content is to design the provablyaccurate feature extraction and feature selection methodsfor FCM clustering Another remaining question is thathow to choose proper number of random projections forcluster ensemble method in order to get a trade-off betweenclustering accuracy and efficiency

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This work was supported in part by the National KeyBasic Research Program (973 programme) under Grant2012CB315905 and in part by the National Nature ScienceFoundation of China under Grants 61502527 and 61379150and in part by the Open Foundation of State Key Laboratoryof Networking and Switching Technology (Beijing Universityof Posts and Telecommunications) (no SKLNST-2013-1-06)

References

[1] M Chen S Mao and Y Liu ldquoBig data a surveyrdquo MobileNetworks and Applications vol 19 no 2 pp 171ndash209 2014

[2] J Zhang X Tao and H Wang ldquoOutlier detection from largedistributed databasesrdquoWorld Wide Web vol 17 no 4 pp 539ndash568 2014

[3] C Ordonez N Mohanam and C Garcia-Alvarado ldquoPCA forlarge data sets with parallel data summarizationrdquo Distributedand Parallel Databases vol 32 no 3 pp 377ndash403 2014

[4] D-S Pham S Venkatesh M Lazarescu and S BudhadityaldquoAnomaly detection in large-scale data stream networksrdquo DataMining and Knowledge Discovery vol 28 no 1 pp 145ndash1892014

[5] F Murtagh and P Contreras ldquoRandom projection towardsthe baire metric for high dimensional clusteringrdquo in StatisticalLearning and Data Sciences pp 424ndash431 Springer BerlinGermany 2015

[6] T C Havens J C Bezdek C Leckie L O Hall and MPalaniswami ldquoFuzzy c-means algorithms for very large datardquoIEEETransactions on Fuzzy Systems vol 20 no 6 pp 1130ndash11462012

[7] J Han M Kamber and J Pei Data Mining Concepts andTechniques Concepts and Techniques Elsevier 2011

[8] S Khan G Situ K Decker and C J Schmidt ldquoGoFigureautomated gene ontology annotationrdquo Bioinformatics vol 19no 18 pp 2484ndash2485 2003

[9] S Gunnemann H Kremer D Lenhard and T Seidl ldquoSub-space clustering for indexing high dimensional data a mainmemory index based on local reductions and individual multi-representationsrdquo in Proceedings of the 14th International Confer-ence on Extending Database Technology (EDBT rsquo11) pp 237ndash248ACM Uppsala Sweden March 2011

[10] J C Bezdek R Ehrlich and W Full ldquoFCM the fuzzy c-meansclustering algorithmrdquo Computers amp Geosciences vol 10 no 2-3pp 191ndash203 1984

[11] R J Hathaway and J C Bezdek ldquoExtending fuzzy andprobabilistic clustering to very large data setsrdquo ComputationalStatistics amp Data Analysis vol 51 no 1 pp 215ndash234 2006

[12] P Hore L O Hall and D B Goldgof ldquoSingle pass fuzzy cmeansrdquo in Proceedings of the IEEE International Fuzzy SystemsConference (FUZZ rsquo07) pp 1ndash7 London UK July 2007

[13] P Hore L O Hall D B Goldgof Y Gu A A Maudsley andA Darkazanli ldquoA scalable framework for segmenting magneticresonance imagesrdquo Journal of Signal Processing Systems vol 54no 1ndash3 pp 183ndash203 2009

[14] W B Johnson and J Lindenstrauss ldquoExtensions of lipschitzmappings into aHilbert spacerdquoContemporaryMathematics vol26 pp 189ndash206 1984

[15] P Indyk and R Motwani ldquoApproximate nearest neighborstowards removing the curse of dimensionalityrdquo in Proceedingsof the 13th Annual ACM Symposium on Theory of Computingpp 604ndash613 ACM 1998

[16] D Achlioptas ldquoDatabase-friendly random projectionsJohnson-Lindenstrauss with binary coinsrdquo Journal of Computerand System Sciences vol 66 no 4 pp 671ndash687 2003

[17] C Boutsidis A Zouzias and P Drineas ldquoRandom projectionsfor k-means clusteringrdquo in Advances in Neural InformationProcessing Systems pp 298ndash306 MIT Press 2010

[18] C C Aggarwal and C K Reddy Data Clustering Algorithmsand Applications CRC Press New York NY USA 2013

[19] R Avogadri and G Valentini ldquoFuzzy ensemble clustering basedon random projections for DNA microarray data analysisrdquoArtificial Intelligence in Medicine vol 45 no 2-3 pp 173ndash1832009

Mathematical Problems in Engineering 13

[20] X Z Fern and C E Brodley ldquoRandom projection for highdimensional data clustering a cluster ensemble approachrdquo inProceedings of the 20th International Conference on MachineLearning (ICML rsquo03) vol 3 pp 186ndash193 August 2003

[21] M Popescu J Keller J Bezdek and A Zare ldquoRandomprojections fuzzy c-means (RPFCM) for big data clusteringrdquoin Proceedings of the IEEE International Conference on FuzzySystems (FUZZ-IEEE rsquo15) pp 1ndash6 Istanbul Turkey August 2015

[22] A Fahad N Alshatri Z Tari et al ldquoA survey of clusteringalgorithms for big data taxonomy and empirical analysisrdquo IEEETransactions on Emerging Topics in Computing vol 2 no 3 pp267ndash279 2014

[23] R A Johnson and D W Wichern Applied Multivariate Statis-tical Analysis vol 4 Pearson Prentice Hall Upper Saddle RiverNJ USA 6th edition 2007

[24] C Boutsidis A Zouzias M W Mahoney and P DrineasldquoRandomized dimensionality reduction for k-means cluster-ingrdquo IEEE Transactions on InformationTheory vol 61 no 2 pp1045ndash1062 2015

[25] X Chen and D Cai ldquoLarge scale spectral clustering withlandmark-based representationrdquo in Proceedings of the 25thAAAI Conference on Artificial Intelligence pp 313ndash318 2011

[26] D Cai and X Chen ldquoLarge scale spectral clustering vialandmark-based sparse representationrdquo IEEE Transactions onCybernetics vol 45 no 8 pp 1669ndash1680 2015

[27] G H Golub and C F Van Loan Matrix Computations vol 3JHU Press 2012

[28] U Maulik and S Bandyopadhyay ldquoPerformance evaluation ofsome clustering algorithms and validity indicesrdquo IEEE Transac-tions on Pattern Analysis and Machine Intelligence vol 24 no12 pp 1650ndash1654 2002

[29] W M Rand ldquoObjective criteria for the evaluation of clusteringmethodsrdquo Journal of the American Statistical Association vol66 no 336 pp 846ndash850 1971

[30] D T Anderson J C Bezdek M Popescu and J M KellerldquoComparing fuzzy probabilistic and possibilistic partitionsrdquoIEEE Transactions on Fuzzy Systems vol 18 no 5 pp 906ndash9182010

[31] X L Xie and G Beni ldquoA validity measure for fuzzy clusteringrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 13 no 8 pp 841ndash847 1991

[32] Y Liu Z LiH Xiong XGao JWu and SWu ldquoUnderstandingand enhancement of internal clustering validation measuresrdquoIEEE Transactions on Cybernetics vol 43 no 3 pp 982ndash9942013

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 5: Research Article Fuzzy -Means and Cluster Ensemble with Random Projection for Big Data ...downloads.hindawi.com/journals/mpe/2016/6529794.pdf · 2019. 7. 30. · Research Article

Mathematical Problems in Engineering 5

(2) If 119894 = 119895 then

cov (119884119894 119884119894) = var (119884

119894) =

1

119905

sdot (

119889

sum

119896=1

119864 (1198832

119896) sdot 119864 (119877

2

119896119894))

=

1

119905

sdot

119889

sum

119896=1

119864 (1198832

119896)

(13)

Thus by assumption 119864(119883119894) = 0 (1 le 119894 le 119889) we can get

119905

sum

119894=1

var (119884119894) =

119889

sum

119894=1

var (119883119894) (14)

We denote spectral decomposition of sample covariancematrice S by S = VΛV119879 whereV is thematrix of eigenvectorsand Λ is a diagonal matrix in which the diagonal elementsare 1205821 1205822 120582

119889and 120582

1ge 1205822ge sdot sdot sdot ge 120582

119889 Supposing the data

samples have been centralized namely their means are 0119904 wecan get covariance matrix S = (1119899)X119879X For conveniencewe still denote a sample of random matrix by R Thusprojected data Y = (1radic119905)XR and sample covariance matrixof projected data Slowast = (1119899)((1radic119905)XR)119879((1radic119905)XR) =

(1119905)R119879SR Then we can get

tr (Slowast) = tr(1119905

R119879VΛV119879R) = tr(1119905

R119879ΛVV119879R)

= tr(1119905

R119879ΛR) =119889

sum

119894=1

120582119894sdot (

1

119905

sdot

119905

sum

119895=1

1199032

119894119895)

(15)

where 119903119894119895(1 le 119894 le 119889 1 le 119895 le 119905) is sample of element of

random matrix RIn practice the spectrum of a covariance often displays a

distinct decay after few large eigenvalues So we assume thatthere exists an integer 119901 limited constant 119902 gt 0 such that forall 119894 gt 119901 it holds that 120582

119894le 119902 Then

1003816100381610038161003816tr (Slowast) minus tr (S)100381610038161003816

1003816=

10038161003816100381610038161003816100381610038161003816100381610038161003816

119889

sum

119894=1

120582119894sdot (

1

119905

119905

sum

119895=1

1199032

119894119895minus 1)

10038161003816100381610038161003816100381610038161003816100381610038161003816

le

10038161003816100381610038161003816100381610038161003816100381610038161003816

119901

sum

119894=1

120582119894sdot (

1

119905

119905

sum

119895=1

1199032

119894119895minus 1)

10038161003816100381610038161003816100381610038161003816100381610038161003816

+

10038161003816100381610038161003816100381610038161003816100381610038161003816

119889

sum

119894=119901+1

120582119894sdot (

1

119905

119905

sum

119895=1

1199032

119894119895minus 1)

10038161003816100381610038161003816100381610038161003816100381610038161003816

le

10038161003816100381610038161003816100381610038161003816100381610038161003816

119901

sum

119894=1

120582119894sdot

1

119905

119905

sum

119895=1

(1199032

119894119895minus 1)

10038161003816100381610038161003816100381610038161003816100381610038161003816

+ 119902

10038161003816100381610038161003816100381610038161003816100381610038161003816

119889

sum

119894=119901+1

(

1

119905

119905

sum

119895=1

(1199032

119894119895minus 1))

10038161003816100381610038161003816100381610038161003816100381610038161003816

(16)

By Lemma 3 with probability 1

lim119905rarrinfin

(

1

119905

119905

sum

119895=1

(1199032

119894119895minus 1)) = 0

lim119905rarrinfin

119889

sum

119894=119901+1

(

1

119905

119905

sum

119895=1

(1199032

119894119895minus 1)) = 0

(17)

Combining the above arguments we achieve tr(Slowast) =tr(S) with probability 1 when 119905 rarr infin

Part (1) of Theorem 4 indicates that compressed dataproduced by random projection can take much informationwith low dimensionality owing to linear independence ofreduced dimensions Part (2) manifests that sum of variancesof dimensions of original data is consistent with the oneof projected data namely random projection holds thevariability of primal data Combining results of Lemma 2with those ofTheorem 4 we consider that random projectioncan be employed to improve the efficiency of FCM clusteringalgorithm with low dimensionality and the modified algo-rithm can keep the accuracy of partition approximately

4 FCM Clustering with Random Projectionand an Efficient Cluster Ensemble Approach

41 FCM Clustering via Random Projection According tothe results of Section 3 we design an improved FCMclustering algorithm with random projection for dimen-sionality reduction The procedure of new algorithm isshown in Algorithm 2

Algorithm 2 reduces the dimensions of input data viamultiplying a random matrix Compared with the 119874(1198881198991198892)time for running each iteration in original FCM clusteringthe new algorithm would imply an 119874(119888119899(120576minus2 ln 119899)2) time foreach iteration Thus the time complexity of new algorithmdecreases obviously for high dimensional data in the case120576minus2 ln 119899 ≪ 119889 Another common dimensionality reductionmethod is SVD Compared with the 119874(1198893 + 1198991198892) time ofrunning SVD on data matrix X the new algorithm onlyneeds 119874(120576minus2119889 ln 119899) time to generate random matrix R Itindicates that random projection is a cost-effective methodof dimensionality reduction for FCM clustering algorithm

42 Ensemble Approach Based on Graph Partition As dif-ferent random projections may result in different clusteringsolutions [20] it is attractive to design the cluster ensembleframework with random projection for improved and robustclustering performance Although it uses smaller memoryand runs faster than ensemble method in [19] the clusterensemble algorithm in [21] still needs product of concate-nated partition matrix for crisp grouping which leads to ahigh time and space costs under the circumstances of big data

In this section we propose a more efficient and effectiveaggregation method for multiple FCM clustering results Theoverview of our new ensemble approach is presented inFigure 1 The new ensemble method is based on partition

6 Mathematical Problems in Engineering

Input data set X (an 119899 times 119889matrix) number of clusters 119888 fuzzy constant119898 FCM clustering algorithmOutput partition matrix U centers of clusters V(1) sample a 119889 times 119905 (119905 le 119889 119905 = Ω(120576minus2 ln 119899)) random projection Rmeeting the requirements of Lemma 2(2) compute the product Y = (1radic119905)XR(3) run FCM algorithm on Y get the partition matrix U(4) compute the centers of clusters through original data X and U

Algorithm 2 FCM clustering with random projection

Input data set X (an 119899 times 119889matrix) number of clusters 119888 reduced dimension 119905 number ofrandom projection 119903 FCM clustering algorithmOutput cluster label vector u(1) at each iteration 119894 isin [1 119903] run Algorithm 2 get membership matrix U

119894isin R119888times119899

(2) concatenate the membership matrices Ucon = [U1198791 U119879

119903] isin R119899times119888119903

(3) compute the first 119888 left singular vectors of Ucon denoted by A = [a1 a2 a

119888] isin R119899times119888

where Ucon = Ucon(119903 sdot D)minus12 D is a diagonal matrix and 119889119894119894= sum119895119906con 119895119894

(4) treat each row of A as a data point and apply 119896-means to obtain cluster label vector

Algorithm 3 Cluster ensemble for FCM clustering with random projection

on similarity graph For each random projection a new dataset is generated After performing FCM clustering on thenew data sets membershipmatrices are outputThe elementsof membership matrix are treated as the similarity measurebetween points and the cluster centers Through SVD on theconcatenation of membership matrices we get the spectralembedding of data point efficiently The detailed procedureof new cluster ensemble approach is shown in Algorithm 3

In step (3) of the procedure in Algorithm 3 the left sin-gular vectors of Ucon are equivalent to the eigenvectors ofUconU

119879

con It implies that we regard the matrix product as aconstruction of affinity matrix of data points This method ismotivated by the research on landmark-based representation[25 26] In our approach we treat the cluster centers ofeach FCM clustering run as landmarks and the membershipmatrix as landmark-based representation Thus the con-catenation of membership matrices forms a combinationallandmark-based representationmatrix In this way the graphsimilarity matrix is computed as

W =UconU

119879

con (18)

which can create spectral embedding efficiently through step(3) To normalize the graph similarity matrix we multiplyUcon by (119903 sdot D)minus12 As a result the degree matrix of W is anidentity matrix

There are two perspectives to explain why our approachworks Considering the similarity measure defined by 119906

119894119895

in FCM clustering proposition 3 in [26] demonstrated thatsingular vectors of U

119894converged to eigenvectors of W

119904as 119888

converges to 119899 where W119904was affinity matrix generated in

standard spectral clustering As a result singular vectors ofUcon converge to eigenvectors of normalized affinity matrixW119904Thus our final outputwill converge to the one of standard

spectral clustering as 119888 converges to 119899 Another explanationis about the similarity measure defined by 119870(x

119894 x119895) = x119879119894x119895

where x119894and x119895are data pointsWe can treat each row of Ucon

as a transformational data point As a result affinity matrixobtained here is the same as the one of standard spectralembedding and our output is just the partition result ofstandard spectral clustering

To facilitate comparison of different ensemble methodsfor FCM clustering solutions with random projection wedenote the approach of [19] byEFCM-A (average the productsof membership matrices) the algorithm of [21] by EFCM-C (concatenate the membership matrices) and our newmethod by EFCM-S (spectral clustering on the membershipmatrices) In the cluster ensemble phase the main computa-tions of EFCM-Amethod are multiplications of membershipmatrices Similarly the algorithm of EFCM-C also needsthe product of concatenated membership matrices in orderto get the crisp partition result Thus the above methodsboth need 119874(119899

2) space and 119874(119888119903119899

2) time However the

main computation of EFCM-S is SVD for Ucon and 119896-meansclustering for A The overall space is 119874(119888119903119899) the SVD timeis 119874((119888119903)2119899) and the 119896-means clustering time is 1198971198882119899 where119897 is iteration number of 119896-means Therefore computationalcomplexity of EFCM-S is obviously decreased comparedwiththe ones of EFCM-A and EFCM-C considering 119888119903 ≪ 119899 and119897 ≪ 119899 in large scale data set

5 Experiments

In this section we present the experimental evaluations ofnew algorithms proposed in Section 4 We implemented therelated algorithms in Matlab computing environment andconducted our experiments on aWindows-based systemwiththe Intel Core 36GHz processor and 16GB of RAM

51 Data Sets and Parameter Settings We conducted theexperiments on synthetic and real data sets which bothhave relatively high dimensionality The synthetic data

Mathematical Problems in Engineering 7

Original dataset

Randomprojection 1

Randomprojection 2

Randomprojection r

Generateddataset 1

Generateddataset 2

Generateddataset r

FCM clustering

Membershipmatrix 1

Consensus matrix

Final result

Membershipmatrix 1

Membershipmatrix 1

FCM clusteringFCM clustering

k-means

First c left singularvectors A

middot middot middot

middot middot middot

middot middot middot

middot middot middot

Figure 1 Framework of the new ensemble approach based on graph partition

set had 10000 data points with 1000 dimensions whichwere generated from 3 Gaussian mixtures in propor-tions (025 05 025) The means of components were(2 2 2)

1000 (0 0 0)

1000 and (minus2 minus2 minus2)

1000and

the standard deviations were (1 1 1)1000

(2 2 2)1000

and (3 3 3)

1000 The real data set is the daily and sports

activities data (ACT) published on UCI machine learningrepository (theACTdata set can be found at httparchiveicsuciedumldatasetsDaily+and+Sports+Activities)Thesearedata of 19 activities collected by 45 motion sensors in 5minutes at 25Hz sampling frequency Each activity wasperformed by 8 subjects in their own styles To get highdimensional data sets we treated 1 minute and 5 secondsof activity data as an instance respectively As a result wegot 760 times 67500 (ACT1) and 9120 times 5625 (ACT2) datamatrices whose rows were activity instances and columnswere features

For the parameters of FCM clustering we let 120576 = 10minus5 welet maximum iteration number be 100 we let fuzzy factor 119898be 2 and we let the number of clusters be 119888 = 3 for syntheticdata set and 119888 = 19 for ACT data sets We also normalizedthe objective function as objlowast = objX2

119865 where sdot

119865is

Frobenius norm of matrix [27] To minimize the influenceintroduced by different initializations we present the averagevalues of evaluation indices of 20 independent experiments

In order to compare different dimensionality reductionmethods for FCM clustering we initialized algorithms by

choosing 119888 points randomly as the cluster centers and madesure that every algorithm began with the same initializationIn addition we ran Algorithm 2 with 119905 = 10 20 100 forsynthetic data set and 119905 = 100 200 1000 for ACT1 dataset Two kinds of random projections (with random variablesfrom (5) in Lemma 2) were both tested for verifying theirfeasibility We also compared Algorithm 2 against anotherpopular method of dimensionality reductionmdashSVD Whatcalls for special attention is that the number of eigenvectorscorresponding to nonzero eigenvalues of ACT1 data is only760 so we just took 119905 = 100 200 700 on FCM clusteringwith SVD for ACT1 data set

Among comparisons of different cluster ensemble algo-rithms we set dimension number of projected data as 119905 =10 20 100 for both synthetic andACT2data sets In orderto meet 119888119903 ≪ 119899 for Algorithm 3 the number of randomprojection 119903 was set as 20 for the synthetic data set and 5 forthe ACT2 data set respectively

52 Evaluation Criteria For clustering algorithms clusteringvalidation and running time are two important indices forjudging their performances Clustering validation measuresevaluate the goodness of clustering results [28] and often canbe divided into two categories external clustering validationand internal clustering validation External validation mea-sures use external information such as the given class labelsto evaluate the goodness of solution output by a clustering

8 Mathematical Problems in Engineering

algorithm On the contrary internal measures are to evaluatethe clustering results using feature inherited from data setsIn this paper validity evaluation criteria used are rand indexand clustering validation index based on nearest neighborsfor crisp partition together with fuzzy rand index and Xie-Beni index for fuzzy partition Here rand index and fuzzyrand index are external validation measures whereas theclustering validation index based on nearest neighbors indexand Xie-Beni index are internal validation measures

(1) Rand Index (RI) [29] RI describes the similarity ofclustering solution and correct labels through pairs of pointsIt takes into account the numbers of point pairs that are in thesame and different clusters The RI is defined as

RI =11989911+ 11989900

1198622

119899

(19)

where 11989911

is the number of pairs of points that exist in thesame cluster in both clustering result and given class labels11989900is the number of pairs of points that are in different subsets

for both clustering result and given class labels and1198622119899equals

119899(119899 minus 1)2 The value of RI ranges from 0 to 1 and the highervalue implies the better clustering solution

(2) Fuzzy Rand Index (FRI) [30] FRI is a generalizationof RI with respect to soft partition It also measures theproportion of pairs of points which exist in the same anddifferent clusters in both clustering solution and true classlabels It needs to compute the analogous 119899

11and 11989900through

contingency table described in [30] Therefore the range ofFRI is also [0 1] and the larger value means more accuratecluster solution

(3) Xie-Beni Index (XB) [31] XB takes the minimum squaredistance between cluster centers as the separation of thepartition and the average square fuzzy deviation of datapoints as the compactness of the partition XB is calculatedas follows

XB =sum119888

119894=1sum119899

119895=1119906119898

119894119895

10038171003817100381710038171003817x119895minus k119894

10038171003817100381710038171003817

2

119899 sdotmin119894119895

10038171003817100381710038171003817k119894minus k119895

10038171003817100381710038171003817

2 (20)

where sum119888119894=1sum119899

119895=1119906119898

119894119895x119895minus k1198942 is just the objective function of

FCM clustering and k119894is the center of cluster 119894 The smallest

XB indicates the optimal cluster partition

(4) Clustering Validation Index Based on Nearest Neighbors(CVNN) [32] The separation of CVNN is about the situationof objects that have geometrical information of each clusterand the compactness is the mean pairwise distance betweenobjects in the same cluster CVNN is computed as follows

CVNN (119888 119896) =Sep (119888 119896)

(max119888 minle119888le119888 maxSep (119888 119896))

+

Com (119888)(max119888 minle119888le119888 maxCom (119888))

(21)

where Sep(119888 119896) = max119894=12119888

((1119899119894) sdot sum

119899119894

119895=1(119902119895119896)) and

Com(119888) = sum119888119894=1((2119899119894(119899119894minus1)) sdotsum

119909119910isinClu119894 119889(119909 119910)) Here 119888 is the

number of clusters in partition result 119888 max is the maximumcluster number given 119888 min is the minimum cluster numbergiven 119896 is the number of nearest neighbors 119899

119894is the number

of objects in the 119894th cluster Clu119894 119902119895denotes the number of

nearest neighbors of Clu119894rsquos 119895th object which are not in Clu

119894

and 119889(119909 119910) denotes the distance between 119909 and 119910 The lowerCVNN value indicates the better clustering solution

Objective function is a special evaluation criterion ofvalidity for FCM clustering algorithm The smaller objectivefunction indicates that the points inside clusters are moreldquosimilarrdquo

Running time is also an important evaluation criterionoften related to the scalability of algorithm One maintarget of random projection for dimensionality reductionis to decrease the runtime and enhance the applicability ofalgorithm in the context of big data

53 Performance of FCM Clustering with Random ProjectionThe experimental results about FCM clustering with randomprojection are presented in Figure 2 where (a) (c) (e) and (g)correspond to the synthetic data set and (b) (d) (f) and (h)correspond to the ACT1 data setThe evaluation criteria usedto assess proposed algorithms are FRI (a) and (b) XB (c) and(d) objective function (e) and (f) and running time (g) and(h) ldquoSignRPrdquo denotes the proposed algorithm with randomsign matrix ldquoGaussRPrdquo denotes the FCM clustering withrandom Gaussian matrix ldquoFCMrdquo denotes the original FCMclustering algorithm and ldquoSVDrdquo denotes the FCM clusteringwith dimensionality reduction through SVD It should benoted that true XB value of FCM clustering in subfigure (d)is 403e + 12 not 0

From Figure 2 we can see that FCM clustering withrandom projection is clearly more efficient than the originalFCM clustering When number of dimensions 119905 is abovecertain bound the validity indices are nearly stable andsimilar to the ones of naive FCM clustering for both datasets This verifies the conclusion that ldquoaccuracy of clusteringalgorithm can be preserved when the dimensionality exceedsa certain boundrdquo The effectiveness for random projectionmethod is also verified by the small bound compared to thetotal dimensions (301000 for synthetic data and 30067500for ACT1 data) Besides the two different kinds of randomprojection methods have the similar impact on FCM cluster-ing because of the analogous plot

The higher objective function values and the smaller XBindices of SVD method for synthetic data set indicate thatthe generated clustering solution has better separation degreebetween clusters The external cluster validation indices alsoverify that SVD method has better clustering results forsynthetic data These observations state that SVD methodis more suitable for Gaussian mixture data sets than FCMclustering with randomprojection and naive FCM clustering

Although the SVDmethod has a higher FRI for syntheticdata set the random projection methods have analogousFRI values for ACT1 data set and better objective functionvalues for both data sets In addition the random projectionapproaches are obviously more efficient as the SVD needscubic time of dimensionality Hence these observationsindicate that our algorithm is quite encouraging in practice

Mathematical Problems in Engineering 9

10 20 30 40 50 60 70 80 90 100

065

07

075

08

085

09

095

1FR

I

SVDFCM

GaussRPSignRP

Number of dimensions t

(a) FRI versus number of dimensions

SVDFCM

GaussRPSignRP

100 200 300 400 500 600 700 800 900 100008998

09

09002

09004

09006

09008

0901

FRI

Number of dimensions t

(b) FRI versus number of dimensions

SVDFCM

GaussRPSignRP

10 20 30 40 50 60 70 80 90 1000

05

1

15

2

25

3

35

XB

Number of dimensions t

(c) FRI versus number of dimensions

SVDFCM

GaussRPSignRP

100 200 300 400 500 600 700 800 900 10000

2

4

6

8

10

12

14XB

(1015)

Number of dimensions t

(d) FRI versus number of dimensions

10 20 30 40 50 60 70 80 90 100

032

034

036

038

04

042

044

046

048

05

Obj

ectiv

e fun

ctio

n

Number of dimensions t

SVDFCM

GaussRPSignRP

(e) FRI versus number of dimensions

100 200 300 400 500 600 700 800 900 100000285

0029

00295

003

00305

0031

00315

0032

00325

Obj

ectiv

e fun

ctio

n

Number of dimensions t

SVDFCM

GaussRPSignRP

(f) FRI versus number of dimensions

Figure 2 Continued

10 Mathematical Problems in Engineering

24262830

Runn

ing

time

(s)

SVDFCM

10 20 30 40 50 60 70 80 90 1001618

222242628

332

Runn

ing

time (

s)

GaussRPSignRP

10 20 30 40 50 60 70 80 90 100Number of dimensions t

Number of dimensions t

(g) FRI versus number of dimensions

SVDFCM

GaussRPSignRP

820830840850

Runn

ing

time

(s)

100 200 300 400 500 600 700 800 900 1000

100 200 300 400 500 600 700 800 900 1000

0

2

4

6

8

10

12

14

Runn

ing

time (

s)

Number of dimensions t

Number of dimensions t

(h) FRI versus number of dimensions

Figure 2 Performance of clustering algorithms with different dimensionality

Table 1 CVNN indices for different ensemble approaches on ACT2 data

Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-A 17315 17383 17449 17789 1819 183 17623 18182 18685 18067EFCM-C 17938 17558 17584 18351 18088 18353 18247 18385 18105 18381EFCM-S 13975 13144 12736 12974 13112 13643 13533 1409 13701 13765

54 Comparisons of Different Cluster Ensemble Methods Thecomparisons of different cluster ensemble approaches areshown in Figure 3 and Table 1 Similarly (a) and (c) of thefigure correspond to the synthetic data set and (b) and (d)corresponds to the ACT2 data set We use RI (a) and (b)and running time (c) and (d) to present the performanceof ensemble methods Meanwhile the meanings of EFCM-A EFCM-C and EFCM-S are identical to the ones inSection 42 In order to get crisp partition for EFCM-A andEFCM-C we used hierarchical clustering-complete linkagemethod after getting the distance matrix as in [21] Since allthree cluster ensemble methods get perfect partition resultson synthetic data set we only compare CVNN indices ofdifferent ensemble methods on ACT2 data set which ispresented in Table 1

In Figure 3 running time of our algorithm is shorterfor both data sets This verifies the result of time complexityanalysis for different algorithms in Section 42 The threecluster ensemble methods all get the perfect partition forsynthetic data set whereas our method is more accuratethan the other two methods for ACT2 data set The perfectpartition results suggest that all three ensemble methods aresuitable for Gaussian mixture data set However the almost18 improvement on RI for ACT2 data set should be due

to the different grouping ideas Our method is based on thegraph partition such that the edges between different clustershave low weight and the edges within a cluster have highweight This clustering way of spectral embedding is moresuitable for ACT2 data set In Table 1 the smaller values ofCVNN of our new method also show that new approach hasbetter partition results on ACT2 data set These observationsindicate that our algorithm has the advantage on efficiencyand adapts to a wider range of geometries

We also compare the stability for three ensemble meth-ods presented in Table 2 From the table we can see that thestandard deviation of RI about EFCM-S is a lower order ofmagnitude than the ones of the other methods Hence thisresult shows that our algorithm is more robust

Aiming at the situation of unknown clustersrsquo numberwe also varied the number of clusters 119888 in FCM clusteringand spectral embedding for our new method We denotethis version of new method as EFCM-SV Since the numberof random projections was set as 5 for ACT2 data set wechanged the clustersrsquo number from 17 to 21 as the input ofFCM clustering algorithm In addition we set the clustersrsquonumber from 14 to 24 as the input of spectral embeddingand applied CVNN to estimate the most plausible number ofclusters The experimental results are presented in Table 3

Mathematical Problems in Engineering 11

Table 2 Standard deviations of RI of 20 runs with different dimensions on ACT2 data

Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-A 00222 00174 0018 00257 00171 00251 00188 00172 00218 00184EFCM-C 00217 00189 00128 00232 00192 00200 00175 00194 00151 00214EFCM-S 00044 00018 00029 00030 00028 00024 00026 00020 00024 00019

Table 3 RI values for EFCM-S and EFCM-Sv on ACT2 data

Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-S 09227 0922 09223 0923 09215 09218 09226 09225 09231 09237EFCM-SV 09257 09257 09165 09257 0927 09165 09268 0927 09105 09245+CVNN 119888 = 185 119888 = 207 119888 = 194 119888 = 193 119888 = 193 119888 = 182 119888 = 192 119888 = 183 119888 = 194 119888 = 202

10 20 30 40 50 60 70 80 90 1000

02

04

06

08

1

12

14

16

18

2

RI

EFCM-AEFCM-CEFCM-S

Number of dimensions t

(a) RI versus number of dimensions

10 20 30 40 50 60 70 80 90 100074

076

078

08

082

084

086

088

09

092

094

RI

Number of dimensions t

EFCM-AEFCM-CEFCM-S

(b) RI versus number of dimensions

10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

Runn

ing

time (

s)

Number of dimensions t

EFCM-AEFCM-CEFCM-S

(c) Running time versus number of dimensions

10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

80

90

100

Runn

ing

time (

s)

Number of dimensions t

EFCM-AEFCM-CEFCM-S

(d) Running time versus number of dimensions

Figure 3 Performance of cluster ensemble approaches with different dimensionality

12 Mathematical Problems in Engineering

In Table 3 the values with respect to ldquoEFCM-SVrdquo are theaverage RI values with the estimated clustersrsquo numbers for20 individual runs The values of ldquo+CVNNrdquo are the averageclustersrsquo numbers decided by the CVNN cluster validityindex Using the estimated clustersrsquo numbers by CVNN ourmethod gets the similar results of ensemble method withcorrect clustersrsquo number In addition the average estimates ofclustersrsquo number are close to the true one This indicates thatour cluster ensemble method EFCM-SV is attractive whenthe number of clusters is unknown

6 Conclusion and Future Work

The ldquocurse of dimensionalityrdquo in big data gives new chal-lenges for clustering recently and feature extraction fordimensionality reduction is a popular way to deal with thesechallenges We studied the feature extraction method ofrandom projection for FCM clustering Through analyzingthe effects of random projection on the entire variabilityof data theoretically and verification both on syntheticand real world data empirically we designed an enhancedFCM clustering algorithm with random projection The newalgorithm can maintain nearly the same clustering solutionof preliminary FCM clustering and be more efficient thanfeature extraction method of SVD What is more we alsoproposed a cluster ensemble approach that is more applicableto large scale data sets than existing ones The new ensembleapproach can achieve spectral embedding efficiently fromSVD on the concatenation of membership matrices Theexperiments showed that the new ensemble method ranfaster had more robust partition solutions and fitted a widerrange of geometrical data sets

A future research content is to design the provablyaccurate feature extraction and feature selection methodsfor FCM clustering Another remaining question is thathow to choose proper number of random projections forcluster ensemble method in order to get a trade-off betweenclustering accuracy and efficiency

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This work was supported in part by the National KeyBasic Research Program (973 programme) under Grant2012CB315905 and in part by the National Nature ScienceFoundation of China under Grants 61502527 and 61379150and in part by the Open Foundation of State Key Laboratoryof Networking and Switching Technology (Beijing Universityof Posts and Telecommunications) (no SKLNST-2013-1-06)

References

[1] M Chen S Mao and Y Liu ldquoBig data a surveyrdquo MobileNetworks and Applications vol 19 no 2 pp 171ndash209 2014

[2] J Zhang X Tao and H Wang ldquoOutlier detection from largedistributed databasesrdquoWorld Wide Web vol 17 no 4 pp 539ndash568 2014

[3] C Ordonez N Mohanam and C Garcia-Alvarado ldquoPCA forlarge data sets with parallel data summarizationrdquo Distributedand Parallel Databases vol 32 no 3 pp 377ndash403 2014

[4] D-S Pham S Venkatesh M Lazarescu and S BudhadityaldquoAnomaly detection in large-scale data stream networksrdquo DataMining and Knowledge Discovery vol 28 no 1 pp 145ndash1892014

[5] F Murtagh and P Contreras ldquoRandom projection towardsthe baire metric for high dimensional clusteringrdquo in StatisticalLearning and Data Sciences pp 424ndash431 Springer BerlinGermany 2015

[6] T C Havens J C Bezdek C Leckie L O Hall and MPalaniswami ldquoFuzzy c-means algorithms for very large datardquoIEEETransactions on Fuzzy Systems vol 20 no 6 pp 1130ndash11462012

[7] J Han M Kamber and J Pei Data Mining Concepts andTechniques Concepts and Techniques Elsevier 2011

[8] S Khan G Situ K Decker and C J Schmidt ldquoGoFigureautomated gene ontology annotationrdquo Bioinformatics vol 19no 18 pp 2484ndash2485 2003

[9] S Gunnemann H Kremer D Lenhard and T Seidl ldquoSub-space clustering for indexing high dimensional data a mainmemory index based on local reductions and individual multi-representationsrdquo in Proceedings of the 14th International Confer-ence on Extending Database Technology (EDBT rsquo11) pp 237ndash248ACM Uppsala Sweden March 2011

[10] J C Bezdek R Ehrlich and W Full ldquoFCM the fuzzy c-meansclustering algorithmrdquo Computers amp Geosciences vol 10 no 2-3pp 191ndash203 1984

[11] R J Hathaway and J C Bezdek ldquoExtending fuzzy andprobabilistic clustering to very large data setsrdquo ComputationalStatistics amp Data Analysis vol 51 no 1 pp 215ndash234 2006

[12] P Hore L O Hall and D B Goldgof ldquoSingle pass fuzzy cmeansrdquo in Proceedings of the IEEE International Fuzzy SystemsConference (FUZZ rsquo07) pp 1ndash7 London UK July 2007

[13] P Hore L O Hall D B Goldgof Y Gu A A Maudsley andA Darkazanli ldquoA scalable framework for segmenting magneticresonance imagesrdquo Journal of Signal Processing Systems vol 54no 1ndash3 pp 183ndash203 2009

[14] W B Johnson and J Lindenstrauss ldquoExtensions of lipschitzmappings into aHilbert spacerdquoContemporaryMathematics vol26 pp 189ndash206 1984

[15] P Indyk and R Motwani ldquoApproximate nearest neighborstowards removing the curse of dimensionalityrdquo in Proceedingsof the 13th Annual ACM Symposium on Theory of Computingpp 604ndash613 ACM 1998

[16] D Achlioptas ldquoDatabase-friendly random projectionsJohnson-Lindenstrauss with binary coinsrdquo Journal of Computerand System Sciences vol 66 no 4 pp 671ndash687 2003

[17] C Boutsidis A Zouzias and P Drineas ldquoRandom projectionsfor k-means clusteringrdquo in Advances in Neural InformationProcessing Systems pp 298ndash306 MIT Press 2010

[18] C C Aggarwal and C K Reddy Data Clustering Algorithmsand Applications CRC Press New York NY USA 2013

[19] R Avogadri and G Valentini ldquoFuzzy ensemble clustering basedon random projections for DNA microarray data analysisrdquoArtificial Intelligence in Medicine vol 45 no 2-3 pp 173ndash1832009

Mathematical Problems in Engineering 13

[20] X Z Fern and C E Brodley ldquoRandom projection for highdimensional data clustering a cluster ensemble approachrdquo inProceedings of the 20th International Conference on MachineLearning (ICML rsquo03) vol 3 pp 186ndash193 August 2003

[21] M Popescu J Keller J Bezdek and A Zare ldquoRandomprojections fuzzy c-means (RPFCM) for big data clusteringrdquoin Proceedings of the IEEE International Conference on FuzzySystems (FUZZ-IEEE rsquo15) pp 1ndash6 Istanbul Turkey August 2015

[22] A Fahad N Alshatri Z Tari et al ldquoA survey of clusteringalgorithms for big data taxonomy and empirical analysisrdquo IEEETransactions on Emerging Topics in Computing vol 2 no 3 pp267ndash279 2014

[23] R A Johnson and D W Wichern Applied Multivariate Statis-tical Analysis vol 4 Pearson Prentice Hall Upper Saddle RiverNJ USA 6th edition 2007

[24] C Boutsidis A Zouzias M W Mahoney and P DrineasldquoRandomized dimensionality reduction for k-means cluster-ingrdquo IEEE Transactions on InformationTheory vol 61 no 2 pp1045ndash1062 2015

[25] X Chen and D Cai ldquoLarge scale spectral clustering withlandmark-based representationrdquo in Proceedings of the 25thAAAI Conference on Artificial Intelligence pp 313ndash318 2011

[26] D Cai and X Chen ldquoLarge scale spectral clustering vialandmark-based sparse representationrdquo IEEE Transactions onCybernetics vol 45 no 8 pp 1669ndash1680 2015

[27] G H Golub and C F Van Loan Matrix Computations vol 3JHU Press 2012

[28] U Maulik and S Bandyopadhyay ldquoPerformance evaluation ofsome clustering algorithms and validity indicesrdquo IEEE Transac-tions on Pattern Analysis and Machine Intelligence vol 24 no12 pp 1650ndash1654 2002

[29] W M Rand ldquoObjective criteria for the evaluation of clusteringmethodsrdquo Journal of the American Statistical Association vol66 no 336 pp 846ndash850 1971

[30] D T Anderson J C Bezdek M Popescu and J M KellerldquoComparing fuzzy probabilistic and possibilistic partitionsrdquoIEEE Transactions on Fuzzy Systems vol 18 no 5 pp 906ndash9182010

[31] X L Xie and G Beni ldquoA validity measure for fuzzy clusteringrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 13 no 8 pp 841ndash847 1991

[32] Y Liu Z LiH Xiong XGao JWu and SWu ldquoUnderstandingand enhancement of internal clustering validation measuresrdquoIEEE Transactions on Cybernetics vol 43 no 3 pp 982ndash9942013

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 6: Research Article Fuzzy -Means and Cluster Ensemble with Random Projection for Big Data ...downloads.hindawi.com/journals/mpe/2016/6529794.pdf · 2019. 7. 30. · Research Article

6 Mathematical Problems in Engineering

Input data set X (an 119899 times 119889matrix) number of clusters 119888 fuzzy constant119898 FCM clustering algorithmOutput partition matrix U centers of clusters V(1) sample a 119889 times 119905 (119905 le 119889 119905 = Ω(120576minus2 ln 119899)) random projection Rmeeting the requirements of Lemma 2(2) compute the product Y = (1radic119905)XR(3) run FCM algorithm on Y get the partition matrix U(4) compute the centers of clusters through original data X and U

Algorithm 2 FCM clustering with random projection

Input data set X (an 119899 times 119889matrix) number of clusters 119888 reduced dimension 119905 number ofrandom projection 119903 FCM clustering algorithmOutput cluster label vector u(1) at each iteration 119894 isin [1 119903] run Algorithm 2 get membership matrix U

119894isin R119888times119899

(2) concatenate the membership matrices Ucon = [U1198791 U119879

119903] isin R119899times119888119903

(3) compute the first 119888 left singular vectors of Ucon denoted by A = [a1 a2 a

119888] isin R119899times119888

where Ucon = Ucon(119903 sdot D)minus12 D is a diagonal matrix and 119889119894119894= sum119895119906con 119895119894

(4) treat each row of A as a data point and apply 119896-means to obtain cluster label vector

Algorithm 3 Cluster ensemble for FCM clustering with random projection

on similarity graph For each random projection a new dataset is generated After performing FCM clustering on thenew data sets membershipmatrices are outputThe elementsof membership matrix are treated as the similarity measurebetween points and the cluster centers Through SVD on theconcatenation of membership matrices we get the spectralembedding of data point efficiently The detailed procedureof new cluster ensemble approach is shown in Algorithm 3

In step (3) of the procedure in Algorithm 3 the left sin-gular vectors of Ucon are equivalent to the eigenvectors ofUconU

119879

con It implies that we regard the matrix product as aconstruction of affinity matrix of data points This method ismotivated by the research on landmark-based representation[25 26] In our approach we treat the cluster centers ofeach FCM clustering run as landmarks and the membershipmatrix as landmark-based representation Thus the con-catenation of membership matrices forms a combinationallandmark-based representationmatrix In this way the graphsimilarity matrix is computed as

W =UconU

119879

con (18)

which can create spectral embedding efficiently through step(3) To normalize the graph similarity matrix we multiplyUcon by (119903 sdot D)minus12 As a result the degree matrix of W is anidentity matrix

There are two perspectives to explain why our approachworks Considering the similarity measure defined by 119906

119894119895

in FCM clustering proposition 3 in [26] demonstrated thatsingular vectors of U

119894converged to eigenvectors of W

119904as 119888

converges to 119899 where W119904was affinity matrix generated in

standard spectral clustering As a result singular vectors ofUcon converge to eigenvectors of normalized affinity matrixW119904Thus our final outputwill converge to the one of standard

spectral clustering as 119888 converges to 119899 Another explanationis about the similarity measure defined by 119870(x

119894 x119895) = x119879119894x119895

where x119894and x119895are data pointsWe can treat each row of Ucon

as a transformational data point As a result affinity matrixobtained here is the same as the one of standard spectralembedding and our output is just the partition result ofstandard spectral clustering

To facilitate comparison of different ensemble methodsfor FCM clustering solutions with random projection wedenote the approach of [19] byEFCM-A (average the productsof membership matrices) the algorithm of [21] by EFCM-C (concatenate the membership matrices) and our newmethod by EFCM-S (spectral clustering on the membershipmatrices) In the cluster ensemble phase the main computa-tions of EFCM-Amethod are multiplications of membershipmatrices Similarly the algorithm of EFCM-C also needsthe product of concatenated membership matrices in orderto get the crisp partition result Thus the above methodsboth need 119874(119899

2) space and 119874(119888119903119899

2) time However the

main computation of EFCM-S is SVD for Ucon and 119896-meansclustering for A The overall space is 119874(119888119903119899) the SVD timeis 119874((119888119903)2119899) and the 119896-means clustering time is 1198971198882119899 where119897 is iteration number of 119896-means Therefore computationalcomplexity of EFCM-S is obviously decreased comparedwiththe ones of EFCM-A and EFCM-C considering 119888119903 ≪ 119899 and119897 ≪ 119899 in large scale data set

5 Experiments

In this section we present the experimental evaluations ofnew algorithms proposed in Section 4 We implemented therelated algorithms in Matlab computing environment andconducted our experiments on aWindows-based systemwiththe Intel Core 36GHz processor and 16GB of RAM

51 Data Sets and Parameter Settings We conducted theexperiments on synthetic and real data sets which bothhave relatively high dimensionality The synthetic data

Mathematical Problems in Engineering 7

Original dataset

Randomprojection 1

Randomprojection 2

Randomprojection r

Generateddataset 1

Generateddataset 2

Generateddataset r

FCM clustering

Membershipmatrix 1

Consensus matrix

Final result

Membershipmatrix 1

Membershipmatrix 1

FCM clusteringFCM clustering

k-means

First c left singularvectors A

middot middot middot

middot middot middot

middot middot middot

middot middot middot

Figure 1 Framework of the new ensemble approach based on graph partition

set had 10000 data points with 1000 dimensions whichwere generated from 3 Gaussian mixtures in propor-tions (025 05 025) The means of components were(2 2 2)

1000 (0 0 0)

1000 and (minus2 minus2 minus2)

1000and

the standard deviations were (1 1 1)1000

(2 2 2)1000

and (3 3 3)

1000 The real data set is the daily and sports

activities data (ACT) published on UCI machine learningrepository (theACTdata set can be found at httparchiveicsuciedumldatasetsDaily+and+Sports+Activities)Thesearedata of 19 activities collected by 45 motion sensors in 5minutes at 25Hz sampling frequency Each activity wasperformed by 8 subjects in their own styles To get highdimensional data sets we treated 1 minute and 5 secondsof activity data as an instance respectively As a result wegot 760 times 67500 (ACT1) and 9120 times 5625 (ACT2) datamatrices whose rows were activity instances and columnswere features

For the parameters of FCM clustering we let 120576 = 10minus5 welet maximum iteration number be 100 we let fuzzy factor 119898be 2 and we let the number of clusters be 119888 = 3 for syntheticdata set and 119888 = 19 for ACT data sets We also normalizedthe objective function as objlowast = objX2

119865 where sdot

119865is

Frobenius norm of matrix [27] To minimize the influenceintroduced by different initializations we present the averagevalues of evaluation indices of 20 independent experiments

In order to compare different dimensionality reductionmethods for FCM clustering we initialized algorithms by

choosing 119888 points randomly as the cluster centers and madesure that every algorithm began with the same initializationIn addition we ran Algorithm 2 with 119905 = 10 20 100 forsynthetic data set and 119905 = 100 200 1000 for ACT1 dataset Two kinds of random projections (with random variablesfrom (5) in Lemma 2) were both tested for verifying theirfeasibility We also compared Algorithm 2 against anotherpopular method of dimensionality reductionmdashSVD Whatcalls for special attention is that the number of eigenvectorscorresponding to nonzero eigenvalues of ACT1 data is only760 so we just took 119905 = 100 200 700 on FCM clusteringwith SVD for ACT1 data set

Among comparisons of different cluster ensemble algo-rithms we set dimension number of projected data as 119905 =10 20 100 for both synthetic andACT2data sets In orderto meet 119888119903 ≪ 119899 for Algorithm 3 the number of randomprojection 119903 was set as 20 for the synthetic data set and 5 forthe ACT2 data set respectively

52 Evaluation Criteria For clustering algorithms clusteringvalidation and running time are two important indices forjudging their performances Clustering validation measuresevaluate the goodness of clustering results [28] and often canbe divided into two categories external clustering validationand internal clustering validation External validation mea-sures use external information such as the given class labelsto evaluate the goodness of solution output by a clustering

8 Mathematical Problems in Engineering

algorithm On the contrary internal measures are to evaluatethe clustering results using feature inherited from data setsIn this paper validity evaluation criteria used are rand indexand clustering validation index based on nearest neighborsfor crisp partition together with fuzzy rand index and Xie-Beni index for fuzzy partition Here rand index and fuzzyrand index are external validation measures whereas theclustering validation index based on nearest neighbors indexand Xie-Beni index are internal validation measures

(1) Rand Index (RI) [29] RI describes the similarity ofclustering solution and correct labels through pairs of pointsIt takes into account the numbers of point pairs that are in thesame and different clusters The RI is defined as

RI =11989911+ 11989900

1198622

119899

(19)

where 11989911

is the number of pairs of points that exist in thesame cluster in both clustering result and given class labels11989900is the number of pairs of points that are in different subsets

for both clustering result and given class labels and1198622119899equals

119899(119899 minus 1)2 The value of RI ranges from 0 to 1 and the highervalue implies the better clustering solution

(2) Fuzzy Rand Index (FRI) [30] FRI is a generalizationof RI with respect to soft partition It also measures theproportion of pairs of points which exist in the same anddifferent clusters in both clustering solution and true classlabels It needs to compute the analogous 119899

11and 11989900through

contingency table described in [30] Therefore the range ofFRI is also [0 1] and the larger value means more accuratecluster solution

(3) Xie-Beni Index (XB) [31] XB takes the minimum squaredistance between cluster centers as the separation of thepartition and the average square fuzzy deviation of datapoints as the compactness of the partition XB is calculatedas follows

XB =sum119888

119894=1sum119899

119895=1119906119898

119894119895

10038171003817100381710038171003817x119895minus k119894

10038171003817100381710038171003817

2

119899 sdotmin119894119895

10038171003817100381710038171003817k119894minus k119895

10038171003817100381710038171003817

2 (20)

where sum119888119894=1sum119899

119895=1119906119898

119894119895x119895minus k1198942 is just the objective function of

FCM clustering and k119894is the center of cluster 119894 The smallest

XB indicates the optimal cluster partition

(4) Clustering Validation Index Based on Nearest Neighbors(CVNN) [32] The separation of CVNN is about the situationof objects that have geometrical information of each clusterand the compactness is the mean pairwise distance betweenobjects in the same cluster CVNN is computed as follows

CVNN (119888 119896) =Sep (119888 119896)

(max119888 minle119888le119888 maxSep (119888 119896))

+

Com (119888)(max119888 minle119888le119888 maxCom (119888))

(21)

where Sep(119888 119896) = max119894=12119888

((1119899119894) sdot sum

119899119894

119895=1(119902119895119896)) and

Com(119888) = sum119888119894=1((2119899119894(119899119894minus1)) sdotsum

119909119910isinClu119894 119889(119909 119910)) Here 119888 is the

number of clusters in partition result 119888 max is the maximumcluster number given 119888 min is the minimum cluster numbergiven 119896 is the number of nearest neighbors 119899

119894is the number

of objects in the 119894th cluster Clu119894 119902119895denotes the number of

nearest neighbors of Clu119894rsquos 119895th object which are not in Clu

119894

and 119889(119909 119910) denotes the distance between 119909 and 119910 The lowerCVNN value indicates the better clustering solution

Objective function is a special evaluation criterion ofvalidity for FCM clustering algorithm The smaller objectivefunction indicates that the points inside clusters are moreldquosimilarrdquo

Running time is also an important evaluation criterionoften related to the scalability of algorithm One maintarget of random projection for dimensionality reductionis to decrease the runtime and enhance the applicability ofalgorithm in the context of big data

53 Performance of FCM Clustering with Random ProjectionThe experimental results about FCM clustering with randomprojection are presented in Figure 2 where (a) (c) (e) and (g)correspond to the synthetic data set and (b) (d) (f) and (h)correspond to the ACT1 data setThe evaluation criteria usedto assess proposed algorithms are FRI (a) and (b) XB (c) and(d) objective function (e) and (f) and running time (g) and(h) ldquoSignRPrdquo denotes the proposed algorithm with randomsign matrix ldquoGaussRPrdquo denotes the FCM clustering withrandom Gaussian matrix ldquoFCMrdquo denotes the original FCMclustering algorithm and ldquoSVDrdquo denotes the FCM clusteringwith dimensionality reduction through SVD It should benoted that true XB value of FCM clustering in subfigure (d)is 403e + 12 not 0

From Figure 2 we can see that FCM clustering withrandom projection is clearly more efficient than the originalFCM clustering When number of dimensions 119905 is abovecertain bound the validity indices are nearly stable andsimilar to the ones of naive FCM clustering for both datasets This verifies the conclusion that ldquoaccuracy of clusteringalgorithm can be preserved when the dimensionality exceedsa certain boundrdquo The effectiveness for random projectionmethod is also verified by the small bound compared to thetotal dimensions (301000 for synthetic data and 30067500for ACT1 data) Besides the two different kinds of randomprojection methods have the similar impact on FCM cluster-ing because of the analogous plot

The higher objective function values and the smaller XBindices of SVD method for synthetic data set indicate thatthe generated clustering solution has better separation degreebetween clusters The external cluster validation indices alsoverify that SVD method has better clustering results forsynthetic data These observations state that SVD methodis more suitable for Gaussian mixture data sets than FCMclustering with randomprojection and naive FCM clustering

Although the SVDmethod has a higher FRI for syntheticdata set the random projection methods have analogousFRI values for ACT1 data set and better objective functionvalues for both data sets In addition the random projectionapproaches are obviously more efficient as the SVD needscubic time of dimensionality Hence these observationsindicate that our algorithm is quite encouraging in practice

Mathematical Problems in Engineering 9

10 20 30 40 50 60 70 80 90 100

065

07

075

08

085

09

095

1FR

I

SVDFCM

GaussRPSignRP

Number of dimensions t

(a) FRI versus number of dimensions

SVDFCM

GaussRPSignRP

100 200 300 400 500 600 700 800 900 100008998

09

09002

09004

09006

09008

0901

FRI

Number of dimensions t

(b) FRI versus number of dimensions

SVDFCM

GaussRPSignRP

10 20 30 40 50 60 70 80 90 1000

05

1

15

2

25

3

35

XB

Number of dimensions t

(c) FRI versus number of dimensions

SVDFCM

GaussRPSignRP

100 200 300 400 500 600 700 800 900 10000

2

4

6

8

10

12

14XB

(1015)

Number of dimensions t

(d) FRI versus number of dimensions

10 20 30 40 50 60 70 80 90 100

032

034

036

038

04

042

044

046

048

05

Obj

ectiv

e fun

ctio

n

Number of dimensions t

SVDFCM

GaussRPSignRP

(e) FRI versus number of dimensions

100 200 300 400 500 600 700 800 900 100000285

0029

00295

003

00305

0031

00315

0032

00325

Obj

ectiv

e fun

ctio

n

Number of dimensions t

SVDFCM

GaussRPSignRP

(f) FRI versus number of dimensions

Figure 2 Continued

10 Mathematical Problems in Engineering

24262830

Runn

ing

time

(s)

SVDFCM

10 20 30 40 50 60 70 80 90 1001618

222242628

332

Runn

ing

time (

s)

GaussRPSignRP

10 20 30 40 50 60 70 80 90 100Number of dimensions t

Number of dimensions t

(g) FRI versus number of dimensions

SVDFCM

GaussRPSignRP

820830840850

Runn

ing

time

(s)

100 200 300 400 500 600 700 800 900 1000

100 200 300 400 500 600 700 800 900 1000

0

2

4

6

8

10

12

14

Runn

ing

time (

s)

Number of dimensions t

Number of dimensions t

(h) FRI versus number of dimensions

Figure 2 Performance of clustering algorithms with different dimensionality

Table 1 CVNN indices for different ensemble approaches on ACT2 data

Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-A 17315 17383 17449 17789 1819 183 17623 18182 18685 18067EFCM-C 17938 17558 17584 18351 18088 18353 18247 18385 18105 18381EFCM-S 13975 13144 12736 12974 13112 13643 13533 1409 13701 13765

54 Comparisons of Different Cluster Ensemble Methods Thecomparisons of different cluster ensemble approaches areshown in Figure 3 and Table 1 Similarly (a) and (c) of thefigure correspond to the synthetic data set and (b) and (d)corresponds to the ACT2 data set We use RI (a) and (b)and running time (c) and (d) to present the performanceof ensemble methods Meanwhile the meanings of EFCM-A EFCM-C and EFCM-S are identical to the ones inSection 42 In order to get crisp partition for EFCM-A andEFCM-C we used hierarchical clustering-complete linkagemethod after getting the distance matrix as in [21] Since allthree cluster ensemble methods get perfect partition resultson synthetic data set we only compare CVNN indices ofdifferent ensemble methods on ACT2 data set which ispresented in Table 1

In Figure 3 running time of our algorithm is shorterfor both data sets This verifies the result of time complexityanalysis for different algorithms in Section 42 The threecluster ensemble methods all get the perfect partition forsynthetic data set whereas our method is more accuratethan the other two methods for ACT2 data set The perfectpartition results suggest that all three ensemble methods aresuitable for Gaussian mixture data set However the almost18 improvement on RI for ACT2 data set should be due

to the different grouping ideas Our method is based on thegraph partition such that the edges between different clustershave low weight and the edges within a cluster have highweight This clustering way of spectral embedding is moresuitable for ACT2 data set In Table 1 the smaller values ofCVNN of our new method also show that new approach hasbetter partition results on ACT2 data set These observationsindicate that our algorithm has the advantage on efficiencyand adapts to a wider range of geometries

We also compare the stability for three ensemble meth-ods presented in Table 2 From the table we can see that thestandard deviation of RI about EFCM-S is a lower order ofmagnitude than the ones of the other methods Hence thisresult shows that our algorithm is more robust

Aiming at the situation of unknown clustersrsquo numberwe also varied the number of clusters 119888 in FCM clusteringand spectral embedding for our new method We denotethis version of new method as EFCM-SV Since the numberof random projections was set as 5 for ACT2 data set wechanged the clustersrsquo number from 17 to 21 as the input ofFCM clustering algorithm In addition we set the clustersrsquonumber from 14 to 24 as the input of spectral embeddingand applied CVNN to estimate the most plausible number ofclusters The experimental results are presented in Table 3

Mathematical Problems in Engineering 11

Table 2 Standard deviations of RI of 20 runs with different dimensions on ACT2 data

Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-A 00222 00174 0018 00257 00171 00251 00188 00172 00218 00184EFCM-C 00217 00189 00128 00232 00192 00200 00175 00194 00151 00214EFCM-S 00044 00018 00029 00030 00028 00024 00026 00020 00024 00019

Table 3 RI values for EFCM-S and EFCM-Sv on ACT2 data

Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-S 09227 0922 09223 0923 09215 09218 09226 09225 09231 09237EFCM-SV 09257 09257 09165 09257 0927 09165 09268 0927 09105 09245+CVNN 119888 = 185 119888 = 207 119888 = 194 119888 = 193 119888 = 193 119888 = 182 119888 = 192 119888 = 183 119888 = 194 119888 = 202

10 20 30 40 50 60 70 80 90 1000

02

04

06

08

1

12

14

16

18

2

RI

EFCM-AEFCM-CEFCM-S

Number of dimensions t

(a) RI versus number of dimensions

10 20 30 40 50 60 70 80 90 100074

076

078

08

082

084

086

088

09

092

094

RI

Number of dimensions t

EFCM-AEFCM-CEFCM-S

(b) RI versus number of dimensions

10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

Runn

ing

time (

s)

Number of dimensions t

EFCM-AEFCM-CEFCM-S

(c) Running time versus number of dimensions

10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

80

90

100

Runn

ing

time (

s)

Number of dimensions t

EFCM-AEFCM-CEFCM-S

(d) Running time versus number of dimensions

Figure 3 Performance of cluster ensemble approaches with different dimensionality

12 Mathematical Problems in Engineering

In Table 3 the values with respect to ldquoEFCM-SVrdquo are theaverage RI values with the estimated clustersrsquo numbers for20 individual runs The values of ldquo+CVNNrdquo are the averageclustersrsquo numbers decided by the CVNN cluster validityindex Using the estimated clustersrsquo numbers by CVNN ourmethod gets the similar results of ensemble method withcorrect clustersrsquo number In addition the average estimates ofclustersrsquo number are close to the true one This indicates thatour cluster ensemble method EFCM-SV is attractive whenthe number of clusters is unknown

6 Conclusion and Future Work

The ldquocurse of dimensionalityrdquo in big data gives new chal-lenges for clustering recently and feature extraction fordimensionality reduction is a popular way to deal with thesechallenges We studied the feature extraction method ofrandom projection for FCM clustering Through analyzingthe effects of random projection on the entire variabilityof data theoretically and verification both on syntheticand real world data empirically we designed an enhancedFCM clustering algorithm with random projection The newalgorithm can maintain nearly the same clustering solutionof preliminary FCM clustering and be more efficient thanfeature extraction method of SVD What is more we alsoproposed a cluster ensemble approach that is more applicableto large scale data sets than existing ones The new ensembleapproach can achieve spectral embedding efficiently fromSVD on the concatenation of membership matrices Theexperiments showed that the new ensemble method ranfaster had more robust partition solutions and fitted a widerrange of geometrical data sets

A future research content is to design the provablyaccurate feature extraction and feature selection methodsfor FCM clustering Another remaining question is thathow to choose proper number of random projections forcluster ensemble method in order to get a trade-off betweenclustering accuracy and efficiency

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This work was supported in part by the National KeyBasic Research Program (973 programme) under Grant2012CB315905 and in part by the National Nature ScienceFoundation of China under Grants 61502527 and 61379150and in part by the Open Foundation of State Key Laboratoryof Networking and Switching Technology (Beijing Universityof Posts and Telecommunications) (no SKLNST-2013-1-06)

References

[1] M Chen S Mao and Y Liu ldquoBig data a surveyrdquo MobileNetworks and Applications vol 19 no 2 pp 171ndash209 2014

[2] J Zhang X Tao and H Wang ldquoOutlier detection from largedistributed databasesrdquoWorld Wide Web vol 17 no 4 pp 539ndash568 2014

[3] C Ordonez N Mohanam and C Garcia-Alvarado ldquoPCA forlarge data sets with parallel data summarizationrdquo Distributedand Parallel Databases vol 32 no 3 pp 377ndash403 2014

[4] D-S Pham S Venkatesh M Lazarescu and S BudhadityaldquoAnomaly detection in large-scale data stream networksrdquo DataMining and Knowledge Discovery vol 28 no 1 pp 145ndash1892014

[5] F Murtagh and P Contreras ldquoRandom projection towardsthe baire metric for high dimensional clusteringrdquo in StatisticalLearning and Data Sciences pp 424ndash431 Springer BerlinGermany 2015

[6] T C Havens J C Bezdek C Leckie L O Hall and MPalaniswami ldquoFuzzy c-means algorithms for very large datardquoIEEETransactions on Fuzzy Systems vol 20 no 6 pp 1130ndash11462012

[7] J Han M Kamber and J Pei Data Mining Concepts andTechniques Concepts and Techniques Elsevier 2011

[8] S Khan G Situ K Decker and C J Schmidt ldquoGoFigureautomated gene ontology annotationrdquo Bioinformatics vol 19no 18 pp 2484ndash2485 2003

[9] S Gunnemann H Kremer D Lenhard and T Seidl ldquoSub-space clustering for indexing high dimensional data a mainmemory index based on local reductions and individual multi-representationsrdquo in Proceedings of the 14th International Confer-ence on Extending Database Technology (EDBT rsquo11) pp 237ndash248ACM Uppsala Sweden March 2011

[10] J C Bezdek R Ehrlich and W Full ldquoFCM the fuzzy c-meansclustering algorithmrdquo Computers amp Geosciences vol 10 no 2-3pp 191ndash203 1984

[11] R J Hathaway and J C Bezdek ldquoExtending fuzzy andprobabilistic clustering to very large data setsrdquo ComputationalStatistics amp Data Analysis vol 51 no 1 pp 215ndash234 2006

[12] P Hore L O Hall and D B Goldgof ldquoSingle pass fuzzy cmeansrdquo in Proceedings of the IEEE International Fuzzy SystemsConference (FUZZ rsquo07) pp 1ndash7 London UK July 2007

[13] P Hore L O Hall D B Goldgof Y Gu A A Maudsley andA Darkazanli ldquoA scalable framework for segmenting magneticresonance imagesrdquo Journal of Signal Processing Systems vol 54no 1ndash3 pp 183ndash203 2009

[14] W B Johnson and J Lindenstrauss ldquoExtensions of lipschitzmappings into aHilbert spacerdquoContemporaryMathematics vol26 pp 189ndash206 1984

[15] P Indyk and R Motwani ldquoApproximate nearest neighborstowards removing the curse of dimensionalityrdquo in Proceedingsof the 13th Annual ACM Symposium on Theory of Computingpp 604ndash613 ACM 1998

[16] D Achlioptas ldquoDatabase-friendly random projectionsJohnson-Lindenstrauss with binary coinsrdquo Journal of Computerand System Sciences vol 66 no 4 pp 671ndash687 2003

[17] C Boutsidis A Zouzias and P Drineas ldquoRandom projectionsfor k-means clusteringrdquo in Advances in Neural InformationProcessing Systems pp 298ndash306 MIT Press 2010

[18] C C Aggarwal and C K Reddy Data Clustering Algorithmsand Applications CRC Press New York NY USA 2013

[19] R Avogadri and G Valentini ldquoFuzzy ensemble clustering basedon random projections for DNA microarray data analysisrdquoArtificial Intelligence in Medicine vol 45 no 2-3 pp 173ndash1832009

Mathematical Problems in Engineering 13

[20] X Z Fern and C E Brodley ldquoRandom projection for highdimensional data clustering a cluster ensemble approachrdquo inProceedings of the 20th International Conference on MachineLearning (ICML rsquo03) vol 3 pp 186ndash193 August 2003

[21] M Popescu J Keller J Bezdek and A Zare ldquoRandomprojections fuzzy c-means (RPFCM) for big data clusteringrdquoin Proceedings of the IEEE International Conference on FuzzySystems (FUZZ-IEEE rsquo15) pp 1ndash6 Istanbul Turkey August 2015

[22] A Fahad N Alshatri Z Tari et al ldquoA survey of clusteringalgorithms for big data taxonomy and empirical analysisrdquo IEEETransactions on Emerging Topics in Computing vol 2 no 3 pp267ndash279 2014

[23] R A Johnson and D W Wichern Applied Multivariate Statis-tical Analysis vol 4 Pearson Prentice Hall Upper Saddle RiverNJ USA 6th edition 2007

[24] C Boutsidis A Zouzias M W Mahoney and P DrineasldquoRandomized dimensionality reduction for k-means cluster-ingrdquo IEEE Transactions on InformationTheory vol 61 no 2 pp1045ndash1062 2015

[25] X Chen and D Cai ldquoLarge scale spectral clustering withlandmark-based representationrdquo in Proceedings of the 25thAAAI Conference on Artificial Intelligence pp 313ndash318 2011

[26] D Cai and X Chen ldquoLarge scale spectral clustering vialandmark-based sparse representationrdquo IEEE Transactions onCybernetics vol 45 no 8 pp 1669ndash1680 2015

[27] G H Golub and C F Van Loan Matrix Computations vol 3JHU Press 2012

[28] U Maulik and S Bandyopadhyay ldquoPerformance evaluation ofsome clustering algorithms and validity indicesrdquo IEEE Transac-tions on Pattern Analysis and Machine Intelligence vol 24 no12 pp 1650ndash1654 2002

[29] W M Rand ldquoObjective criteria for the evaluation of clusteringmethodsrdquo Journal of the American Statistical Association vol66 no 336 pp 846ndash850 1971

[30] D T Anderson J C Bezdek M Popescu and J M KellerldquoComparing fuzzy probabilistic and possibilistic partitionsrdquoIEEE Transactions on Fuzzy Systems vol 18 no 5 pp 906ndash9182010

[31] X L Xie and G Beni ldquoA validity measure for fuzzy clusteringrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 13 no 8 pp 841ndash847 1991

[32] Y Liu Z LiH Xiong XGao JWu and SWu ldquoUnderstandingand enhancement of internal clustering validation measuresrdquoIEEE Transactions on Cybernetics vol 43 no 3 pp 982ndash9942013

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 7: Research Article Fuzzy -Means and Cluster Ensemble with Random Projection for Big Data ...downloads.hindawi.com/journals/mpe/2016/6529794.pdf · 2019. 7. 30. · Research Article

Mathematical Problems in Engineering 7

Original dataset

Randomprojection 1

Randomprojection 2

Randomprojection r

Generateddataset 1

Generateddataset 2

Generateddataset r

FCM clustering

Membershipmatrix 1

Consensus matrix

Final result

Membershipmatrix 1

Membershipmatrix 1

FCM clusteringFCM clustering

k-means

First c left singularvectors A

middot middot middot

middot middot middot

middot middot middot

middot middot middot

Figure 1 Framework of the new ensemble approach based on graph partition

set had 10000 data points with 1000 dimensions whichwere generated from 3 Gaussian mixtures in propor-tions (025 05 025) The means of components were(2 2 2)

1000 (0 0 0)

1000 and (minus2 minus2 minus2)

1000and

the standard deviations were (1 1 1)1000

(2 2 2)1000

and (3 3 3)

1000 The real data set is the daily and sports

activities data (ACT) published on UCI machine learningrepository (theACTdata set can be found at httparchiveicsuciedumldatasetsDaily+and+Sports+Activities)Thesearedata of 19 activities collected by 45 motion sensors in 5minutes at 25Hz sampling frequency Each activity wasperformed by 8 subjects in their own styles To get highdimensional data sets we treated 1 minute and 5 secondsof activity data as an instance respectively As a result wegot 760 times 67500 (ACT1) and 9120 times 5625 (ACT2) datamatrices whose rows were activity instances and columnswere features

For the parameters of FCM clustering we let 120576 = 10minus5 welet maximum iteration number be 100 we let fuzzy factor 119898be 2 and we let the number of clusters be 119888 = 3 for syntheticdata set and 119888 = 19 for ACT data sets We also normalizedthe objective function as objlowast = objX2

119865 where sdot

119865is

Frobenius norm of matrix [27] To minimize the influenceintroduced by different initializations we present the averagevalues of evaluation indices of 20 independent experiments

In order to compare different dimensionality reductionmethods for FCM clustering we initialized algorithms by

choosing 119888 points randomly as the cluster centers and madesure that every algorithm began with the same initializationIn addition we ran Algorithm 2 with 119905 = 10 20 100 forsynthetic data set and 119905 = 100 200 1000 for ACT1 dataset Two kinds of random projections (with random variablesfrom (5) in Lemma 2) were both tested for verifying theirfeasibility We also compared Algorithm 2 against anotherpopular method of dimensionality reductionmdashSVD Whatcalls for special attention is that the number of eigenvectorscorresponding to nonzero eigenvalues of ACT1 data is only760 so we just took 119905 = 100 200 700 on FCM clusteringwith SVD for ACT1 data set

Among comparisons of different cluster ensemble algo-rithms we set dimension number of projected data as 119905 =10 20 100 for both synthetic andACT2data sets In orderto meet 119888119903 ≪ 119899 for Algorithm 3 the number of randomprojection 119903 was set as 20 for the synthetic data set and 5 forthe ACT2 data set respectively

52 Evaluation Criteria For clustering algorithms clusteringvalidation and running time are two important indices forjudging their performances Clustering validation measuresevaluate the goodness of clustering results [28] and often canbe divided into two categories external clustering validationand internal clustering validation External validation mea-sures use external information such as the given class labelsto evaluate the goodness of solution output by a clustering

8 Mathematical Problems in Engineering

algorithm On the contrary internal measures are to evaluatethe clustering results using feature inherited from data setsIn this paper validity evaluation criteria used are rand indexand clustering validation index based on nearest neighborsfor crisp partition together with fuzzy rand index and Xie-Beni index for fuzzy partition Here rand index and fuzzyrand index are external validation measures whereas theclustering validation index based on nearest neighbors indexand Xie-Beni index are internal validation measures

(1) Rand Index (RI) [29] RI describes the similarity ofclustering solution and correct labels through pairs of pointsIt takes into account the numbers of point pairs that are in thesame and different clusters The RI is defined as

RI =11989911+ 11989900

1198622

119899

(19)

where 11989911

is the number of pairs of points that exist in thesame cluster in both clustering result and given class labels11989900is the number of pairs of points that are in different subsets

for both clustering result and given class labels and1198622119899equals

119899(119899 minus 1)2 The value of RI ranges from 0 to 1 and the highervalue implies the better clustering solution

(2) Fuzzy Rand Index (FRI) [30] FRI is a generalizationof RI with respect to soft partition It also measures theproportion of pairs of points which exist in the same anddifferent clusters in both clustering solution and true classlabels It needs to compute the analogous 119899

11and 11989900through

contingency table described in [30] Therefore the range ofFRI is also [0 1] and the larger value means more accuratecluster solution

(3) Xie-Beni Index (XB) [31] XB takes the minimum squaredistance between cluster centers as the separation of thepartition and the average square fuzzy deviation of datapoints as the compactness of the partition XB is calculatedas follows

XB =sum119888

119894=1sum119899

119895=1119906119898

119894119895

10038171003817100381710038171003817x119895minus k119894

10038171003817100381710038171003817

2

119899 sdotmin119894119895

10038171003817100381710038171003817k119894minus k119895

10038171003817100381710038171003817

2 (20)

where sum119888119894=1sum119899

119895=1119906119898

119894119895x119895minus k1198942 is just the objective function of

FCM clustering and k119894is the center of cluster 119894 The smallest

XB indicates the optimal cluster partition

(4) Clustering Validation Index Based on Nearest Neighbors(CVNN) [32] The separation of CVNN is about the situationof objects that have geometrical information of each clusterand the compactness is the mean pairwise distance betweenobjects in the same cluster CVNN is computed as follows

CVNN (119888 119896) =Sep (119888 119896)

(max119888 minle119888le119888 maxSep (119888 119896))

+

Com (119888)(max119888 minle119888le119888 maxCom (119888))

(21)

where Sep(119888 119896) = max119894=12119888

((1119899119894) sdot sum

119899119894

119895=1(119902119895119896)) and

Com(119888) = sum119888119894=1((2119899119894(119899119894minus1)) sdotsum

119909119910isinClu119894 119889(119909 119910)) Here 119888 is the

number of clusters in partition result 119888 max is the maximumcluster number given 119888 min is the minimum cluster numbergiven 119896 is the number of nearest neighbors 119899

119894is the number

of objects in the 119894th cluster Clu119894 119902119895denotes the number of

nearest neighbors of Clu119894rsquos 119895th object which are not in Clu

119894

and 119889(119909 119910) denotes the distance between 119909 and 119910 The lowerCVNN value indicates the better clustering solution

Objective function is a special evaluation criterion ofvalidity for FCM clustering algorithm The smaller objectivefunction indicates that the points inside clusters are moreldquosimilarrdquo

Running time is also an important evaluation criterionoften related to the scalability of algorithm One maintarget of random projection for dimensionality reductionis to decrease the runtime and enhance the applicability ofalgorithm in the context of big data

53 Performance of FCM Clustering with Random ProjectionThe experimental results about FCM clustering with randomprojection are presented in Figure 2 where (a) (c) (e) and (g)correspond to the synthetic data set and (b) (d) (f) and (h)correspond to the ACT1 data setThe evaluation criteria usedto assess proposed algorithms are FRI (a) and (b) XB (c) and(d) objective function (e) and (f) and running time (g) and(h) ldquoSignRPrdquo denotes the proposed algorithm with randomsign matrix ldquoGaussRPrdquo denotes the FCM clustering withrandom Gaussian matrix ldquoFCMrdquo denotes the original FCMclustering algorithm and ldquoSVDrdquo denotes the FCM clusteringwith dimensionality reduction through SVD It should benoted that true XB value of FCM clustering in subfigure (d)is 403e + 12 not 0

From Figure 2 we can see that FCM clustering withrandom projection is clearly more efficient than the originalFCM clustering When number of dimensions 119905 is abovecertain bound the validity indices are nearly stable andsimilar to the ones of naive FCM clustering for both datasets This verifies the conclusion that ldquoaccuracy of clusteringalgorithm can be preserved when the dimensionality exceedsa certain boundrdquo The effectiveness for random projectionmethod is also verified by the small bound compared to thetotal dimensions (301000 for synthetic data and 30067500for ACT1 data) Besides the two different kinds of randomprojection methods have the similar impact on FCM cluster-ing because of the analogous plot

The higher objective function values and the smaller XBindices of SVD method for synthetic data set indicate thatthe generated clustering solution has better separation degreebetween clusters The external cluster validation indices alsoverify that SVD method has better clustering results forsynthetic data These observations state that SVD methodis more suitable for Gaussian mixture data sets than FCMclustering with randomprojection and naive FCM clustering

Although the SVDmethod has a higher FRI for syntheticdata set the random projection methods have analogousFRI values for ACT1 data set and better objective functionvalues for both data sets In addition the random projectionapproaches are obviously more efficient as the SVD needscubic time of dimensionality Hence these observationsindicate that our algorithm is quite encouraging in practice

Mathematical Problems in Engineering 9

10 20 30 40 50 60 70 80 90 100

065

07

075

08

085

09

095

1FR

I

SVDFCM

GaussRPSignRP

Number of dimensions t

(a) FRI versus number of dimensions

SVDFCM

GaussRPSignRP

100 200 300 400 500 600 700 800 900 100008998

09

09002

09004

09006

09008

0901

FRI

Number of dimensions t

(b) FRI versus number of dimensions

SVDFCM

GaussRPSignRP

10 20 30 40 50 60 70 80 90 1000

05

1

15

2

25

3

35

XB

Number of dimensions t

(c) FRI versus number of dimensions

SVDFCM

GaussRPSignRP

100 200 300 400 500 600 700 800 900 10000

2

4

6

8

10

12

14XB

(1015)

Number of dimensions t

(d) FRI versus number of dimensions

10 20 30 40 50 60 70 80 90 100

032

034

036

038

04

042

044

046

048

05

Obj

ectiv

e fun

ctio

n

Number of dimensions t

SVDFCM

GaussRPSignRP

(e) FRI versus number of dimensions

100 200 300 400 500 600 700 800 900 100000285

0029

00295

003

00305

0031

00315

0032

00325

Obj

ectiv

e fun

ctio

n

Number of dimensions t

SVDFCM

GaussRPSignRP

(f) FRI versus number of dimensions

Figure 2 Continued

10 Mathematical Problems in Engineering

24262830

Runn

ing

time

(s)

SVDFCM

10 20 30 40 50 60 70 80 90 1001618

222242628

332

Runn

ing

time (

s)

GaussRPSignRP

10 20 30 40 50 60 70 80 90 100Number of dimensions t

Number of dimensions t

(g) FRI versus number of dimensions

SVDFCM

GaussRPSignRP

820830840850

Runn

ing

time

(s)

100 200 300 400 500 600 700 800 900 1000

100 200 300 400 500 600 700 800 900 1000

0

2

4

6

8

10

12

14

Runn

ing

time (

s)

Number of dimensions t

Number of dimensions t

(h) FRI versus number of dimensions

Figure 2 Performance of clustering algorithms with different dimensionality

Table 1 CVNN indices for different ensemble approaches on ACT2 data

Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-A 17315 17383 17449 17789 1819 183 17623 18182 18685 18067EFCM-C 17938 17558 17584 18351 18088 18353 18247 18385 18105 18381EFCM-S 13975 13144 12736 12974 13112 13643 13533 1409 13701 13765

54 Comparisons of Different Cluster Ensemble Methods Thecomparisons of different cluster ensemble approaches areshown in Figure 3 and Table 1 Similarly (a) and (c) of thefigure correspond to the synthetic data set and (b) and (d)corresponds to the ACT2 data set We use RI (a) and (b)and running time (c) and (d) to present the performanceof ensemble methods Meanwhile the meanings of EFCM-A EFCM-C and EFCM-S are identical to the ones inSection 42 In order to get crisp partition for EFCM-A andEFCM-C we used hierarchical clustering-complete linkagemethod after getting the distance matrix as in [21] Since allthree cluster ensemble methods get perfect partition resultson synthetic data set we only compare CVNN indices ofdifferent ensemble methods on ACT2 data set which ispresented in Table 1

In Figure 3 running time of our algorithm is shorterfor both data sets This verifies the result of time complexityanalysis for different algorithms in Section 42 The threecluster ensemble methods all get the perfect partition forsynthetic data set whereas our method is more accuratethan the other two methods for ACT2 data set The perfectpartition results suggest that all three ensemble methods aresuitable for Gaussian mixture data set However the almost18 improvement on RI for ACT2 data set should be due

to the different grouping ideas Our method is based on thegraph partition such that the edges between different clustershave low weight and the edges within a cluster have highweight This clustering way of spectral embedding is moresuitable for ACT2 data set In Table 1 the smaller values ofCVNN of our new method also show that new approach hasbetter partition results on ACT2 data set These observationsindicate that our algorithm has the advantage on efficiencyand adapts to a wider range of geometries

We also compare the stability for three ensemble meth-ods presented in Table 2 From the table we can see that thestandard deviation of RI about EFCM-S is a lower order ofmagnitude than the ones of the other methods Hence thisresult shows that our algorithm is more robust

Aiming at the situation of unknown clustersrsquo numberwe also varied the number of clusters 119888 in FCM clusteringand spectral embedding for our new method We denotethis version of new method as EFCM-SV Since the numberof random projections was set as 5 for ACT2 data set wechanged the clustersrsquo number from 17 to 21 as the input ofFCM clustering algorithm In addition we set the clustersrsquonumber from 14 to 24 as the input of spectral embeddingand applied CVNN to estimate the most plausible number ofclusters The experimental results are presented in Table 3

Mathematical Problems in Engineering 11

Table 2 Standard deviations of RI of 20 runs with different dimensions on ACT2 data

Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-A 00222 00174 0018 00257 00171 00251 00188 00172 00218 00184EFCM-C 00217 00189 00128 00232 00192 00200 00175 00194 00151 00214EFCM-S 00044 00018 00029 00030 00028 00024 00026 00020 00024 00019

Table 3 RI values for EFCM-S and EFCM-Sv on ACT2 data

Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-S 09227 0922 09223 0923 09215 09218 09226 09225 09231 09237EFCM-SV 09257 09257 09165 09257 0927 09165 09268 0927 09105 09245+CVNN 119888 = 185 119888 = 207 119888 = 194 119888 = 193 119888 = 193 119888 = 182 119888 = 192 119888 = 183 119888 = 194 119888 = 202

10 20 30 40 50 60 70 80 90 1000

02

04

06

08

1

12

14

16

18

2

RI

EFCM-AEFCM-CEFCM-S

Number of dimensions t

(a) RI versus number of dimensions

10 20 30 40 50 60 70 80 90 100074

076

078

08

082

084

086

088

09

092

094

RI

Number of dimensions t

EFCM-AEFCM-CEFCM-S

(b) RI versus number of dimensions

10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

Runn

ing

time (

s)

Number of dimensions t

EFCM-AEFCM-CEFCM-S

(c) Running time versus number of dimensions

10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

80

90

100

Runn

ing

time (

s)

Number of dimensions t

EFCM-AEFCM-CEFCM-S

(d) Running time versus number of dimensions

Figure 3 Performance of cluster ensemble approaches with different dimensionality

12 Mathematical Problems in Engineering

In Table 3 the values with respect to ldquoEFCM-SVrdquo are theaverage RI values with the estimated clustersrsquo numbers for20 individual runs The values of ldquo+CVNNrdquo are the averageclustersrsquo numbers decided by the CVNN cluster validityindex Using the estimated clustersrsquo numbers by CVNN ourmethod gets the similar results of ensemble method withcorrect clustersrsquo number In addition the average estimates ofclustersrsquo number are close to the true one This indicates thatour cluster ensemble method EFCM-SV is attractive whenthe number of clusters is unknown

6 Conclusion and Future Work

The ldquocurse of dimensionalityrdquo in big data gives new chal-lenges for clustering recently and feature extraction fordimensionality reduction is a popular way to deal with thesechallenges We studied the feature extraction method ofrandom projection for FCM clustering Through analyzingthe effects of random projection on the entire variabilityof data theoretically and verification both on syntheticand real world data empirically we designed an enhancedFCM clustering algorithm with random projection The newalgorithm can maintain nearly the same clustering solutionof preliminary FCM clustering and be more efficient thanfeature extraction method of SVD What is more we alsoproposed a cluster ensemble approach that is more applicableto large scale data sets than existing ones The new ensembleapproach can achieve spectral embedding efficiently fromSVD on the concatenation of membership matrices Theexperiments showed that the new ensemble method ranfaster had more robust partition solutions and fitted a widerrange of geometrical data sets

A future research content is to design the provablyaccurate feature extraction and feature selection methodsfor FCM clustering Another remaining question is thathow to choose proper number of random projections forcluster ensemble method in order to get a trade-off betweenclustering accuracy and efficiency

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This work was supported in part by the National KeyBasic Research Program (973 programme) under Grant2012CB315905 and in part by the National Nature ScienceFoundation of China under Grants 61502527 and 61379150and in part by the Open Foundation of State Key Laboratoryof Networking and Switching Technology (Beijing Universityof Posts and Telecommunications) (no SKLNST-2013-1-06)

References

[1] M Chen S Mao and Y Liu ldquoBig data a surveyrdquo MobileNetworks and Applications vol 19 no 2 pp 171ndash209 2014

[2] J Zhang X Tao and H Wang ldquoOutlier detection from largedistributed databasesrdquoWorld Wide Web vol 17 no 4 pp 539ndash568 2014

[3] C Ordonez N Mohanam and C Garcia-Alvarado ldquoPCA forlarge data sets with parallel data summarizationrdquo Distributedand Parallel Databases vol 32 no 3 pp 377ndash403 2014

[4] D-S Pham S Venkatesh M Lazarescu and S BudhadityaldquoAnomaly detection in large-scale data stream networksrdquo DataMining and Knowledge Discovery vol 28 no 1 pp 145ndash1892014

[5] F Murtagh and P Contreras ldquoRandom projection towardsthe baire metric for high dimensional clusteringrdquo in StatisticalLearning and Data Sciences pp 424ndash431 Springer BerlinGermany 2015

[6] T C Havens J C Bezdek C Leckie L O Hall and MPalaniswami ldquoFuzzy c-means algorithms for very large datardquoIEEETransactions on Fuzzy Systems vol 20 no 6 pp 1130ndash11462012

[7] J Han M Kamber and J Pei Data Mining Concepts andTechniques Concepts and Techniques Elsevier 2011

[8] S Khan G Situ K Decker and C J Schmidt ldquoGoFigureautomated gene ontology annotationrdquo Bioinformatics vol 19no 18 pp 2484ndash2485 2003

[9] S Gunnemann H Kremer D Lenhard and T Seidl ldquoSub-space clustering for indexing high dimensional data a mainmemory index based on local reductions and individual multi-representationsrdquo in Proceedings of the 14th International Confer-ence on Extending Database Technology (EDBT rsquo11) pp 237ndash248ACM Uppsala Sweden March 2011

[10] J C Bezdek R Ehrlich and W Full ldquoFCM the fuzzy c-meansclustering algorithmrdquo Computers amp Geosciences vol 10 no 2-3pp 191ndash203 1984

[11] R J Hathaway and J C Bezdek ldquoExtending fuzzy andprobabilistic clustering to very large data setsrdquo ComputationalStatistics amp Data Analysis vol 51 no 1 pp 215ndash234 2006

[12] P Hore L O Hall and D B Goldgof ldquoSingle pass fuzzy cmeansrdquo in Proceedings of the IEEE International Fuzzy SystemsConference (FUZZ rsquo07) pp 1ndash7 London UK July 2007

[13] P Hore L O Hall D B Goldgof Y Gu A A Maudsley andA Darkazanli ldquoA scalable framework for segmenting magneticresonance imagesrdquo Journal of Signal Processing Systems vol 54no 1ndash3 pp 183ndash203 2009

[14] W B Johnson and J Lindenstrauss ldquoExtensions of lipschitzmappings into aHilbert spacerdquoContemporaryMathematics vol26 pp 189ndash206 1984

[15] P Indyk and R Motwani ldquoApproximate nearest neighborstowards removing the curse of dimensionalityrdquo in Proceedingsof the 13th Annual ACM Symposium on Theory of Computingpp 604ndash613 ACM 1998

[16] D Achlioptas ldquoDatabase-friendly random projectionsJohnson-Lindenstrauss with binary coinsrdquo Journal of Computerand System Sciences vol 66 no 4 pp 671ndash687 2003

[17] C Boutsidis A Zouzias and P Drineas ldquoRandom projectionsfor k-means clusteringrdquo in Advances in Neural InformationProcessing Systems pp 298ndash306 MIT Press 2010

[18] C C Aggarwal and C K Reddy Data Clustering Algorithmsand Applications CRC Press New York NY USA 2013

[19] R Avogadri and G Valentini ldquoFuzzy ensemble clustering basedon random projections for DNA microarray data analysisrdquoArtificial Intelligence in Medicine vol 45 no 2-3 pp 173ndash1832009

Mathematical Problems in Engineering 13

[20] X Z Fern and C E Brodley ldquoRandom projection for highdimensional data clustering a cluster ensemble approachrdquo inProceedings of the 20th International Conference on MachineLearning (ICML rsquo03) vol 3 pp 186ndash193 August 2003

[21] M Popescu J Keller J Bezdek and A Zare ldquoRandomprojections fuzzy c-means (RPFCM) for big data clusteringrdquoin Proceedings of the IEEE International Conference on FuzzySystems (FUZZ-IEEE rsquo15) pp 1ndash6 Istanbul Turkey August 2015

[22] A Fahad N Alshatri Z Tari et al ldquoA survey of clusteringalgorithms for big data taxonomy and empirical analysisrdquo IEEETransactions on Emerging Topics in Computing vol 2 no 3 pp267ndash279 2014

[23] R A Johnson and D W Wichern Applied Multivariate Statis-tical Analysis vol 4 Pearson Prentice Hall Upper Saddle RiverNJ USA 6th edition 2007

[24] C Boutsidis A Zouzias M W Mahoney and P DrineasldquoRandomized dimensionality reduction for k-means cluster-ingrdquo IEEE Transactions on InformationTheory vol 61 no 2 pp1045ndash1062 2015

[25] X Chen and D Cai ldquoLarge scale spectral clustering withlandmark-based representationrdquo in Proceedings of the 25thAAAI Conference on Artificial Intelligence pp 313ndash318 2011

[26] D Cai and X Chen ldquoLarge scale spectral clustering vialandmark-based sparse representationrdquo IEEE Transactions onCybernetics vol 45 no 8 pp 1669ndash1680 2015

[27] G H Golub and C F Van Loan Matrix Computations vol 3JHU Press 2012

[28] U Maulik and S Bandyopadhyay ldquoPerformance evaluation ofsome clustering algorithms and validity indicesrdquo IEEE Transac-tions on Pattern Analysis and Machine Intelligence vol 24 no12 pp 1650ndash1654 2002

[29] W M Rand ldquoObjective criteria for the evaluation of clusteringmethodsrdquo Journal of the American Statistical Association vol66 no 336 pp 846ndash850 1971

[30] D T Anderson J C Bezdek M Popescu and J M KellerldquoComparing fuzzy probabilistic and possibilistic partitionsrdquoIEEE Transactions on Fuzzy Systems vol 18 no 5 pp 906ndash9182010

[31] X L Xie and G Beni ldquoA validity measure for fuzzy clusteringrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 13 no 8 pp 841ndash847 1991

[32] Y Liu Z LiH Xiong XGao JWu and SWu ldquoUnderstandingand enhancement of internal clustering validation measuresrdquoIEEE Transactions on Cybernetics vol 43 no 3 pp 982ndash9942013

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 8: Research Article Fuzzy -Means and Cluster Ensemble with Random Projection for Big Data ...downloads.hindawi.com/journals/mpe/2016/6529794.pdf · 2019. 7. 30. · Research Article

8 Mathematical Problems in Engineering

algorithm On the contrary internal measures are to evaluatethe clustering results using feature inherited from data setsIn this paper validity evaluation criteria used are rand indexand clustering validation index based on nearest neighborsfor crisp partition together with fuzzy rand index and Xie-Beni index for fuzzy partition Here rand index and fuzzyrand index are external validation measures whereas theclustering validation index based on nearest neighbors indexand Xie-Beni index are internal validation measures

(1) Rand Index (RI) [29] RI describes the similarity ofclustering solution and correct labels through pairs of pointsIt takes into account the numbers of point pairs that are in thesame and different clusters The RI is defined as

RI =11989911+ 11989900

1198622

119899

(19)

where 11989911

is the number of pairs of points that exist in thesame cluster in both clustering result and given class labels11989900is the number of pairs of points that are in different subsets

for both clustering result and given class labels and1198622119899equals

119899(119899 minus 1)2 The value of RI ranges from 0 to 1 and the highervalue implies the better clustering solution

(2) Fuzzy Rand Index (FRI) [30] FRI is a generalizationof RI with respect to soft partition It also measures theproportion of pairs of points which exist in the same anddifferent clusters in both clustering solution and true classlabels It needs to compute the analogous 119899

11and 11989900through

contingency table described in [30] Therefore the range ofFRI is also [0 1] and the larger value means more accuratecluster solution

(3) Xie-Beni Index (XB) [31] XB takes the minimum squaredistance between cluster centers as the separation of thepartition and the average square fuzzy deviation of datapoints as the compactness of the partition XB is calculatedas follows

XB =sum119888

119894=1sum119899

119895=1119906119898

119894119895

10038171003817100381710038171003817x119895minus k119894

10038171003817100381710038171003817

2

119899 sdotmin119894119895

10038171003817100381710038171003817k119894minus k119895

10038171003817100381710038171003817

2 (20)

where sum119888119894=1sum119899

119895=1119906119898

119894119895x119895minus k1198942 is just the objective function of

FCM clustering and k119894is the center of cluster 119894 The smallest

XB indicates the optimal cluster partition

(4) Clustering Validation Index Based on Nearest Neighbors(CVNN) [32] The separation of CVNN is about the situationof objects that have geometrical information of each clusterand the compactness is the mean pairwise distance betweenobjects in the same cluster CVNN is computed as follows

CVNN (119888 119896) =Sep (119888 119896)

(max119888 minle119888le119888 maxSep (119888 119896))

+

Com (119888)(max119888 minle119888le119888 maxCom (119888))

(21)

where Sep(119888 119896) = max119894=12119888

((1119899119894) sdot sum

119899119894

119895=1(119902119895119896)) and

Com(119888) = sum119888119894=1((2119899119894(119899119894minus1)) sdotsum

119909119910isinClu119894 119889(119909 119910)) Here 119888 is the

number of clusters in partition result 119888 max is the maximumcluster number given 119888 min is the minimum cluster numbergiven 119896 is the number of nearest neighbors 119899

119894is the number

of objects in the 119894th cluster Clu119894 119902119895denotes the number of

nearest neighbors of Clu119894rsquos 119895th object which are not in Clu

119894

and 119889(119909 119910) denotes the distance between 119909 and 119910 The lowerCVNN value indicates the better clustering solution

Objective function is a special evaluation criterion ofvalidity for FCM clustering algorithm The smaller objectivefunction indicates that the points inside clusters are moreldquosimilarrdquo

Running time is also an important evaluation criterionoften related to the scalability of algorithm One maintarget of random projection for dimensionality reductionis to decrease the runtime and enhance the applicability ofalgorithm in the context of big data

53 Performance of FCM Clustering with Random ProjectionThe experimental results about FCM clustering with randomprojection are presented in Figure 2 where (a) (c) (e) and (g)correspond to the synthetic data set and (b) (d) (f) and (h)correspond to the ACT1 data setThe evaluation criteria usedto assess proposed algorithms are FRI (a) and (b) XB (c) and(d) objective function (e) and (f) and running time (g) and(h) ldquoSignRPrdquo denotes the proposed algorithm with randomsign matrix ldquoGaussRPrdquo denotes the FCM clustering withrandom Gaussian matrix ldquoFCMrdquo denotes the original FCMclustering algorithm and ldquoSVDrdquo denotes the FCM clusteringwith dimensionality reduction through SVD It should benoted that true XB value of FCM clustering in subfigure (d)is 403e + 12 not 0

From Figure 2 we can see that FCM clustering withrandom projection is clearly more efficient than the originalFCM clustering When number of dimensions 119905 is abovecertain bound the validity indices are nearly stable andsimilar to the ones of naive FCM clustering for both datasets This verifies the conclusion that ldquoaccuracy of clusteringalgorithm can be preserved when the dimensionality exceedsa certain boundrdquo The effectiveness for random projectionmethod is also verified by the small bound compared to thetotal dimensions (301000 for synthetic data and 30067500for ACT1 data) Besides the two different kinds of randomprojection methods have the similar impact on FCM cluster-ing because of the analogous plot

The higher objective function values and the smaller XBindices of SVD method for synthetic data set indicate thatthe generated clustering solution has better separation degreebetween clusters The external cluster validation indices alsoverify that SVD method has better clustering results forsynthetic data These observations state that SVD methodis more suitable for Gaussian mixture data sets than FCMclustering with randomprojection and naive FCM clustering

Although the SVDmethod has a higher FRI for syntheticdata set the random projection methods have analogousFRI values for ACT1 data set and better objective functionvalues for both data sets In addition the random projectionapproaches are obviously more efficient as the SVD needscubic time of dimensionality Hence these observationsindicate that our algorithm is quite encouraging in practice

Mathematical Problems in Engineering 9

10 20 30 40 50 60 70 80 90 100

065

07

075

08

085

09

095

1FR

I

SVDFCM

GaussRPSignRP

Number of dimensions t

(a) FRI versus number of dimensions

SVDFCM

GaussRPSignRP

100 200 300 400 500 600 700 800 900 100008998

09

09002

09004

09006

09008

0901

FRI

Number of dimensions t

(b) FRI versus number of dimensions

SVDFCM

GaussRPSignRP

10 20 30 40 50 60 70 80 90 1000

05

1

15

2

25

3

35

XB

Number of dimensions t

(c) FRI versus number of dimensions

SVDFCM

GaussRPSignRP

100 200 300 400 500 600 700 800 900 10000

2

4

6

8

10

12

14XB

(1015)

Number of dimensions t

(d) FRI versus number of dimensions

10 20 30 40 50 60 70 80 90 100

032

034

036

038

04

042

044

046

048

05

Obj

ectiv

e fun

ctio

n

Number of dimensions t

SVDFCM

GaussRPSignRP

(e) FRI versus number of dimensions

100 200 300 400 500 600 700 800 900 100000285

0029

00295

003

00305

0031

00315

0032

00325

Obj

ectiv

e fun

ctio

n

Number of dimensions t

SVDFCM

GaussRPSignRP

(f) FRI versus number of dimensions

Figure 2 Continued

10 Mathematical Problems in Engineering

24262830

Runn

ing

time

(s)

SVDFCM

10 20 30 40 50 60 70 80 90 1001618

222242628

332

Runn

ing

time (

s)

GaussRPSignRP

10 20 30 40 50 60 70 80 90 100Number of dimensions t

Number of dimensions t

(g) FRI versus number of dimensions

SVDFCM

GaussRPSignRP

820830840850

Runn

ing

time

(s)

100 200 300 400 500 600 700 800 900 1000

100 200 300 400 500 600 700 800 900 1000

0

2

4

6

8

10

12

14

Runn

ing

time (

s)

Number of dimensions t

Number of dimensions t

(h) FRI versus number of dimensions

Figure 2 Performance of clustering algorithms with different dimensionality

Table 1 CVNN indices for different ensemble approaches on ACT2 data

Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-A 17315 17383 17449 17789 1819 183 17623 18182 18685 18067EFCM-C 17938 17558 17584 18351 18088 18353 18247 18385 18105 18381EFCM-S 13975 13144 12736 12974 13112 13643 13533 1409 13701 13765

54 Comparisons of Different Cluster Ensemble Methods Thecomparisons of different cluster ensemble approaches areshown in Figure 3 and Table 1 Similarly (a) and (c) of thefigure correspond to the synthetic data set and (b) and (d)corresponds to the ACT2 data set We use RI (a) and (b)and running time (c) and (d) to present the performanceof ensemble methods Meanwhile the meanings of EFCM-A EFCM-C and EFCM-S are identical to the ones inSection 42 In order to get crisp partition for EFCM-A andEFCM-C we used hierarchical clustering-complete linkagemethod after getting the distance matrix as in [21] Since allthree cluster ensemble methods get perfect partition resultson synthetic data set we only compare CVNN indices ofdifferent ensemble methods on ACT2 data set which ispresented in Table 1

In Figure 3 running time of our algorithm is shorterfor both data sets This verifies the result of time complexityanalysis for different algorithms in Section 42 The threecluster ensemble methods all get the perfect partition forsynthetic data set whereas our method is more accuratethan the other two methods for ACT2 data set The perfectpartition results suggest that all three ensemble methods aresuitable for Gaussian mixture data set However the almost18 improvement on RI for ACT2 data set should be due

to the different grouping ideas Our method is based on thegraph partition such that the edges between different clustershave low weight and the edges within a cluster have highweight This clustering way of spectral embedding is moresuitable for ACT2 data set In Table 1 the smaller values ofCVNN of our new method also show that new approach hasbetter partition results on ACT2 data set These observationsindicate that our algorithm has the advantage on efficiencyand adapts to a wider range of geometries

We also compare the stability for three ensemble meth-ods presented in Table 2 From the table we can see that thestandard deviation of RI about EFCM-S is a lower order ofmagnitude than the ones of the other methods Hence thisresult shows that our algorithm is more robust

Aiming at the situation of unknown clustersrsquo numberwe also varied the number of clusters 119888 in FCM clusteringand spectral embedding for our new method We denotethis version of new method as EFCM-SV Since the numberof random projections was set as 5 for ACT2 data set wechanged the clustersrsquo number from 17 to 21 as the input ofFCM clustering algorithm In addition we set the clustersrsquonumber from 14 to 24 as the input of spectral embeddingand applied CVNN to estimate the most plausible number ofclusters The experimental results are presented in Table 3

Mathematical Problems in Engineering 11

Table 2 Standard deviations of RI of 20 runs with different dimensions on ACT2 data

Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-A 00222 00174 0018 00257 00171 00251 00188 00172 00218 00184EFCM-C 00217 00189 00128 00232 00192 00200 00175 00194 00151 00214EFCM-S 00044 00018 00029 00030 00028 00024 00026 00020 00024 00019

Table 3 RI values for EFCM-S and EFCM-Sv on ACT2 data

Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-S 09227 0922 09223 0923 09215 09218 09226 09225 09231 09237EFCM-SV 09257 09257 09165 09257 0927 09165 09268 0927 09105 09245+CVNN 119888 = 185 119888 = 207 119888 = 194 119888 = 193 119888 = 193 119888 = 182 119888 = 192 119888 = 183 119888 = 194 119888 = 202

10 20 30 40 50 60 70 80 90 1000

02

04

06

08

1

12

14

16

18

2

RI

EFCM-AEFCM-CEFCM-S

Number of dimensions t

(a) RI versus number of dimensions

10 20 30 40 50 60 70 80 90 100074

076

078

08

082

084

086

088

09

092

094

RI

Number of dimensions t

EFCM-AEFCM-CEFCM-S

(b) RI versus number of dimensions

10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

Runn

ing

time (

s)

Number of dimensions t

EFCM-AEFCM-CEFCM-S

(c) Running time versus number of dimensions

10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

80

90

100

Runn

ing

time (

s)

Number of dimensions t

EFCM-AEFCM-CEFCM-S

(d) Running time versus number of dimensions

Figure 3 Performance of cluster ensemble approaches with different dimensionality

12 Mathematical Problems in Engineering

In Table 3 the values with respect to ldquoEFCM-SVrdquo are theaverage RI values with the estimated clustersrsquo numbers for20 individual runs The values of ldquo+CVNNrdquo are the averageclustersrsquo numbers decided by the CVNN cluster validityindex Using the estimated clustersrsquo numbers by CVNN ourmethod gets the similar results of ensemble method withcorrect clustersrsquo number In addition the average estimates ofclustersrsquo number are close to the true one This indicates thatour cluster ensemble method EFCM-SV is attractive whenthe number of clusters is unknown

6 Conclusion and Future Work

The ldquocurse of dimensionalityrdquo in big data gives new chal-lenges for clustering recently and feature extraction fordimensionality reduction is a popular way to deal with thesechallenges We studied the feature extraction method ofrandom projection for FCM clustering Through analyzingthe effects of random projection on the entire variabilityof data theoretically and verification both on syntheticand real world data empirically we designed an enhancedFCM clustering algorithm with random projection The newalgorithm can maintain nearly the same clustering solutionof preliminary FCM clustering and be more efficient thanfeature extraction method of SVD What is more we alsoproposed a cluster ensemble approach that is more applicableto large scale data sets than existing ones The new ensembleapproach can achieve spectral embedding efficiently fromSVD on the concatenation of membership matrices Theexperiments showed that the new ensemble method ranfaster had more robust partition solutions and fitted a widerrange of geometrical data sets

A future research content is to design the provablyaccurate feature extraction and feature selection methodsfor FCM clustering Another remaining question is thathow to choose proper number of random projections forcluster ensemble method in order to get a trade-off betweenclustering accuracy and efficiency

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This work was supported in part by the National KeyBasic Research Program (973 programme) under Grant2012CB315905 and in part by the National Nature ScienceFoundation of China under Grants 61502527 and 61379150and in part by the Open Foundation of State Key Laboratoryof Networking and Switching Technology (Beijing Universityof Posts and Telecommunications) (no SKLNST-2013-1-06)

References

[1] M Chen S Mao and Y Liu ldquoBig data a surveyrdquo MobileNetworks and Applications vol 19 no 2 pp 171ndash209 2014

[2] J Zhang X Tao and H Wang ldquoOutlier detection from largedistributed databasesrdquoWorld Wide Web vol 17 no 4 pp 539ndash568 2014

[3] C Ordonez N Mohanam and C Garcia-Alvarado ldquoPCA forlarge data sets with parallel data summarizationrdquo Distributedand Parallel Databases vol 32 no 3 pp 377ndash403 2014

[4] D-S Pham S Venkatesh M Lazarescu and S BudhadityaldquoAnomaly detection in large-scale data stream networksrdquo DataMining and Knowledge Discovery vol 28 no 1 pp 145ndash1892014

[5] F Murtagh and P Contreras ldquoRandom projection towardsthe baire metric for high dimensional clusteringrdquo in StatisticalLearning and Data Sciences pp 424ndash431 Springer BerlinGermany 2015

[6] T C Havens J C Bezdek C Leckie L O Hall and MPalaniswami ldquoFuzzy c-means algorithms for very large datardquoIEEETransactions on Fuzzy Systems vol 20 no 6 pp 1130ndash11462012

[7] J Han M Kamber and J Pei Data Mining Concepts andTechniques Concepts and Techniques Elsevier 2011

[8] S Khan G Situ K Decker and C J Schmidt ldquoGoFigureautomated gene ontology annotationrdquo Bioinformatics vol 19no 18 pp 2484ndash2485 2003

[9] S Gunnemann H Kremer D Lenhard and T Seidl ldquoSub-space clustering for indexing high dimensional data a mainmemory index based on local reductions and individual multi-representationsrdquo in Proceedings of the 14th International Confer-ence on Extending Database Technology (EDBT rsquo11) pp 237ndash248ACM Uppsala Sweden March 2011

[10] J C Bezdek R Ehrlich and W Full ldquoFCM the fuzzy c-meansclustering algorithmrdquo Computers amp Geosciences vol 10 no 2-3pp 191ndash203 1984

[11] R J Hathaway and J C Bezdek ldquoExtending fuzzy andprobabilistic clustering to very large data setsrdquo ComputationalStatistics amp Data Analysis vol 51 no 1 pp 215ndash234 2006

[12] P Hore L O Hall and D B Goldgof ldquoSingle pass fuzzy cmeansrdquo in Proceedings of the IEEE International Fuzzy SystemsConference (FUZZ rsquo07) pp 1ndash7 London UK July 2007

[13] P Hore L O Hall D B Goldgof Y Gu A A Maudsley andA Darkazanli ldquoA scalable framework for segmenting magneticresonance imagesrdquo Journal of Signal Processing Systems vol 54no 1ndash3 pp 183ndash203 2009

[14] W B Johnson and J Lindenstrauss ldquoExtensions of lipschitzmappings into aHilbert spacerdquoContemporaryMathematics vol26 pp 189ndash206 1984

[15] P Indyk and R Motwani ldquoApproximate nearest neighborstowards removing the curse of dimensionalityrdquo in Proceedingsof the 13th Annual ACM Symposium on Theory of Computingpp 604ndash613 ACM 1998

[16] D Achlioptas ldquoDatabase-friendly random projectionsJohnson-Lindenstrauss with binary coinsrdquo Journal of Computerand System Sciences vol 66 no 4 pp 671ndash687 2003

[17] C Boutsidis A Zouzias and P Drineas ldquoRandom projectionsfor k-means clusteringrdquo in Advances in Neural InformationProcessing Systems pp 298ndash306 MIT Press 2010

[18] C C Aggarwal and C K Reddy Data Clustering Algorithmsand Applications CRC Press New York NY USA 2013

[19] R Avogadri and G Valentini ldquoFuzzy ensemble clustering basedon random projections for DNA microarray data analysisrdquoArtificial Intelligence in Medicine vol 45 no 2-3 pp 173ndash1832009

Mathematical Problems in Engineering 13

[20] X Z Fern and C E Brodley ldquoRandom projection for highdimensional data clustering a cluster ensemble approachrdquo inProceedings of the 20th International Conference on MachineLearning (ICML rsquo03) vol 3 pp 186ndash193 August 2003

[21] M Popescu J Keller J Bezdek and A Zare ldquoRandomprojections fuzzy c-means (RPFCM) for big data clusteringrdquoin Proceedings of the IEEE International Conference on FuzzySystems (FUZZ-IEEE rsquo15) pp 1ndash6 Istanbul Turkey August 2015

[22] A Fahad N Alshatri Z Tari et al ldquoA survey of clusteringalgorithms for big data taxonomy and empirical analysisrdquo IEEETransactions on Emerging Topics in Computing vol 2 no 3 pp267ndash279 2014

[23] R A Johnson and D W Wichern Applied Multivariate Statis-tical Analysis vol 4 Pearson Prentice Hall Upper Saddle RiverNJ USA 6th edition 2007

[24] C Boutsidis A Zouzias M W Mahoney and P DrineasldquoRandomized dimensionality reduction for k-means cluster-ingrdquo IEEE Transactions on InformationTheory vol 61 no 2 pp1045ndash1062 2015

[25] X Chen and D Cai ldquoLarge scale spectral clustering withlandmark-based representationrdquo in Proceedings of the 25thAAAI Conference on Artificial Intelligence pp 313ndash318 2011

[26] D Cai and X Chen ldquoLarge scale spectral clustering vialandmark-based sparse representationrdquo IEEE Transactions onCybernetics vol 45 no 8 pp 1669ndash1680 2015

[27] G H Golub and C F Van Loan Matrix Computations vol 3JHU Press 2012

[28] U Maulik and S Bandyopadhyay ldquoPerformance evaluation ofsome clustering algorithms and validity indicesrdquo IEEE Transac-tions on Pattern Analysis and Machine Intelligence vol 24 no12 pp 1650ndash1654 2002

[29] W M Rand ldquoObjective criteria for the evaluation of clusteringmethodsrdquo Journal of the American Statistical Association vol66 no 336 pp 846ndash850 1971

[30] D T Anderson J C Bezdek M Popescu and J M KellerldquoComparing fuzzy probabilistic and possibilistic partitionsrdquoIEEE Transactions on Fuzzy Systems vol 18 no 5 pp 906ndash9182010

[31] X L Xie and G Beni ldquoA validity measure for fuzzy clusteringrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 13 no 8 pp 841ndash847 1991

[32] Y Liu Z LiH Xiong XGao JWu and SWu ldquoUnderstandingand enhancement of internal clustering validation measuresrdquoIEEE Transactions on Cybernetics vol 43 no 3 pp 982ndash9942013

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 9: Research Article Fuzzy -Means and Cluster Ensemble with Random Projection for Big Data ...downloads.hindawi.com/journals/mpe/2016/6529794.pdf · 2019. 7. 30. · Research Article

Mathematical Problems in Engineering 9

10 20 30 40 50 60 70 80 90 100

065

07

075

08

085

09

095

1FR

I

SVDFCM

GaussRPSignRP

Number of dimensions t

(a) FRI versus number of dimensions

SVDFCM

GaussRPSignRP

100 200 300 400 500 600 700 800 900 100008998

09

09002

09004

09006

09008

0901

FRI

Number of dimensions t

(b) FRI versus number of dimensions

SVDFCM

GaussRPSignRP

10 20 30 40 50 60 70 80 90 1000

05

1

15

2

25

3

35

XB

Number of dimensions t

(c) FRI versus number of dimensions

SVDFCM

GaussRPSignRP

100 200 300 400 500 600 700 800 900 10000

2

4

6

8

10

12

14XB

(1015)

Number of dimensions t

(d) FRI versus number of dimensions

10 20 30 40 50 60 70 80 90 100

032

034

036

038

04

042

044

046

048

05

Obj

ectiv

e fun

ctio

n

Number of dimensions t

SVDFCM

GaussRPSignRP

(e) FRI versus number of dimensions

100 200 300 400 500 600 700 800 900 100000285

0029

00295

003

00305

0031

00315

0032

00325

Obj

ectiv

e fun

ctio

n

Number of dimensions t

SVDFCM

GaussRPSignRP

(f) FRI versus number of dimensions

Figure 2 Continued

10 Mathematical Problems in Engineering

24262830

Runn

ing

time

(s)

SVDFCM

10 20 30 40 50 60 70 80 90 1001618

222242628

332

Runn

ing

time (

s)

GaussRPSignRP

10 20 30 40 50 60 70 80 90 100Number of dimensions t

Number of dimensions t

(g) FRI versus number of dimensions

SVDFCM

GaussRPSignRP

820830840850

Runn

ing

time

(s)

100 200 300 400 500 600 700 800 900 1000

100 200 300 400 500 600 700 800 900 1000

0

2

4

6

8

10

12

14

Runn

ing

time (

s)

Number of dimensions t

Number of dimensions t

(h) FRI versus number of dimensions

Figure 2 Performance of clustering algorithms with different dimensionality

Table 1 CVNN indices for different ensemble approaches on ACT2 data

Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-A 17315 17383 17449 17789 1819 183 17623 18182 18685 18067EFCM-C 17938 17558 17584 18351 18088 18353 18247 18385 18105 18381EFCM-S 13975 13144 12736 12974 13112 13643 13533 1409 13701 13765

54 Comparisons of Different Cluster Ensemble Methods Thecomparisons of different cluster ensemble approaches areshown in Figure 3 and Table 1 Similarly (a) and (c) of thefigure correspond to the synthetic data set and (b) and (d)corresponds to the ACT2 data set We use RI (a) and (b)and running time (c) and (d) to present the performanceof ensemble methods Meanwhile the meanings of EFCM-A EFCM-C and EFCM-S are identical to the ones inSection 42 In order to get crisp partition for EFCM-A andEFCM-C we used hierarchical clustering-complete linkagemethod after getting the distance matrix as in [21] Since allthree cluster ensemble methods get perfect partition resultson synthetic data set we only compare CVNN indices ofdifferent ensemble methods on ACT2 data set which ispresented in Table 1

In Figure 3 running time of our algorithm is shorterfor both data sets This verifies the result of time complexityanalysis for different algorithms in Section 42 The threecluster ensemble methods all get the perfect partition forsynthetic data set whereas our method is more accuratethan the other two methods for ACT2 data set The perfectpartition results suggest that all three ensemble methods aresuitable for Gaussian mixture data set However the almost18 improvement on RI for ACT2 data set should be due

to the different grouping ideas Our method is based on thegraph partition such that the edges between different clustershave low weight and the edges within a cluster have highweight This clustering way of spectral embedding is moresuitable for ACT2 data set In Table 1 the smaller values ofCVNN of our new method also show that new approach hasbetter partition results on ACT2 data set These observationsindicate that our algorithm has the advantage on efficiencyand adapts to a wider range of geometries

We also compare the stability for three ensemble meth-ods presented in Table 2 From the table we can see that thestandard deviation of RI about EFCM-S is a lower order ofmagnitude than the ones of the other methods Hence thisresult shows that our algorithm is more robust

Aiming at the situation of unknown clustersrsquo numberwe also varied the number of clusters 119888 in FCM clusteringand spectral embedding for our new method We denotethis version of new method as EFCM-SV Since the numberof random projections was set as 5 for ACT2 data set wechanged the clustersrsquo number from 17 to 21 as the input ofFCM clustering algorithm In addition we set the clustersrsquonumber from 14 to 24 as the input of spectral embeddingand applied CVNN to estimate the most plausible number ofclusters The experimental results are presented in Table 3

Mathematical Problems in Engineering 11

Table 2 Standard deviations of RI of 20 runs with different dimensions on ACT2 data

Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-A 00222 00174 0018 00257 00171 00251 00188 00172 00218 00184EFCM-C 00217 00189 00128 00232 00192 00200 00175 00194 00151 00214EFCM-S 00044 00018 00029 00030 00028 00024 00026 00020 00024 00019

Table 3 RI values for EFCM-S and EFCM-Sv on ACT2 data

Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-S 09227 0922 09223 0923 09215 09218 09226 09225 09231 09237EFCM-SV 09257 09257 09165 09257 0927 09165 09268 0927 09105 09245+CVNN 119888 = 185 119888 = 207 119888 = 194 119888 = 193 119888 = 193 119888 = 182 119888 = 192 119888 = 183 119888 = 194 119888 = 202

10 20 30 40 50 60 70 80 90 1000

02

04

06

08

1

12

14

16

18

2

RI

EFCM-AEFCM-CEFCM-S

Number of dimensions t

(a) RI versus number of dimensions

10 20 30 40 50 60 70 80 90 100074

076

078

08

082

084

086

088

09

092

094

RI

Number of dimensions t

EFCM-AEFCM-CEFCM-S

(b) RI versus number of dimensions

10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

Runn

ing

time (

s)

Number of dimensions t

EFCM-AEFCM-CEFCM-S

(c) Running time versus number of dimensions

10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

80

90

100

Runn

ing

time (

s)

Number of dimensions t

EFCM-AEFCM-CEFCM-S

(d) Running time versus number of dimensions

Figure 3 Performance of cluster ensemble approaches with different dimensionality

12 Mathematical Problems in Engineering

In Table 3 the values with respect to ldquoEFCM-SVrdquo are theaverage RI values with the estimated clustersrsquo numbers for20 individual runs The values of ldquo+CVNNrdquo are the averageclustersrsquo numbers decided by the CVNN cluster validityindex Using the estimated clustersrsquo numbers by CVNN ourmethod gets the similar results of ensemble method withcorrect clustersrsquo number In addition the average estimates ofclustersrsquo number are close to the true one This indicates thatour cluster ensemble method EFCM-SV is attractive whenthe number of clusters is unknown

6 Conclusion and Future Work

The ldquocurse of dimensionalityrdquo in big data gives new chal-lenges for clustering recently and feature extraction fordimensionality reduction is a popular way to deal with thesechallenges We studied the feature extraction method ofrandom projection for FCM clustering Through analyzingthe effects of random projection on the entire variabilityof data theoretically and verification both on syntheticand real world data empirically we designed an enhancedFCM clustering algorithm with random projection The newalgorithm can maintain nearly the same clustering solutionof preliminary FCM clustering and be more efficient thanfeature extraction method of SVD What is more we alsoproposed a cluster ensemble approach that is more applicableto large scale data sets than existing ones The new ensembleapproach can achieve spectral embedding efficiently fromSVD on the concatenation of membership matrices Theexperiments showed that the new ensemble method ranfaster had more robust partition solutions and fitted a widerrange of geometrical data sets

A future research content is to design the provablyaccurate feature extraction and feature selection methodsfor FCM clustering Another remaining question is thathow to choose proper number of random projections forcluster ensemble method in order to get a trade-off betweenclustering accuracy and efficiency

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This work was supported in part by the National KeyBasic Research Program (973 programme) under Grant2012CB315905 and in part by the National Nature ScienceFoundation of China under Grants 61502527 and 61379150and in part by the Open Foundation of State Key Laboratoryof Networking and Switching Technology (Beijing Universityof Posts and Telecommunications) (no SKLNST-2013-1-06)

References

[1] M Chen S Mao and Y Liu ldquoBig data a surveyrdquo MobileNetworks and Applications vol 19 no 2 pp 171ndash209 2014

[2] J Zhang X Tao and H Wang ldquoOutlier detection from largedistributed databasesrdquoWorld Wide Web vol 17 no 4 pp 539ndash568 2014

[3] C Ordonez N Mohanam and C Garcia-Alvarado ldquoPCA forlarge data sets with parallel data summarizationrdquo Distributedand Parallel Databases vol 32 no 3 pp 377ndash403 2014

[4] D-S Pham S Venkatesh M Lazarescu and S BudhadityaldquoAnomaly detection in large-scale data stream networksrdquo DataMining and Knowledge Discovery vol 28 no 1 pp 145ndash1892014

[5] F Murtagh and P Contreras ldquoRandom projection towardsthe baire metric for high dimensional clusteringrdquo in StatisticalLearning and Data Sciences pp 424ndash431 Springer BerlinGermany 2015

[6] T C Havens J C Bezdek C Leckie L O Hall and MPalaniswami ldquoFuzzy c-means algorithms for very large datardquoIEEETransactions on Fuzzy Systems vol 20 no 6 pp 1130ndash11462012

[7] J Han M Kamber and J Pei Data Mining Concepts andTechniques Concepts and Techniques Elsevier 2011

[8] S Khan G Situ K Decker and C J Schmidt ldquoGoFigureautomated gene ontology annotationrdquo Bioinformatics vol 19no 18 pp 2484ndash2485 2003

[9] S Gunnemann H Kremer D Lenhard and T Seidl ldquoSub-space clustering for indexing high dimensional data a mainmemory index based on local reductions and individual multi-representationsrdquo in Proceedings of the 14th International Confer-ence on Extending Database Technology (EDBT rsquo11) pp 237ndash248ACM Uppsala Sweden March 2011

[10] J C Bezdek R Ehrlich and W Full ldquoFCM the fuzzy c-meansclustering algorithmrdquo Computers amp Geosciences vol 10 no 2-3pp 191ndash203 1984

[11] R J Hathaway and J C Bezdek ldquoExtending fuzzy andprobabilistic clustering to very large data setsrdquo ComputationalStatistics amp Data Analysis vol 51 no 1 pp 215ndash234 2006

[12] P Hore L O Hall and D B Goldgof ldquoSingle pass fuzzy cmeansrdquo in Proceedings of the IEEE International Fuzzy SystemsConference (FUZZ rsquo07) pp 1ndash7 London UK July 2007

[13] P Hore L O Hall D B Goldgof Y Gu A A Maudsley andA Darkazanli ldquoA scalable framework for segmenting magneticresonance imagesrdquo Journal of Signal Processing Systems vol 54no 1ndash3 pp 183ndash203 2009

[14] W B Johnson and J Lindenstrauss ldquoExtensions of lipschitzmappings into aHilbert spacerdquoContemporaryMathematics vol26 pp 189ndash206 1984

[15] P Indyk and R Motwani ldquoApproximate nearest neighborstowards removing the curse of dimensionalityrdquo in Proceedingsof the 13th Annual ACM Symposium on Theory of Computingpp 604ndash613 ACM 1998

[16] D Achlioptas ldquoDatabase-friendly random projectionsJohnson-Lindenstrauss with binary coinsrdquo Journal of Computerand System Sciences vol 66 no 4 pp 671ndash687 2003

[17] C Boutsidis A Zouzias and P Drineas ldquoRandom projectionsfor k-means clusteringrdquo in Advances in Neural InformationProcessing Systems pp 298ndash306 MIT Press 2010

[18] C C Aggarwal and C K Reddy Data Clustering Algorithmsand Applications CRC Press New York NY USA 2013

[19] R Avogadri and G Valentini ldquoFuzzy ensemble clustering basedon random projections for DNA microarray data analysisrdquoArtificial Intelligence in Medicine vol 45 no 2-3 pp 173ndash1832009

Mathematical Problems in Engineering 13

[20] X Z Fern and C E Brodley ldquoRandom projection for highdimensional data clustering a cluster ensemble approachrdquo inProceedings of the 20th International Conference on MachineLearning (ICML rsquo03) vol 3 pp 186ndash193 August 2003

[21] M Popescu J Keller J Bezdek and A Zare ldquoRandomprojections fuzzy c-means (RPFCM) for big data clusteringrdquoin Proceedings of the IEEE International Conference on FuzzySystems (FUZZ-IEEE rsquo15) pp 1ndash6 Istanbul Turkey August 2015

[22] A Fahad N Alshatri Z Tari et al ldquoA survey of clusteringalgorithms for big data taxonomy and empirical analysisrdquo IEEETransactions on Emerging Topics in Computing vol 2 no 3 pp267ndash279 2014

[23] R A Johnson and D W Wichern Applied Multivariate Statis-tical Analysis vol 4 Pearson Prentice Hall Upper Saddle RiverNJ USA 6th edition 2007

[24] C Boutsidis A Zouzias M W Mahoney and P DrineasldquoRandomized dimensionality reduction for k-means cluster-ingrdquo IEEE Transactions on InformationTheory vol 61 no 2 pp1045ndash1062 2015

[25] X Chen and D Cai ldquoLarge scale spectral clustering withlandmark-based representationrdquo in Proceedings of the 25thAAAI Conference on Artificial Intelligence pp 313ndash318 2011

[26] D Cai and X Chen ldquoLarge scale spectral clustering vialandmark-based sparse representationrdquo IEEE Transactions onCybernetics vol 45 no 8 pp 1669ndash1680 2015

[27] G H Golub and C F Van Loan Matrix Computations vol 3JHU Press 2012

[28] U Maulik and S Bandyopadhyay ldquoPerformance evaluation ofsome clustering algorithms and validity indicesrdquo IEEE Transac-tions on Pattern Analysis and Machine Intelligence vol 24 no12 pp 1650ndash1654 2002

[29] W M Rand ldquoObjective criteria for the evaluation of clusteringmethodsrdquo Journal of the American Statistical Association vol66 no 336 pp 846ndash850 1971

[30] D T Anderson J C Bezdek M Popescu and J M KellerldquoComparing fuzzy probabilistic and possibilistic partitionsrdquoIEEE Transactions on Fuzzy Systems vol 18 no 5 pp 906ndash9182010

[31] X L Xie and G Beni ldquoA validity measure for fuzzy clusteringrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 13 no 8 pp 841ndash847 1991

[32] Y Liu Z LiH Xiong XGao JWu and SWu ldquoUnderstandingand enhancement of internal clustering validation measuresrdquoIEEE Transactions on Cybernetics vol 43 no 3 pp 982ndash9942013

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 10: Research Article Fuzzy -Means and Cluster Ensemble with Random Projection for Big Data ...downloads.hindawi.com/journals/mpe/2016/6529794.pdf · 2019. 7. 30. · Research Article

10 Mathematical Problems in Engineering

24262830

Runn

ing

time

(s)

SVDFCM

10 20 30 40 50 60 70 80 90 1001618

222242628

332

Runn

ing

time (

s)

GaussRPSignRP

10 20 30 40 50 60 70 80 90 100Number of dimensions t

Number of dimensions t

(g) FRI versus number of dimensions

SVDFCM

GaussRPSignRP

820830840850

Runn

ing

time

(s)

100 200 300 400 500 600 700 800 900 1000

100 200 300 400 500 600 700 800 900 1000

0

2

4

6

8

10

12

14

Runn

ing

time (

s)

Number of dimensions t

Number of dimensions t

(h) FRI versus number of dimensions

Figure 2 Performance of clustering algorithms with different dimensionality

Table 1 CVNN indices for different ensemble approaches on ACT2 data

Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-A 17315 17383 17449 17789 1819 183 17623 18182 18685 18067EFCM-C 17938 17558 17584 18351 18088 18353 18247 18385 18105 18381EFCM-S 13975 13144 12736 12974 13112 13643 13533 1409 13701 13765

54 Comparisons of Different Cluster Ensemble Methods Thecomparisons of different cluster ensemble approaches areshown in Figure 3 and Table 1 Similarly (a) and (c) of thefigure correspond to the synthetic data set and (b) and (d)corresponds to the ACT2 data set We use RI (a) and (b)and running time (c) and (d) to present the performanceof ensemble methods Meanwhile the meanings of EFCM-A EFCM-C and EFCM-S are identical to the ones inSection 42 In order to get crisp partition for EFCM-A andEFCM-C we used hierarchical clustering-complete linkagemethod after getting the distance matrix as in [21] Since allthree cluster ensemble methods get perfect partition resultson synthetic data set we only compare CVNN indices ofdifferent ensemble methods on ACT2 data set which ispresented in Table 1

In Figure 3 running time of our algorithm is shorterfor both data sets This verifies the result of time complexityanalysis for different algorithms in Section 42 The threecluster ensemble methods all get the perfect partition forsynthetic data set whereas our method is more accuratethan the other two methods for ACT2 data set The perfectpartition results suggest that all three ensemble methods aresuitable for Gaussian mixture data set However the almost18 improvement on RI for ACT2 data set should be due

to the different grouping ideas Our method is based on thegraph partition such that the edges between different clustershave low weight and the edges within a cluster have highweight This clustering way of spectral embedding is moresuitable for ACT2 data set In Table 1 the smaller values ofCVNN of our new method also show that new approach hasbetter partition results on ACT2 data set These observationsindicate that our algorithm has the advantage on efficiencyand adapts to a wider range of geometries

We also compare the stability for three ensemble meth-ods presented in Table 2 From the table we can see that thestandard deviation of RI about EFCM-S is a lower order ofmagnitude than the ones of the other methods Hence thisresult shows that our algorithm is more robust

Aiming at the situation of unknown clustersrsquo numberwe also varied the number of clusters 119888 in FCM clusteringand spectral embedding for our new method We denotethis version of new method as EFCM-SV Since the numberof random projections was set as 5 for ACT2 data set wechanged the clustersrsquo number from 17 to 21 as the input ofFCM clustering algorithm In addition we set the clustersrsquonumber from 14 to 24 as the input of spectral embeddingand applied CVNN to estimate the most plausible number ofclusters The experimental results are presented in Table 3

Mathematical Problems in Engineering 11

Table 2 Standard deviations of RI of 20 runs with different dimensions on ACT2 data

Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-A 00222 00174 0018 00257 00171 00251 00188 00172 00218 00184EFCM-C 00217 00189 00128 00232 00192 00200 00175 00194 00151 00214EFCM-S 00044 00018 00029 00030 00028 00024 00026 00020 00024 00019

Table 3 RI values for EFCM-S and EFCM-Sv on ACT2 data

Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-S 09227 0922 09223 0923 09215 09218 09226 09225 09231 09237EFCM-SV 09257 09257 09165 09257 0927 09165 09268 0927 09105 09245+CVNN 119888 = 185 119888 = 207 119888 = 194 119888 = 193 119888 = 193 119888 = 182 119888 = 192 119888 = 183 119888 = 194 119888 = 202

10 20 30 40 50 60 70 80 90 1000

02

04

06

08

1

12

14

16

18

2

RI

EFCM-AEFCM-CEFCM-S

Number of dimensions t

(a) RI versus number of dimensions

10 20 30 40 50 60 70 80 90 100074

076

078

08

082

084

086

088

09

092

094

RI

Number of dimensions t

EFCM-AEFCM-CEFCM-S

(b) RI versus number of dimensions

10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

Runn

ing

time (

s)

Number of dimensions t

EFCM-AEFCM-CEFCM-S

(c) Running time versus number of dimensions

10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

80

90

100

Runn

ing

time (

s)

Number of dimensions t

EFCM-AEFCM-CEFCM-S

(d) Running time versus number of dimensions

Figure 3 Performance of cluster ensemble approaches with different dimensionality

12 Mathematical Problems in Engineering

In Table 3 the values with respect to ldquoEFCM-SVrdquo are theaverage RI values with the estimated clustersrsquo numbers for20 individual runs The values of ldquo+CVNNrdquo are the averageclustersrsquo numbers decided by the CVNN cluster validityindex Using the estimated clustersrsquo numbers by CVNN ourmethod gets the similar results of ensemble method withcorrect clustersrsquo number In addition the average estimates ofclustersrsquo number are close to the true one This indicates thatour cluster ensemble method EFCM-SV is attractive whenthe number of clusters is unknown

6 Conclusion and Future Work

The ldquocurse of dimensionalityrdquo in big data gives new chal-lenges for clustering recently and feature extraction fordimensionality reduction is a popular way to deal with thesechallenges We studied the feature extraction method ofrandom projection for FCM clustering Through analyzingthe effects of random projection on the entire variabilityof data theoretically and verification both on syntheticand real world data empirically we designed an enhancedFCM clustering algorithm with random projection The newalgorithm can maintain nearly the same clustering solutionof preliminary FCM clustering and be more efficient thanfeature extraction method of SVD What is more we alsoproposed a cluster ensemble approach that is more applicableto large scale data sets than existing ones The new ensembleapproach can achieve spectral embedding efficiently fromSVD on the concatenation of membership matrices Theexperiments showed that the new ensemble method ranfaster had more robust partition solutions and fitted a widerrange of geometrical data sets

A future research content is to design the provablyaccurate feature extraction and feature selection methodsfor FCM clustering Another remaining question is thathow to choose proper number of random projections forcluster ensemble method in order to get a trade-off betweenclustering accuracy and efficiency

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This work was supported in part by the National KeyBasic Research Program (973 programme) under Grant2012CB315905 and in part by the National Nature ScienceFoundation of China under Grants 61502527 and 61379150and in part by the Open Foundation of State Key Laboratoryof Networking and Switching Technology (Beijing Universityof Posts and Telecommunications) (no SKLNST-2013-1-06)

References

[1] M Chen S Mao and Y Liu ldquoBig data a surveyrdquo MobileNetworks and Applications vol 19 no 2 pp 171ndash209 2014

[2] J Zhang X Tao and H Wang ldquoOutlier detection from largedistributed databasesrdquoWorld Wide Web vol 17 no 4 pp 539ndash568 2014

[3] C Ordonez N Mohanam and C Garcia-Alvarado ldquoPCA forlarge data sets with parallel data summarizationrdquo Distributedand Parallel Databases vol 32 no 3 pp 377ndash403 2014

[4] D-S Pham S Venkatesh M Lazarescu and S BudhadityaldquoAnomaly detection in large-scale data stream networksrdquo DataMining and Knowledge Discovery vol 28 no 1 pp 145ndash1892014

[5] F Murtagh and P Contreras ldquoRandom projection towardsthe baire metric for high dimensional clusteringrdquo in StatisticalLearning and Data Sciences pp 424ndash431 Springer BerlinGermany 2015

[6] T C Havens J C Bezdek C Leckie L O Hall and MPalaniswami ldquoFuzzy c-means algorithms for very large datardquoIEEETransactions on Fuzzy Systems vol 20 no 6 pp 1130ndash11462012

[7] J Han M Kamber and J Pei Data Mining Concepts andTechniques Concepts and Techniques Elsevier 2011

[8] S Khan G Situ K Decker and C J Schmidt ldquoGoFigureautomated gene ontology annotationrdquo Bioinformatics vol 19no 18 pp 2484ndash2485 2003

[9] S Gunnemann H Kremer D Lenhard and T Seidl ldquoSub-space clustering for indexing high dimensional data a mainmemory index based on local reductions and individual multi-representationsrdquo in Proceedings of the 14th International Confer-ence on Extending Database Technology (EDBT rsquo11) pp 237ndash248ACM Uppsala Sweden March 2011

[10] J C Bezdek R Ehrlich and W Full ldquoFCM the fuzzy c-meansclustering algorithmrdquo Computers amp Geosciences vol 10 no 2-3pp 191ndash203 1984

[11] R J Hathaway and J C Bezdek ldquoExtending fuzzy andprobabilistic clustering to very large data setsrdquo ComputationalStatistics amp Data Analysis vol 51 no 1 pp 215ndash234 2006

[12] P Hore L O Hall and D B Goldgof ldquoSingle pass fuzzy cmeansrdquo in Proceedings of the IEEE International Fuzzy SystemsConference (FUZZ rsquo07) pp 1ndash7 London UK July 2007

[13] P Hore L O Hall D B Goldgof Y Gu A A Maudsley andA Darkazanli ldquoA scalable framework for segmenting magneticresonance imagesrdquo Journal of Signal Processing Systems vol 54no 1ndash3 pp 183ndash203 2009

[14] W B Johnson and J Lindenstrauss ldquoExtensions of lipschitzmappings into aHilbert spacerdquoContemporaryMathematics vol26 pp 189ndash206 1984

[15] P Indyk and R Motwani ldquoApproximate nearest neighborstowards removing the curse of dimensionalityrdquo in Proceedingsof the 13th Annual ACM Symposium on Theory of Computingpp 604ndash613 ACM 1998

[16] D Achlioptas ldquoDatabase-friendly random projectionsJohnson-Lindenstrauss with binary coinsrdquo Journal of Computerand System Sciences vol 66 no 4 pp 671ndash687 2003

[17] C Boutsidis A Zouzias and P Drineas ldquoRandom projectionsfor k-means clusteringrdquo in Advances in Neural InformationProcessing Systems pp 298ndash306 MIT Press 2010

[18] C C Aggarwal and C K Reddy Data Clustering Algorithmsand Applications CRC Press New York NY USA 2013

[19] R Avogadri and G Valentini ldquoFuzzy ensemble clustering basedon random projections for DNA microarray data analysisrdquoArtificial Intelligence in Medicine vol 45 no 2-3 pp 173ndash1832009

Mathematical Problems in Engineering 13

[20] X Z Fern and C E Brodley ldquoRandom projection for highdimensional data clustering a cluster ensemble approachrdquo inProceedings of the 20th International Conference on MachineLearning (ICML rsquo03) vol 3 pp 186ndash193 August 2003

[21] M Popescu J Keller J Bezdek and A Zare ldquoRandomprojections fuzzy c-means (RPFCM) for big data clusteringrdquoin Proceedings of the IEEE International Conference on FuzzySystems (FUZZ-IEEE rsquo15) pp 1ndash6 Istanbul Turkey August 2015

[22] A Fahad N Alshatri Z Tari et al ldquoA survey of clusteringalgorithms for big data taxonomy and empirical analysisrdquo IEEETransactions on Emerging Topics in Computing vol 2 no 3 pp267ndash279 2014

[23] R A Johnson and D W Wichern Applied Multivariate Statis-tical Analysis vol 4 Pearson Prentice Hall Upper Saddle RiverNJ USA 6th edition 2007

[24] C Boutsidis A Zouzias M W Mahoney and P DrineasldquoRandomized dimensionality reduction for k-means cluster-ingrdquo IEEE Transactions on InformationTheory vol 61 no 2 pp1045ndash1062 2015

[25] X Chen and D Cai ldquoLarge scale spectral clustering withlandmark-based representationrdquo in Proceedings of the 25thAAAI Conference on Artificial Intelligence pp 313ndash318 2011

[26] D Cai and X Chen ldquoLarge scale spectral clustering vialandmark-based sparse representationrdquo IEEE Transactions onCybernetics vol 45 no 8 pp 1669ndash1680 2015

[27] G H Golub and C F Van Loan Matrix Computations vol 3JHU Press 2012

[28] U Maulik and S Bandyopadhyay ldquoPerformance evaluation ofsome clustering algorithms and validity indicesrdquo IEEE Transac-tions on Pattern Analysis and Machine Intelligence vol 24 no12 pp 1650ndash1654 2002

[29] W M Rand ldquoObjective criteria for the evaluation of clusteringmethodsrdquo Journal of the American Statistical Association vol66 no 336 pp 846ndash850 1971

[30] D T Anderson J C Bezdek M Popescu and J M KellerldquoComparing fuzzy probabilistic and possibilistic partitionsrdquoIEEE Transactions on Fuzzy Systems vol 18 no 5 pp 906ndash9182010

[31] X L Xie and G Beni ldquoA validity measure for fuzzy clusteringrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 13 no 8 pp 841ndash847 1991

[32] Y Liu Z LiH Xiong XGao JWu and SWu ldquoUnderstandingand enhancement of internal clustering validation measuresrdquoIEEE Transactions on Cybernetics vol 43 no 3 pp 982ndash9942013

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 11: Research Article Fuzzy -Means and Cluster Ensemble with Random Projection for Big Data ...downloads.hindawi.com/journals/mpe/2016/6529794.pdf · 2019. 7. 30. · Research Article

Mathematical Problems in Engineering 11

Table 2 Standard deviations of RI of 20 runs with different dimensions on ACT2 data

Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-A 00222 00174 0018 00257 00171 00251 00188 00172 00218 00184EFCM-C 00217 00189 00128 00232 00192 00200 00175 00194 00151 00214EFCM-S 00044 00018 00029 00030 00028 00024 00026 00020 00024 00019

Table 3 RI values for EFCM-S and EFCM-Sv on ACT2 data

Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-S 09227 0922 09223 0923 09215 09218 09226 09225 09231 09237EFCM-SV 09257 09257 09165 09257 0927 09165 09268 0927 09105 09245+CVNN 119888 = 185 119888 = 207 119888 = 194 119888 = 193 119888 = 193 119888 = 182 119888 = 192 119888 = 183 119888 = 194 119888 = 202

10 20 30 40 50 60 70 80 90 1000

02

04

06

08

1

12

14

16

18

2

RI

EFCM-AEFCM-CEFCM-S

Number of dimensions t

(a) RI versus number of dimensions

10 20 30 40 50 60 70 80 90 100074

076

078

08

082

084

086

088

09

092

094

RI

Number of dimensions t

EFCM-AEFCM-CEFCM-S

(b) RI versus number of dimensions

10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

Runn

ing

time (

s)

Number of dimensions t

EFCM-AEFCM-CEFCM-S

(c) Running time versus number of dimensions

10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

80

90

100

Runn

ing

time (

s)

Number of dimensions t

EFCM-AEFCM-CEFCM-S

(d) Running time versus number of dimensions

Figure 3 Performance of cluster ensemble approaches with different dimensionality

12 Mathematical Problems in Engineering

In Table 3 the values with respect to ldquoEFCM-SVrdquo are theaverage RI values with the estimated clustersrsquo numbers for20 individual runs The values of ldquo+CVNNrdquo are the averageclustersrsquo numbers decided by the CVNN cluster validityindex Using the estimated clustersrsquo numbers by CVNN ourmethod gets the similar results of ensemble method withcorrect clustersrsquo number In addition the average estimates ofclustersrsquo number are close to the true one This indicates thatour cluster ensemble method EFCM-SV is attractive whenthe number of clusters is unknown

6 Conclusion and Future Work

The ldquocurse of dimensionalityrdquo in big data gives new chal-lenges for clustering recently and feature extraction fordimensionality reduction is a popular way to deal with thesechallenges We studied the feature extraction method ofrandom projection for FCM clustering Through analyzingthe effects of random projection on the entire variabilityof data theoretically and verification both on syntheticand real world data empirically we designed an enhancedFCM clustering algorithm with random projection The newalgorithm can maintain nearly the same clustering solutionof preliminary FCM clustering and be more efficient thanfeature extraction method of SVD What is more we alsoproposed a cluster ensemble approach that is more applicableto large scale data sets than existing ones The new ensembleapproach can achieve spectral embedding efficiently fromSVD on the concatenation of membership matrices Theexperiments showed that the new ensemble method ranfaster had more robust partition solutions and fitted a widerrange of geometrical data sets

A future research content is to design the provablyaccurate feature extraction and feature selection methodsfor FCM clustering Another remaining question is thathow to choose proper number of random projections forcluster ensemble method in order to get a trade-off betweenclustering accuracy and efficiency

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This work was supported in part by the National KeyBasic Research Program (973 programme) under Grant2012CB315905 and in part by the National Nature ScienceFoundation of China under Grants 61502527 and 61379150and in part by the Open Foundation of State Key Laboratoryof Networking and Switching Technology (Beijing Universityof Posts and Telecommunications) (no SKLNST-2013-1-06)

References

[1] M Chen S Mao and Y Liu ldquoBig data a surveyrdquo MobileNetworks and Applications vol 19 no 2 pp 171ndash209 2014

[2] J Zhang X Tao and H Wang ldquoOutlier detection from largedistributed databasesrdquoWorld Wide Web vol 17 no 4 pp 539ndash568 2014

[3] C Ordonez N Mohanam and C Garcia-Alvarado ldquoPCA forlarge data sets with parallel data summarizationrdquo Distributedand Parallel Databases vol 32 no 3 pp 377ndash403 2014

[4] D-S Pham S Venkatesh M Lazarescu and S BudhadityaldquoAnomaly detection in large-scale data stream networksrdquo DataMining and Knowledge Discovery vol 28 no 1 pp 145ndash1892014

[5] F Murtagh and P Contreras ldquoRandom projection towardsthe baire metric for high dimensional clusteringrdquo in StatisticalLearning and Data Sciences pp 424ndash431 Springer BerlinGermany 2015

[6] T C Havens J C Bezdek C Leckie L O Hall and MPalaniswami ldquoFuzzy c-means algorithms for very large datardquoIEEETransactions on Fuzzy Systems vol 20 no 6 pp 1130ndash11462012

[7] J Han M Kamber and J Pei Data Mining Concepts andTechniques Concepts and Techniques Elsevier 2011

[8] S Khan G Situ K Decker and C J Schmidt ldquoGoFigureautomated gene ontology annotationrdquo Bioinformatics vol 19no 18 pp 2484ndash2485 2003

[9] S Gunnemann H Kremer D Lenhard and T Seidl ldquoSub-space clustering for indexing high dimensional data a mainmemory index based on local reductions and individual multi-representationsrdquo in Proceedings of the 14th International Confer-ence on Extending Database Technology (EDBT rsquo11) pp 237ndash248ACM Uppsala Sweden March 2011

[10] J C Bezdek R Ehrlich and W Full ldquoFCM the fuzzy c-meansclustering algorithmrdquo Computers amp Geosciences vol 10 no 2-3pp 191ndash203 1984

[11] R J Hathaway and J C Bezdek ldquoExtending fuzzy andprobabilistic clustering to very large data setsrdquo ComputationalStatistics amp Data Analysis vol 51 no 1 pp 215ndash234 2006

[12] P Hore L O Hall and D B Goldgof ldquoSingle pass fuzzy cmeansrdquo in Proceedings of the IEEE International Fuzzy SystemsConference (FUZZ rsquo07) pp 1ndash7 London UK July 2007

[13] P Hore L O Hall D B Goldgof Y Gu A A Maudsley andA Darkazanli ldquoA scalable framework for segmenting magneticresonance imagesrdquo Journal of Signal Processing Systems vol 54no 1ndash3 pp 183ndash203 2009

[14] W B Johnson and J Lindenstrauss ldquoExtensions of lipschitzmappings into aHilbert spacerdquoContemporaryMathematics vol26 pp 189ndash206 1984

[15] P Indyk and R Motwani ldquoApproximate nearest neighborstowards removing the curse of dimensionalityrdquo in Proceedingsof the 13th Annual ACM Symposium on Theory of Computingpp 604ndash613 ACM 1998

[16] D Achlioptas ldquoDatabase-friendly random projectionsJohnson-Lindenstrauss with binary coinsrdquo Journal of Computerand System Sciences vol 66 no 4 pp 671ndash687 2003

[17] C Boutsidis A Zouzias and P Drineas ldquoRandom projectionsfor k-means clusteringrdquo in Advances in Neural InformationProcessing Systems pp 298ndash306 MIT Press 2010

[18] C C Aggarwal and C K Reddy Data Clustering Algorithmsand Applications CRC Press New York NY USA 2013

[19] R Avogadri and G Valentini ldquoFuzzy ensemble clustering basedon random projections for DNA microarray data analysisrdquoArtificial Intelligence in Medicine vol 45 no 2-3 pp 173ndash1832009

Mathematical Problems in Engineering 13

[20] X Z Fern and C E Brodley ldquoRandom projection for highdimensional data clustering a cluster ensemble approachrdquo inProceedings of the 20th International Conference on MachineLearning (ICML rsquo03) vol 3 pp 186ndash193 August 2003

[21] M Popescu J Keller J Bezdek and A Zare ldquoRandomprojections fuzzy c-means (RPFCM) for big data clusteringrdquoin Proceedings of the IEEE International Conference on FuzzySystems (FUZZ-IEEE rsquo15) pp 1ndash6 Istanbul Turkey August 2015

[22] A Fahad N Alshatri Z Tari et al ldquoA survey of clusteringalgorithms for big data taxonomy and empirical analysisrdquo IEEETransactions on Emerging Topics in Computing vol 2 no 3 pp267ndash279 2014

[23] R A Johnson and D W Wichern Applied Multivariate Statis-tical Analysis vol 4 Pearson Prentice Hall Upper Saddle RiverNJ USA 6th edition 2007

[24] C Boutsidis A Zouzias M W Mahoney and P DrineasldquoRandomized dimensionality reduction for k-means cluster-ingrdquo IEEE Transactions on InformationTheory vol 61 no 2 pp1045ndash1062 2015

[25] X Chen and D Cai ldquoLarge scale spectral clustering withlandmark-based representationrdquo in Proceedings of the 25thAAAI Conference on Artificial Intelligence pp 313ndash318 2011

[26] D Cai and X Chen ldquoLarge scale spectral clustering vialandmark-based sparse representationrdquo IEEE Transactions onCybernetics vol 45 no 8 pp 1669ndash1680 2015

[27] G H Golub and C F Van Loan Matrix Computations vol 3JHU Press 2012

[28] U Maulik and S Bandyopadhyay ldquoPerformance evaluation ofsome clustering algorithms and validity indicesrdquo IEEE Transac-tions on Pattern Analysis and Machine Intelligence vol 24 no12 pp 1650ndash1654 2002

[29] W M Rand ldquoObjective criteria for the evaluation of clusteringmethodsrdquo Journal of the American Statistical Association vol66 no 336 pp 846ndash850 1971

[30] D T Anderson J C Bezdek M Popescu and J M KellerldquoComparing fuzzy probabilistic and possibilistic partitionsrdquoIEEE Transactions on Fuzzy Systems vol 18 no 5 pp 906ndash9182010

[31] X L Xie and G Beni ldquoA validity measure for fuzzy clusteringrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 13 no 8 pp 841ndash847 1991

[32] Y Liu Z LiH Xiong XGao JWu and SWu ldquoUnderstandingand enhancement of internal clustering validation measuresrdquoIEEE Transactions on Cybernetics vol 43 no 3 pp 982ndash9942013

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 12: Research Article Fuzzy -Means and Cluster Ensemble with Random Projection for Big Data ...downloads.hindawi.com/journals/mpe/2016/6529794.pdf · 2019. 7. 30. · Research Article

12 Mathematical Problems in Engineering

In Table 3 the values with respect to ldquoEFCM-SVrdquo are theaverage RI values with the estimated clustersrsquo numbers for20 individual runs The values of ldquo+CVNNrdquo are the averageclustersrsquo numbers decided by the CVNN cluster validityindex Using the estimated clustersrsquo numbers by CVNN ourmethod gets the similar results of ensemble method withcorrect clustersrsquo number In addition the average estimates ofclustersrsquo number are close to the true one This indicates thatour cluster ensemble method EFCM-SV is attractive whenthe number of clusters is unknown

6 Conclusion and Future Work

The ldquocurse of dimensionalityrdquo in big data gives new chal-lenges for clustering recently and feature extraction fordimensionality reduction is a popular way to deal with thesechallenges We studied the feature extraction method ofrandom projection for FCM clustering Through analyzingthe effects of random projection on the entire variabilityof data theoretically and verification both on syntheticand real world data empirically we designed an enhancedFCM clustering algorithm with random projection The newalgorithm can maintain nearly the same clustering solutionof preliminary FCM clustering and be more efficient thanfeature extraction method of SVD What is more we alsoproposed a cluster ensemble approach that is more applicableto large scale data sets than existing ones The new ensembleapproach can achieve spectral embedding efficiently fromSVD on the concatenation of membership matrices Theexperiments showed that the new ensemble method ranfaster had more robust partition solutions and fitted a widerrange of geometrical data sets

A future research content is to design the provablyaccurate feature extraction and feature selection methodsfor FCM clustering Another remaining question is thathow to choose proper number of random projections forcluster ensemble method in order to get a trade-off betweenclustering accuracy and efficiency

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This work was supported in part by the National KeyBasic Research Program (973 programme) under Grant2012CB315905 and in part by the National Nature ScienceFoundation of China under Grants 61502527 and 61379150and in part by the Open Foundation of State Key Laboratoryof Networking and Switching Technology (Beijing Universityof Posts and Telecommunications) (no SKLNST-2013-1-06)

References

[1] M Chen S Mao and Y Liu ldquoBig data a surveyrdquo MobileNetworks and Applications vol 19 no 2 pp 171ndash209 2014

[2] J Zhang X Tao and H Wang ldquoOutlier detection from largedistributed databasesrdquoWorld Wide Web vol 17 no 4 pp 539ndash568 2014

[3] C Ordonez N Mohanam and C Garcia-Alvarado ldquoPCA forlarge data sets with parallel data summarizationrdquo Distributedand Parallel Databases vol 32 no 3 pp 377ndash403 2014

[4] D-S Pham S Venkatesh M Lazarescu and S BudhadityaldquoAnomaly detection in large-scale data stream networksrdquo DataMining and Knowledge Discovery vol 28 no 1 pp 145ndash1892014

[5] F Murtagh and P Contreras ldquoRandom projection towardsthe baire metric for high dimensional clusteringrdquo in StatisticalLearning and Data Sciences pp 424ndash431 Springer BerlinGermany 2015

[6] T C Havens J C Bezdek C Leckie L O Hall and MPalaniswami ldquoFuzzy c-means algorithms for very large datardquoIEEETransactions on Fuzzy Systems vol 20 no 6 pp 1130ndash11462012

[7] J Han M Kamber and J Pei Data Mining Concepts andTechniques Concepts and Techniques Elsevier 2011

[8] S Khan G Situ K Decker and C J Schmidt ldquoGoFigureautomated gene ontology annotationrdquo Bioinformatics vol 19no 18 pp 2484ndash2485 2003

[9] S Gunnemann H Kremer D Lenhard and T Seidl ldquoSub-space clustering for indexing high dimensional data a mainmemory index based on local reductions and individual multi-representationsrdquo in Proceedings of the 14th International Confer-ence on Extending Database Technology (EDBT rsquo11) pp 237ndash248ACM Uppsala Sweden March 2011

[10] J C Bezdek R Ehrlich and W Full ldquoFCM the fuzzy c-meansclustering algorithmrdquo Computers amp Geosciences vol 10 no 2-3pp 191ndash203 1984

[11] R J Hathaway and J C Bezdek ldquoExtending fuzzy andprobabilistic clustering to very large data setsrdquo ComputationalStatistics amp Data Analysis vol 51 no 1 pp 215ndash234 2006

[12] P Hore L O Hall and D B Goldgof ldquoSingle pass fuzzy cmeansrdquo in Proceedings of the IEEE International Fuzzy SystemsConference (FUZZ rsquo07) pp 1ndash7 London UK July 2007

[13] P Hore L O Hall D B Goldgof Y Gu A A Maudsley andA Darkazanli ldquoA scalable framework for segmenting magneticresonance imagesrdquo Journal of Signal Processing Systems vol 54no 1ndash3 pp 183ndash203 2009

[14] W B Johnson and J Lindenstrauss ldquoExtensions of lipschitzmappings into aHilbert spacerdquoContemporaryMathematics vol26 pp 189ndash206 1984

[15] P Indyk and R Motwani ldquoApproximate nearest neighborstowards removing the curse of dimensionalityrdquo in Proceedingsof the 13th Annual ACM Symposium on Theory of Computingpp 604ndash613 ACM 1998

[16] D Achlioptas ldquoDatabase-friendly random projectionsJohnson-Lindenstrauss with binary coinsrdquo Journal of Computerand System Sciences vol 66 no 4 pp 671ndash687 2003

[17] C Boutsidis A Zouzias and P Drineas ldquoRandom projectionsfor k-means clusteringrdquo in Advances in Neural InformationProcessing Systems pp 298ndash306 MIT Press 2010

[18] C C Aggarwal and C K Reddy Data Clustering Algorithmsand Applications CRC Press New York NY USA 2013

[19] R Avogadri and G Valentini ldquoFuzzy ensemble clustering basedon random projections for DNA microarray data analysisrdquoArtificial Intelligence in Medicine vol 45 no 2-3 pp 173ndash1832009

Mathematical Problems in Engineering 13

[20] X Z Fern and C E Brodley ldquoRandom projection for highdimensional data clustering a cluster ensemble approachrdquo inProceedings of the 20th International Conference on MachineLearning (ICML rsquo03) vol 3 pp 186ndash193 August 2003

[21] M Popescu J Keller J Bezdek and A Zare ldquoRandomprojections fuzzy c-means (RPFCM) for big data clusteringrdquoin Proceedings of the IEEE International Conference on FuzzySystems (FUZZ-IEEE rsquo15) pp 1ndash6 Istanbul Turkey August 2015

[22] A Fahad N Alshatri Z Tari et al ldquoA survey of clusteringalgorithms for big data taxonomy and empirical analysisrdquo IEEETransactions on Emerging Topics in Computing vol 2 no 3 pp267ndash279 2014

[23] R A Johnson and D W Wichern Applied Multivariate Statis-tical Analysis vol 4 Pearson Prentice Hall Upper Saddle RiverNJ USA 6th edition 2007

[24] C Boutsidis A Zouzias M W Mahoney and P DrineasldquoRandomized dimensionality reduction for k-means cluster-ingrdquo IEEE Transactions on InformationTheory vol 61 no 2 pp1045ndash1062 2015

[25] X Chen and D Cai ldquoLarge scale spectral clustering withlandmark-based representationrdquo in Proceedings of the 25thAAAI Conference on Artificial Intelligence pp 313ndash318 2011

[26] D Cai and X Chen ldquoLarge scale spectral clustering vialandmark-based sparse representationrdquo IEEE Transactions onCybernetics vol 45 no 8 pp 1669ndash1680 2015

[27] G H Golub and C F Van Loan Matrix Computations vol 3JHU Press 2012

[28] U Maulik and S Bandyopadhyay ldquoPerformance evaluation ofsome clustering algorithms and validity indicesrdquo IEEE Transac-tions on Pattern Analysis and Machine Intelligence vol 24 no12 pp 1650ndash1654 2002

[29] W M Rand ldquoObjective criteria for the evaluation of clusteringmethodsrdquo Journal of the American Statistical Association vol66 no 336 pp 846ndash850 1971

[30] D T Anderson J C Bezdek M Popescu and J M KellerldquoComparing fuzzy probabilistic and possibilistic partitionsrdquoIEEE Transactions on Fuzzy Systems vol 18 no 5 pp 906ndash9182010

[31] X L Xie and G Beni ldquoA validity measure for fuzzy clusteringrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 13 no 8 pp 841ndash847 1991

[32] Y Liu Z LiH Xiong XGao JWu and SWu ldquoUnderstandingand enhancement of internal clustering validation measuresrdquoIEEE Transactions on Cybernetics vol 43 no 3 pp 982ndash9942013

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 13: Research Article Fuzzy -Means and Cluster Ensemble with Random Projection for Big Data ...downloads.hindawi.com/journals/mpe/2016/6529794.pdf · 2019. 7. 30. · Research Article

Mathematical Problems in Engineering 13

[20] X Z Fern and C E Brodley ldquoRandom projection for highdimensional data clustering a cluster ensemble approachrdquo inProceedings of the 20th International Conference on MachineLearning (ICML rsquo03) vol 3 pp 186ndash193 August 2003

[21] M Popescu J Keller J Bezdek and A Zare ldquoRandomprojections fuzzy c-means (RPFCM) for big data clusteringrdquoin Proceedings of the IEEE International Conference on FuzzySystems (FUZZ-IEEE rsquo15) pp 1ndash6 Istanbul Turkey August 2015

[22] A Fahad N Alshatri Z Tari et al ldquoA survey of clusteringalgorithms for big data taxonomy and empirical analysisrdquo IEEETransactions on Emerging Topics in Computing vol 2 no 3 pp267ndash279 2014

[23] R A Johnson and D W Wichern Applied Multivariate Statis-tical Analysis vol 4 Pearson Prentice Hall Upper Saddle RiverNJ USA 6th edition 2007

[24] C Boutsidis A Zouzias M W Mahoney and P DrineasldquoRandomized dimensionality reduction for k-means cluster-ingrdquo IEEE Transactions on InformationTheory vol 61 no 2 pp1045ndash1062 2015

[25] X Chen and D Cai ldquoLarge scale spectral clustering withlandmark-based representationrdquo in Proceedings of the 25thAAAI Conference on Artificial Intelligence pp 313ndash318 2011

[26] D Cai and X Chen ldquoLarge scale spectral clustering vialandmark-based sparse representationrdquo IEEE Transactions onCybernetics vol 45 no 8 pp 1669ndash1680 2015

[27] G H Golub and C F Van Loan Matrix Computations vol 3JHU Press 2012

[28] U Maulik and S Bandyopadhyay ldquoPerformance evaluation ofsome clustering algorithms and validity indicesrdquo IEEE Transac-tions on Pattern Analysis and Machine Intelligence vol 24 no12 pp 1650ndash1654 2002

[29] W M Rand ldquoObjective criteria for the evaluation of clusteringmethodsrdquo Journal of the American Statistical Association vol66 no 336 pp 846ndash850 1971

[30] D T Anderson J C Bezdek M Popescu and J M KellerldquoComparing fuzzy probabilistic and possibilistic partitionsrdquoIEEE Transactions on Fuzzy Systems vol 18 no 5 pp 906ndash9182010

[31] X L Xie and G Beni ldquoA validity measure for fuzzy clusteringrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 13 no 8 pp 841ndash847 1991

[32] Y Liu Z LiH Xiong XGao JWu and SWu ldquoUnderstandingand enhancement of internal clustering validation measuresrdquoIEEE Transactions on Cybernetics vol 43 no 3 pp 982ndash9942013

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 14: Research Article Fuzzy -Means and Cluster Ensemble with Random Projection for Big Data ...downloads.hindawi.com/journals/mpe/2016/6529794.pdf · 2019. 7. 30. · Research Article

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of